# Gen AI RAG System 

**This notebook is split up into five parts:**
1. Setup
2. Data Acquisition
3. Experiments
4. Key Runs and Evaluation
5. Results

Credit to UC Berkeley ISchool and the Data Science 290 course team for the set up and data acquisition portion of this notebook

**The overall scenario is as follows:**

You work at a tech company that is looking for new ways to organize their question answering and search capabilities to accelerate both engineering activity and the marketing team. The company also wants to roll out new GenAI-based products, so a lot of the questions will center around Generative AI concepts. The company has about 300 engineers and a marketing staff of 40. Product releases are done quarterly.

Your role is to implement and conduct a (mini-)POC helping the company to evaluate RAG capabilities for the improvement of their document search (and corresponding question answering), supporting particularly the engineering and marketing organizations. You will have a gold dataset with 'good' responses to questions from marketing and engineering teams. You need to develop metric(s) that help you to evaluate how well your RAG system performs relative to the gold data. You should work with the tunables of the setup (LLM, chunking, embeddings, ...) for your iterations.

We are given the following validation set:



In [None]:
validation_questions_answers = {
    0: {"question": "What purpose do large language models serve in the field of natural language processing?",
  "gold_answer_research": "Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks.",
  "gold_answer_marketing": "Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval."},
1: {"question": "How does a large language model learn from text during training?",
  "gold_answer_research": "A large language model learns from text during training by first going through an unsupervised generative 'pretraining' stage where it sets initial parameters using a language modeling objective. Then, it goes through a supervised discriminative 'fine-tuning' stage where it refines its parameters based on annotated examples or task demonstrations. This dual-stage approach allows the model to learn statistical relationships from text documents in a computationally intensive process, enabling it to achieve general-purpose language generation and natural language processing tasks.",
  "gold_answer_marketing": "A large language model learns from text during training by first pretraining on a diverse dataset to acquire general language knowledge, and then fine-tuning on specific tasks or demonstrations to adapt its parameters for more targeted performance."},
2: {"question": "What are some key architectures behind the development of large language models?",
  "gold_answer_research": "Key architectures behind the development of large language models include the use of self-attention mechanisms, such as those seen in Transformer decoders. These architectures have been applied to tasks like autoregressive language modeling and have led to the dominance of Transformer-based language models in NLP. Models like BERT and GPT-2 have further advanced this paradigm, showcasing the power of large Transformer language models in achieving state-of-the-art results across various NLP tasks. Additionally, architectures like neural-retriever-in-the-loop generative-based models have shown improvements in tasks like open-domain QA and knowledge-grounded dialogue, emphasizing the importance of consistent and engaging responses in long-form generation and multi-turn conversations.",
  "gold_answer_marketing": "Key architectures behind the development of large language models include Transformer-based models such as BERT and GPT-2, which utilize self-attention mechanisms for tasks like autoregressive language modeling and knowledge-grounded dialogue. These models have shown significant success in NLP tasks and have led to advancements in general-purpose language generation and natural language processing."},
3: {"question": "Can you name some specific large language models and the companies or organizations that have developed them?",
  "gold_answer_research": "Some specific large language models include GPT-3 by OpenAI, Chinchilla by DeepMind, and BERT by Google. OpenAI developed GPT-3, DeepMind developed Chinchilla, and Google developed BERT. These models have been significant advancements in the field of natural language processing.",
  "gold_answer_marketing": "Chinchilla by DeepMind, GPT-3 by OpenAI."},
7: {"question": "What licensing models have been adopted for the distribution of source-available language models?",
  "gold_answer_research": "Based on the provided context, it seems that licensing models for the distribution of source-available language models have not been explicitly discussed in the referenced papers. However, it is crucial to consider potential licensing options such as open-source licenses (e.g., GPL, MIT) or proprietary licenses when distributing language models to ensure legal compliance and control over usage rights. Additionally, considering the implications of different licensing models on accessibility, collaboration, and commercialization is essential for determining the most suitable approach for sharing language models with the community. Further research or consultation with legal experts may be necessary to explore specific licensing strategies for source-available language models.",
  "gold_answer_marketing": "Answer: Some organizations choose open-sourcing, while others restrict access to a few organizations with resources or offer end-to-end deployment via API."},
8: {"question": "What are language models and what is their purpose in natural language processing?",
  "gold_answer_research": "Language models are probabilistic models of natural language that help predict or correct text. Their purpose in natural language processing is to assist in various tasks such as speech recognition, machine translation, natural language generation, and information retrieval. By analyzing the performance of human subjects, language models improve the understanding and generation of human-like text.",
  "gold_answer_marketing": "Language models are probabilistic models of natural language that are used in tasks such as speech recognition, machine translation, and natural language generation in natural language processing."},
9: {"question": "How have language models evolved in terms of architecture, from the 1980s to present times?",
  "gold_answer_research": "Language models have evolved significantly in terms of architecture from the 1980s to present times. In the 1980s, the first statistical language model was proposed, leading to experiments by IBM that identified areas for improvement by observing human subjects. However, it wasn't until 2017 when the transformer architecture was introduced by Google, revolutionizing the field. This development paved the way for models like BERT in 2018, which marked a shift towards large-scale transformer-based language models. These modern architectures, based on self-attention mechanisms, have dominated the field of natural language processing, achieving state-of-the-art performance in various tasks.",
  "gold_answer_marketing": "Language models have evolved from early statistical models in the 1980s to modern transformer architectures, such as BERT and GPT-2, which use self-attention mechanisms and have become dominant in natural language processing tasks."},
11: {"question": "Can you explain how maximum entropy language models work and what the partition function signifies?",
  "gold_answer_research": "Maximum entropy language models use feature functions to encode the relationship between a word and its n-gram history, aiming to maximize reward while satisfying a KL-constrained objective. The partition function, denoted as Z(x), is crucial in normalizing the probabilities of all possible outputs given the input. It represents the sum of the exponential of the reward function over all possible output sequences, making it computationally expensive to estimate but essential for accurate modeling. The partition function ensures that the model's predicted probabilities sum up to 1, providing a foundation for effective language modeling.",
  "gold_answer_marketing": "Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The partition function in this context represents the total probability of all possible outcomes, making it a crucial factor in determining the optimal solution for the reward maximization objective."},
12: {"question": "What is the benefit of using continuous space embeddings in recurrent neural network language models?",
  "gold_answer_research": "Continuous space embeddings in recurrent neural network language models help alleviate the curse of dimensionality by representing words as non-linear combinations of weights in the embedding space. This approach helps address the data sparsity problem caused by the exponential increase in possible word sequences with vocabulary size. By utilizing continuous space embeddings, neural networks can effectively capture semantic relationships and meaning within the language model.",
  "gold_answer_marketing": "Continuous space embeddings in recurrent neural network language models help alleviate the curse of dimensionality caused by the exponential increase in possible word sequences, reducing data sparsity issues."},
13: {"question": "What challenges do large language models face in mirroring human cognitive patterns?",
  "gold_answer_research": "Large language models face challenges in mirroring human cognitive patterns because they sometimes learn patterns that humans do not learn, while also failing to learn patterns that humans typically learn. This discrepancy suggests that the models may not be plausible cognitive models, despite matching human performance in some tasks. Further research is needed to address these limitations and improve the alignment of large language models with human cognitive patterns.",
  "gold_answer_marketing": "Large language models sometimes learn patterns that humans do not learn and fail to learn patterns that humans typically do learn."},
16: {"question": "What factors influenced the development of generative language models by Anthropic?",
  "gold_answer_research": "Several factors influenced the development of generative language models by Anthropic, including the limitations in coding, math, and reasoning capabilities of the initial version Claude, the partnerships with companies like Notion and Quora to enhance the model's capabilities, and the need to address biases, unsafe content, and ethical considerations in training data. Additionally, the reliance on supervised learning and the need for controlled generation in generative models played a role in shaping the development of Anthropic's language models.",
  "gold_answer_marketing": "Factors that influenced the development of generative language models by Anthropic include partnerships with companies like Notion and Quora, limitations in coding, math, and reasoning capabilities in initial models like Claude, and the need to address biases and unsafe content in training datasets."},
17: {"question": "What is Constitutional AI and how does it affect the functionality of AI systems?",
  "gold_answer_research": "Constitutional AI is an approach developed by Anthropic for training AI systems, particularly language models like Claude, to be harmless and helpful without relying on extensive human feedback. It involves two phases: supervised learning, where the model generates responses to prompts and self-critiques based on a set of guiding principles, and reinforcement learning, where the model is trained with AI-generated feedback according to constitutional principles. This approach enables the training of AI assistants that are both helpful and harmless, with the ability to explain objections to harmful requests, enhancing transparency and reducing the need for human supervision.",
  "gold_answer_marketing": "Constitutional AI is an approach developed by Anthropic for training AI systems, particularly language models like Claude, to be harmless and helpful without relying on extensive human feedback. It involves supervised learning and reinforcement learning phases to guide the model's responses based on a set of guiding principles (a 'constitution'). This approach aims to create AI systems that are both helpful and transparent in their decision-making process, reducing the need for constant human supervision."},
18: {"question": "How do advances in AI models impact their ability to interact with different types of data, such as images?",
  "gold_answer_research": "Advances in AI models, such as multimodal models like RA-CM3, have significantly improved their ability to interact with different types of data, such as images. These models can refer to external memory, like web data, to increase their knowledge capacity, allowing them to generate correct images from entity-rich captions. Additionally, these models can perform image editing and manually specify examples in-context for better results. The use of large language models, combined with larger datasets and neural networks, has also enhanced their performance in tasks like image generation and text generation.",
  "gold_answer_marketing": "Advances in AI models, such as multimodal models like RA-CM3, allow for better interaction with different types of data, like images, by accessing external memory for increased knowledge capacity and improving performance in tasks like image generation and image editing."},
19: {"question": "What are the potential trade-offs between AI system alignment with ethical guidelines and practical utility?",
  "gold_answer_research": "The potential trade-offs between AI system alignment with ethical guidelines and practical utility include the risk of reduced performance and usability due to stringent ethical alignment measures, as seen with Claude 2. Users may face limitations and refusal of assistance for benign requests, leading to debates over the 'alignment tax' in AI development. Balancing ethical considerations with practical functionality is crucial to ensure alignment with ethical guidelines without compromising the practical utility of AI systems. Research is needed to find a middle ground that prioritizes ethical alignment while maintaining usability and performance.",
  "gold_answer_marketing": "The potential trade-offs between AI system alignment with ethical guidelines and practical utility include balancing stringent ethical alignment that may reduce usability and performance, ensuring transparency and fairness in alignment processes, and addressing the alignment tax that may impact adoption of AI systems."},
20: {"question": "How has the token handling capacity changed between different versions of the Claude model?",
  "gold_answer_research": "The token handling capacity has increased with each new version of the Claude model. Claude Instant has a context length of 100,000 tokens, Claude 2.1 doubled this to 200,000 tokens, and Claude 3 Opus default version has a context window of 200,000 tokens but can be expanded to 1 million for specific use cases. This progression shows a trend towards handling larger amounts of text data for improved performance and capabilities.",
  "gold_answer_marketing": "The token handling capacity has increased from Claude to Claude Instant to Claude 2.1, with Claude Instant having a input context length of 100,000 tokens, Claude 2.1 having a context window of 200,000 tokens, and Claude 3 Opus having a context window of 1 million tokens."},
22: {"question": "In what ways has the Claude model's ability to self-critique and revise its responses enhanced its transparency?",
  "gold_answer_research": "The Claude model's ability to self-critique and revise its responses has enhanced its transparency by allowing for iterative improvements based on past actions and mistakes. Through self-reflection, the model can refine its output by learning from feedback and generating special tokens to signal the need for retrieval or confirm the relevance, support, or completeness of its responses. This process ensures that the model's statements about the world are truthful and accurate, ultimately increasing transparency in its decision-making and reasoning processes.",
  "gold_answer_marketing": "The Claude model's ability to self-critique and revise its responses has enhanced its transparency by allowing it to generate text informed by retrieved passages, criticize the output, and signal the need for retrieval or confirm the output's relevance, support, or completeness. This self-reflection process helps improve the model's accuracy and reliability in generating responses."},
23: {"question": "How do subsequent versions of Claude compare in terms of their likelihood to produce false statements?",
  "gold_answer_research": "Claude Instant is a faster and lighter version of Claude, with an input context length of 100,000 tokens. In contrast, Claude 3 has faced criticism for its stringent ethical alignment, leading to a debate over the 'alignment tax' in AI development. Users have been refused assistance with benign requests, which has sparked discussions on balancing ethical considerations and practical functionality. This suggests that Claude Instant may have a lower likelihood of producing false statements compared to Claude 3 due to its focus on usability and performance.",
  "gold_answer_marketing": "Claude Instant is a faster, less expensive, and lighter version of Claude with a shorter input context length. Claude 3 has faced criticism for ethical alignment issues that may affect usability and performance."},
24: {"question": "Who developed the language model family known as Chinchilla?",
  "gold_answer_research": "The Chinchilla language model family was developed by the research team at DeepMind and presented in March 2022. It is named 'Chinchilla' as an advancement over the previous Gopher model family. The Chinchilla family has been trained to investigate the scaling laws of large language models and is designed to outperform GPT-3.",
  "gold_answer_marketing": "The research team at DeepMind developed the language model family known as Chinchilla."},
25: {"question": "What benchmark did Chinchilla achieve an average accuracy of 67.5% on?",
  "gold_answer_research": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding).",
  "gold_answer_marketing": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding)."},
27: {"question": "What is the relationship between Chinchilla and the Gopher language model families?",
  "gold_answer_research": "The Chinchilla family of transformer models is essentially the same as the Gopher family, with minor modifications and different training optimizers. Chinchilla uses AdamW optimizer while Gopher uses Adam optimizer. Additionally, Chinchilla uses relative positional encoding and RMSNorm instead of absolute positional encoding and LayerNorm used by Gopher. Chinchilla has 70B parameters and outperforms Gopher on the MMLU benchmark by 7%, showcasing an improvement in performance. Both families follow similar naming conventions and were developed to investigate the scaling laws of large language models.",
  "gold_answer_marketing": "Chinchilla is a family of transformer models developed by DeepMind, which is a further development over a previous model family named Gopher. Both model families were trained to investigate the scaling laws of large language models."},
28: {"question": "What distinguishes the architectures of the Chinchilla and Gopher family models in terms of optimization techniques used?",
  "gold_answer_research": "The main distinction in optimization techniques between the Chinchilla and Gopher family models lies in the choice of optimizers. The Gopher family utilizes the Adam optimizer, whereas the Chinchilla family is trained using the AdamW optimizer. Additionally, the Gopher family employs RMSNorm instead of LayerNorm, and relative positional encoding rather than absolute positional encoding. These differences in optimization techniques contribute to the unique characteristics and performance of each model family.",
  "gold_answer_marketing": "The Chinchilla family uses AdamW optimizer, while the Gopher family uses the Adam optimizer."},
30: {"question": "What is the recommended strategy for training large autoregressive language models with limited compute resources, as contributed by the Chinchilla team?",
  "gold_answer_research": "The Chinchilla team recommends that the number of training tokens should be doubled for every model size doubling to achieve better results on downstream tasks. They also suggest using larger, higher-quality training datasets to improve performance. Additionally, they mention the importance of balancing model size and efficiency to address computational costs and inference latency limitations. It is advised to focus on Transformer language models and consider sharing model parameters for quick task-switching when deploying as a service.",
  "gold_answer_marketing": "The Chinchilla team recommends doubling the number of training tokens for every model size doubling and using larger, higher-quality training datasets to achieve better results on downstream tasks."},
33: {"question": "What are some key areas of research in the field of artificial intelligence as reflected in recent academic literature?",
  "gold_answer_research": "Recent academic literature in the field of artificial intelligence reflects key areas of research such as natural language processing with state-of-the-art transformers, feature learning in infinite-width neural networks, diverse beam search for complex scene description, and the development of generative AI models capable of generating text and images. Additionally, research focuses on human preferences in dueling bandits, the use of few-shot learners in language models, and the exploration of knowledge-grounded neural conversation models. These areas of research highlight the advancements in AI technology and its applications across various domains.",
  "gold_answer_marketing": "Some key areas of research in artificial intelligence include natural language processing, deep neural networks, generative AI, AI safety, AI art, reinforcement learning, and language agents alignment."},
34: {"question": "What are some of the limitations of traditional position encoding methods in the architecture of pre-trained language models (PLMs), and what novel approach does the paper propose to address these issues?",
  "gold_answer_research": "One limitation of traditional position encoding methods in PLMs is that they may not enable length extrapolation of pre-existing models, leading to the need for substantial pre-training costs. The paper proposes a novel approach called Position Interpolation, which extends existing PLMs without deviating far from existing definitions of position encoding or attention mechanisms. This method allows for much extended context windows for text modeling, leading to significant perplexity gains and improved model performance.",
  "gold_answer_marketing": "Traditional position encoding methods in PLMs have limitations in enabling length extrapolation and adapting to extended context windows. The paper proposes a novel approach called Position Interpolation, which generates strong models that can effectively make use of much extended context windows. This method allows for substantial pre-training cost savings and preserves the quality of the original models, even for small context window tasks."},
35: {"question": "How does the Rotary Position Embedding (RoPE) approach in Transformers differ from the traditional additive method of position embedding with respect to encoding position information?",
  "gold_answer_research": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by being multiplicative instead of additive. While traditional methods add position encoding to context representations, RoPE incorporates relative position information through rotation matrix product. This means that RoPE naturally includes relative position dependency in the self-attention formulation, without altering terms in the expanded formulation like the additive method does. Additionally, RoPE's properties show that it decays as the relative distance between positions increases, providing a clear theoretical interpretation of how position information is encoded.",
  "gold_answer_marketing": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by incorporating relative position information through rotation matrix product instead of altering terms in the expanded formulation of additive position encoding."},
36: {"question": "What is the significance of comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices when analyzing the adaptation of pre-trained language models?",
  "gold_answer_research": "Comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices provides insight into the underlying mechanism for adapting pre-trained language models. It helps determine the intrinsic rank of the adaptation matrix ∆W and sheds light on the connection between ∆W and the original weight matrix W. By analyzing these similarities, we can understand how much of the adaptation is specific to the task at hand and how much is influenced by the pre-trained model. This comparison is crucial for optimizing the adaptation process and maximizing downstream performance in NLP tasks.",
  "gold_answer_marketing": "Comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices helps understand the underlying mechanism for adapting pre-trained language models. It reveals the intrinsic rank and common singular value directions learned by different runs, shedding light on the fundamental principles of using pre-trained language models for downstream tasks in NLP."},
38: {"question": "What issues are associated with the homogeneity of language model training contractors, and how might it affect the behavior of the models?",
  "gold_answer_research": "The issues associated with the homogeneity of language model training contractors include potential biases in the labeling process, lack of diverse perspectives leading to limited coverage of sensitive content, and reduced robustness in model performance across different tasks. This homogeneity can affect the behavior of the models by reinforcing certain biases, increasing the risk of harmful content generation, and limiting the models' ability to generalize effectively. To address these issues, it is important to ensure diversity among labelers, incorporate varied perspectives in training data, and implement measures to enhance model robustness and performance across a range of tasks.",
  "gold_answer_marketing": "The homogeneity of language model training contractors can lead to biased or limited perspectives in the data, which may result in the models producing harmful content, gaming objectives, or lacking sensitivity to diverse viewpoints. This can affect the behavior of the models by reinforcing stereotypes, increasing toxicity, and reducing their ability to accurately represent under-represented groups."},
39: {"question": "What are common research topics and themes found in recent publications about artificial intelligence and natural language processing?",
  "gold_answer_research": "Recent publications in artificial intelligence and natural language processing have covered topics such as transformer models, feature learning in neural networks, attention mechanisms, multi-task benchmark platforms, semantic search using sentence embeddings, cross-task generalization, and question generation for question answering. Themes commonly explored include machine comprehension of text, reinforcement learning algorithms, sentence embeddings, semantic compositionality, reasoning with language models and knowledge graphs, and the gap between neural text and human text. These publications also delve into deep language understanding, retrieval-augmented transformers, image captioning, and open datasets for image-text pairs.",
  "gold_answer_marketing": "Common research topics and themes in recent publications on artificial intelligence and natural language processing include transformer models, attention mechanisms, semantic search, sentence embeddings, and question answering using language models and knowledge graphs."},
41: {"question": "Question: When conducting demographic and technical assessments of teams or research subjects, what types of data categories are typically collected and analyzed to ensure a comprehensive understanding of the group's composition and the methods used?",
  "gold_answer_research": "When conducting demographic and technical assessments of teams or research subjects, it is important to collect and analyze data categories such as age, gender, education level, professional background, and expertise in specific areas. By gathering information on these categories, you can ensure a comprehensive understanding of the group's composition and the methods used in your assessments. Additionally, it may be helpful to consider factors like cultural background, language proficiency, and geographical location to capture a more nuanced picture of the group being assessed. This detailed approach to data collection and analysis can provide valuable insights for making informed decisions and recommendations based on the gathered information.",
  "gold_answer_marketing": "Answer: Demographic data such as age, gender, education level, and technical data related to skills and experience are typically collected and analyzed for comprehensive understanding."},
43: {"question": "What kind of tasks can be performed using the datasets described in the provided text, and what are some common features of these datasets?",
  "gold_answer_research": "The datasets described in the provided text can be used for tasks such as question answering, duplicate question retrieval, entity retrieval, citation prediction, query understanding, document understanding, passage retrieval, text summarization, fact verification, and code search. Common features of these datasets include diverse task categories, comprehensive instructions, a wide range of synthetic user personalities and interaction patterns, and a focus on enhancing comprehension of documents to deliver accurate results. Additionally, the datasets cover a variety of domains such as public health, scientific exams, climate, and general knowledge.",
  "gold_answer_marketing": "The datasets described in the provided text can be used for tasks such as question answering, document summarization, duplicate question retrieval, code search, sentence simplification, dialogue generation, body retrieval, caption generation, fact verification, and more. Some common features of these datasets include diverse input-output pairs, incorporation of various knowledge-intensive datasets, and a focus on generating high-quality synthetic data points."},
44: {"question": "What conclusions can be drawn about the relationship between input prompt toxicity and output toxicity when using different language models and prompts?",
  "gold_answer_research": "Based on the findings presented in the results section, it can be concluded that the relationship between input prompt toxicity and output toxicity varies depending on the language model used and the specific prompt given. When instructed to produce a safe and respectful output, InstructGPT models generate less toxic outputs compared to GPT-3, but this advantage disappears when the respectful prompt is removed. On the other hand, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than GPT-3 outputs. Additionally, the toxicity of the model outputs is highly correlated with the toxicity of the input prompt, as shown in Figure 39.",
  "gold_answer_marketing": "The study found that when instructed to produce a safe and respectful output, InstructGPT models generate less toxic outputs compared to GPT-3. However, this advantage disappears when the respectful prompt is removed. Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than GPT-3. This suggests that the toxicity of the output is highly correlated with the toxicity of the input prompt."},
45: {"question": "What are some challenges in training retrieval systems and how are negative samples used to address them?",
  "gold_answer_research": "Training retrieval systems face challenges such as redundancy in retrieved documents and lack of diversity in retrieval. Negative samples, including randomly sampled negatives, denoised hard negatives, and instruction-unfollowing negatives, are crucial for improving system performance. Carefully designed negative samples help the system effectively learn the task, but they can also lead to performance drops in out-of-domain datasets. Combining random samples and challenging negatives during training is key to building a competitive system for both in-domain and out-of-domain retrieval.",
  "gold_answer_marketing": "Some challenges in training retrieval systems include high cost of annotating datasets for new tasks and improving performance in zero-shot settings. Negative samples, such as denoised hard negative documents and instruction-unfollowing negative documents, are used to train retrieval systems effectively and address performance drops in out-of-domain datasets."},
46: {"question": "What factors have been found to potentially impact the ability of models to follow instructions, based on the analysis provided?",
  "gold_answer_research": "Based on the analysis provided, factors that have been found to potentially impact the ability of models to follow instructions include the human feedback obtained from contractors, which may be influenced by their beliefs, cultural backgrounds, and personal history. Additionally, the model's behavior can be affected by false premises in instructions, tendencies to hedge, and performance degradation with multiple explicit constraints in instructions. The models are also not fully aligned or safe, as they can generate toxic or biased outputs, make up facts, and fail to generate reasonable outputs in some cases.",
  "gold_answer_marketing": "Factors that may impact the ability of models to follow instructions include false premises in instructions, models hedging unnecessarily, performance degradation with multiple constraints in instructions, generation of toxic or biased outputs, and over-generalization leading to refusal of innocuous instructions."},
47: {"question": "What are some key factors to consider when building a successful multi-task instruction-following retrieval system as identified in the research?",
  "gold_answer_research": "Some key factors to consider when building a successful multi-task instruction-following retrieval system include the need for cross-task interdependence for training a single retriever, the flexibility and zero-shot transfer enabled by instructions compared to task identifiers, and the elimination of the need for hosting multiple task-specific retrievers. Additionally, optimizing the mix and volume of instructional data for diverse tasks is crucial, as well as considering the impact of ranking strategy in data construction. Finally, the effectiveness of the dataset scale in retrieval and the importance of carefully designed negative samples should be taken into account for improved efficiency of instruction-following retrievers.",
  "gold_answer_marketing": "Key factors to consider when building a successful multi-task instruction-following retrieval system include the effectiveness of the dataset scale in retrieval, the diversity in data and model scale, carefully designed negative samples, and the ability to adapt to new tasks via instructions."},
48: {"question": "What are the benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model in the document?",
  "gold_answer_research": "The benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model, include significantly better training efficiency with less training compute, outperforming existing models by using less training data, compute, and parameters. The retrieval augmentation allows the model to focus on learning how to use retrieved documents in context, leading to improved accuracy in classification tasks. Additionally, the RA-CM3 model achieves strong performance in image and caption generation, surpassing existing models like DALL-E and Flamingo despite using fewer resources.",
  "gold_answer_marketing": "The benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model in the document, include outperforming existing models by using less training data, compute, and parameters, achieving significantly better training efficiency, and improving accuracy in k-shot classification tasks. Additionally, retrieval augmentation allows the model to focus on learning how to use retrieved documents in context, leading to stronger performance in tasks such as image and caption generation."},
50: {"question": "What methods are typically employed to create training data for embedding models that use task-specific instructions?",
  "gold_answer_research": "To create training data for embedding models that use task-specific instructions, a common method is to combine datasets from different sources, such as the SuperNaturalInstructions dataset with existing collections designed for embedding training. The SuperNaturalInstructions dataset provides natural language instructions, which can be paired with positive and negative examples to form training samples. Additionally, for tasks like classification or similarity, training samples can be constructed by selecting text sequences associated with different classes or similarities. This diverse training data is essential for instruction-based finetuning, which enables the embedding model to learn from a wide range of tasks and domains.",
  "gold_answer_marketing": "Training data for embedding models that use task-specific instructions is typically created by formulating a wide variety of tasks as text-to-text problems, distinguishing good/bad candidate outputs given an input text. This is done by combining datasets with natural language instructions and constructing positive and negative pairs for training."},
51: {"question": "Question: What are some of the challenges and innovations associated with fine-tuning large language models, and how does the approach discussed in the referenced text aim to address them?",
  "gold_answer_research": "Some challenges associated with fine-tuning large language models include limited access to and manipulation of knowledge, lagging performance on knowledge-intensive tasks, and the need for provenance in decision-making and updating world knowledge. The approach discussed in the referenced text aims to address these challenges by utilizing Retrieval Augmented Generation (RAG), which involves retrieving relevant passages from a corpus to feed to the language model for improved performance in tasks such as question-answering and dialogue. This iterative approach focuses on improving alignment with user intent and fine-tuning models to control sentiment and improve response quality in various language tasks.",
  "gold_answer_marketing": "The challenges with fine-tuning large language models include aligning them with user intent and controlling the quality of generated outputs. The approach discussed in the referenced text aims to address these challenges by using Retrieval Augmented Generation (RAG) to retrieve relevant passages from a corpus and feed them to the language model, improving alignment and performance."},
52: {"question": "What is a common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors, and how does it work?",
  "gold_answer_research": "A common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant. This approach involves dividing the input tensor into contiguous blocks of size B by flattening the tensor and slicing it into n blocks, where n is determined by the size of the blocks. Each block is then quantized independently using a quantization constant c, which helps prevent outlier values from causing performance degradation.",
  "gold_answer_marketing": "A common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant. This helps prevent performance degradation by reducing the impact of outliers on the quantization process."},
54: {"question": "What considerations or techniques are commonly implemented when setting up finetuning experiments for machine learning models?",
  "gold_answer_research": "When setting up finetuning experiments for machine learning models, it is common to use a two-stage approach. The initial stage involves setting the initial parameters using a language modeling objective. This is followed by a supervised discriminative 'fine-tuning' stage to adapt these parameters to the target task. Additionally, it is typical to train all models using the Adam optimizer and a triangular learning rate scheduler with 10% warmup. Experimentation with different hyperparameters such as number of epochs, peak learning rate, and batch size is also conducted to optimize model performance. Finally, utilizing a mixture of datasets and balancing the sizes of datasets can help improve the robustness and generalization of the finetuned models.",
  "gold_answer_marketing": "Considerations for setting up finetuning experiments for machine learning models commonly include using a language modeling objective for initial parameter setting and supervised discriminative fine-tuning for adapting parameters to the target task. Techniques such as hyperparameter search, Adam optimizer with triangular learning rate scheduler, and balancing dataset sizes through mixing strategies are also commonly implemented. Additionally, freezing some model layers during fine-tuning and incorporating negative examples for contrastive learning can be effective strategies."},
55: {"question": "What are the implications of the equivalence relation defined in the theoretical analysis of the DPO model for understanding the relationship between reward functions in reinforcement learning?",
  "gold_answer_research": "The equivalence relation defined in the theoretical analysis of the DPO model implies that two reward functions are considered equivalent if they differ by a constant function. This means that the class of learned reward models is not constrained by this reparameterization, allowing for the exact recovery of the optimal policy. Understanding this relationship between reward functions in reinforcement learning helps in defining a unique reward function within each equivalence class, which is crucial for optimizing policies under existing models of human preferences. It also highlights the generality and flexibility in the reward model due to the proposed reparameterization.",
  "gold_answer_marketing": "The equivalence relation defined in the theoretical analysis of the DPO model shows that two reward functions are considered equivalent if they differ by a fixed function. This implies that different reward functions can lead to the same optimal policy, allowing for flexibility in designing reward models in reinforcement learning."},
59: {"question": "Considering the structure and content of the provided text, what guidelines should be used to evaluate the effectiveness of a summary or chatbot response in this context?",
  "gold_answer_research": "To evaluate the effectiveness of a summary or chatbot response in this context, guidelines should include assessing the faithfulness of the answer to the retrieved context, the relevance of the answer to the question, and the focus of the retrieved context. Additionally, consider using quality metrics such as answer relevancy to rank responses based on how directly they address the question and avoid redundant or incomplete information. Lastly, take into account the performance of different tasks such as summarization, citation prediction, and passage ranking to determine the overall effectiveness of the response.",
  "gold_answer_marketing": "Answer: Evaluate based on faithfulness, answer relevance, and context relevance."},
60: {"question": "What are some recent methods and technologies that have been developed to enhance the capabilities and performance of natural language processing models?",
  "gold_answer_research": "Recent methods and technologies developed to enhance natural language processing models include retrieval-augmented multimodal language modeling, which outperforms existing models with less training data and parameters. Another advancement is the use of feature learning in infinite-width neural networks to improve performance. Additionally, embedding techniques in NLP have been developed to map words or phrases to real number vectors, enhancing the model's understanding of language. These innovations have led to improvements in tasks like query reformulation, document ranking, and fine-tuning larger language models for various applications.",
  "gold_answer_marketing": "Recent methods and technologies include retrieval-augmented language models, feature learning in infinite-width neural networks, and word embeddings."},
61: {"question": "What are some potential directions for future work mentioned in the document related to enhancing question-answering techniques for document-oriented tasks?",
  "gold_answer_research": "One potential direction for future work mentioned in the document is the development of multi-modal approaches that incorporate table and figure information into GPT-4 question-answering for documents. Another direction is to incorporate question type in the PDFTriage approach to improve the efficiency and efficacy of the approach. Additionally, the document suggests further research in document-grounded, information-seeking question answering, which the dataset is designed to facilitate.",
  "gold_answer_marketing": "Some potential future directions mentioned in the document include developing multi-modal approaches that incorporate table and figure information into question-answering for documents, and incorporating question type in the PDFTriage approach to improve efficiency and efficacy."},
62: {"question": "What information would you expect to find in section 2 of a document, based on the types of questions classified under Summarization?",
  "gold_answer_research": "Based on the types of questions classified under Summarization, you would expect to find key takeaways, concise summaries, and specific content extraction related to different sections of the document in section 2. The section likely contains detailed summaries of specific parts of the document, along with structured metadata representation and instructions for summarizing the content effectively. It may also include guidelines for extracting specific information and rewriting text for clarity and conciseness.",
  "gold_answer_marketing": "Based on the types of questions classified under Summarization, you would expect to find key takeaways, concise summaries, and specific content extraction related to the document in section 2."},
63: {"question": "What are the main advantages and attention mechanisms that contribute to the enhanced performance and efficiency of the newly introduced language model as compared to its predecessors?",
  "gold_answer_research": "The main advantages of the newly introduced language model include utilizing retrieval-augmentation to incorporate external knowledge, which improves prediction accuracy. Additionally, the model employs attention mechanisms that allow for better understanding of dependencies between source and target sequences, leading to more informed predictions. These attention mechanisms have been extended from machine translation to various other fields, enhancing the model's adaptability and performance across different tasks. Finally, the model's use of self-attention mechanisms enables better contextual representation learning, parallelization, and modeling of longer intra-token relations, improving efficiency and performance compared to previous models.",
  "gold_answer_marketing": "The main advantages of the newly introduced language model include the use of retrieval-augmented mechanisms, attention mechanisms, and context representation learning, which contribute to enhanced performance and efficiency compared to its predecessors."},
64: {"question": "What criteria are used to assess the quality of recommendations provided by different language models in a comparison study?",
  "gold_answer_research": "In a comparison study of language models, criteria such as sentence relevance, lexical accuracy, and contextual understanding are used to assess the quality of recommendations. Different tasks may benefit from different evaluation measures, such as STRINC, LEXICAL, and CXMI. Additionally, template selection plays a vital role in the quality of recommendations, with deliberate template design being important for tasks like query suggestion. The overall quality of recommendations is often judged using a Likert scale, along with metadata collection for each model output.",
  "gold_answer_marketing": "The criteria used to assess the quality of recommendations provided by different language models in a comparison study include comparing to human-created benchmarks, examining intrinsic character, comparing two models, investigating rate of learning, and analyzing learning curves."},
65: {"question": "What approaches have been proposed to enhance the task performance of language models while considering the trade-offs such as runtime efficiency, robustness to irrelevant context, and attribution quality?",
  "gold_answer_research": "Several approaches have been proposed to enhance the task performance of language models while considering trade-offs. These include using compression and selective augmentation methods to decrease the propensity of models to generate toxic or biased outputs. Adversarial setups have been suggested where labelers find worst-case behaviors of the model and add them to the dataset. Additionally, models like BART and T5 leverage bi-directional attention to achieve stronger performance on both discriminative and generative tasks. These methods aim to balance model performance with considerations such as runtime efficiency, robustness to irrelevant context, and attribution quality.",
  "gold_answer_marketing": "Approaches proposed to enhance language model task performance include compression and selective augmentation, adversarial set-ups for labeling worst-case behaviors, retrieval-augmented models, and extending existing models to enable length extrapolation while maintaining quality."},
67: {"question": "What metrics are commonly used to compare the performance of language models in various tasks, as outlined in an experimental results table?",
  "gold_answer_research": "Common metrics used to compare the performance of language models in various tasks, as outlined in an experimental results table, include Exact Match and Unigram F1. These metrics have become standard in evaluating language models. Additionally, other metrics such as BLEU score, FactScore (factuality), precision, and recall are also commonly used to assess the performance of language models across different tasks. It is important to consider a variety of metrics to get a comprehensive understanding of the effectiveness of a language model in different contexts.",
  "gold_answer_marketing": "The metrics commonly used to compare the performance of language models in various tasks are Exact Match and Unigram F1."},
69: {"question": "What is the role of manual assessment in the validation of language model predictions according to the text provided?",
  "gold_answer_research": "Manual assessment plays a crucial role in the validation of language model predictions. The engineers evaluate the quality of model outputs by having labelers rate them on test sets consisting of prompts from held-out customers. This manual assessment helps ensure that the models are aligned with a broad distribution of language tasks and can identify any behavioral issues that may arise from misalignment. Additionally, human annotators find that certain reflection token predictions are aligned with their assessments, providing valuable insights into the accuracy and effectiveness of the models.",
  "gold_answer_marketing": "Answer: Manual assessment plays a key role in evaluating the quality of language model predictions by having labelers rate the model outputs and comparing them to prompts from held-out customers."},
70: {"question": "What are the general steps outlined for training a language model in the document, and how is the training data for the generator language model collected and utilized?",
  "gold_answer_research": "The document outlines the general steps for training a language model, including incorporating retrieved documents into the main input sequence and optimizing the loss function to train the generator. The training data for the generator language model is collected through various techniques such as supervised fine-tuning, critic learning, and custom retrievers for downstream tasks. The collected data is used to train the generator on specific tasks like summarization, machine reading comprehension, and natural language to SQL translation, improving performance on those tasks.",
  "gold_answer_marketing": "The general steps for training a language model include fine-tuning on specific datasets, filtering pretraining data, and using critic learning. Training data for the generator language model is collected from open-access NLP papers and used for downstream conditional text generation tasks."},
73: {"question": "What are the three main categories used to refine language model abilities in understanding and executing search tasks according to the given document?",
  "gold_answer_research": "The three main categories used to refine language model abilities in understanding and executing search tasks are query understanding, document understanding, and query-document relationship understanding. Tasks within these categories focus on interpreting queries, comprehending documents, and understanding the relationships between queries and documents. This approach aims to enhance the models' performance in interpreting and responding to search-related instructions effectively, improving their utility in complex information retrieval scenarios.",
  "gold_answer_marketing": "The three main categories used to refine language model abilities in understanding and executing search tasks are query understanding, document understanding, and query-document relationship understanding."},
74: {"question": "What are some of the emerging research topics and challenges in the field of natural language processing and information retrieval according to recent academic conferences and publications?",
  "gold_answer_research": "Recent academic conferences and publications have highlighted emerging research topics and challenges in natural language processing and information retrieval. Some key areas of focus include efficient retrieval augmented generation, unsupervised dense information retrieval with contrastive learning, citation-informed transformers, and knowledge refinement via interaction between search engines and large language models. Additionally, challenges such as zero-shot retrieval, semantic search using GPT sentence embeddings, and prompt-based effective input reformulation for legal case retrieval have been identified as important research directions. These topics reflect the ongoing advancements and complexities in the field, driving innovation and progress in NLP and IR research.",
  "gold_answer_marketing": "Some emerging research topics and challenges in the field of natural language processing and information retrieval include efficient generation from unstructured knowledge, semantic code search evaluation, unsupervised dense information retrieval, context-aware document term weighting, knowledge refinement through interaction with large language models, and investigating the effectiveness of large language models in search re-ranking."},
75: {"question": "Question: How do models with different fine-tuning strategies compare in terms of accuracy and F1 score for fact verification tasks?",
  "gold_answer_research": "Models with different fine-tuning strategies are compared in terms of accuracy and F1 score for fact verification tasks. The introduction of LLMs has led to notable developments, with some studies leveraging prompting methods to apply LLMs in IR tasks. However, not all LLMs consistently outperform fine-tuned smaller models. For example, RankGPT based on gpt-3.5-turbo underperforms monoBERT in certain scenarios. Fine-tuning is not strictly necessary for models like GPT3, which has been evaluated on closed book question answering tasks without any updates or fine-tuning.",
  "gold_answer_marketing": "Models with different fine-tuning strategies have shown mixed results in terms of accuracy and F1 score for fact verification tasks. Some studies have found that large language models (LLMs) outperform smaller fine-tuned models, while others have reported inconsistent performance. Factors such as task complexity and the need for prompt methods to apply LLMs in information retrieval tasks can also impact the comparison."},
76: {"question": "What components does a fact verification task typically involve in order to assess the accuracy of a given statement?",
  "gold_answer_research": "A fact verification task typically involves assessing the relationship between a claim and the evidence provided, analyzing if there is enough information for a conclusive judgment. This task requires a detailed understanding of the claim and evidence to determine if it is supported or refuted. The use of performance metrics based on including gold answers in model generations instead of exact matching can help search engines deliver accurate and relevant results. Additionally, incorporating lexical measures and verification functions can aid in determining the accuracy of statements.",
  "gold_answer_marketing": "A fact verification task typically involves assessing the relationship between a claim and supporting evidence to determine accuracy."},
78: {"question": "What are the key factors that determine the performance of HALO-aligned models compared to non-HALO models, according to the results presented in the analysis?",
  "gold_answer_research": "According to the analysis presented, the key factors that determine the performance of HALO-aligned models compared to non-HALO models include the specific alignment method used (such as DPO and PPO variant), the model size (significant gap at 13B+ model sizes), and the ability to match or exceed the generation quality of SFT target sequences. Additionally, the study suggests that the cost of increasing model alignment is modest relative to pretraining, and that the modeling of human biases in HALOs may have practical benefits in improving overall performance.",
  "gold_answer_marketing": "The key factor that determines the performance of HALO-aligned models compared to non-HALO models is the model size, with HALO-aligned models generally outperforming non-HALO models at larger sizes (13B+ model sizes)."},
80: {"question": "How does the performance of KTO compare to DPO in model alignment, and what are the potential implications for data usage and training efficiency?",
  "gold_answer_research": "Based on the provided data and experiments, KTO consistently outperforms DPO in model alignment, even with restrictions such as using only one output per input. This suggests that KTO can achieve higher win rates and improve performance across various benchmarks compared to DPO. The implications of this performance difference include the ability to achieve quality generation results with significantly fewer desirable examples, potentially leading to more efficient data usage and training processes. This indicates that KTO may offer a more efficient and effective approach to model alignment compared to DPO.",
  "gold_answer_marketing": "KTO outperforms DPO in model alignment with up to 90% fewer examples. This suggests that KTO can achieve high performance even with imbalanced data, potentially leading to more efficient training processes."},
81: {"question": "What are some common approaches to building an open-domain question answering system?",
  "gold_answer_research": "Some common approaches to building an open-domain question answering system include using the RAG model, which minimizes the negative log-likelihood of answers, and comparing it to extractive QA paradigms that rely on non-parametric knowledge retrieval. Another approach is to incorporate question rewriting techniques to make open-domain QA more conversational. Additionally, utilizing datasets like QASPER, which contain questions requiring complex reasoning, can improve the performance of the system. References to papers by Anantha et al. and Asai et al. provide further insights into building ODQA systems.",
  "gold_answer_marketing": "Common approaches to building an open-domain question answering system include using retrieval over a knowledge base and incorporating the retrieved content as part of the prompt. Other methods involve pretraining models on large amounts of text data and fine-tuning them for question answering tasks."},
82: {"question": "What is the difference between open-book and closed-book question answering?",
  "gold_answer_research": "Open-book question answering involves the use of external sources of knowledge, such as Wikipedia, to retrieve information and generate a response. In contrast, closed-book question answering relies on pre-trained language models that have memorized factual knowledge within their parameters to generate responses without explicit context. Closed-book QA can be seen as analogous to a closed-book exam where no external resources are allowed. The key distinction lies in the reliance on external knowledge sources for open-book QA versus internal memorized knowledge for closed-book QA.",
  "gold_answer_marketing": "Open-book question answering involves using external sources of knowledge to answer questions, while closed-book question answering relies on pre-trained language models to provide answers without explicit context."},
84: {"question": "What are the basic components of the Retriever-Reader framework in open-domain QA?",
  "gold_answer_research": "The basic components of the Retriever-Reader framework in open-domain QA include a retriever model, which fetches relevant information based on input prompts efficiently using FAISS. The retriever component is responsible for retrieving contextually relevant documents or evidence blocks based on the input question. The reader component then processes this retrieved information to generate answers to the questions posed. This framework combines information retrieval and machine reading comprehension to achieve state-of-the-art results in open-domain question answering tasks.",
  "gold_answer_marketing": "The basic components of the Retriever-Reader framework in open-domain QA are the retriever and the reader components, which can be set up and trained independently or jointly trained end-to-end. The retriever component automatically fetches relevant information based on input prompts, while the reader component processes and comprehends the retrieved information to answer questions."},
85: {"question": "How is the TF-IDF model used in question answering retrieval systems?",
  "gold_answer_research": "In question answering retrieval systems, the TF-IDF model is used to represent queries and documents as bag-of-word vectors with terms weighted by term frequency multiplied by inverse document frequency. This allows for efficient non-learning-based search engine operations based on the vector space model. The TF-IDF model helps in calculating the relevance of documents to queries by measuring the importance of terms in the context of the entire document collection. This classic information retrieval approach aids in retrieving relevant information to answer questions accurately and efficiently.",
  "gold_answer_marketing": "The TF-IDF model is used in question answering retrieval systems to weight terms in queries and documents based on their importance in determining relevance."},
86: {"question": "Can neural networks enhance the process of information retrieval in QA systems?",
  "gold_answer_research": "Neural networks, such as MLP, LSTM, and bidirectional LSTM, can be used to learn dense representations of text for information retrieval in QA systems. These approaches, known as 'Neural IR', are a new category of methods that can improve performance in retrieval problems. The introduction of neural retrievers in recent QA literature has shown to outperform traditional word-similarity-based architectures, such as BM25, and can scale to handle knowledge-grounded dialogue tasks effectively. Additionally, incorporating pre-trained retrievers in QA systems has been shown to enhance the performance of generative language models.",
  "gold_answer_marketing": "Yes, neural networks can enhance the process of information retrieval in QA systems by improving performance in open-domain QA tasks and enabling the generation of more accurate answers."},
87: {"question": "What is the importance of fine-tuning in the context of QA data for open-domain question answering models?",
  "gold_answer_research": "Fine-tuning is important in the context of QA data for open-domain question answering models because it allows the model to adapt and improve its performance on specific QA datasets. By fine-tuning the model with common QA datasets, engineers can optimize the model's ability to answer questions accurately. However, there is a concern about the significant overlap between questions in the train and test sets of public QA datasets, which could affect the generalization ability of the fine-tuned models. Engineers should carefully consider this overlap and potentially explore ways to mitigate its impact during the fine-tuning process to ensure the model's effectiveness in real-world applications.",
  "gold_answer_marketing": "Fine-tuning is important in the context of QA data for open-domain question answering models to improve search task performance and the ability to generalize to unseen datasets."},
88: {"question": "How does pre-training with tasks like the Inverse Cloze Task benefit open-domain question answering models?",
  "gold_answer_research": "Pre-training with tasks like the Inverse Cloze Task benefits open-domain question answering models by improving the retrieval process over a knowledge base. By predicting the context given a sentence, the model can better understand the relationship between the question and the evidence. This approach helps in incorporating retrieved content effectively into the prompt, leading to higher accuracy in the question answering task. Additionally, using models pretrained with ICT can enhance the overall performance of the QA system by providing a better understanding of the context.",
  "gold_answer_marketing": "Pre-training with tasks like the Inverse Cloze Task benefits open-domain question answering models by improving retrieval and generation steps, ultimately enhancing the accuracy of the process."},
89: {"question": "What is the main goal of prompt engineering in language models?",
  "gold_answer_research": "The main goal of prompt engineering in language models is to effectively steer the behavior of the model towards desired outcomes without updating the model weights. This is achieved by composing and formatting prompts in a way that maximizes the model's performance on a specific task. Prompt engineering involves treating prompts as trainable parameters and optimizing them directly on the embedding space through methods like AutoPrompt, Prefix-Tuning, P-tuning, and Prompt-Tuning. The ultimate aim is to enhance the model's performance and alignment with user-defined tasks.",
  "gold_answer_marketing": "The main goal of prompt engineering in language models is to steer the behavior of the model for desired outcomes without updating the model weights."},
91: {"question": "What are some known biases that can affect the performance of few-shot classification in LLMs?",
  "gold_answer_research": "Some known biases that can affect the performance of few-shot classification in LLMs include majority label bias, recency bias, and common token bias. Majority label bias occurs when the distribution of labels among examples is unbalanced, recency bias refers to the tendency for the model to repeat the label at the end, and common token bias indicates that LLM tends to produce common tokens more often than rare tokens. These biases can contribute to high variance in few-shot classification tasks and may impact the model's ability to generalize effectively.",
  "gold_answer_marketing": "Some known biases that can affect the performance of few-shot classification in LLMs are majority label bias, recency bias, and common token bias."},
92: {"question": "Why might increasing model size not reduce variance in model performance with varying prompts?",
  "gold_answer_research": "Increasing model size may not necessarily reduce variance in model performance with varying prompts because the model's ability to generalize and adapt to different prompts is not solely dependent on its size. Factors such as the quality and relevance of the training examples, the learning rate or schedule, and the model's sensitivity to different hyperparameters can also play a significant role in determining performance variability. Additionally, the complexity of the task or dataset being used for training can impact how effectively the model scales with size. It is essential to consider these factors holistically when optimizing model performance rather than relying solely on increasing model size.",
  "gold_answer_marketing": "Increasing model size may not reduce variance in model performance with varying prompts because the same order of prompts may work well for one model but poorly for another. Additionally, when the validation set is limited, choosing the order of prompts that prevents the model from producing extremely unbalanced predictions or being overconfident can also affect performance."},
93: {"question": "What is the benefit of instruction-based finetuning in language models?",
  "gold_answer_research": "Instruction-based finetuning improves models' ability to generalize to unseen domains and tasks by providing task-specific representations that can be used for many downstream language tasks without additional training. This method also allows pretrained language models to follow instructions provided in prompts, enabling them to generate the desired output given specific inputs. Additionally, instruction finetuning helps transform raw pretrained LLMs into chatbot-like models, making finetuning more accessible and common, particularly for researchers with limited resources. Overall, the benefit of instruction-based finetuning is improved model performance, enhanced generalizability, and reduced communication costs in aligning with human intentions.",
  "gold_answer_marketing": "The benefit of instruction-based finetuning in language models is improved ability to generalize to unseen domains and tasks, without the need for additional training."},
94: {"question": "Can you describe a situation where retrieval-based methods would be necessary to enhance language model performance?",
  "gold_answer_research": "Retrieval-based methods are necessary to enhance language model performance in scenarios where the model needs to generate accurate and informative responses for entity-rich queries, such as 'George Washington standing in front of the Eiffel Tower.' In such cases, incorporating a retrieval module can provide additional context and relevant information to improve the model's understanding and generation of the desired output. Additionally, retrieval-based methods are crucial for question answering tasks, where the model needs to access external knowledge sources to provide accurate and comprehensive answers. By utilizing retrieval mechanisms, the language model can benefit from a wider range of information and improve its performance in handling complex and ambiguous queries effectively.",
  "gold_answer_marketing": "Retrieval-based methods are necessary to enhance language model performance in tasks like question answering, where incorporating additional information from external sources can improve the model's ability to generate accurate and relevant responses."},
95: {"question": "What is the Chain-of-Thought prompting technique and for which types of tasks is it particularly beneficial?",
  "gold_answer_research": "Chain-of-Thought (CoT) prompting is a technique that generates reasoning chains or rationales step by step to lead to a final answer, benefiting complicated reasoning tasks using large models with more than 50B parameters. It can be implemented through iterative Monte Carlo search methods or through a three-step process called augment-prune-select. CoT is particularly beneficial for enhancing model performance on complex tasks by decomposing them into smaller and simpler steps, shedding light on the model's thinking process. Task decomposition in CoT can be done with simple prompting, task-specific instructions, or human inputs.",
  "gold_answer_marketing": "Chain-of-Thought (CoT) prompting is a technique that generates reasoning chains or rationales step by step to lead to a final answer. It is particularly beneficial for complicated reasoning tasks when using large models with more than 50B parameters. Simple tasks only benefit slightly from CoT prompting."},
96: {"question": "How do augmented language models with external tools differ from regular models in functionality?",
  "gold_answer_research": "Augmented language models with external tools, such as TALM and Toolformer, are fine-tuned to learn how to use external tool APIs, expanding their capabilities beyond traditional language processing tasks. These models are trained to incorporate external tool API calls in order to improve the quality of their outputs, allowing them to perform tasks like speech recognition, machine translation, and information retrieval more effectively. By leveraging external tools, these models have the ability to access and utilize a wider range of resources and functionalities, enhancing their overall performance and versatility compared to regular language models.",
  "gold_answer_marketing": "Augmented language models with external tools differ from regular models by fine-tuning a LM to use external tool APIs, expanding the dataset to improve model outputs and enhancing tasks like speech recognition, machine translation, and natural language generation."},
97: {"question": "What can be inferred about the utilization of attention in neural networks?",
  "gold_answer_research": "Attention mechanisms in neural networks play a crucial role in allowing models to focus on specific parts of input data when making predictions or generating outputs. By assigning importance weights to different elements, such as pixels in an image or words in a sentence, attention helps the model to attend to relevant information and make more accurate predictions. The use of attention can improve the interpretability of neural networks by showing which parts of the input data are being focused on during the prediction process. Additionally, attention mechanisms, like multi-head attention, can enhance model performance by allowing the model to jointly attend to information from different representation subspaces at different positions.",
  "gold_answer_marketing": "Attention in neural networks allows the model to focus on specific parts of input data, such as images or text, in order to make predictions or generate output. It helps the model to learn relationships and correlations between different elements and improve performance in tasks like image captioning or language translation."},
101: {"question": "Can the use of attention mechanisms in deep learning models be applied to both machine translation and computer vision?",
  "gold_answer_research": "Yes, attention mechanisms in deep learning models have shown success in both machine translation and computer vision tasks. In machine translation, attention allows the model to capture dependencies between source and target sequences regardless of distance, leading to improved translation quality. Similarly, in computer vision, attention mechanisms have been used to focus on relevant parts of an image during caption generation, showcasing the ability to handle details and global dependencies effectively. Therefore, utilizing attention in both domains can enhance the performance of deep learning models significantly.",
  "gold_answer_marketing": "Yes, attention mechanisms in deep learning models can be applied to both machine translation and computer vision."},
102: {"question": "What are the potential benefits of incorporating self-attention mechanisms into Generative Adversarial Networks (GANs)?",
  "gold_answer_research": "Incorporating self-attention mechanisms into GANs can help the generator and discriminator better model relationships between spatial regions, leading to improved generation of detailed and realistic images. This is particularly useful for capturing global dependencies and enhancing the performance of transformer architectures. Additionally, self-attention can enable the model to assess its own predictions after each generated segment, allowing for customizable decoding algorithms to meet specific constraints or user preferences. Overall, self-attention in GANs can enhance detail handling and overall performance.",
  "gold_answer_marketing": "Incorporating self-attention mechanisms into GANs can help the generator and discriminator better model relationships between spatial regions, leading to improved performance in handling details and capturing global dependencies."},
103: {"question": "How does the transformer model variate from traditional sequence-aligned recurrent architectures?",
  "gold_answer_research": "The transformer model differs from traditional sequence-aligned recurrent architectures by not having a recurrent or convolutional structure. Instead, it heavily relies on self-attention mechanisms for processing sequences. This lack of recurrence and convolution, even with positional encoding, weakly incorporates sequential order, which can be a drawback for tasks sensitive to positional dependencies. Additionally, the transformer's architecture includes embedding layers, sinusoid-wave-based positional encoding, and softmax and linear layers in the final decoder output to maintain position information and facilitate processing of long sequences efficiently.",
  "gold_answer_marketing": "The transformer model differs from traditional sequence-aligned recurrent architectures by not having a recurrent or convolutional structure, and instead making heavy use of self-attention. This allows for handling very long sequences efficiently and achieving better performance on tasks involving long texts."},
104: {"question": "What implications does the concept of a Neural Turing Machine have for the theoretical power of neural networks?",
  "gold_answer_research": "The concept of a Neural Turing Machine (NTM) expands the theoretical power of neural networks by incorporating external memory storage, allowing for more complex computations and tasks. This mimics the Turing machine tape, enabling the neural network to control operation heads for reading and writing to the tape. However, the finite memory in NTM suggests it may resemble more of a 'Neural von Neumann Machine,' limiting its mathematical limitlessness seen in traditional Turing machines. Overall, the addition of external memory in NTM enhances the capabilities and potential applications of neural networks in solving more advanced problems.",
  "gold_answer_marketing": "The concept of a Neural Turing Machine suggests that neural networks can be equipped with external memory storage for more complex operations, potentially increasing their theoretical power."},
}


test_questions = {
4: {"question": "When was the transformer architecture introduced, and by which organization?"},
5: {"question": "How has the accessibility of powerful language models, such as GPT-3 and GPT-4, been controlled by their developers?"},
6: {"question": "What benchmarks or ratings are used to compare the capabilities of different language models?"},
10: {"question": "What are some of the primary applications for language models in technology and computing?"},
14: {"question": "How are language models typically evaluated and what benchmarks are used for this purpose?"},
15: {"question": "What datasets are available for evaluating language processing systems?"},
21: {"question": "What collaborations with other companies have contributed to the development of Claude's capabilities?"},
26: {"question": "According to DeepMind, how should the number of training tokens change relative to the model size?"},
29: {"question": "How do the sizes of models in the Gopher family range?"},
31: {"question": "What type of model architecture do the Gopher and Chinchilla families belong to?"},
32: {"question": "Can you name the author who wrote the novels A Farewell to Arms and The Sun Also Rises?"},
37: {"question": "What are the key advantages of InstructGPT models over GPT-3 models according to the findings in the research?"},
40: {"question": "What metrics are used to compare the performance of different models on training and validation splits according to the document provided?"},
42: {"question": "What types of evaluation metrics are commonly used to assess the accuracy of answers in AI-driven question and answer datasets?"},
49: {"question": "What factors contribute to the performance improvement in retrieval-augmented language models compared to non-retrieval-augmented models?"},
56: {"question": "What are the benchmarks used to evaluate the performance of the Deep Policy Optimization (DPO) method compared to other preference learning algorithms in the document provided?"},
57: {"question": "What methodologies have been evaluated for training language models to align with human preferences, and how do they compare in terms of effectiveness?"},
58: {"question": "What methods have been discussed in the literature for improving the alignment of language models with human preferences or feedback?"},
66: {"question": "What are some of the evaluation metrics used for assessing different types of text generation tasks presented in the study?"},
68: {"question": "Consider a document related to research in natural language processing or artificial intelligence. Can you name some of the recent topics or methods that have been discussed or introduced in the field according to the document?"},
71: {"question": "What is the significance of using reflection tokens in a model like SELF-RAG?"},
72: {"question": "How does the inclusion of selected context as opposed to appending all retrieved text spans impact computational cost during both training and inference times in language model generation tasks?"},
77: {"question": "What are the benefits of modeling human biases in Human-Aware Loss Optimizations (HALOs), and how do they compare to non-HALOs on the same datasets?"},
79: {"question": "What are the modifications made to the traditional Kahneman-Tversky model to adapt it for optimizing language model performance?"},
83: {"question": "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?"},
90: {"question": "How can adding examples to a prompt affect the performance of language models?"},
98: {"question": "What are the main components of a Neural Turing Machine (NTM) architecture?"},
99: {"question": "How might a seq2seq model's limitations be addressed in natural language processing tasks?"},
100: {"question": "What differentiates hard attention from soft attention in image processing algorithms?"},
}



## 1. Setup

We will first install a number of libraries and import what we will need.

In [1]:
%%capture
!pip -q install git+https://github.com/huggingface/transformers
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate
!pip -q install langchain
!pip install einops
!pip install faiss-gpu
!pip install langchain_community
!pip install --upgrade --quiet chromadb bs4 qdrant-client
!pip install langchainhub
!pip install -U langchain-huggingface
!pip install -U langchain-cohere
!pip install --upgrade --quiet  wikipedia
!pip install --upgrade --quiet  arxiv
!pip install --upgrade --quiet  pymupdf

!pip install xmltodict

!pip install cohere


In [2]:
import torch
import os
import bs4
import json
import numpy as np
import pandas as pd
import time


from pprint import pprint

import locale

from transformers import AutoTokenizer , AutoModelForCausalLM
from transformers import pipeline, BitsAndBytesConfig
from langchain_huggingface import HuggingFacePipeline
from langchain.llms import HuggingFacePipeline
from langchain_cohere import ChatCohere
from langchain import PromptTemplate, LLMChain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores import Qdrant
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.utils.math import cosine_similarity

from langchain_community.document_loaders import ArxivLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import WikipediaLoader
from langchain_community.document_loaders import OnlinePDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import PubMedLoader

#from langchain_community.chat_models import ChatCohere

from google.colab import userdata



In [3]:
locale.getpreferredencoding = lambda: "UTF-8"

In [4]:
%%capture
!pip install sentence_transformers

In [5]:
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

## 2. Data Acquisition


Next, we will need to retrieve the data.  We will work with three types of documents:

* A few papers from the ArXiv on RAG and NLP
* A few blogs from Lily Weng that talk about Open Domain Question Answering and related topics
* A number of Wikipedia articles on that topic

To make testing easier  we'll define a global record number so we can trace back to see which chunk came from which specific document.


In [17]:
#assign a unique number to each document we ingest
global_doc_number = 1

First we'll grab some papers from ArXiv.  We'll grab the pdf files and get all of the pages as separate documents.

In [18]:
arxiv_numbers = ('2005.11401', '2104.07567', '2104.09864', '2105.03011', '2106.09685', '2203.02155', '2211.09260', '2211.12561',
                 '2212.09741', '2305.14314', '2305.18290', '2306.15595', '2309.08872', '2309.15217', '2310.06825', '2310.11511',
                 '2311.08377', '2312.05708', '2401.06532', '2401.17268', '2402.01306', '2402.19473', '2406.04744')

In [19]:
all_arxiv_pages = []

#loop through the papers
for identifier in arxiv_numbers:
    # Construct URL using the arXiv unique identifier
    arx_url = f"https://arxiv.org/pdf/{identifier}.pdf"

    # Extract pages from the document and add them to the list of pages
    arx_loader = PyMuPDFLoader(arx_url)
    arx_pages = arx_loader.load()
    for page_num in range(len(arx_pages)):
        page = arx_pages[page_num]
        #CHANGED
        page.metadata['page_num'] = page_num
        page.metadata['doc_num'] = global_doc_number
        page.metadata['doc_source'] = "ArXiv"
        all_arxiv_pages.append(page)


    global_doc_number += 1

How many docs did we get?  Is that the correct number? And what is the content?

In [20]:
num_pages = len(all_arxiv_pages)
num_docs = global_doc_number - 1

print(f"{num_docs} documents in total")
print(f"{num_pages} pages in total")

23 documents in total
485 pages in total


In [None]:
all_arxiv_pages[5].page_content[:150]  # all pages of the Document content

'Table 1: Open-Domain QA Test Scores. For TQA,\nleft column uses the standard test set for Open-\nDomain QA, right column uses the TQA-Wiki\ntest set. See'

Next, let's get some information from Wikipedia on our main topic -- Gen AI.  LangChain provides a DocumentLoader that accesses the Wikipedia API.

In [21]:
wiki_docs = WikipediaLoader(query="Generative Artificial Intelligence", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
    wiki_docs[idx].metadata['doc_num'] = global_doc_number
    wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
    wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))


Number of documents:  4
Number of splits/chunks:  153


Same with a couple of other queries:

In [22]:
wiki_docs = WikipediaLoader(query="Information Retrieval", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
    wiki_docs[idx].metadata['doc_num'] = global_doc_number
    wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
    wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))

Number of documents:  4
Number of splits/chunks:  160


And yet another related Wikipedia article.

In [23]:
wiki_docs = WikipediaLoader(query="Large Language Models", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
    wiki_docs[idx].metadata['doc_num'] = global_doc_number
    wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
    wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))

Number of documents:  4
Number of splits/chunks:  162


We'll also augment our collection with some blog entries about Open Domain Question Answering, of which RAG is an approach, and some related topics in case users want to ask how the new Search system works.

In [24]:
web_loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2020-10-29-odqa/",
               "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
               "https://lilianweng.github.io/posts/2018-06-24-attention/",
               "https://lilianweng.github.io/posts/2023-06-23-agent/",
               "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/"),

    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)

web_documents = web_loader.load()

for idx, text in enumerate(web_documents):
    web_documents[idx].metadata['doc_num'] = global_doc_number
    web_documents[idx].metadata['doc_source'] = "WWW"
global_doc_number += 1

print('Number of documents: ', len(web_documents))


Number of documents:  5


Now we will begin to chunk this data. Let us first start with a default  chunk size and overlap, as well as the type of splitter. 

In [None]:
#Note that these defaults may or may not be ideal!
CHUNK_SIZE=128
OVERLAP=0

text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP)

In [25]:
web_splits = text_splitter.split_documents(web_documents)

for idx, text in enumerate(web_splits):
    web_splits[idx].metadata['split_id'] = idx

print('Number of splits: ', len(web_splits))

Number of splits:  2103


In [None]:
web_splits[35]

## 3. Experiments

### Experimentation with Chunking

Currently, we are using the default chunk size of 128 and an overlap of 0. However, when examining the chunks produced with these settings, it appears that a lot of the chunks quite short and are cut off mid sentence. This causes information loss and many of the chunks to be less useful.

As an example, with the current settings, the following sentences are split as:

Chunk 1: Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and

Chunk 2: correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.


We can see two issues here. First, Chunk 2 has no way to tell that "It" is refering to "Self Reflection", as it is not referenced anywhere in the chunk, so we are not able to understand the subject that the rest of the chunk is describing. Second, the information provided by Chunk 1 is less useful since it is missing part of the sentence.

Our main goal with this section will be to address these two issues. Increasing the chunk size should help with the chunks being cut off mid sentences. The increased chunk size and adding a chunk overlap should help to increase the chance of grouping sentences that are dependent on each other in the same chunk, or at least maintain the subject of the sentence across chunk boundaries. Just increasing the chunk size and adding a chunk overlap may not be enough to solve issue, and we may need to experiment with more sophisticated methods of chunking that involve grouping sentences based on the semantic meaning if this initial plan is insufficient in creating effective chunks.

Also, since we have different document types, we will also be using different text_splitter classes from langchain to handle each document type. This helps create better chunks since different document types often use different formatting characters. Lets start with the blog post documents using markdown:

In [None]:
#%%capture
#!pip install unstructured
# !pip uninstall pdfminer
# !pip install pdfminer.six

In [27]:
from langchain.text_splitter import MarkdownTextSplitter, LatexTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
#from unstructured.partition.pdf import partition_pdf

Document variables: all_arxiv_pages, wiki_docs, web_documents. Let's start with the web_documents (blog posts) first.

In [28]:
markdown_text_splitter = MarkdownTextSplitter(
    chunk_size=650,
    chunk_overlap=50
)

markdown_web_splits = markdown_text_splitter.split_documents(web_documents)
print('Number of splits/chunks: ', str(len(markdown_web_splits)))

Number of splits/chunks:  449


Lets see how many of our chunks now end in an alpha numeric character, indicating our sentence was potentially cut off mid sentence.

In [27]:
def compute_percentage_alphanumeric(splits):
  count = 0
  for i in range(len(splits)):
    if splits[i].page_content == "":
      continue
    last_char = splits[i].page_content[-1]
    if last_char.isalnum():
      count += 1

  return count / len(splits)

print("Default Splits: " + str(compute_percentage_alphanumeric(web_splits)))
print("Markdown Splits: " + str(compute_percentage_alphanumeric(markdown_web_splits)))

Default Splits: 0.43271516880646693
Markdown Splits: 0.14471544715447154


Nice! This is a huge improvement from the default text splitter. From experimenting more with the chunk size, I found that the percentage chunks that end in an alpha numeric value decreases even further if the chunk size is further increased. However, we also don't want the chunk size to be too big or else we will lose context precision as well as increased computational costs. 500 as a chunk size seemed to be a good balance.

Lets check how our chunks look, and more specifically for the chunks that do end in an alphanumeric value to see if they really are cut off mid sentence.

In [None]:
def find_alphanumeric_ending_chunks(splits):
  res = []
  for i in range(len(splits)):
    last_char = splits[i].page_content[-1]
    if last_char.isalnum():
      res.append(splits[i].page_content)

  return res

alphanumeric_ending_chunks = find_alphanumeric_ending_chunks(markdown_web_splits)
alphanumeric_ending_chunks[0:5]

['How to Build an Open-Domain Question Answering System?\n    \nDate: October 29, 2020  |  Estimated Reading Time: 33 min  |  Author: Lilian Weng',
 'Assume we have access to a powerful pretrained language model.\nWe do not cover how to use structured knowledge base (e.g. Freebase, WikiData) here.\nWe only focus on a single-turn QA instead of a multi-turn conversation style QA.\nWe mostly focus on QA models that contain neural networks, specially Transformer-based language models.\nI admit that I missed a lot of papers with architectures designed specifically for QA tasks between 2017-2019\uf8ffüòî',
 'BERTserini (Yang et al., 2019) pairs the open-source Anserini IR toolkit as the retriever with a fine-tuned pre-trained BERT model as the reader. The top $k$ documents ($k=10$) are retrieved via the post-v3.0 branch of Anserini with the query treated as a bag of words. The retrieved text segments are ranked by BM25, a classic TF-IDF-based retrieval scoring function. In terms of the effec

The sentences do appear to be cut off mid sentence. This is not ideal, but the percentage of sentences that are cut off are a lot lower now, so we will live with this for now. With more time, I would want to dig more into ways to further improve this.

Now let's look at the chunks retrieved based on a query. For embedding model, we are allowed to use 'all-mpnet-base-v2', 'all-MiniLM-L6-v2', 'multi-qa-mpnet-base-dot-v1', 'all-distilroberta-v1', and 'avsolatorio/GIST-Embedding-v0'. Let's start with 'all-mpnet-base-v2' first and then do more experimentation later.

In [None]:
base_embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")

In [None]:
vectorstore = Qdrant.from_documents(markdown_web_splits,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="markdown_test",
)
retriever = vectorstore.as_retriever()

In [None]:
query = "What is Chain of Thought doing?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits

In [None]:
docs

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/', 'doc_num': 27, 'doc_source': 'WWW', '_id': 'dafd6d0de84140baa5801dae06fe9a7f', '_collection_name': 'markdown_test'}, page_content='Chain-of-Thought (CoT)#\nChain-of-thought (CoT) prompting (Wei et al. 2022) generates a sequence of short sentences to describe reasoning logics step by step, known as reasoning chains or rationales, to eventually lead to the final answer. The benefit of CoT is more pronounced for complicated reasoning tasks, while using large models (e.g. with more than 50B parameters). Simple tasks only benefit slightly from CoT prompting.\nTypes of CoT prompts#\nTwo main types of CoT prompting:'),
 Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'doc_num': 27, 'doc_source': 'WWW', '_id': 'ef26fba1534d46cf8b7dde57a8998702', '_collection_name': 'markdown_test'}, page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring mult

Chunks look good and provide sufficient information to answer the inputted question. Let's move on to the wikipedia documents now:

Wikipedia uses a mark up language call WikiText. There is currently not an existing text splitter class for this language in LangChain, but we can achieve the same thing by using the RecursiveCharacterTextSplitter class and add separators specific to the WikiText language. Here we use the common text separators as well as "==" which is WikiText indicates a section header.

In [29]:
wiki_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=650,
    chunk_overlap=50,
    separators=["\n\n", "\n", "==", ".", " ", ""]
)

wiki_web_splits = wiki_text_splitter.split_documents(wiki_docs)
print('Number of splits/chunks: ', str(len(wiki_web_splits)))

Number of splits/chunks:  40


In [None]:
vectorstore = Qdrant.from_documents(wiki_web_splits,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="wiki_test",
)
retriever = vectorstore.as_retriever()

In [None]:
query = "What is Chain of Thought doing?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

In the supervised learning phase, the model generates responses to prompts, self-critiques these responses based on a set of guiding principles (a "constitution"), and revises the responses. Then the model is fine-tuned on these revised responses.

Claude 3 has seemed to perform meta-cognitive reasoning, including the ability to realize it is being artificially tested during needle in a haystack tests.

. This technique is similar to reinforcement learning from human feedback (RLHF), except that the comparisons used to train the preference model are AI-generated, and that they are based on the constitution.

. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.



The chunks pulled don't seem to be very relevant to Chain of Thought, but looking at the wikipedia articles that we loaded in, none of them are containing information on Chain of Thought so this is to be expected. Let's try a different query.

In [None]:
query = "When were the gpt models created?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

Competing language models have for the most part been attempting to equal the GPT series, at least in terms of number of parameters.
Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on the field of use. Mistral AI's

Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI at first deemed it too powerful to release publicly, out of fear of malicious use. GPT-3 in 2020 went a step further and as of 2024 is available only via API with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing browser-based ChatGPT that captured the imaginations of the general population and caused some media hype and online buzz

. The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities. OpenAI did not reveal high-level architecture and the number of parameters of GPT-4

In [None]:
compute_percentage_alphanumeric(wiki_web_splits)

0.2692307692307692

Chunks look good for this query and provide sufficient information to answer the inputted question.
Let's move on to the Arxiv documents now. The Arxiv documents are in pdf form, which have a fairly standard text formatting, so let try using the  RecursiveCharacterTextSplitter class with the default separators.

In [31]:
arxiv_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=650,
    chunk_overlap=50,
)

arxiv_splits = arxiv_text_splitter.split_documents(all_arxiv_pages)
print('Number of splits/chunks: ', str(len(arxiv_splits)))

Number of splits/chunks:  3106


In [None]:
vectorstore = Qdrant.from_documents(arxiv_splits,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="arxiv_test",
)
retriever = vectorstore.as_retriever()

In [None]:
#query = "What is Chain of Thought doing?"
query = "What is a GAN?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

late and generate realistic images, audio, and other data [66].
As shown in Fig. 2, a typical GAN consists of two main
components: a generator and a discriminator. These two parts
compete with each other through adversarial learning, allowing
the generator to continuously improve its ability to generate re-
alistic samples, while the discriminator continuously improves
its ability to distinguish between true and false samples.
C. Retriever

generative models that can create realistic and diverse samples
of data (including images, texts, videos, molecules, etc.) [65].
As shown in Fig. 2, diffusion models work by gradually adding
noise to data until it becomes random, then reversing the
process to generate new data from noise. This process is based
on probabilistic modeling and neural networks.
4) GAN: Generative Adversarial Networks (GANs) [14]
are highly anticipated deep learning models which can simu-

images even for rare or unseen subjects, but also reduces the
parameter count and c

In [None]:
compute_percentage_alphanumeric(arxiv_splits)

0.6282592500620809

Chunks do not appear to provide useful information and many sentences are cut off. Let try a different approach using the SemanticChunker class.

In [None]:
arxiv_semantic_splitter = SemanticChunker(base_embeddings)

arxiv_splits = arxiv_semantic_splitter.split_documents(all_arxiv_pages)
print('Number of splits/chunks: ', str(len(arxiv_splits)))

Number of splits/chunks:  1693


In [None]:
vectorstore = Qdrant.from_documents(arxiv_splits,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="arxiv_test",
)
retriever = vectorstore.as_retriever()

In [None]:
query = "What is Chain of Thought doing?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

Zhou. Chain of thought prompting elicits reasoning in large language models. In A. H. Oh, A. Agarwal,
D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL
https://openreview.net/forum?id=_VjQlMeSB_J. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and
G. Mann. Bloomberggpt: A large language model for finance, 2023. K. Yang, Y. Tian, N. Peng, and D.

19
[130] M. Xia, S. Malladi, S. Gururangan et al., “LESS: selecting influential
data for targeted instruction tuning,” arXiv:2402.04333, 2024. [131] A.-L. Bornea, F. Ayed et al., “Telco-rag: Navigating the challenges
of retrieval-augmented language models for telecommunications,”
arXiv:2404.15939, 2024. [132] S. Yao, J. Zhao, D. Yu et al., “React: Synergizing reasoning and acting
in language models,” in ICLR, 2023. [133] J. Wei, X. Wang, D. Schuurmans et al., “Chain-of-thought prompting
elicits reasoning in large language models,” in NeurIPS, 2022. [134] T.

In [None]:
compute_percentage_alphanumeric(arxiv_splits)

0.0

Chunks are no longer cut off mid sentence, although the content doesn't seem that helpful. Might just be the same issue where Chain of Thought isn't described in the arvix papers. Lets see if it does better on a different prompt.

In [None]:
query = "What is a GAN model?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

Overview
As shown in Fig. 1, the entire RAG system consists of
two core modules: the retriever and the generator, where the
retriever searches for relevant information from the data store
and the generator produces the required contents. The RAG
process unfolds as follows: (i) the retriever initially receives
the input query and searches for relevant information; (ii) then,
the original query and the retrieval results are fed into the
generator through a specific augmentation methodology; (iii)
finally, the generator produces the desired outcomes. B. Generator
The remarkable performance of generative AI across di-
verse tasks has ushered in the era of AIGC. The generation
module plays a crucial role within the RAG system. Different
generative models are applied for different scenarios, such
as transformer models for text-to-text tasks, VisualGPT [62]
for image-to-text tasks, Stable Diffusion [10] for text-to-
image tasks, Codex [2] for text-to-code tasks, etc. Here we
introduce 4 typic

Chunks retrieved don't align with the best content for answering the query. There is one paragraph in 2402.19473 that was directly addressing what a GAN model is. Lets try other embedding engines to see if we can get more informative chunks retrieved.

Here is the paragraph we want returned:

> "Generative Adversarial Networks (GANs) [14] are highly anticipated deep learning models which can simulate and generate realistic images, audio, and other data [66]. As shown in Fig. 2, a typical GAN consists of two main components: a generator and a discriminator. These two parts compete with each other through adversarial learning, allowing the generator to continuously improve its ability to generate realistic samples, while the discriminator continuously improves its ability to distinguish between true and false samples."


In [30]:
# load in embedding options
%%capture
mpnet_v2_embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
mini_lm_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
mpnet_qa_v1_embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")
roberta_embeddings = HuggingFaceEmbeddings(model_name="all-distilroberta-v1")
gist_embeddings = HuggingFaceEmbeddings(model_name="avsolatorio/GIST-Embedding-v0")

Here we will test the cosine similarities of the query and desired chunk using the different embeddings.

In [None]:
embedding_models_mapping = {"mpnet_v2_embeddings": mpnet_v2_embeddings,
                            "mini_lm_embeddings": mini_lm_embeddings,
                            "mpnet_qa_v1_embeddings": mpnet_qa_v1_embeddings,
                            "roberta_embeddings": roberta_embeddings,
                            "gist_embeddings": gist_embeddings}

for name, model in embedding_models_mapping.items():
  text = "What is a GAN model?"
  query_result = model.embed_query(text)
  doc_result = model.embed_documents(["Generative Adversarial Networks (GANs) [14] are highly anticipated deep learning models which can simulate and generate realistic images, audio, and other data [66]. As shown in Fig. 2, a typical GAN consists of two main components: a generator and a discriminator. These two parts compete with each other through adversarial learning, allowing the generator to continuously improve its ability to generate realistic samples, while the discriminator continuously improves its ability to distinguish between true and false samples."])
  similarity = cosine_similarity([query_result], doc_result)[0]
  print(name + ": " + str(similarity))

mpnet_v2_embeddings: [0.61145242]
mini_lm_embeddings: [0.66174942]
mpnet_qa_v1_embeddings: [0.65688943]
roberta_embeddings: [0.53257887]
gist_embeddings: [0.85950966]


Looks like the GIST embeddings has a lot higher of a similiarity score compared to the other embedding models. Lets try to use this embedding model with the SemanticChunker and vector store to see if we can get better chunks.

In [32]:
arxiv_semantic_splitter = SemanticChunker(gist_embeddings)

arxiv_splits = arxiv_semantic_splitter.split_documents(all_arxiv_pages)
print('Number of splits/chunks: ', str(len(arxiv_splits)))

Number of splits/chunks:  1693


In [None]:
vectorstore = Qdrant.from_documents(arxiv_splits,
    gist_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="arxiv_test",
)
retriever = vectorstore.as_retriever()

In [None]:
query = "What is a GAN model?"
docs = vectorstore.similarity_search_by_vector(gist_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

V. Houdt et al., “A review on the long short-term memory model,”
Artif. Intell. Rev., vol. 53, no. 8, pp. 5929–5955, 2020. [65] L. Yang, Z. Zhang et al., “Diffusion models: A comprehensive survey
of methods and applications,” CSUR, vol. 56, no. 4, pp. 1–39, 2023. [66] J. Gui, Z. Sun, Y. Wen et al., “A review on generative adversarial
networks: Algorithms, theory, and applications,” TKDE, vol. 35, no. 4,
pp. 3313–3332, 2023. [67] S. E. Robertson and S. Walker, “On relevance weights with little
relevance information,” in SIGIR, 1997. [68] J.

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
6B: A 6 Billion Parameter Autoregressive Lan-
guage Model. https://github.com/kingoflolz/
mesh-transformer-jax. Liang Wang, Nan Yang, Xiaolong Huang, Binxing
Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
and Furu Wei.

Gan. Salmon: Self-alignment
with principle-following reward models, 2023. Gemini Team. Gemini: A fami

The retriever is still not giving useful chunks for answering what a GAN model is. Let find the chunk with the desired information to see what it looks like.

In [None]:
for i, split in enumerate(arxiv_splits):
  if "Generative Adversarial Networks" in split.page_content:
    print(i)
    print(split)

1499
page_content='(email: zeus@tencent.com)
novel model algorithms, explosive scale of foundation models,
and massive high-quality datasets. Specifically, sequence-to-
sequence tasks have transitioned from utilizing Long Short-
Term Memory (LSTM) networks [12] to Transformer-based
models [13], and image-generation tasks have shifted from
Generative Adversarial Networks (GANs) [14] to Latent Dif-
fusion Models (LDMs) [10] as well. Notably, the architecture
of foundation models, initially constituted by millions of
parameters [15], [16], has now grown to billions or even
trillions of parameters [1], [4], [17]. These advancements
are further bolstered by the availability of rich, high-quality
datasets [1], [18], which provide ample training samples to
fully optimize model parameters. Information retrieval is another pivotal application within
the field of computer science. Different from generation,
retrieval aims to locate relevant existing objects from a vast
pool of resources. The mos

We can see that the information we want is in chunk 1505. However, the chunk is so big that the other text in the chunk is lowering the similiarity score and causing the desired chunk to not be returned in the vector search. Let's see if we can chunk again to create smaller chunks. I notice that the pdf files are written in Latex, so I will use a Latex specific text splitter here instead of the default RecursiveCharacterTextSplitter.

In [33]:
arxiv_latex_splitter = LatexTextSplitter(chunk_size=500, chunk_overlap=50)

arxiv_latex_splits = arxiv_latex_splitter.split_documents(arxiv_splits)
print('Number of splits/chunks: ', str(len(arxiv_latex_splits)))

Number of splits/chunks:  4700


In [None]:
vectorstore = Qdrant.from_documents(arxiv_latex_splits,
    gist_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="arxiv_test",
)
retriever = vectorstore.as_retriever()

In [None]:
query = "What is a GAN model?"
docs = vectorstore.similarity_search_by_vector(gist_embeddings.embed_query(query)) # will rank the splits
for doc in docs:
  print(doc.page_content + "\n")

outputs. 3) Diffusion Model: Diffusion models are a family of deep
generative models that can create realistic and diverse samples
of data (including images, texts, videos, molecules, etc.) [65]. As shown in Fig. 2, diffusion models work by gradually adding
noise to data until it becomes random, then reversing the
process to generate new data from noise. This process is based
on probabilistic modeling and neural networks. 4) GAN: Generative Adversarial Networks (GANs) [14]
are highly

Adversarial Networks (GANs) [14]
are highly anticipated deep learning models which can simu-
late and generate realistic images, audio, and other data [66]. As shown in Fig. 2, a typical GAN consists of two main
components: a generator and a discriminator. These two parts
compete with each other through adversarial learning, allowing
the generator to continuously improve its ability to generate re-
alistic samples, while the discriminator continuously improves
its ability to distinguish between

machine l

In [None]:
compute_percentage_alphanumeric(arxiv_latex_splits)

0.6106382978723405

Now the chunks returned are giving good information, but a majority of the sentences are cut off. I experimented with increasing the chunk size, which helped, but the number of sentences that were cut off were still high. We will live with this for now. After testing with our models, if performance of the retrieval aspect is insufficient, we will revisit this. We may need to implement a custom function that generate chunks that are sentence aware.

### Final Embeddings, Chunking Strategy, and Vector Store

Here we will create the final vector store using the following settings:
1. GIST embeddings
2. For blog posts: MarkdownTextSplitter with chunk size of 650 and chunk overlap of 50
3. For wikipedia docs: RecursiveTextSplitter with WikiText specfic separators and a chunk size of 650 and chunk overlap of 50
4. For Arxiv pdf files: Chunk with SemanticChunker first, then LatexTextSplitter with chunk size of 650 and chunk overlap of 50

In [34]:
gist_embeddings = HuggingFaceEmbeddings(model_name="avsolatorio/GIST-Embedding-v0")

In [35]:
markdown_text_splitter = MarkdownTextSplitter(
    chunk_size=650,
    chunk_overlap=50
)

markdown_web_splits = markdown_text_splitter.split_documents(web_documents)
print('Number of splits/chunks: ', str(len(markdown_web_splits)))

Number of splits/chunks:  449


In [36]:
wiki_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=650,
    chunk_overlap=50,
    separators=["\n\n", "\n", "==", ".", " ", ""]
)

wiki_web_splits = wiki_text_splitter.split_documents(wiki_docs)
print('Number of splits/chunks: ', str(len(wiki_web_splits)))

Number of splits/chunks:  40


In [37]:
arxiv_semantic_splitter = SemanticChunker(gist_embeddings)
arxiv_latex_splitter = LatexTextSplitter(chunk_size=650, chunk_overlap=50)

arxiv_semantic_splits = arxiv_semantic_splitter.split_documents(all_arxiv_pages)
arxiv_latex_splits = arxiv_latex_splitter.split_documents(arxiv_semantic_splits)

print('Number of splits/chunks: ', str(len(arxiv_latex_splits)))

Number of splits/chunks:  3794


In [38]:
all_chunks = markdown_web_splits + wiki_web_splits + arxiv_latex_splits
qdrant_vectorstore = Qdrant.from_documents(all_chunks,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="rag_tech_db",
    force_recreate=True
)

retriever = qdrant_vectorstore.as_retriever()

### Experimentation with Prompt Engineering

Next, we will build better prompts for our use case. The main strategy I will be using here is to create a set of prompt components, one specific to engineers and one specific to marketing. Then I will test the different permutations of the components, and select the final prompt that has the best evaluation metrics.

Areas of emphasis for the prompt include:
* Audience part is important -> describe purpose of the user asking to question. Why are they asking the question?
* Output Charactistics -> High Level vs Technical, word choice
* Response length -> compute average and max response length of validation set and have llm try to match response level as a way to determine how detailed the responses should be.


* The engineers, who require pretty detailed information when they ask questions
  * Help engineer to explain (or implement) the technical intricate details of a gen ai concept with your response
  * Slightly longer
* The marketing team and supporting staff who also will ask questions around GenAI in order to better understand the products and the field as a whole, but a lot more high level answers would likely be in order
  * Give enough information to the user in order to promote the company’s gen ai products to the clients

In [None]:
# compute average number of sentences in validation answers
total_sentences_eng = 0
total_sentences_marketing = 0
max_sentence_length_eng = 0
max_sentence_length_marketing = 0
for i in validation_questions_answers.keys():
  # get number of sentences in the gold answer
  cur_eng_sentence_length = len(validation_questions_answers[i]["gold_answer_research"].split("."))
  cur_marketing_sentence_length = len(validation_questions_answers[i]["gold_answer_marketing"].split("."))

  # add to running count
  total_sentences_eng += cur_eng_sentence_length
  total_sentences_marketing += cur_marketing_sentence_length

  # update max sentence length if necessary
  if cur_eng_sentence_length > max_sentence_length_eng:
    max_sentence_length_eng = cur_eng_sentence_length
  if cur_marketing_sentence_length > max_sentence_length_marketing:
    max_sentence_length_marketing = cur_marketing_sentence_length


print("Average number of sentences in validation answers for engineers: " + str(total_sentences_eng / len(validation_questions_answers)))
print("Average number of sentences in validation answers for marketing: " + str(total_sentences_marketing / len(validation_questions_answers)))
print("Max number of sentences in validation answers for engineers: " + str(max_sentence_length_eng))
print("Max number of sentences in validation answers for marketing: " + str(max_sentence_length_marketing))

Average number of sentences in validation answers for engineers: 4.733333333333333
Average number of sentences in validation answers for marketing: 2.493333333333333
Max number of sentences in validation answers for engineers: 7
Max number of sentences in validation answers for marketing: 5


Let's start with creating the prompt components for the engineering pipeline.

In [39]:
# function to create prompt permutations
def create_prompt_permutations(role, task, audience, audience_characteristics, audience_whys, output_characteristics, nots, mollick):
  default_end = """
  Answer the question based on the following context:
  {context}

  Question: {question}
  [/INST]
  """

  permutations = []
  for audience_char in audience_characteristics:
    for audience_why in audience_whys:
      for output_char in output_characteristics:
        prompt = "[INST] " + role + " " + task + " " + audience + " " + audience_char + " " + audience_why + " " + output_char + " " + nots + " " + mollick + default_end
        permutations.append(prompt)

  return permutations

In [40]:
prompt_role = "You are a question-answering assistant for Generative AI topics."
prompt_task = "Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question."
prompt_audience = "You will be answering questions for a team of engineers working at a startup creating new Generative AI products."
prompt_audience_characteristics = ["The engineers are new to the Generative AI field.",
                                   "The engineers have a graduate student level understanding of traditional machine learning, but are not familiar with advanced Generative AI topics.",
                                   "The engineers have a graduate student level understanding of Generative AI, so you can assume they are familiar with advanced topics."]
prompt_audience_why = ["The engineers are asking questions in order to gain the technical knowledge around Generative AI topics.",
                       "The engineers are asking questions in order to gain the technical knowledge necessary to build new Generative AI products.",
                       "The engineers need detailed, technical explanations to conceptual questions in order to guide them on how to design and implement their Generative AI product."]
prompt_output_characteristics = ["Your response should include clarity, technical depth, and precision.",
                                 "Your response should break complex concepts into smaller, easily digestable parts.",
                                 "Your response should help the engineers in staying updated with the latest technological advancements.",
                                 "Your response should include clarity, technical depth, and precision. Your response should break complex concepts into smaller, easily digestable parts. Your response should help the engineers in staying updated with the latest technological advancements."]
prompt_nots = "Do not use information that is not in the provided context to answer the question. Do not use more than 7 sentences in your answer. Do not say 'Based on the context'"
prompt_mollick = "You are very capable."

In [41]:
eng_prompts = create_prompt_permutations(prompt_role, prompt_task, prompt_audience, prompt_audience_characteristics, prompt_audience_why, prompt_output_characteristics, prompt_nots, prompt_mollick)
print("Total number of Eng Prompts: " + str(len(eng_prompts)))
print(eng_prompts[0])

Total number of Eng Prompts: 36
[INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for a team of engineers working at a startup creating new Generative AI products. The engineers are new to the Generative AI field. The engineers are asking questions in order to gain the technical knowledge around Generative AI topics. Your response should include clarity, technical depth, and precision. Do not use information that is not in the provided context to answer the question. Do not use more than 7 sentences in your answer. Do not say 'Based on the context' You are very capable.
  Answer the question based on the following context:
  {context}

  Question: {question}
  [/INST]
  


Now lets do the same thing for the marketing pipeline. I will create less prompt combinations for the marketing team as there are less people on the team and I want to save on computation costs.

In [42]:
marketing_prompt_role = "You are a question-answering assistant for Generative AI topics."
marketing_prompt_task = "Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question."
marketing_prompt_audience = "You will be answering questions for the marketing team working at a startup creating new Generative AI products."
marketing_prompt_audience_characteristics = ["The marketing team is new to the Generative AI field."]
marketing_prompt_audience_why = ["The marketing team is asking questions in order to gain a high level overview around Generative AI topics.",
                       "The marketing team is asking questions in order to gain the high level knowledge necessary to promote their new Generative AI products.",
                       "The marketing team is asking questions around Generative AI in order to better understand the products and the field as a whole, but a lot more high level answers would likely be in order."]
marketing_prompt_output_characteristics = ["Your response should be high level, concise, and nontechnical.",
                                 "Your response should break complex concepts into smaller, nontechnical parts that are easy for someone who does not have a background in Generative AI to understand.",
                                 "Your response should be around 2 to 3 sentences long.",
                                 "Your response should be high level, concise, and nontechnical. Your response should break complex concepts into smaller, nontechnical parts that are easy for someone who does not have a background in Generative AI to understand. Your response should be around 2 to 3 sentences long."]
marketing_prompt_nots = "Do not use information that is not in the provided context to answer the question. Do not use more than 5 sentences in your answer. Do not say 'Based on the context'."
marketing_prompt_mollick = "You are very capable."


In [43]:
marketing_prompts = create_prompt_permutations(marketing_prompt_role, marketing_prompt_task, marketing_prompt_audience, marketing_prompt_audience_characteristics, marketing_prompt_audience_why, marketing_prompt_output_characteristics, marketing_prompt_nots, marketing_prompt_mollick)
print("Total number of Marketing Prompts: " + str(len(marketing_prompts)))
print(marketing_prompts[0])

Total number of Marketing Prompts: 12
[INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for the marketing team working at a startup creating new Generative AI products. The marketing team is new to the Generative AI field. The marketing team is asking questions in order to gain a high level overview around Generative AI topics. Your response should be high level, concise, and nontechnical. Do not use information that is not in the provided context to answer the question. Do not use more than 5 sentences in your answer. Do not say 'Based on the context'. You are very capable.
  Answer the question based on the following context:
  {context}

  Question: {question}
  [/INST]
  


## 4. Key Runs and Evaluation

### Define Metrics



For my main metric, I chose to use a weighted evaluation metric, consisting of two parts: similarity to the reference answer and general RAG performance. To evaluate the similiarity to the evaluation metric, I chose to use the BERTScore and cosine similarity between generated and reference answers. I decided against including other traditional NLP metrics like BLEU and ROUGE due to their limitation from word matching. The closeness of generated answers to reference material is important, thus these metrics were weighted at 65%. Additionally, the RAGAs framework was utilized to evaluate general RAG performance, focusing on answer relevance, faithfulness, and context relevancy, and was weighted at 35% to provide a comprehensive assessment of the system's capabilities.

The final equation for computing the weighted evaluation score is as follows:

0.65 * ((BERTScore + Cosine Sim) / 2) + 0.35 * ((Faithfulness + Answer Relevance + Context Relevance) / 3)

In [44]:
%%capture
!pip install bert-score
!pip install ragas==0.1.12

In [45]:
from bert_score import BERTScorer
from ragas.metrics import answer_relevancy, faithfulness, context_precision, context_recall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import evaluate
from datasets import Dataset

In [46]:
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

In [47]:
def compute_bert_score(generated_answers, reference_answers):
  # Initialize the BERTScorer
  scorer = BERTScorer(lang="en")

  # Compute the scores
  P, R, F1 = scorer.score(generated_answers, reference_answers)

  # Return values
  return P, R, F1

def compute_cosine_similarity(generated_answers, reference_answers):
  res = []

  for gen_answer, ref_answer in zip(generated_answers, reference_answers):
    gen_answer_emb = base_embeddings.embed_query(gen_answer)
    ref_answer_emb = base_embeddings.embed_query(ref_answer)
    res.append(cosine_similarity([gen_answer_emb], [ref_answer_emb])[0])

  res = np.array(res)
  avg_similiarity = np.mean(res)

  return avg_similiarity


def compute_ragas_metrics(questions, generated_answers, reference_answers, contexts):
  """
    questions: list of questions
    generated_answers: list of generated answers
    reference_answers: list of reference answers
    contexts: list of lists of contexts
  """

  # create dictionary in form that RAGAs expects
  data_samples = {
    'question': questions,
    'answer': generated_answers,
    'ground_truth': reference_answers,
    'contexts' : contexts
  }

  dataset = Dataset.from_dict(data_samples)
  llm_wrapped = LangchainLLMWrapper(cohere_chat_model)
  embeddings_wrapped = LangchainEmbeddingsWrapper(base_embeddings)
  score = evaluate(dataset, metrics=[answer_relevancy, faithfulness, context_precision, context_recall], llm = llm_wrapped, embeddings = embeddings_wrapped)

  return score.to_pandas()


def compute_metrics(questions, generated_answers, reference_answers, contexts):
  # compute the BERTScore
  bert_P, bert_R, bert_F1 = compute_bert_score(generated_answers, reference_answers)

  # compute cosine similarity from embedding
  cosine_sim = compute_cosine_similarity(generated_answers, reference_answers)

  # compute ragas metrics
  ragas_metrics_df = compute_ragas_metrics(questions, generated_answers, reference_answers, contexts)
  ragas_metrics_df["context_relevancy_f1"] = (2 * ragas_metrics_df["context_precision"] * ragas_metrics_df["context_recall"]) / (ragas_metrics_df["context_precision"] + ragas_metrics_df["context_recall"])

  # create results dataframe
  full_results = ragas_metrics_df.copy()
  full_results['bert_P'] = bert_P
  full_results['bert_R'] = bert_R
  full_results['bert_F1'] = bert_F1
  full_results['cosine_similarity'] = cosine_sim

  # compute the final weighted eval score that we will use to judge our RAG system
  full_results["Weighted_Eval_Score"] = 0.65 * ((full_results["bert_F1"] + full_results["cosine_similarity"]) / 2) + 0.35 * ((full_results["context_relevancy_f1"] + full_results["faithfulness"] + full_results["answer_relevancy"]) / 3)

  # create aggregated metrics
  agg_results = full_results.drop(columns = ["question", "answer", "ground_truth", "contexts"]).mean()

  return full_results, agg_results

In [None]:
# test the metric eval with dummy data
questions = ['When was the first super bowl?', 'Who won the most super bowls?']
answers = ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots']
contexts = [['The Super Bowl started season in 1966,','with the first superbowl officially starting 1/15/1967.'],
    ['The super bowl is the most sought after title, with the most won by the Patriots','The New England Patriots won 6 times']]
ground_truth = ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']


# full_results, agg_results = compute_metrics(questions, answers, ground_truth, contexts)

In [None]:
#full_results

In [None]:
#agg_results

### Key Runs

#### Create Validation Subset

Let's create a subset of 10 validation set questions in order to evaluate the models on.

In [57]:
import random

random.seed(42)

# get 10 random indices from val set
validation_subset_idx = list(validation_questions_answers.keys())
validation_subset_idx = random.sample(validation_subset_idx, 10)

# get questions and answers for the subset
validation_subset_questions = []
validation_subset_research_answers = []
validation_subset_marketing_answers = []

for i in validation_subset_idx:
  validation_subset_questions.append(validation_questions_answers[i]["question"])
  validation_subset_research_answers.append(validation_questions_answers[i]["gold_answer_research"])
  validation_subset_marketing_answers.append(validation_questions_answers[i]["gold_answer_marketing"])

# save as dictionary
validation_subset = {"question": validation_subset_questions, "gold_answer_research": validation_subset_research_answers, "gold_answer_marketing": validation_subset_marketing_answers}

### LLMs

We will use one Open Source Model ("mistralai/Mistral-7B-Instruct-v0.1") and one Proprietery Model (Cohere) for our tests. Let's first set up the OS model:

#### Default Mistral Model

This is the Mistral model we will use for experimentation.

In [None]:
%%capture

quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         )


llm_mistral_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    torch_dtype=torch.float32,
    device_map='auto',
    quantization_config=quantization_config
)

llm_mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

In [None]:
mistral_pipe = pipeline(
    "text-generation",
    model=llm_mistral_model,
    tokenizer=llm_mistral_tokenizer,
    max_new_tokens=1000,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.2
)
mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe) # wrap into LangChain object

#### Default Cohere Model

Now we will load in the default cohere model. We do not need to wrap the cohere model into a langchain object like we do for mistral.

In [None]:
cohere_chat_model = ChatCohere(cohere_api_key=COHERE_API_KEY)

#### Key Run 1: Engineering Prompt Permutations

This first run is evaluating the 36 prompt permutations with the validation subset of 10 questions. The main goal is to find the engineering prompt that yields the highest weighted eval score.

We are using the Mistral model with the default hyperparameters. Ideally, we would test each of the prompt with different sets of hyperparameters, however, we are limited by computation costs and time. This first run below cost around $23 and took around 90 minutes. Each additional set of hyperparameters would double these values. Instead, we will first find the prompt that performs best on the default hyperparameters, and then test that prompt with different sets of hyperparameters.

Note: The bulk of the computation cost is coming from the RAGAS metrics computation. I am using the cohere production key here as the trial key and mistral were not sufficient (rate limited).

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [59]:
# parses llm output string into the prompt, contexts, question, and response
def output_parser(text):
  # separate parts of the response
  split1 = text.split("[/INST]\n")
  prompt_and_contexts = split1[0]
  response = split1[1]

  split2 = prompt_and_contexts.split("Answer the question based on the following context:\n")
  prompt = split2[0]
  prompt = prompt[7:] # remove "Human: "" that is added by langchain
  contexts_and_question = split2[1]

  split3 = contexts_and_question.split("Question: ")
  contexts = split3[0]
  question = split3[1]

  return {"prompt": prompt, "contexts": contexts, "question": question, "response": response}

In [60]:
# formats the llm output into the expected format for metric evaluation, then runs metric evaluation
def evaluate_research(output, validation_set):
  generated_answers = []
  reference_answers = []
  contexts = []
  questions = []

  for i in range(len(output)):
    generated_answers.append(output[i]["response"])
    reference_answers.append(validation_set["gold_answer_research"][i])
    contexts.append(output[i]["contexts"].split("\n\n")[:-1])
    questions.append(output[i]["question"])

  return compute_metrics(questions, generated_answers, reference_answers, contexts)

Now let's run the prompt permutations on the subset of validation questions.

In [None]:
outputs = []

# loop through each prompt permutation
for i in range(len(eng_prompts)):

  rag_prompt = ChatPromptTemplate.from_template(eng_prompts[i])

  rag_chain = (
      {"context": retriever | format_docs,
      "question": RunnablePassthrough()}
      | rag_prompt
      | mistral_llm_lc
      | output_parser
  )

  # run the chain on the subset of validation questions
  batch_output = rag_chain.batch(validation_subset["question"])
  outputs.append(batch_output)

Check shapes to make sure they are as expected

In [None]:
print("Number of Prompts: " + str(len(outputs)))
print("Number of outputs for each prompt: " + str(len(outputs[0])))

In [None]:
# show example output
outputs[0][0]

Now we will compute the evaluation metrics on each of the different prompts.

In [None]:
# loop through all prompts, computing and saving results to csv
for i in range(len(outputs)):
  full_results, agg_results = evaluate_research(outputs[i], validation_subset)
  full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/prompt_" + str(i) + "_full_results.csv")
  agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/prompt_" + str(i) + "_agg_results.csv")

In [49]:
# load in dataframe of results
full_results = []
agg_results = []

for i in range(len(eng_prompts)):
  full_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/prompt_" + str(i) + "_full_results.csv"))
  agg_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/prompt_" + str(i) + "_agg_results.csv"))

In [50]:
# show example of full results
full_results[12].head(3)

Unnamed: 0.1,Unnamed: 0,question,answer,ground_truth,contexts,answer_relevancy,faithfulness,context_precision,context_recall,context_relevancy_f1,bert_P,bert_R,bert_F1,cosine_similarity,Weighted_Eval_Score
0,0,How has the token handling capacity changed be...,The token handling capacity of the Claude mo...,The token handling capacity has increased with...,[' ==== Claude 2.1 ====\nClaude 2.1 doubled t...,0.928808,1.0,1.0,1.0,1.0,0.919766,0.91836,0.919062,0.813095,0.904645
1,1,Can you name some specific large language mode...,1. Gibbon: Developed by Google in the late 2...,Some specific large language models include GP...,"[' Large language models, currently their mos...",0.877451,0.416667,0.0,0.0,,0.804278,0.866485,0.834223,0.813095,
2,2,What methods are typically employed to create ...,The creation of training data for embedding ...,To create training data for embedding models t...,"[' across all\ndatasets in MEDI, we design a ...",0.921639,0.666667,0.638889,0.0,0.0,0.879988,0.873792,0.876879,0.813095,0.734544


In [51]:
# show example of agg results
agg_results[4]

Unnamed: 0.1,Unnamed: 0,0
0,answer_relevancy,0.604244
1,faithfulness,0.87874
2,context_precision,0.675
3,context_recall,0.45
4,context_relevancy_f1,0.565476
5,bert_P,0.861678
6,bert_R,0.886285
7,bert_F1,0.873741
8,cosine_similarity,0.819099
9,Weighted_Eval_Score,0.822791


In [56]:
# find the prompt with the highest Weighted Eval Score
highest_score = 0
highest_score_idx = 0
for i in range(len(agg_results)):
  if agg_results[i]["0"][9] > highest_score:
    highest_score = agg_results[i]["0"][9]
    highest_score_idx = i

print("Highest Weighted Eval Score: " + str(highest_score))
print("Index of Highest Weighted Eval Score: " + str(highest_score_idx))
print("Prompt of Highest Weighted Eval Score:\n" + eng_prompts[highest_score_idx])

Highest Weighted Eval Score: 0.8227914890580635
Index of Highest Weighted Eval Score: 4
Prompt of Highest Weighted Eval Score:
[INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for a team of engineers working at a startup creating new Generative AI products. The engineers are new to the Generative AI field. The engineers are asking questions in order to gain the technical knowledge necessary to build new Generative AI products. Your response should include clarity, technical depth, and precision. Do not use information that is not in the provided context to answer the question. Do not use more than 7 sentences in your answer. Do not say 'Based on the context' You are very capable.
  Answer the question based on the following context:
  {context}

  Question: {question}


#### Key Run 2: Engineering Hyperparameter Tuning With Mistral Model

Now we have our "best" prompt for the engineering pipeline. Next we will test different sets of hyperparameters with the Mistral model.

Again ideally, we would be able to test each hyperparameter set with each prompt, since different prompt would yield different results with each hyperparameter set. The "best" prompt for the default hyperparameter settings may not be the best prompt for a different set of hyperparameters. However, we are restricted by computation costs and time.

The hyperparameter values chosen here lean toward having a more deterministic output. Our use case does not need the model to have more creative outputs, rather it should give factual information that it is certain about. This means chosing lower temperatures and higher top_p values. We also want the answer to be concise, so we will experiment with higher repetition_penalities as well.

We will also be reduce the validation subset we are using from 10 questions to 6 since the computation costs are getting a bit high.

In [57]:
# took 46 min with 10 samples, 25 mins with 6 samples
temperatures = [0.1, 0.3, 0.6]
top_ps = [0.9,0.95]
repetition_penalties = [1.2, 1.5]

hyperparam_inputs = [] # will use to store the hyperparam set for each run
hyperparam_outputs = []

for temp in temperatures:
  for top_p in top_ps:
    for rep_pen in repetition_penalties:

      # initialize pipeline with current hyperparameter set
      mistral_pipe = pipeline(
          "text-generation",
          model=llm_mistral_model,
          tokenizer=llm_mistral_tokenizer,
          max_new_tokens=1000,
          temperature=temp,
          top_p=top_p,
          do_sample=True,
          repetition_penalty=rep_pen
      )
      mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id
      mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

      # evaluate using prompt with the highest weight eval score from key run 1
      rag_prompt = ChatPromptTemplate.from_template(eng_prompts[4])
      rag_chain = (
          {"context": retriever | format_docs,
          "question": RunnablePassthrough()}
          | rag_prompt
          | mistral_llm_lc
          | output_parser
      )

      hyperparam_output = rag_chain.batch(validation_subset["question"][0:6])
      hyperparam_outputs.append(hyperparam_output)
      hyperparam_inputs.append([temp, top_p, rep_pen])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [58]:
print("Number of Hyper parameter sets: " + str(len(hyperparam_outputs)))
print("Number of sample per set: " + str(len(hyperparam_outputs[0])))

Number of Hyper parameter sets: 12
Number of sample per set: 6


In [None]:
# loop through all hyperparameter sets' outputs, evaluating metrics and saving results to csv
for i in range(len(hyperparam_outputs)):
  full_results, agg_results = evaluate_research(hyperparam_outputs[i], validation_subset)
  full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/mistral_hyperparam_" + str(i) + "_full_results.csv")
  agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/mistral_hyperparam_" + str(i) + "_agg_results.csv")

Now we have our evaluation metrics for each hyperparameter set saved. Let's load them back in and find which hyperparameter set had the highest weighted evaluation score.

In [64]:
# load in dataframes of results
mistral_hyperparam_full_results = []
mistral_hyperparam_agg_results = []

for i in range(len(hyperparam_outputs)):
  mistral_hyperparam_full_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/mistral_hyperparam_" + str(i) + "_full_results.csv"))
  mistral_hyperparam_agg_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/mistral_hyperparam_" + str(i) + "_agg_results.csv"))

In [65]:
# find the hyperparameter set with the highest Weighted Eval Score
highest_score = 0
highest_score_idx = 0
for i in range(len(mistral_hyperparam_agg_results)):
  if mistral_hyperparam_agg_results[i]["0"][9] > highest_score:
    highest_score = mistral_hyperparam_agg_results[i]["0"][9]
    highest_score_idx = i

print("Highest Weighted Eval Score: " + str(highest_score))
print("Index of Highest Weighted Eval Score: " + str(highest_score_idx))
print("Hyperparameter set of Highest Weighted Eval Score:\n" + str(hyperparam_inputs[highest_score_idx]))

Highest Weighted Eval Score: 0.7930844590171613
Index of Highest Weighted Eval Score: 2
Hyperparameter set of Highest Weighted Eval Score:
[0.1, 0.95, 1.2]


#### Key Run 3: Engineering Hyperparameter Tuning With Cohere Model

Next we will test different sets of hyperparameters with the Cohere model.

It is important to call out the limitation here again. The "best" prompt we chose was based on the result using the Mistral model with the default hyperparameters. Different models react differently to prompts, so we can not be sure that the prompt we are using here would be the best for the Cohere model. However, we are limited by computation costs and time, so we will live with this for now.

In [116]:
temperatures = [0.1, 0.3, 0.6]
top_ps = [0.9,0.95]
repetition_penalties = [1.2, 1.5]

cohere_hyperparam_inputs = [] # will use to store the hyperparam set for each run
cohere_hyperparam_outputs = []

for temp in temperatures:
  for top_p in top_ps:
    for rep_pen in repetition_penalties:

      # initialize pipeline with current hyperparameter set
      cohere_model = ChatCohere(
          cohere_api_key=COHERE_API_KEY,
          temperature=temp,
          top_p = top_p,
          repetition_penalty = rep_pen
      )

      # evaluate using prompt with the highest weight eval score from key run 1
      rag_prompt = ChatPromptTemplate.from_template(eng_prompts[4])

      # create chain for retrieval part
      cohere_retrieval_chain = (
          {"context": retriever | format_docs,
           "question": RunnablePassthrough()}
          | rag_prompt
      )

      # create chain for generation
      cohere_generate_chain = (
          cohere_model
      )

      # run chains
      retrieval_output = cohere_retrieval_chain.batch(validation_subset["question"][0:6])
      generation_output = cohere_generate_chain.batch(retrieval_output)

      # parse results into prompt, contexts, question, and response
      cohere_hyperparam_output = []
      for i in range(len(generation_output)):
        cohere_hyperparam_output.append(output_parser(retrieval_output[i].messages[0].content))
        cohere_hyperparam_output[i]["response"] = generation_output[i].content
      cohere_hyperparam_outputs.append(cohere_hyperparam_output)

      # save hyperparameter settings
      cohere_hyperparam_inputs.append([temp, top_p, rep_pen])

In [117]:
print("Number of Hyper parameter sets: " + str(len(cohere_hyperparam_outputs)))
print("Number of sample per set: " + str(len(cohere_hyperparam_outputs[0])))

Number of Hyper parameter sets: 12
Number of sample per set: 6


In [None]:
# loop through all hyperparameter sets' outputs, evaluating and saving results to csv
for i in range(len(cohere_hyperparam_outputs)):
  full_results, agg_results = evaluate_research(cohere_hyperparam_outputs[i], validation_subset)
  full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/cohere_hyperparam_" + str(i) + "_full_results.csv")
  agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/cohere_hyperparam_" + str(i) + "_agg_results.csv")

In [120]:
# load in dataframes of results
cohere_hyperparam_full_results = []
cohere_hyperparam_agg_results = []

for i in range(len(cohere_hyperparam_outputs)):
  cohere_hyperparam_full_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/cohere_hyperparam_" + str(i) + "_full_results.csv"))
  cohere_hyperparam_agg_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/cohere_hyperparam_" + str(i) + "_agg_results.csv"))

In [121]:
# find the hyperparameter set with the highest Weighted Eval Score
cohere_highest_score = 0
cohere_highest_score_idx = 0
for i in range(len(cohere_hyperparam_agg_results)):
  if cohere_hyperparam_agg_results[i]["0"][9] > cohere_highest_score:
    cohere_highest_score = cohere_hyperparam_agg_results[i]["0"][9]
    cohere_highest_score_idx = i

print("Highest Weighted Eval Score: " + str(cohere_highest_score))
print("Index of Highest Weighted Eval Score: " + str(cohere_highest_score_idx))
print("Hyperparameter set of Highest Weighted Eval Score:\n" + str(cohere_hyperparam_inputs[cohere_highest_score_idx]))

Highest Weighted Eval Score: 0.8167944966201452
Index of Highest Weighted Eval Score: 3
Hyperparameter set of Highest Weighted Eval Score:
[0.1, 0.95, 1.5]


#### Key Run 4: Marketing Prompt Permutations

Next, we will be doing the same thing as the first run, but with the marketing prompt permutations instead. We are evaluating the 12 prompt permutations with the validation subset of 10 questions. The main goal is to find the marketing prompt that yields the highest weighted eval score.

We are using the Mistral model with the default hyperparameters here again.

In [102]:
# formats the llm output into the expected format for metric evaluation, then runs metric evaluation
def evaluate_marketing(output, validation_set):
  generated_answers = []
  reference_answers = []
  contexts = []
  questions = []

  for i in range(len(output)):
    generated_answers.append(output[i]["response"])
    reference_answers.append(validation_set["gold_answer_marketing"][i])
    contexts.append(output[i]["contexts"].split("\n\n")[:-1])
    questions.append(output[i]["question"])

  return compute_metrics(questions, generated_answers, reference_answers, contexts)

In [126]:
# initialize mistral pipeline with original setting again
mistral_pipe = pipeline(
    "text-generation",
    model=llm_mistral_model,
    tokenizer=llm_mistral_tokenizer,
    max_new_tokens=1000,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.2
)
mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id
mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

In [None]:
# took around 30 mins to run

marketing_prompt_outputs = []

# loop through each prompt permutation
for i in range(len(marketing_prompts)):

  rag_prompt = ChatPromptTemplate.from_template(marketing_prompts[i])

  rag_chain = (
      {"context": retriever | format_docs,
      "question": RunnablePassthrough()}
      | rag_prompt
      | mistral_llm_lc
      | output_parser
  )

  # run the chain on the subset of validation questions
  batch_output = rag_chain.batch(validation_subset["question"])
  marketing_prompt_outputs.append(batch_output)

Check output shapes to make sure they are as expected.

In [128]:
print("Number of Prompts: " + str(len(marketing_prompt_outputs)))
print("Number of outputs for each prompt: " + str(len(marketing_prompt_outputs[0])))

Number of Prompts: 12
Number of outputs for each prompt: 10


In [131]:
# show example output
marketing_prompt_outputs[11][9]

{'prompt': "[INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for the marketing team working at a startup creating new Generative AI products. The marketing team is new to the Generative AI field. The marketing team is asking questions around Generative AI in order to better understand the products and the field as a whole, but a lot more high level answers would likely be in order. Your response should be high level, concise, and nontechnical. Your response should break complex concepts into smaller, nontechnical parts that are easy for someone who does not have a background in Generative AI to understand. Your response should be around 2 to 3 sentences long. Do not use information that is not in the provided context to answer the question. Do not use more than 5 sente

Now we will compute the evaluation metrics on each of the different prompts.

In [None]:
# loop through all prompts, computing and saving results to csv
for i in range(len(marketing_prompt_outputs)):
  full_results, agg_results = evaluate_research(marketing_prompt_outputs[i], validation_subset)
  full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_prompt_" + str(i) + "_full_results.csv")
  agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_prompt_" + str(i) + "_agg_results.csv")

In [149]:
# load in dataframe of results
full_results = []
agg_results = []

for i in range(len(marketing_prompts)):
  full_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_prompt_" + str(i) + "_full_results.csv"))
  agg_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_prompt_" + str(i) + "_agg_results.csv"))

Now we have the evaluation metrics for each prompt permutation. Let's find the one that performed the best.

In [134]:
# find the prompt with the highest Weighted Eval Score
highest_score = 0
best_marketing_prompt_idx = 0

for i in range(len(agg_results)):
  if agg_results[i]["0"][9] > highest_score:
    highest_score = agg_results[i]["0"][9]
    best_marketing_prompt_idx = i

print("Highest Weighted Eval Score: " + str(highest_score))
print("Index of Highest Weighted Eval Score: " + str(best_marketing_prompt_idx))
print("Prompt of Highest Weighted Eval Score:\n" + marketing_prompts[best_marketing_prompt_idx])

Highest Weighted Eval Score: 0.806979593106999
Index of Highest Weighted Eval Score: 4
Prompt of Highest Weighted Eval Score:
[INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for the marketing team working at a startup creating new Generative AI products. The marketing team is new to the Generative AI field. The marketing team is asking questions in order to gain the high level knowledge necessary to promote their new Generative AI products. Your response should be high level, concise, and nontechnical. Do not use information that is not in the provided context to answer the question. Do not use more than 5 sentences in your answer. Do not say 'Based on the context'. You are very capable.
  Answer the question based on the following context:
  {context}

  Question: {q

Just out of curiousity, I wanted to see how the other prompts performed. It looks like most prompts did similiarly, all between 0.77 to 0.81.

In [135]:
for i in range(len(agg_results)):
  print(agg_results[i]["0"][9])

0.798179864473412
0.79755192185339
0.7843747302392476
0.7961906579217133
0.806979593106999
0.7888073854671911
0.7828718470724554
0.8069709633024024
0.7998867180394303
0.7706602739750545
0.7799633412009574
0.7977309829248007


#### Key Run 5: Marketing Hyperparameter Tuning with Mistral Model

Now we have our "best" prompt for the marketing pipeline. Here we will be doing the same thing as key run 2 but for the marketing pipeline. We will test different sets of hyperparameters with the Mistral model and the best prompt we found.

In [None]:
temperatures = [0.1, 0.3, 0.6]
top_ps = [0.9, 0.95]
repetition_penalties = [1.2, 1.5]

hyperparam_inputs = [] # will use to store the hyperparam set for each run
hyperparam_outputs = []

for temp in temperatures:
  for top_p in top_ps:
    for rep_pen in repetition_penalties:

      # initialize pipeline with current hyperparameter set
      mistral_pipe = pipeline(
          "text-generation",
          model=llm_mistral_model,
          tokenizer=llm_mistral_tokenizer,
          max_new_tokens=1000,
          temperature=temp,
          top_p=top_p,
          do_sample=True,
          repetition_penalty=rep_pen
      )
      mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id
      mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

      # evaluate using prompt with the highest weight eval score from key run 4
      rag_prompt = ChatPromptTemplate.from_template(marketing_prompts[best_marketing_prompt_idx])
      rag_chain = (
          {"context": retriever | format_docs,
          "question": RunnablePassthrough()}
          | rag_prompt
          | mistral_llm_lc
          | output_parser
      )

      hyperparam_output = rag_chain.batch(validation_subset["question"][0:6])
      hyperparam_outputs.append(hyperparam_output)
      hyperparam_inputs.append([temp, top_p, rep_pen])

Check shapes of outputs to make sure they are as expected before saving.

In [138]:
print("Number of Hyper parameter sets: " + str(len(hyperparam_outputs)))
print("Number of sample per set: " + str(len(hyperparam_outputs[0])))

Number of Hyper parameter sets: 12
Number of sample per set: 6


In [None]:
# loop through all hyperparameter sets' outputs, evaluating metrics and saving results to csv
for i in range(len(hyperparam_outputs)):
  full_results, agg_results = evaluate_research(hyperparam_outputs[i], validation_subset)
  full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_mistral_hyperparam_" + str(i) + "_full_results.csv")
  agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_mistral_hyperparam_" + str(i) + "_agg_results.csv")

Now we have our evaluation metrics for each hyperparameter set saved. Let's load them back in and find which hyperparameter set had the highest weighted evaluation score.

In [144]:
# load in dataframes of results
mistral_hyperparam_full_results = []
mistral_hyperparam_agg_results = []

for i in range(len(hyperparam_outputs)):
  mistral_hyperparam_full_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_mistral_hyperparam_" + str(i) + "_full_results.csv"))
  mistral_hyperparam_agg_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_mistral_hyperparam_" + str(i) + "_agg_results.csv"))

In [142]:
# find the hyperparameter set with the highest Weighted Eval Score
highest_score = 0
highest_score_idx = 0

for i in range(len(mistral_hyperparam_agg_results)):
  if mistral_hyperparam_agg_results[i]["0"][9] > highest_score:
    highest_score = mistral_hyperparam_agg_results[i]["0"][9]
    highest_score_idx = i

print("Highest Weighted Eval Score: " + str(highest_score))
print("Index of Highest Weighted Eval Score: " + str(highest_score_idx))
print("Hyperparameter set of Highest Weighted Eval Score:\n" + str(hyperparam_inputs[highest_score_idx]))

Highest Weighted Eval Score: 0.8010015706009299
Index of Highest Weighted Eval Score: 6
Hyperparameter set of Highest Weighted Eval Score:
[0.3, 0.95, 1.2]


In [161]:
for i in range(len(mistral_hyperparam_agg_results)):
  print(mistral_hyperparam_agg_results[i]["0"][9])

0.7857681923401499
0.6981472629762034
0.792544926213811
0.6810361614003216
0.8007874299915443
0.6887618924433848
0.8010015706009299
0.7098558470929224
0.787707937751937
0.700953573446715
0.7939625333682294
0.7095145659347429


#### Key Run 6: Marketing Hyperparameter Tuning With Cohere Model

Here we will be doing the same thing as key run 3, but for the marketing pipeline. We will test the different sets of hyperparameters using the "best" prompt with the Cohere Model.

In [73]:
temperatures = [0.1, 0.3, 0.6]
top_ps = [0.9,0.95]
repetition_penalties = [1.2, 1.5]

cohere_hyperparam_inputs = [] # will use to store the hyperparam set for each run
cohere_hyperparam_outputs = []

for temp in temperatures:
  for top_p in top_ps:
    for rep_pen in repetition_penalties:

      # initialize pipeline with current hyperparameter set
      cohere_model = ChatCohere(
          cohere_api_key=COHERE_API_KEY,
          temperature=temp,
          top_p = top_p,
          repetition_penalty = rep_pen
      )

      # evaluate using prompt with the highest weight eval score from key run 4
      rag_prompt = ChatPromptTemplate.from_template(marketing_prompts[4])

      # create chain for retrieval part
      cohere_retrieval_chain = (
          {"context": retriever | format_docs,
           "question": RunnablePassthrough()}
          | rag_prompt
      )

      # create chain for generation
      cohere_generate_chain = (
          cohere_model
      )

      # run chains
      retrieval_output = cohere_retrieval_chain.batch(validation_subset["question"][0:6])
      generation_output = cohere_generate_chain.batch(retrieval_output)

      # parse results into prompt, contexts, question, and response
      cohere_hyperparam_output = []
      for i in range(len(generation_output)):
        cohere_hyperparam_output.append(output_parser(retrieval_output[i].messages[0].content))
        cohere_hyperparam_output[i]["response"] = generation_output[i].content
      cohere_hyperparam_outputs.append(cohere_hyperparam_output)

      # save hyperparameter settings
      cohere_hyperparam_inputs.append([temp, top_p, rep_pen])

Check that shape of outputs are as expected

In [74]:
print("Number of Hyper parameter sets: " + str(len(cohere_hyperparam_outputs)))
print("Number of sample per set: " + str(len(cohere_hyperparam_outputs[0])))

Number of Hyper parameter sets: 12
Number of sample per set: 6


In [None]:
# loop through all hyperparameter sets' outputs, evaluating and saving results to csv
for i in range(len(cohere_hyperparam_outputs)):
  full_results, agg_results = evaluate_research(cohere_hyperparam_outputs[i], validation_subset)
  full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_cohere_hyperparam_" + str(i) + "_full_results.csv")
  agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_cohere_hyperparam_" + str(i) + "_agg_results.csv")

In [76]:
# load in dataframes of results
cohere_hyperparam_full_results = []
cohere_hyperparam_agg_results = []

for i in range(len(cohere_hyperparam_outputs)):
  cohere_hyperparam_full_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_cohere_hyperparam_" + str(i) + "_full_results.csv"))
  cohere_hyperparam_agg_results.append(pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/marketing_cohere_hyperparam_" + str(i) + "_agg_results.csv"))

In [77]:
# find the hyperparameter set with the highest Weighted Eval Score
cohere_highest_score = 0
cohere_highest_score_idx = 0
for i in range(len(cohere_hyperparam_agg_results)):
  if cohere_hyperparam_agg_results[i]["0"][9] > cohere_highest_score:
    cohere_highest_score = cohere_hyperparam_agg_results[i]["0"][9]
    cohere_highest_score_idx = i

print("Highest Weighted Eval Score: " + str(cohere_highest_score))
print("Index of Highest Weighted Eval Score: " + str(cohere_highest_score_idx))
print("Hyperparameter set of Highest Weighted Eval Score:\n" + str(cohere_hyperparam_inputs[cohere_highest_score_idx]))

Highest Weighted Eval Score: 0.8350029480644124
Index of Highest Weighted Eval Score: 2
Hyperparameter set of Highest Weighted Eval Score:
[0.1, 0.95, 1.2]


In [78]:
for i in range(len(cohere_hyperparam_agg_results)):
  print(cohere_hyperparam_agg_results[i]["0"][9])

0.8166988464933567
0.7980494711788049
0.8350029480644124
0.8073239462381877
0.8104974313069965
0.8086994754078833
0.8040161162857998
0.7637323544710503
0.8077331449871886
0.8098489177085405
0.8012859798516969
0.7701831399185768


## 5. Results

### Final Engineering Model

For our final engineering model, we are using the Cohere model with the following prompt and hyperparameter:
1. Prompt: [INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for a team of engineers working at a startup creating new Generative AI products. The engineers are new to the Generative AI field. The engineers are asking questions in order to gain the technical knowledge necessary to build new Generative AI products. Your response should include clarity, technical depth, and precision. Do not use information that is not in the provided context to answer the question. Do not use more than 7 sentences in your answer. Do not say 'Based on the context'. You are very capable. Answer the question based on the following context:
{context}
Question: {question}
[/INST]
2. Temperature: 0.1
3. Top_p: 0.95
4. repetition_penalty: 1.5

Let's now evaluate our final model on the entire validation set.


In [17]:
# get all validation questions / answers
validation_questions = []
validation_research_answers = []
validation_marketing_answers = []

for i in validation_questions_answers.keys():
  validation_questions.append(validation_questions_answers[i]["question"])
  validation_research_answers.append(validation_questions_answers[i]["gold_answer_research"])
  validation_marketing_answers.append(validation_questions_answers[i]["gold_answer_marketing"])

# save as dictionary
validation_set = {"question": validation_questions, "gold_answer_research": validation_research_answers, "gold_answer_marketing": validation_marketing_answers}

In [61]:
# initialize pipeline with best hyperparameter set
final_eng_model = ChatCohere(
    cohere_api_key=COHERE_API_KEY,
    temperature=0.1,
    top_p = 0.95,
    repetition_penalty = 1.5
)

# evaluate using prompt with the highest weight eval score from key run 1
rag_prompt = ChatPromptTemplate.from_template(eng_prompts[4])

# create chain for retrieval part
cohere_retrieval_chain = (
    {"context": retriever | format_docs,
      "question": RunnablePassthrough()}
    | rag_prompt
)

# create chain for generation
cohere_generate_chain = (
    final_eng_model
)

# run chains
retrieval_output = cohere_retrieval_chain.batch(validation_set["question"])
generation_output = cohere_generate_chain.batch(retrieval_output)

# parse results into prompt, contexts, question, and response
cohere_output = []
for i in range(len(generation_output)):
  cohere_output.append(output_parser(retrieval_output[i].messages[0].content))
  cohere_output[i]["response"] = generation_output[i].content

In [64]:
print(len(cohere_output))

75


In [63]:
cohere_output[0]

{'prompt': "You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for a team of engineers working at a startup creating new Generative AI products. The engineers are new to the Generative AI field. The engineers are asking questions in order to gain the technical knowledge necessary to build new Generative AI products. Your response should include clarity, technical depth, and precision. Do not use information that is not in the provided context to answer the question. Do not use more than 7 sentences in your answer. Do not say 'Based on the context' You are very capable.\n  ",
 'contexts': '  Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and tr

In [None]:
full_results, agg_results = evaluate_research(cohere_output, validation_set)
full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_engineer_" + str(i) + "_full_results.csv")
agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_engineer" + str(i) + "_agg_results.csv")

In [70]:
agg_results

Unnamed: 0,0
answer_relevancy,0.717992
faithfulness,0.885483
context_precision,0.803742
context_recall,0.471111
context_relevancy_f1,0.505174
bert_P,0.891877
bert_R,0.87558
bert_F1,0.883552
cosine_similarity,0.760966
Weighted_Eval_Score,0.781844


### Final Marketing Model

For our final marketing model, we are using the Cohere model with the following prompt and hyperparameter:
1. Prompt: [INST] You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for the marketing team working at a startup creating new Generative AI products. The marketing team is new to the Generative AI field. The marketing team is asking questions in order to gain the high level knowledge necessary to promote their new Generative AI products. Your response should be high level, concise, and nontechnical. Do not use information that is not in the provided context to answer the question. Do not use more than 5 sentences in your answer. Do not say 'Based on the context'. You are very capable. Answer the question based on the following context: {context} Question: {question}
  [/INST]
2. Temperature: 0.1
3. Top_p: 0.95
4. repetition_penalty: 1.2

Let's now evaluate our final model on the entire validation set.

In [98]:
# initialize pipeline with best hyperparameter set
final_marketing_model = ChatCohere(
    cohere_api_key=COHERE_API_KEY,
    temperature=0.1,
    top_p = 0.95,
    repetition_penalty = 1.2
)

# evaluate using prompt with the highest weight eval score from key run 4
rag_prompt = ChatPromptTemplate.from_template(marketing_prompts[4])

# create chain for retrieval part
cohere_retrieval_chain = (
    {"context": retriever | format_docs,
      "question": RunnablePassthrough()}
    | rag_prompt
)

# create chain for generation
cohere_generate_chain = (
    final_marketing_model
)

# run chains
retrieval_output = cohere_retrieval_chain.batch(validation_set["question"])
generation_output = cohere_generate_chain.batch(retrieval_output)

# parse results into prompt, contexts, question, and response
final_marketing_output = []
for i in range(len(generation_output)):
  final_marketing_output.append(output_parser(retrieval_output[i].messages[0].content))
  final_marketing_output[i]["response"] = generation_output[i].content

In [None]:
marketing_full_results, marketing_agg_results = evaluate_marketing(final_marketing_output, validation_set)

marketing_full_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_marketing_" + str(i) + "_full_results.csv")
marketing_agg_results.to_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_marketing_" + str(i) + "_agg_results.csv")

In [118]:
marketing_agg_results

Unnamed: 0,0
answer_relevancy,0.721454
faithfulness,0.840558
context_precision,0.808133
context_recall,0.50991
context_relevancy_f1,0.533432
bert_P,0.8906
bert_R,0.899131
bert_F1,0.894626
cosine_similarity,0.761062
Weighted_Eval_Score,0.782404



### 5.2 Some Test Questions

**QUESTIONS:**


Please study the answers generated by your chosen setup for these specific test questions:

1. "What purpose do large language models serve in the field of natural language processing?" (Question 0)

2. "What methods are typically employed to create training data for embedding models that use task-specific instructions?" (Question 50)

3. "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?" (Question 83, no labeled answers)

For each of the three questions above please provide:

a) The RAG results (research and marketing response)  
b) The context provided  
c) The document sources for the context  
d) Also discuss your metric(s) for the first two examples (for both responses) compared to the gold responses


#### 5.2.1 Test Question 1

Please run the query:








In [4]:
# load in results
research_full_result = pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_engineer_74_full_results.csv")
research_agg_result = pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_engineer74_agg_results.csv")
marketing_full_results = pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_marketing_74_full_results.csv")
marketing_agg_results = pd.read_csv("/content/drive/MyDrive/DS 290 - Gen AI Assignments/final_marketing_74_agg_results.csv")

In [7]:
# used to translated the index of validation set to the index of the results
validation_set_keys = list(validation_questions_answers.keys())

#### a) The RAG results (research and marketing response)

In [143]:
# get RAG results
output_index = validation_set_keys.index(0)

marketing_ans = marketing_full_results.iloc[output_index]["answer"]
print("Marketing Answer: " + marketing_ans)
research_ans = research_full_result.iloc[output_index]["answer"]
print("Research Answer: " + research_ans)

Marketing Answer: Large language models are used for a range of tasks in natural language processing, including speech recognition, machine translation, and natural language generation, to make these technologies more human-like and capable. They are also useful for writing assistance, as per Yann LeCun.
Research Answer: Large language models (LLMs) are a pivotal component of natural language processing (NLP), offering a wide range of functionalities. They are trained on vast datasets and advanced neural network architectures, surpassing earlier models in performance. LLMs facilitate tasks like speech recognition, machine translation, and natural language generation, enhancing human-machine interactions and making them more intuitive and human-like.


#### b) The context provided

In [167]:
# function to convert contexts string back into a list of strings
def convert_contexts(contexts):
  contexts = contexts.split("\'\n")
  return contexts

In [168]:
# get the marketing contexts
marketing_context = marketing_full_results.iloc[output_index]["contexts"]
convert_contexts(marketing_context)

["['  Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.",
 " 'Language models are useful for a variety of tasks, including speech recognition (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation, natural language generation (generating more human-like text), optical character recognition, handwriting recognition, grammar induction, and information retrieval.",
 ' \'== History ==\\nBefore 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the IBM alignment models pioneered statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3 billion words achieved then-SOTA

In [170]:
# get the engineering/research contexts
research_context = research_full_result.iloc[output_index]["contexts"]
convert_contexts(research_context)

["['  Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.",
 " 'Language models are useful for a variety of tasks, including speech recognition (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation, natural language generation (generating more human-like text), optical character recognition, handwriting recognition, grammar induction, and information retrieval.",
 ' \'== History ==\\nBefore 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the IBM alignment models pioneered statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3 billion words achieved then-SOTA

#### c) The document sources for the context  

* Context 1: Wikipedia Article - Language Model
* Context 2: Wikipedia Article - Language Model
* Context 3: Wikipedia Article - Language Model
* Context 4: Wikipedia Article - Llama (Language Model)

#### d) Also discuss your metric(s) for the first two examples (for both responses) compared to the gold responses

In [180]:
marketing_eval_score = marketing_full_results.iloc[output_index]["Weighted_Eval_Score"]
print("Marketing Weighted Eval Score: " + str(marketing_eval_score))
research_eval_score = research_full_result.iloc[output_index]["Weighted_Eval_Score"]
print("Research Weighted Eval Score: " + str(research_eval_score))

Marketing Weighted Eval Score: 0.8681837392237681
Research Weighted Eval Score: 0.810052848579051


In [12]:
marketing_full_results.iloc[output_index]

Unnamed: 0                                                              0
question                What purpose do large language models serve in...
answer                  Large language models are used for a range of ...
ground_truth            Large language models serve the purpose of imp...
contexts                ['  Large language models, currently their mos...
answer_relevancy                                                 0.780209
faithfulness                                                          1.0
context_precision                                                     1.0
context_recall                                                        1.0
context_relevancy_f1                                                  1.0
bert_P                                                           0.901613
bert_R                                                           0.923138
bert_F1                                                          0.912248
cosine_similarity                     

In [13]:
research_full_result.iloc[output_index]

Unnamed: 0                                                              0
question                What purpose do large language models serve in...
answer                  Large language models (LLMs) are a pivotal com...
ground_truth            Large language models (LLMs) serve the purpose...
contexts                ['  Large language models, currently their mos...
answer_relevancy                                                 0.813242
faithfulness                                                     0.833333
context_precision                                                     1.0
context_recall                                                        0.5
context_relevancy_f1                                             0.666667
bert_P                                                           0.910497
bert_R                                                           0.891914
bert_F1                                                          0.901109
cosine_similarity                     

#### 5.2.2 Test Question 2

Please run the query:

#### a) The RAG results (research and marketing response)

In [14]:
# get RAG results
output_index = validation_set_keys.index(50)

marketing_ans = marketing_full_results.iloc[output_index]["answer"]
print("Marketing Answer: " + marketing_ans)
research_ans = research_full_result.iloc[output_index]["answer"]
print("Research Answer: " + research_ans)

Marketing Answer: Training data for embedding models with task-specific instructions is created by combining datasets with natural language instructions and constructing positive and negative pairs using sentence embeddings.
Research Answer: Training data for embedding models with task-specific instructions is created by combining datasets with natural language instructions and constructing positive and negative pairs using sentence embeddings.


#### b) The context provided

In [202]:
# get the marketing contexts
marketing_context = marketing_full_results.iloc[output_index]["contexts"]
convert_contexts(marketing_context)

["['  across all\\ndatasets in MEDI, we design a unified instruction\\nformat that consists of the following parts (see Ta-\\nble 4 in the appendix for instances of each part):\\n• Text Type specifies the type of input text that\\nwe encode using the embedding model. For\\nexample, for an open-domain QA task, the\\ninput type of the query is a question, while the\\ninput type of the target is a document. • Task Objective (Optional) describes the ob-\\njective of how the input text is used in a task. For example, for a classification task, the task\\nobjective is to classify the sentence into some\\ncategory, while the task objective of the re-\\ntrieval is to retrieve a",
 " 'of a\\nvariety of tasks for embedding training with in-\\nstructions. We thus construct a collection of 330\\ndatasets with instructions across diverse task cate-\\ngories and domains: Multitask Embeddings Data\\nwith Instructions (MEDI). Data\\nConstruction\\nWe\\nbuild\\nMEDI\\nby\\ncombining\\n300\\ndatasets\\n

In [203]:
# get the engineering/research contexts
research_context = research_full_result.iloc[output_index]["contexts"]
convert_contexts(research_context)

["['  across all\\ndatasets in MEDI, we design a unified instruction\\nformat that consists of the following parts (see Ta-\\nble 4 in the appendix for instances of each part):\\n• Text Type specifies the type of input text that\\nwe encode using the embedding model. For\\nexample, for an open-domain QA task, the\\ninput type of the query is a question, while the\\ninput type of the target is a document. • Task Objective (Optional) describes the ob-\\njective of how the input text is used in a task. For example, for a classification task, the task\\nobjective is to classify the sentence into some\\ncategory, while the task objective of the re-\\ntrieval is to retrieve a",
 " 'of a\\nvariety of tasks for embedding training with in-\\nstructions. We thus construct a collection of 330\\ndatasets with instructions across diverse task cate-\\ngories and domains: Multitask Embeddings Data\\nwith Instructions (MEDI). Data\\nConstruction\\nWe\\nbuild\\nMEDI\\nby\\ncombining\\n300\\ndatasets\\n

#### c) The document sources for the context  

* Context 1: https://arxiv.org/pdf/2212.09741.pdf, 'page': 3
* Context 2: https://arxiv.org/pdf/2212.09741.pdf, 'page': 2
* Context 3: https://arxiv.org/pdf/2212.09741.pdf, 'page': 1
* Context 4: https://arxiv.org/pdf/2212.09741.pdf', 'page': 0

#### d) Also discuss your metric(s) for the first two examples (for both responses) compared to the gold responses

In [18]:
print("Marketing Answer: " + marketing_ans)
print("Gold Marketing Answer: " + validation_set["gold_answer_marketing"][output_index])
print("\n")
print("Research Answer: " + research_ans)
print("Gold Research Answer: " + validation_set["gold_answer_research"][output_index])

Marketing Answer: Training data for embedding models with task-specific instructions is created by combining datasets with natural language instructions and constructing positive and negative pairs using sentence embeddings.
Gold Marketing Answer: Training data for embedding models that use task-specific instructions is typically created by formulating a wide variety of tasks as text-to-text problems, distinguishing good/bad candidate outputs given an input text. This is done by combining datasets with natural language instructions and constructing positive and negative pairs for training.


Research Answer: Training data for embedding models with task-specific instructions is created by combining datasets with natural language instructions and constructing positive and negative pairs using sentence embeddings.
Gold Research Answer: To create training data for embedding models that use task-specific instructions, a common method is to combine datasets from different sources, such as th

In [209]:
marketing_eval_score = marketing_full_results.iloc[output_index]["Weighted_Eval_Score"]
print("Marketing Weighted Eval Score: " + str(marketing_eval_score))
research_eval_score = research_full_result.iloc[output_index]["Weighted_Eval_Score"]
print("Research Weighted Eval Score: " + str(research_eval_score))

Marketing Weighted Eval Score: 0.8460527939538453
Research Weighted Eval Score: 0.8641478154462917


In [15]:
marketing_full_results.iloc[output_index]

Unnamed: 0                                                             35
question                What methods are typically employed to create ...
answer                  Training data for embedding models with task-s...
ground_truth            Training data for embedding models that use ta...
contexts                ['  across all\ndatasets in MEDI, we design a ...
answer_relevancy                                                 0.936172
faithfulness                                                          1.0
context_precision                                                0.805556
context_recall                                                        0.5
context_relevancy_f1                                             0.617021
bert_P                                                           0.950019
bert_R                                                           0.902493
bert_F1                                                          0.925646
cosine_similarity                     

In [16]:
research_full_result.iloc[output_index]

Unnamed: 0                                                             35
question                What methods are typically employed to create ...
answer                  Training data for embedding models with task-s...
ground_truth            To create training data for embedding models t...
contexts                ['  across all\ndatasets in MEDI, we design a ...
answer_relevancy                                                 0.936172
faithfulness                                                          1.0
context_precision                                                0.805556
context_recall                                                        1.0
context_relevancy_f1                                             0.892308
bert_P                                                           0.913768
bert_R                                                           0.853485
bert_F1                                                          0.882598
cosine_similarity                     

#### 5.2.3 Test Question 3

Please run the query:

#### a) The RAG results (research and marketing response)

In [216]:
# final marketing model

# initialize pipeline with best hyperparameter set
final_marketing_model = ChatCohere(
    cohere_api_key=COHERE_API_KEY,
    temperature=0.1,
    top_p = 0.95,
    repetition_penalty = 1.2
)

# evaluate using prompt with the highest weight eval score from key run 4
rag_prompt = ChatPromptTemplate.from_template(marketing_prompts[4])

# create chain for retrieval part
cohere_retrieval_chain = (
    {"context": retriever | format_docs,
      "question": RunnablePassthrough()}
    | rag_prompt
)

# create chain for generation
cohere_generate_chain = (
    final_marketing_model
)

In [221]:
# run chains
retrieval_output = cohere_retrieval_chain.invoke(test_questions[83]["question"])
generation_output = cohere_generate_chain.invoke(retrieval_output)

In [224]:
# parse results into prompt, contexts, question, and response
q83_marketing_output = output_parser(retrieval_output.messages[0].content)
q83_marketing_output["response"] = generation_output.content
q83_marketing_output

{'prompt': "You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for the marketing team working at a startup creating new Generative AI products. The marketing team is new to the Generative AI field. The marketing team is asking questions in order to gain the high level knowledge necessary to promote their new Generative AI products. Your response should be high level, concise, and nontechnical. Do not use information that is not in the provided context to answer the question. Do not use more than 5 sentences in your answer. Do not say 'Based on the context'. You are very capable.\n  ",
 'contexts': '  A model is able to correctly memorize and respond with the answer to a question that has been seen at training time.\nA model is able to answer novel questions at test time and cho

In [225]:
# final engineering/research model

# initialize pipeline with best hyperparameter set
final_eng_model = ChatCohere(
    cohere_api_key=COHERE_API_KEY,
    temperature=0.1,
    top_p = 0.95,
    repetition_penalty = 1.5
)

# evaluate using prompt with the highest weight eval score from key run 1
rag_prompt = ChatPromptTemplate.from_template(eng_prompts[4])

# create chain for retrieval part
cohere_retrieval_chain = (
    {"context": retriever | format_docs,
      "question": RunnablePassthrough()}
    | rag_prompt
)

# create chain for generation
cohere_generate_chain = (
    final_eng_model
)

In [226]:
# run chains
retrieval_output = cohere_retrieval_chain.invoke(test_questions[83]["question"])
generation_output = cohere_generate_chain.invoke(retrieval_output)

In [227]:
# parse results into prompt, contexts, question, and response
q83_eng_output = output_parser(retrieval_output.messages[0].content)
q83_eng_output["response"] = generation_output.content
q83_eng_output

{'prompt': "You are a question-answering assistant for Generative AI topics. Your task is to provide factual, concise, and informative answers to Generative AI related questions. Use information in the provided context to answer the question. You will be answering questions for a team of engineers working at a startup creating new Generative AI products. The engineers are new to the Generative AI field. The engineers are asking questions in order to gain the technical knowledge necessary to build new Generative AI products. Your response should include clarity, technical depth, and precision. Do not use information that is not in the provided context to answer the question. Do not use more than 7 sentences in your answer. Do not say 'Based on the context' You are very capable.\n  ",
 'contexts': '  A model is able to correctly memorize and respond with the answer to a question that has been seen at training time.\nA model is able to answer novel questions at test time and choose an ans

In [228]:
print("Marketing Answer: " + q83_marketing_output["response"])
print("Research Answer: " + q83_eng_output["response"])

Marketing Answer: A model's ability to answer questions is directly linked to its exposure during training. Models can answer questions seen during training, choose answers from trained options, or answer novel questions with unseen responses.
Research Answer: A model's ability to answer questions is directly related to its exposure during training. Models can answer questions seen during training, select answers from a known set, or respond to novel questions with unseen answers.


#### b) The context provided

In [233]:
marketing_contexts = q83_marketing_output["contexts"].split('\n\n')
marketing_contexts

['  A model is able to correctly memorize and respond with the answer to a question that has been seen at training time.\nA model is able to answer novel questions at test time and choose an answer from the set of answers it has seen during training.\nA model is able to answer novel questions which have answers not contained in the training dataset.',
 'D\nFurther Details on Open-Domain QA\nFor open-domain QA, multiple answer annotations are often available for a given question. These\nanswer annotations are exploited by extractive models during training as typically all the answer\nannotations are used to ﬁnd matches within documents when preparing training data. For RAG, we\nalso make use of multiple annotation examples for Natural Questions and WebQuestions by training\nthe model with each (q, a) pair separately, leading to a small increase in accuracy. For TriviaQA,\nthere are often many valid answers to a given question, some of which are not suitable training targets,\nsuch as em

In [234]:
eng_contexts = q83_eng_output["contexts"].split('\n\n')
eng_contexts

['  A model is able to correctly memorize and respond with the answer to a question that has been seen at training time.\nA model is able to answer novel questions at test time and choose an answer from the set of answers it has seen during training.\nA model is able to answer novel questions which have answers not contained in the training dataset.',
 'D\nFurther Details on Open-Domain QA\nFor open-domain QA, multiple answer annotations are often available for a given question. These\nanswer annotations are exploited by extractive models during training as typically all the answer\nannotations are used to ﬁnd matches within documents when preparing training data. For RAG, we\nalso make use of multiple annotation examples for Natural Questions and WebQuestions by training\nthe model with each (q, a) pair separately, leading to a small increase in accuracy. For TriviaQA,\nthere are often many valid answers to a given question, some of which are not suitable training targets,\nsuch as em

#### c) The document sources

* Context 1: https://lilianweng.github.io/posts/2020-10-29-odqa/
* Context 2: https://arxiv.org/pdf/2005.11401.pdf, 'page': 17
* Context 3: https://arxiv.org/pdf/2005.11401.pdf, 'page': 17
* Context 4: https://arxiv.org/pdf/2005.11401.2305.14314, 'page': 11