## DATASCI 290 - GenAI - Assignment 5 - Provided Overview

Below is the provided overview for this project.

The overall scenarios is as follows:

You work at a tech company that is looking for new ways to organize their question answering and search capabilities to accelerate both engineering activity and the marketing team. The company also wants to roll out new GenAI-based products, so a lot of the questions will center around Generative AI concepts. The company has about 300 engineers and a marketing staff of 40. Product releases are done quarterly.

Your role is to implement and conduct a (mini-)POC helping the company to evaluate RAG capabilities for the improvement of their document search (and corresponding question answering), supporting particularly the engineering and marketing organizations. You will have a gold dataset with 'good' responses to questions from marketing and engineering teams. You need to develop metric(s) that help you to evaluate how well your RAG system performs relative to the gold data. You should work with the tunables of the setup (LLM, chunking, embeddings, ...) for your iterations.

You will also need to write up your findings as a short proposal.

## Model, Parameter, Prompt & Evaluation Options



Below are the various tuning options we are chosing to explore for improving this RAG Chain model.

<br>

**Embedding Models**
 - 'multi-qa-mpnet-base-dot-v1'
 - 'all-MiniLM-L6-v2'
 - 'avsolatorio/GIST-Embedding-v0'

**Splitter Chunk Parameters**
 - 'CHUNK_SIZE'
 - 'OVERLAP'

**Retriever Parameters**
 - Num of documents.
 - Type of serach

**LLM Model**
 - Cohere
 - 'mistralai/Mistral-7B-Instruct-v0.1'

**RAG Prompt Template**

**Evaluation Metrics**
 - BERTScore
 - BLEU
 - ROUGE
---

## 1) Setup

### 1.A) Provided Setup

Below is the setup that was provided with the original notebook.

In [None]:
%%capture
!pip -q install git+https://github.com/huggingface/transformers
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate
!pip -q install langchain
!pip install einops
!pip install faiss-gpu
!pip install --upgrade --quiet  langchain-community chromadb bs4 qdrant-client
!pip install langchainhub

!pip install --upgrade --quiet  wikipedia
!pip install --upgrade --quiet  arxiv
!pip install --upgrade --quiet  pymupdf

!pip install xmltodict

!pip install cohere

!pip install rouge_score

!pip install -U langchain-cohere

In [None]:
import torch
import os
import bs4
import json
import numpy as np
import time


from pprint import pprint

import locale

from transformers import AutoTokenizer , AutoModelForCausalLM
from transformers import pipeline, BitsAndBytesConfig

from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores import Qdrant
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.utils.math import cosine_similarity

from langchain_community.document_loaders import ArxivLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import WikipediaLoader
from langchain_community.document_loaders import OnlinePDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import PubMedLoader

# from langchain_community.chat_models import ChatCohere
from langchain_cohere import ChatCohere

from google.colab import userdata

In [None]:
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
%%capture
!pip install sentence_transformers

Add your keys from the secret store (do **NOT** print them out or leave them exposed as plaintext in your notebook!):

In [None]:
COHERE_API_KEY = userdata.get('COHERE_PRODUCTION_API_KEY')

### 1.B) Validation Question/Answer Set

**Below is an exact copy of the validation set for evaluation. No changes made, just a copy/paste.**

In [None]:
validation_questions_answers = {
    0: {"question": "What purpose do large language models serve in the field of natural language processing?",
  "gold_answer_research": "Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks.",
  "gold_answer_marketing": "Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval."},
1: {"question": "How does a large language model learn from text during training?",
  "gold_answer_research": "A large language model learns from text during training by first going through an unsupervised generative 'pretraining' stage where it sets initial parameters using a language modeling objective. Then, it goes through a supervised discriminative 'fine-tuning' stage where it refines its parameters based on annotated examples or task demonstrations. This dual-stage approach allows the model to learn statistical relationships from text documents in a computationally intensive process, enabling it to achieve general-purpose language generation and natural language processing tasks.",
  "gold_answer_marketing": "A large language model learns from text during training by first pretraining on a diverse dataset to acquire general language knowledge, and then fine-tuning on specific tasks or demonstrations to adapt its parameters for more targeted performance."},
2: {"question": "What are some key architectures behind the development of large language models?",
  "gold_answer_research": "Key architectures behind the development of large language models include the use of self-attention mechanisms, such as those seen in Transformer decoders. These architectures have been applied to tasks like autoregressive language modeling and have led to the dominance of Transformer-based language models in NLP. Models like BERT and GPT-2 have further advanced this paradigm, showcasing the power of large Transformer language models in achieving state-of-the-art results across various NLP tasks. Additionally, architectures like neural-retriever-in-the-loop generative-based models have shown improvements in tasks like open-domain QA and knowledge-grounded dialogue, emphasizing the importance of consistent and engaging responses in long-form generation and multi-turn conversations.",
  "gold_answer_marketing": "Key architectures behind the development of large language models include Transformer-based models such as BERT and GPT-2, which utilize self-attention mechanisms for tasks like autoregressive language modeling and knowledge-grounded dialogue. These models have shown significant success in NLP tasks and have led to advancements in general-purpose language generation and natural language processing."},
3: {"question": "Can you name some specific large language models and the companies or organizations that have developed them?",
  "gold_answer_research": "Some specific large language models include GPT-3 by OpenAI, Chinchilla by DeepMind, and BERT by Google. OpenAI developed GPT-3, DeepMind developed Chinchilla, and Google developed BERT. These models have been significant advancements in the field of natural language processing.",
  "gold_answer_marketing": "Chinchilla by DeepMind, GPT-3 by OpenAI."},
7: {"question": "What licensing models have been adopted for the distribution of source-available language models?",
  "gold_answer_research": "Based on the provided context, it seems that licensing models for the distribution of source-available language models have not been explicitly discussed in the referenced papers. However, it is crucial to consider potential licensing options such as open-source licenses (e.g., GPL, MIT) or proprietary licenses when distributing language models to ensure legal compliance and control over usage rights. Additionally, considering the implications of different licensing models on accessibility, collaboration, and commercialization is essential for determining the most suitable approach for sharing language models with the community. Further research or consultation with legal experts may be necessary to explore specific licensing strategies for source-available language models.",
  "gold_answer_marketing": "Answer: Some organizations choose open-sourcing, while others restrict access to a few organizations with resources or offer end-to-end deployment via API."},
8: {"question": "What are language models and what is their purpose in natural language processing?",
  "gold_answer_research": "Language models are probabilistic models of natural language that help predict or correct text. Their purpose in natural language processing is to assist in various tasks such as speech recognition, machine translation, natural language generation, and information retrieval. By analyzing the performance of human subjects, language models improve the understanding and generation of human-like text.",
  "gold_answer_marketing": "Language models are probabilistic models of natural language that are used in tasks such as speech recognition, machine translation, and natural language generation in natural language processing."},
9: {"question": "How have language models evolved in terms of architecture, from the 1980s to present times?",
  "gold_answer_research": "Language models have evolved significantly in terms of architecture from the 1980s to present times. In the 1980s, the first statistical language model was proposed, leading to experiments by IBM that identified areas for improvement by observing human subjects. However, it wasn't until 2017 when the transformer architecture was introduced by Google, revolutionizing the field. This development paved the way for models like BERT in 2018, which marked a shift towards large-scale transformer-based language models. These modern architectures, based on self-attention mechanisms, have dominated the field of natural language processing, achieving state-of-the-art performance in various tasks.",
  "gold_answer_marketing": "Language models have evolved from early statistical models in the 1980s to modern transformer architectures, such as BERT and GPT-2, which use self-attention mechanisms and have become dominant in natural language processing tasks."},
11: {"question": "Can you explain how maximum entropy language models work and what the partition function signifies?",
  "gold_answer_research": "Maximum entropy language models use feature functions to encode the relationship between a word and its n-gram history, aiming to maximize reward while satisfying a KL-constrained objective. The partition function, denoted as Z(x), is crucial in normalizing the probabilities of all possible outputs given the input. It represents the sum of the exponential of the reward function over all possible output sequences, making it computationally expensive to estimate but essential for accurate modeling. The partition function ensures that the model's predicted probabilities sum up to 1, providing a foundation for effective language modeling.",
  "gold_answer_marketing": "Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The partition function in this context represents the total probability of all possible outcomes, making it a crucial factor in determining the optimal solution for the reward maximization objective."},
12: {"question": "What is the benefit of using continuous space embeddings in recurrent neural network language models?",
  "gold_answer_research": "Continuous space embeddings in recurrent neural network language models help alleviate the curse of dimensionality by representing words as non-linear combinations of weights in the embedding space. This approach helps address the data sparsity problem caused by the exponential increase in possible word sequences with vocabulary size. By utilizing continuous space embeddings, neural networks can effectively capture semantic relationships and meaning within the language model.",
  "gold_answer_marketing": "Continuous space embeddings in recurrent neural network language models help alleviate the curse of dimensionality caused by the exponential increase in possible word sequences, reducing data sparsity issues."},
13: {"question": "What challenges do large language models face in mirroring human cognitive patterns?",
  "gold_answer_research": "Large language models face challenges in mirroring human cognitive patterns because they sometimes learn patterns that humans do not learn, while also failing to learn patterns that humans typically learn. This discrepancy suggests that the models may not be plausible cognitive models, despite matching human performance in some tasks. Further research is needed to address these limitations and improve the alignment of large language models with human cognitive patterns.",
  "gold_answer_marketing": "Large language models sometimes learn patterns that humans do not learn and fail to learn patterns that humans typically do learn."},
16: {"question": "What factors influenced the development of generative language models by Anthropic?",
  "gold_answer_research": "Several factors influenced the development of generative language models by Anthropic, including the limitations in coding, math, and reasoning capabilities of the initial version Claude, the partnerships with companies like Notion and Quora to enhance the model's capabilities, and the need to address biases, unsafe content, and ethical considerations in training data. Additionally, the reliance on supervised learning and the need for controlled generation in generative models played a role in shaping the development of Anthropic's language models.",
  "gold_answer_marketing": "Factors that influenced the development of generative language models by Anthropic include partnerships with companies like Notion and Quora, limitations in coding, math, and reasoning capabilities in initial models like Claude, and the need to address biases and unsafe content in training datasets."},
17: {"question": "What is Constitutional AI and how does it affect the functionality of AI systems?",
  "gold_answer_research": "Constitutional AI is an approach developed by Anthropic for training AI systems, particularly language models like Claude, to be harmless and helpful without relying on extensive human feedback. It involves two phases: supervised learning, where the model generates responses to prompts and self-critiques based on a set of guiding principles, and reinforcement learning, where the model is trained with AI-generated feedback according to constitutional principles. This approach enables the training of AI assistants that are both helpful and harmless, with the ability to explain objections to harmful requests, enhancing transparency and reducing the need for human supervision.",
  "gold_answer_marketing": "Constitutional AI is an approach developed by Anthropic for training AI systems, particularly language models like Claude, to be harmless and helpful without relying on extensive human feedback. It involves supervised learning and reinforcement learning phases to guide the model's responses based on a set of guiding principles (a 'constitution'). This approach aims to create AI systems that are both helpful and transparent in their decision-making process, reducing the need for constant human supervision."},
18: {"question": "How do advances in AI models impact their ability to interact with different types of data, such as images?",
  "gold_answer_research": "Advances in AI models, such as multimodal models like RA-CM3, have significantly improved their ability to interact with different types of data, such as images. These models can refer to external memory, like web data, to increase their knowledge capacity, allowing them to generate correct images from entity-rich captions. Additionally, these models can perform image editing and manually specify examples in-context for better results. The use of large language models, combined with larger datasets and neural networks, has also enhanced their performance in tasks like image generation and text generation.",
  "gold_answer_marketing": "Advances in AI models, such as multimodal models like RA-CM3, allow for better interaction with different types of data, like images, by accessing external memory for increased knowledge capacity and improving performance in tasks like image generation and image editing."},
19: {"question": "What are the potential trade-offs between AI system alignment with ethical guidelines and practical utility?",
  "gold_answer_research": "The potential trade-offs between AI system alignment with ethical guidelines and practical utility include the risk of reduced performance and usability due to stringent ethical alignment measures, as seen with Claude 2. Users may face limitations and refusal of assistance for benign requests, leading to debates over the 'alignment tax' in AI development. Balancing ethical considerations with practical functionality is crucial to ensure alignment with ethical guidelines without compromising the practical utility of AI systems. Research is needed to find a middle ground that prioritizes ethical alignment while maintaining usability and performance.",
  "gold_answer_marketing": "The potential trade-offs between AI system alignment with ethical guidelines and practical utility include balancing stringent ethical alignment that may reduce usability and performance, ensuring transparency and fairness in alignment processes, and addressing the alignment tax that may impact adoption of AI systems."},
20: {"question": "How has the token handling capacity changed between different versions of the Claude model?",
  "gold_answer_research": "The token handling capacity has increased with each new version of the Claude model. Claude Instant has a context length of 100,000 tokens, Claude 2.1 doubled this to 200,000 tokens, and Claude 3 Opus default version has a context window of 200,000 tokens but can be expanded to 1 million for specific use cases. This progression shows a trend towards handling larger amounts of text data for improved performance and capabilities.",
  "gold_answer_marketing": "The token handling capacity has increased from Claude to Claude Instant to Claude 2.1, with Claude Instant having a input context length of 100,000 tokens, Claude 2.1 having a context window of 200,000 tokens, and Claude 3 Opus having a context window of 1 million tokens."},
22: {"question": "In what ways has the Claude model's ability to self-critique and revise its responses enhanced its transparency?",
  "gold_answer_research": "The Claude model's ability to self-critique and revise its responses has enhanced its transparency by allowing for iterative improvements based on past actions and mistakes. Through self-reflection, the model can refine its output by learning from feedback and generating special tokens to signal the need for retrieval or confirm the relevance, support, or completeness of its responses. This process ensures that the model's statements about the world are truthful and accurate, ultimately increasing transparency in its decision-making and reasoning processes.",
  "gold_answer_marketing": "The Claude model's ability to self-critique and revise its responses has enhanced its transparency by allowing it to generate text informed by retrieved passages, criticize the output, and signal the need for retrieval or confirm the output's relevance, support, or completeness. This self-reflection process helps improve the model's accuracy and reliability in generating responses."},
23: {"question": "How do subsequent versions of Claude compare in terms of their likelihood to produce false statements?",
  "gold_answer_research": "Claude Instant is a faster and lighter version of Claude, with an input context length of 100,000 tokens. In contrast, Claude 3 has faced criticism for its stringent ethical alignment, leading to a debate over the 'alignment tax' in AI development. Users have been refused assistance with benign requests, which has sparked discussions on balancing ethical considerations and practical functionality. This suggests that Claude Instant may have a lower likelihood of producing false statements compared to Claude 3 due to its focus on usability and performance.",
  "gold_answer_marketing": "Claude Instant is a faster, less expensive, and lighter version of Claude with a shorter input context length. Claude 3 has faced criticism for ethical alignment issues that may affect usability and performance."},
24: {"question": "Who developed the language model family known as Chinchilla?",
  "gold_answer_research": "The Chinchilla language model family was developed by the research team at DeepMind and presented in March 2022. It is named 'Chinchilla' as an advancement over the previous Gopher model family. The Chinchilla family has been trained to investigate the scaling laws of large language models and is designed to outperform GPT-3.",
  "gold_answer_marketing": "The research team at DeepMind developed the language model family known as Chinchilla."},
25: {"question": "What benchmark did Chinchilla achieve an average accuracy of 67.5% on?",
  "gold_answer_research": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding).",
  "gold_answer_marketing": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding)."},
27: {"question": "What is the relationship between Chinchilla and the Gopher language model families?",
  "gold_answer_research": "The Chinchilla family of transformer models is essentially the same as the Gopher family, with minor modifications and different training optimizers. Chinchilla uses AdamW optimizer while Gopher uses Adam optimizer. Additionally, Chinchilla uses relative positional encoding and RMSNorm instead of absolute positional encoding and LayerNorm used by Gopher. Chinchilla has 70B parameters and outperforms Gopher on the MMLU benchmark by 7%, showcasing an improvement in performance. Both families follow similar naming conventions and were developed to investigate the scaling laws of large language models.",
  "gold_answer_marketing": "Chinchilla is a family of transformer models developed by DeepMind, which is a further development over a previous model family named Gopher. Both model families were trained to investigate the scaling laws of large language models."},
28: {"question": "What distinguishes the architectures of the Chinchilla and Gopher family models in terms of optimization techniques used?",
  "gold_answer_research": "The main distinction in optimization techniques between the Chinchilla and Gopher family models lies in the choice of optimizers. The Gopher family utilizes the Adam optimizer, whereas the Chinchilla family is trained using the AdamW optimizer. Additionally, the Gopher family employs RMSNorm instead of LayerNorm, and relative positional encoding rather than absolute positional encoding. These differences in optimization techniques contribute to the unique characteristics and performance of each model family.",
  "gold_answer_marketing": "The Chinchilla family uses AdamW optimizer, while the Gopher family uses the Adam optimizer."},
30: {"question": "What is the recommended strategy for training large autoregressive language models with limited compute resources, as contributed by the Chinchilla team?",
  "gold_answer_research": "The Chinchilla team recommends that the number of training tokens should be doubled for every model size doubling to achieve better results on downstream tasks. They also suggest using larger, higher-quality training datasets to improve performance. Additionally, they mention the importance of balancing model size and efficiency to address computational costs and inference latency limitations. It is advised to focus on Transformer language models and consider sharing model parameters for quick task-switching when deploying as a service.",
  "gold_answer_marketing": "The Chinchilla team recommends doubling the number of training tokens for every model size doubling and using larger, higher-quality training datasets to achieve better results on downstream tasks."},
33: {"question": "What are some key areas of research in the field of artificial intelligence as reflected in recent academic literature?",
  "gold_answer_research": "Recent academic literature in the field of artificial intelligence reflects key areas of research such as natural language processing with state-of-the-art transformers, feature learning in infinite-width neural networks, diverse beam search for complex scene description, and the development of generative AI models capable of generating text and images. Additionally, research focuses on human preferences in dueling bandits, the use of few-shot learners in language models, and the exploration of knowledge-grounded neural conversation models. These areas of research highlight the advancements in AI technology and its applications across various domains.",
  "gold_answer_marketing": "Some key areas of research in artificial intelligence include natural language processing, deep neural networks, generative AI, AI safety, AI art, reinforcement learning, and language agents alignment."},
34: {"question": "What are some of the limitations of traditional position encoding methods in the architecture of pre-trained language models (PLMs), and what novel approach does the paper propose to address these issues?",
  "gold_answer_research": "One limitation of traditional position encoding methods in PLMs is that they may not enable length extrapolation of pre-existing models, leading to the need for substantial pre-training costs. The paper proposes a novel approach called Position Interpolation, which extends existing PLMs without deviating far from existing definitions of position encoding or attention mechanisms. This method allows for much extended context windows for text modeling, leading to significant perplexity gains and improved model performance.",
  "gold_answer_marketing": "Traditional position encoding methods in PLMs have limitations in enabling length extrapolation and adapting to extended context windows. The paper proposes a novel approach called Position Interpolation, which generates strong models that can effectively make use of much extended context windows. This method allows for substantial pre-training cost savings and preserves the quality of the original models, even for small context window tasks."},
35: {"question": "How does the Rotary Position Embedding (RoPE) approach in Transformers differ from the traditional additive method of position embedding with respect to encoding position information?",
  "gold_answer_research": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by being multiplicative instead of additive. While traditional methods add position encoding to context representations, RoPE incorporates relative position information through rotation matrix product. This means that RoPE naturally includes relative position dependency in the self-attention formulation, without altering terms in the expanded formulation like the additive method does. Additionally, RoPE's properties show that it decays as the relative distance between positions increases, providing a clear theoretical interpretation of how position information is encoded.",
  "gold_answer_marketing": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by incorporating relative position information through rotation matrix product instead of altering terms in the expanded formulation of additive position encoding."},
36: {"question": "What is the significance of comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices when analyzing the adaptation of pre-trained language models?",
  "gold_answer_research": "Comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices provides insight into the underlying mechanism for adapting pre-trained language models. It helps determine the intrinsic rank of the adaptation matrix ∆W and sheds light on the connection between ∆W and the original weight matrix W. By analyzing these similarities, we can understand how much of the adaptation is specific to the task at hand and how much is influenced by the pre-trained model. This comparison is crucial for optimizing the adaptation process and maximizing downstream performance in NLP tasks.",
  "gold_answer_marketing": "Comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices helps understand the underlying mechanism for adapting pre-trained language models. It reveals the intrinsic rank and common singular value directions learned by different runs, shedding light on the fundamental principles of using pre-trained language models for downstream tasks in NLP."},
38: {"question": "What issues are associated with the homogeneity of language model training contractors, and how might it affect the behavior of the models?",
  "gold_answer_research": "The issues associated with the homogeneity of language model training contractors include potential biases in the labeling process, lack of diverse perspectives leading to limited coverage of sensitive content, and reduced robustness in model performance across different tasks. This homogeneity can affect the behavior of the models by reinforcing certain biases, increasing the risk of harmful content generation, and limiting the models' ability to generalize effectively. To address these issues, it is important to ensure diversity among labelers, incorporate varied perspectives in training data, and implement measures to enhance model robustness and performance across a range of tasks.",
  "gold_answer_marketing": "The homogeneity of language model training contractors can lead to biased or limited perspectives in the data, which may result in the models producing harmful content, gaming objectives, or lacking sensitivity to diverse viewpoints. This can affect the behavior of the models by reinforcing stereotypes, increasing toxicity, and reducing their ability to accurately represent under-represented groups."},
39: {"question": "What are common research topics and themes found in recent publications about artificial intelligence and natural language processing?",
  "gold_answer_research": "Recent publications in artificial intelligence and natural language processing have covered topics such as transformer models, feature learning in neural networks, attention mechanisms, multi-task benchmark platforms, semantic search using sentence embeddings, cross-task generalization, and question generation for question answering. Themes commonly explored include machine comprehension of text, reinforcement learning algorithms, sentence embeddings, semantic compositionality, reasoning with language models and knowledge graphs, and the gap between neural text and human text. These publications also delve into deep language understanding, retrieval-augmented transformers, image captioning, and open datasets for image-text pairs.",
  "gold_answer_marketing": "Common research topics and themes in recent publications on artificial intelligence and natural language processing include transformer models, attention mechanisms, semantic search, sentence embeddings, and question answering using language models and knowledge graphs."},
41: {"question": "Question: When conducting demographic and technical assessments of teams or research subjects, what types of data categories are typically collected and analyzed to ensure a comprehensive understanding of the group's composition and the methods used?",
  "gold_answer_research": "When conducting demographic and technical assessments of teams or research subjects, it is important to collect and analyze data categories such as age, gender, education level, professional background, and expertise in specific areas. By gathering information on these categories, you can ensure a comprehensive understanding of the group's composition and the methods used in your assessments. Additionally, it may be helpful to consider factors like cultural background, language proficiency, and geographical location to capture a more nuanced picture of the group being assessed. This detailed approach to data collection and analysis can provide valuable insights for making informed decisions and recommendations based on the gathered information.",
  "gold_answer_marketing": "Answer: Demographic data such as age, gender, education level, and technical data related to skills and experience are typically collected and analyzed for comprehensive understanding."},
43: {"question": "What kind of tasks can be performed using the datasets described in the provided text, and what are some common features of these datasets?",
  "gold_answer_research": "The datasets described in the provided text can be used for tasks such as question answering, duplicate question retrieval, entity retrieval, citation prediction, query understanding, document understanding, passage retrieval, text summarization, fact verification, and code search. Common features of these datasets include diverse task categories, comprehensive instructions, a wide range of synthetic user personalities and interaction patterns, and a focus on enhancing comprehension of documents to deliver accurate results. Additionally, the datasets cover a variety of domains such as public health, scientific exams, climate, and general knowledge.",
  "gold_answer_marketing": "The datasets described in the provided text can be used for tasks such as question answering, document summarization, duplicate question retrieval, code search, sentence simplification, dialogue generation, body retrieval, caption generation, fact verification, and more. Some common features of these datasets include diverse input-output pairs, incorporation of various knowledge-intensive datasets, and a focus on generating high-quality synthetic data points."},
44: {"question": "What conclusions can be drawn about the relationship between input prompt toxicity and output toxicity when using different language models and prompts?",
  "gold_answer_research": "Based on the findings presented in the results section, it can be concluded that the relationship between input prompt toxicity and output toxicity varies depending on the language model used and the specific prompt given. When instructed to produce a safe and respectful output, InstructGPT models generate less toxic outputs compared to GPT-3, but this advantage disappears when the respectful prompt is removed. On the other hand, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than GPT-3 outputs. Additionally, the toxicity of the model outputs is highly correlated with the toxicity of the input prompt, as shown in Figure 39.",
  "gold_answer_marketing": "The study found that when instructed to produce a safe and respectful output, InstructGPT models generate less toxic outputs compared to GPT-3. However, this advantage disappears when the respectful prompt is removed. Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than GPT-3. This suggests that the toxicity of the output is highly correlated with the toxicity of the input prompt."},
45: {"question": "What are some challenges in training retrieval systems and how are negative samples used to address them?",
  "gold_answer_research": "Training retrieval systems face challenges such as redundancy in retrieved documents and lack of diversity in retrieval. Negative samples, including randomly sampled negatives, denoised hard negatives, and instruction-unfollowing negatives, are crucial for improving system performance. Carefully designed negative samples help the system effectively learn the task, but they can also lead to performance drops in out-of-domain datasets. Combining random samples and challenging negatives during training is key to building a competitive system for both in-domain and out-of-domain retrieval.",
  "gold_answer_marketing": "Some challenges in training retrieval systems include high cost of annotating datasets for new tasks and improving performance in zero-shot settings. Negative samples, such as denoised hard negative documents and instruction-unfollowing negative documents, are used to train retrieval systems effectively and address performance drops in out-of-domain datasets."},
46: {"question": "What factors have been found to potentially impact the ability of models to follow instructions, based on the analysis provided?",
  "gold_answer_research": "Based on the analysis provided, factors that have been found to potentially impact the ability of models to follow instructions include the human feedback obtained from contractors, which may be influenced by their beliefs, cultural backgrounds, and personal history. Additionally, the model's behavior can be affected by false premises in instructions, tendencies to hedge, and performance degradation with multiple explicit constraints in instructions. The models are also not fully aligned or safe, as they can generate toxic or biased outputs, make up facts, and fail to generate reasonable outputs in some cases.",
  "gold_answer_marketing": "Factors that may impact the ability of models to follow instructions include false premises in instructions, models hedging unnecessarily, performance degradation with multiple constraints in instructions, generation of toxic or biased outputs, and over-generalization leading to refusal of innocuous instructions."},
47: {"question": "What are some key factors to consider when building a successful multi-task instruction-following retrieval system as identified in the research?",
  "gold_answer_research": "Some key factors to consider when building a successful multi-task instruction-following retrieval system include the need for cross-task interdependence for training a single retriever, the flexibility and zero-shot transfer enabled by instructions compared to task identifiers, and the elimination of the need for hosting multiple task-specific retrievers. Additionally, optimizing the mix and volume of instructional data for diverse tasks is crucial, as well as considering the impact of ranking strategy in data construction. Finally, the effectiveness of the dataset scale in retrieval and the importance of carefully designed negative samples should be taken into account for improved efficiency of instruction-following retrievers.",
  "gold_answer_marketing": "Key factors to consider when building a successful multi-task instruction-following retrieval system include the effectiveness of the dataset scale in retrieval, the diversity in data and model scale, carefully designed negative samples, and the ability to adapt to new tasks via instructions."},
48: {"question": "What are the benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model in the document?",
  "gold_answer_research": "The benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model, include significantly better training efficiency with less training compute, outperforming existing models by using less training data, compute, and parameters. The retrieval augmentation allows the model to focus on learning how to use retrieved documents in context, leading to improved accuracy in classification tasks. Additionally, the RA-CM3 model achieves strong performance in image and caption generation, surpassing existing models like DALL-E and Flamingo despite using fewer resources.",
  "gold_answer_marketing": "The benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model in the document, include outperforming existing models by using less training data, compute, and parameters, achieving significantly better training efficiency, and improving accuracy in k-shot classification tasks. Additionally, retrieval augmentation allows the model to focus on learning how to use retrieved documents in context, leading to stronger performance in tasks such as image and caption generation."},
50: {"question": "What methods are typically employed to create training data for embedding models that use task-specific instructions?",
  "gold_answer_research": "To create training data for embedding models that use task-specific instructions, a common method is to combine datasets from different sources, such as the SuperNaturalInstructions dataset with existing collections designed for embedding training. The SuperNaturalInstructions dataset provides natural language instructions, which can be paired with positive and negative examples to form training samples. Additionally, for tasks like classification or similarity, training samples can be constructed by selecting text sequences associated with different classes or similarities. This diverse training data is essential for instruction-based finetuning, which enables the embedding model to learn from a wide range of tasks and domains.",
  "gold_answer_marketing": "Training data for embedding models that use task-specific instructions is typically created by formulating a wide variety of tasks as text-to-text problems, distinguishing good/bad candidate outputs given an input text. This is done by combining datasets with natural language instructions and constructing positive and negative pairs for training."},
51: {"question": "Question: What are some of the challenges and innovations associated with fine-tuning large language models, and how does the approach discussed in the referenced text aim to address them?",
  "gold_answer_research": "Some challenges associated with fine-tuning large language models include limited access to and manipulation of knowledge, lagging performance on knowledge-intensive tasks, and the need for provenance in decision-making and updating world knowledge. The approach discussed in the referenced text aims to address these challenges by utilizing Retrieval Augmented Generation (RAG), which involves retrieving relevant passages from a corpus to feed to the language model for improved performance in tasks such as question-answering and dialogue. This iterative approach focuses on improving alignment with user intent and fine-tuning models to control sentiment and improve response quality in various language tasks.",
  "gold_answer_marketing": "The challenges with fine-tuning large language models include aligning them with user intent and controlling the quality of generated outputs. The approach discussed in the referenced text aims to address these challenges by using Retrieval Augmented Generation (RAG) to retrieve relevant passages from a corpus and feed them to the language model, improving alignment and performance."},
52: {"question": "What is a common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors, and how does it work?",
  "gold_answer_research": "A common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant. This approach involves dividing the input tensor into contiguous blocks of size B by flattening the tensor and slicing it into n blocks, where n is determined by the size of the blocks. Each block is then quantized independently using a quantization constant c, which helps prevent outlier values from causing performance degradation.",
  "gold_answer_marketing": "A common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant. This helps prevent performance degradation by reducing the impact of outliers on the quantization process."},
54: {"question": "What considerations or techniques are commonly implemented when setting up finetuning experiments for machine learning models?",
  "gold_answer_research": "When setting up finetuning experiments for machine learning models, it is common to use a two-stage approach. The initial stage involves setting the initial parameters using a language modeling objective. This is followed by a supervised discriminative 'fine-tuning' stage to adapt these parameters to the target task. Additionally, it is typical to train all models using the Adam optimizer and a triangular learning rate scheduler with 10% warmup. Experimentation with different hyperparameters such as number of epochs, peak learning rate, and batch size is also conducted to optimize model performance. Finally, utilizing a mixture of datasets and balancing the sizes of datasets can help improve the robustness and generalization of the finetuned models.",
  "gold_answer_marketing": "Considerations for setting up finetuning experiments for machine learning models commonly include using a language modeling objective for initial parameter setting and supervised discriminative fine-tuning for adapting parameters to the target task. Techniques such as hyperparameter search, Adam optimizer with triangular learning rate scheduler, and balancing dataset sizes through mixing strategies are also commonly implemented. Additionally, freezing some model layers during fine-tuning and incorporating negative examples for contrastive learning can be effective strategies."},
55: {"question": "What are the implications of the equivalence relation defined in the theoretical analysis of the DPO model for understanding the relationship between reward functions in reinforcement learning?",
  "gold_answer_research": "The equivalence relation defined in the theoretical analysis of the DPO model implies that two reward functions are considered equivalent if they differ by a constant function. This means that the class of learned reward models is not constrained by this reparameterization, allowing for the exact recovery of the optimal policy. Understanding this relationship between reward functions in reinforcement learning helps in defining a unique reward function within each equivalence class, which is crucial for optimizing policies under existing models of human preferences. It also highlights the generality and flexibility in the reward model due to the proposed reparameterization.",
  "gold_answer_marketing": "The equivalence relation defined in the theoretical analysis of the DPO model shows that two reward functions are considered equivalent if they differ by a fixed function. This implies that different reward functions can lead to the same optimal policy, allowing for flexibility in designing reward models in reinforcement learning."},
59: {"question": "Considering the structure and content of the provided text, what guidelines should be used to evaluate the effectiveness of a summary or chatbot response in this context?",
  "gold_answer_research": "To evaluate the effectiveness of a summary or chatbot response in this context, guidelines should include assessing the faithfulness of the answer to the retrieved context, the relevance of the answer to the question, and the focus of the retrieved context. Additionally, consider using quality metrics such as answer relevancy to rank responses based on how directly they address the question and avoid redundant or incomplete information. Lastly, take into account the performance of different tasks such as summarization, citation prediction, and passage ranking to determine the overall effectiveness of the response.",
  "gold_answer_marketing": "Answer: Evaluate based on faithfulness, answer relevance, and context relevance."},
60: {"question": "What are some recent methods and technologies that have been developed to enhance the capabilities and performance of natural language processing models?",
  "gold_answer_research": "Recent methods and technologies developed to enhance natural language processing models include retrieval-augmented multimodal language modeling, which outperforms existing models with less training data and parameters. Another advancement is the use of feature learning in infinite-width neural networks to improve performance. Additionally, embedding techniques in NLP have been developed to map words or phrases to real number vectors, enhancing the model's understanding of language. These innovations have led to improvements in tasks like query reformulation, document ranking, and fine-tuning larger language models for various applications.",
  "gold_answer_marketing": "Recent methods and technologies include retrieval-augmented language models, feature learning in infinite-width neural networks, and word embeddings."},
61: {"question": "What are some potential directions for future work mentioned in the document related to enhancing question-answering techniques for document-oriented tasks?",
  "gold_answer_research": "One potential direction for future work mentioned in the document is the development of multi-modal approaches that incorporate table and figure information into GPT-4 question-answering for documents. Another direction is to incorporate question type in the PDFTriage approach to improve the efficiency and efficacy of the approach. Additionally, the document suggests further research in document-grounded, information-seeking question answering, which the dataset is designed to facilitate.",
  "gold_answer_marketing": "Some potential future directions mentioned in the document include developing multi-modal approaches that incorporate table and figure information into question-answering for documents, and incorporating question type in the PDFTriage approach to improve efficiency and efficacy."},
62: {"question": "What information would you expect to find in section 2 of a document, based on the types of questions classified under Summarization?",
  "gold_answer_research": "Based on the types of questions classified under Summarization, you would expect to find key takeaways, concise summaries, and specific content extraction related to different sections of the document in section 2. The section likely contains detailed summaries of specific parts of the document, along with structured metadata representation and instructions for summarizing the content effectively. It may also include guidelines for extracting specific information and rewriting text for clarity and conciseness.",
  "gold_answer_marketing": "Based on the types of questions classified under Summarization, you would expect to find key takeaways, concise summaries, and specific content extraction related to the document in section 2."},
63: {"question": "What are the main advantages and attention mechanisms that contribute to the enhanced performance and efficiency of the newly introduced language model as compared to its predecessors?",
  "gold_answer_research": "The main advantages of the newly introduced language model include utilizing retrieval-augmentation to incorporate external knowledge, which improves prediction accuracy. Additionally, the model employs attention mechanisms that allow for better understanding of dependencies between source and target sequences, leading to more informed predictions. These attention mechanisms have been extended from machine translation to various other fields, enhancing the model's adaptability and performance across different tasks. Finally, the model's use of self-attention mechanisms enables better contextual representation learning, parallelization, and modeling of longer intra-token relations, improving efficiency and performance compared to previous models.",
  "gold_answer_marketing": "The main advantages of the newly introduced language model include the use of retrieval-augmented mechanisms, attention mechanisms, and context representation learning, which contribute to enhanced performance and efficiency compared to its predecessors."},
64: {"question": "What criteria are used to assess the quality of recommendations provided by different language models in a comparison study?",
  "gold_answer_research": "In a comparison study of language models, criteria such as sentence relevance, lexical accuracy, and contextual understanding are used to assess the quality of recommendations. Different tasks may benefit from different evaluation measures, such as STRINC, LEXICAL, and CXMI. Additionally, template selection plays a vital role in the quality of recommendations, with deliberate template design being important for tasks like query suggestion. The overall quality of recommendations is often judged using a Likert scale, along with metadata collection for each model output.",
  "gold_answer_marketing": "The criteria used to assess the quality of recommendations provided by different language models in a comparison study include comparing to human-created benchmarks, examining intrinsic character, comparing two models, investigating rate of learning, and analyzing learning curves."},
65: {"question": "What approaches have been proposed to enhance the task performance of language models while considering the trade-offs such as runtime efficiency, robustness to irrelevant context, and attribution quality?",
  "gold_answer_research": "Several approaches have been proposed to enhance the task performance of language models while considering trade-offs. These include using compression and selective augmentation methods to decrease the propensity of models to generate toxic or biased outputs. Adversarial setups have been suggested where labelers find worst-case behaviors of the model and add them to the dataset. Additionally, models like BART and T5 leverage bi-directional attention to achieve stronger performance on both discriminative and generative tasks. These methods aim to balance model performance with considerations such as runtime efficiency, robustness to irrelevant context, and attribution quality.",
  "gold_answer_marketing": "Approaches proposed to enhance language model task performance include compression and selective augmentation, adversarial set-ups for labeling worst-case behaviors, retrieval-augmented models, and extending existing models to enable length extrapolation while maintaining quality."},
67: {"question": "What metrics are commonly used to compare the performance of language models in various tasks, as outlined in an experimental results table?",
  "gold_answer_research": "Common metrics used to compare the performance of language models in various tasks, as outlined in an experimental results table, include Exact Match and Unigram F1. These metrics have become standard in evaluating language models. Additionally, other metrics such as BLEU score, FactScore (factuality), precision, and recall are also commonly used to assess the performance of language models across different tasks. It is important to consider a variety of metrics to get a comprehensive understanding of the effectiveness of a language model in different contexts.",
  "gold_answer_marketing": "The metrics commonly used to compare the performance of language models in various tasks are Exact Match and Unigram F1."},
69: {"question": "What is the role of manual assessment in the validation of language model predictions according to the text provided?",
  "gold_answer_research": "Manual assessment plays a crucial role in the validation of language model predictions. The engineers evaluate the quality of model outputs by having labelers rate them on test sets consisting of prompts from held-out customers. This manual assessment helps ensure that the models are aligned with a broad distribution of language tasks and can identify any behavioral issues that may arise from misalignment. Additionally, human annotators find that certain reflection token predictions are aligned with their assessments, providing valuable insights into the accuracy and effectiveness of the models.",
  "gold_answer_marketing": "Answer: Manual assessment plays a key role in evaluating the quality of language model predictions by having labelers rate the model outputs and comparing them to prompts from held-out customers."},
70: {"question": "What are the general steps outlined for training a language model in the document, and how is the training data for the generator language model collected and utilized?",
  "gold_answer_research": "The document outlines the general steps for training a language model, including incorporating retrieved documents into the main input sequence and optimizing the loss function to train the generator. The training data for the generator language model is collected through various techniques such as supervised fine-tuning, critic learning, and custom retrievers for downstream tasks. The collected data is used to train the generator on specific tasks like summarization, machine reading comprehension, and natural language to SQL translation, improving performance on those tasks.",
  "gold_answer_marketing": "The general steps for training a language model include fine-tuning on specific datasets, filtering pretraining data, and using critic learning. Training data for the generator language model is collected from open-access NLP papers and used for downstream conditional text generation tasks."},
73: {"question": "What are the three main categories used to refine language model abilities in understanding and executing search tasks according to the given document?",
  "gold_answer_research": "The three main categories used to refine language model abilities in understanding and executing search tasks are query understanding, document understanding, and query-document relationship understanding. Tasks within these categories focus on interpreting queries, comprehending documents, and understanding the relationships between queries and documents. This approach aims to enhance the models' performance in interpreting and responding to search-related instructions effectively, improving their utility in complex information retrieval scenarios.",
  "gold_answer_marketing": "The three main categories used to refine language model abilities in understanding and executing search tasks are query understanding, document understanding, and query-document relationship understanding."},
74: {"question": "What are some of the emerging research topics and challenges in the field of natural language processing and information retrieval according to recent academic conferences and publications?",
  "gold_answer_research": "Recent academic conferences and publications have highlighted emerging research topics and challenges in natural language processing and information retrieval. Some key areas of focus include efficient retrieval augmented generation, unsupervised dense information retrieval with contrastive learning, citation-informed transformers, and knowledge refinement via interaction between search engines and large language models. Additionally, challenges such as zero-shot retrieval, semantic search using GPT sentence embeddings, and prompt-based effective input reformulation for legal case retrieval have been identified as important research directions. These topics reflect the ongoing advancements and complexities in the field, driving innovation and progress in NLP and IR research.",
  "gold_answer_marketing": "Some emerging research topics and challenges in the field of natural language processing and information retrieval include efficient generation from unstructured knowledge, semantic code search evaluation, unsupervised dense information retrieval, context-aware document term weighting, knowledge refinement through interaction with large language models, and investigating the effectiveness of large language models in search re-ranking."},
75: {"question": "Question: How do models with different fine-tuning strategies compare in terms of accuracy and F1 score for fact verification tasks?",
  "gold_answer_research": "Models with different fine-tuning strategies are compared in terms of accuracy and F1 score for fact verification tasks. The introduction of LLMs has led to notable developments, with some studies leveraging prompting methods to apply LLMs in IR tasks. However, not all LLMs consistently outperform fine-tuned smaller models. For example, RankGPT based on gpt-3.5-turbo underperforms monoBERT in certain scenarios. Fine-tuning is not strictly necessary for models like GPT3, which has been evaluated on closed book question answering tasks without any updates or fine-tuning.",
  "gold_answer_marketing": "Models with different fine-tuning strategies have shown mixed results in terms of accuracy and F1 score for fact verification tasks. Some studies have found that large language models (LLMs) outperform smaller fine-tuned models, while others have reported inconsistent performance. Factors such as task complexity and the need for prompt methods to apply LLMs in information retrieval tasks can also impact the comparison."},
76: {"question": "What components does a fact verification task typically involve in order to assess the accuracy of a given statement?",
  "gold_answer_research": "A fact verification task typically involves assessing the relationship between a claim and the evidence provided, analyzing if there is enough information for a conclusive judgment. This task requires a detailed understanding of the claim and evidence to determine if it is supported or refuted. The use of performance metrics based on including gold answers in model generations instead of exact matching can help search engines deliver accurate and relevant results. Additionally, incorporating lexical measures and verification functions can aid in determining the accuracy of statements.",
  "gold_answer_marketing": "A fact verification task typically involves assessing the relationship between a claim and supporting evidence to determine accuracy."},
78: {"question": "What are the key factors that determine the performance of HALO-aligned models compared to non-HALO models, according to the results presented in the analysis?",
  "gold_answer_research": "According to the analysis presented, the key factors that determine the performance of HALO-aligned models compared to non-HALO models include the specific alignment method used (such as DPO and PPO variant), the model size (significant gap at 13B+ model sizes), and the ability to match or exceed the generation quality of SFT target sequences. Additionally, the study suggests that the cost of increasing model alignment is modest relative to pretraining, and that the modeling of human biases in HALOs may have practical benefits in improving overall performance.",
  "gold_answer_marketing": "The key factor that determines the performance of HALO-aligned models compared to non-HALO models is the model size, with HALO-aligned models generally outperforming non-HALO models at larger sizes (13B+ model sizes)."},
80: {"question": "How does the performance of KTO compare to DPO in model alignment, and what are the potential implications for data usage and training efficiency?",
  "gold_answer_research": "Based on the provided data and experiments, KTO consistently outperforms DPO in model alignment, even with restrictions such as using only one output per input. This suggests that KTO can achieve higher win rates and improve performance across various benchmarks compared to DPO. The implications of this performance difference include the ability to achieve quality generation results with significantly fewer desirable examples, potentially leading to more efficient data usage and training processes. This indicates that KTO may offer a more efficient and effective approach to model alignment compared to DPO.",
  "gold_answer_marketing": "KTO outperforms DPO in model alignment with up to 90% fewer examples. This suggests that KTO can achieve high performance even with imbalanced data, potentially leading to more efficient training processes."},
81: {"question": "What are some common approaches to building an open-domain question answering system?",
  "gold_answer_research": "Some common approaches to building an open-domain question answering system include using the RAG model, which minimizes the negative log-likelihood of answers, and comparing it to extractive QA paradigms that rely on non-parametric knowledge retrieval. Another approach is to incorporate question rewriting techniques to make open-domain QA more conversational. Additionally, utilizing datasets like QASPER, which contain questions requiring complex reasoning, can improve the performance of the system. References to papers by Anantha et al. and Asai et al. provide further insights into building ODQA systems.",
  "gold_answer_marketing": "Common approaches to building an open-domain question answering system include using retrieval over a knowledge base and incorporating the retrieved content as part of the prompt. Other methods involve pretraining models on large amounts of text data and fine-tuning them for question answering tasks."},
82: {"question": "What is the difference between open-book and closed-book question answering?",
  "gold_answer_research": "Open-book question answering involves the use of external sources of knowledge, such as Wikipedia, to retrieve information and generate a response. In contrast, closed-book question answering relies on pre-trained language models that have memorized factual knowledge within their parameters to generate responses without explicit context. Closed-book QA can be seen as analogous to a closed-book exam where no external resources are allowed. The key distinction lies in the reliance on external knowledge sources for open-book QA versus internal memorized knowledge for closed-book QA.",
  "gold_answer_marketing": "Open-book question answering involves using external sources of knowledge to answer questions, while closed-book question answering relies on pre-trained language models to provide answers without explicit context."},
84: {"question": "What are the basic components of the Retriever-Reader framework in open-domain QA?",
  "gold_answer_research": "The basic components of the Retriever-Reader framework in open-domain QA include a retriever model, which fetches relevant information based on input prompts efficiently using FAISS. The retriever component is responsible for retrieving contextually relevant documents or evidence blocks based on the input question. The reader component then processes this retrieved information to generate answers to the questions posed. This framework combines information retrieval and machine reading comprehension to achieve state-of-the-art results in open-domain question answering tasks.",
  "gold_answer_marketing": "The basic components of the Retriever-Reader framework in open-domain QA are the retriever and the reader components, which can be set up and trained independently or jointly trained end-to-end. The retriever component automatically fetches relevant information based on input prompts, while the reader component processes and comprehends the retrieved information to answer questions."},
85: {"question": "How is the TF-IDF model used in question answering retrieval systems?",
  "gold_answer_research": "In question answering retrieval systems, the TF-IDF model is used to represent queries and documents as bag-of-word vectors with terms weighted by term frequency multiplied by inverse document frequency. This allows for efficient non-learning-based search engine operations based on the vector space model. The TF-IDF model helps in calculating the relevance of documents to queries by measuring the importance of terms in the context of the entire document collection. This classic information retrieval approach aids in retrieving relevant information to answer questions accurately and efficiently.",
  "gold_answer_marketing": "The TF-IDF model is used in question answering retrieval systems to weight terms in queries and documents based on their importance in determining relevance."},
86: {"question": "Can neural networks enhance the process of information retrieval in QA systems?",
  "gold_answer_research": "Neural networks, such as MLP, LSTM, and bidirectional LSTM, can be used to learn dense representations of text for information retrieval in QA systems. These approaches, known as 'Neural IR', are a new category of methods that can improve performance in retrieval problems. The introduction of neural retrievers in recent QA literature has shown to outperform traditional word-similarity-based architectures, such as BM25, and can scale to handle knowledge-grounded dialogue tasks effectively. Additionally, incorporating pre-trained retrievers in QA systems has been shown to enhance the performance of generative language models.",
  "gold_answer_marketing": "Yes, neural networks can enhance the process of information retrieval in QA systems by improving performance in open-domain QA tasks and enabling the generation of more accurate answers."},
87: {"question": "What is the importance of fine-tuning in the context of QA data for open-domain question answering models?",
  "gold_answer_research": "Fine-tuning is important in the context of QA data for open-domain question answering models because it allows the model to adapt and improve its performance on specific QA datasets. By fine-tuning the model with common QA datasets, engineers can optimize the model's ability to answer questions accurately. However, there is a concern about the significant overlap between questions in the train and test sets of public QA datasets, which could affect the generalization ability of the fine-tuned models. Engineers should carefully consider this overlap and potentially explore ways to mitigate its impact during the fine-tuning process to ensure the model's effectiveness in real-world applications.",
  "gold_answer_marketing": "Fine-tuning is important in the context of QA data for open-domain question answering models to improve search task performance and the ability to generalize to unseen datasets."},
88: {"question": "How does pre-training with tasks like the Inverse Cloze Task benefit open-domain question answering models?",
  "gold_answer_research": "Pre-training with tasks like the Inverse Cloze Task benefits open-domain question answering models by improving the retrieval process over a knowledge base. By predicting the context given a sentence, the model can better understand the relationship between the question and the evidence. This approach helps in incorporating retrieved content effectively into the prompt, leading to higher accuracy in the question answering task. Additionally, using models pretrained with ICT can enhance the overall performance of the QA system by providing a better understanding of the context.",
  "gold_answer_marketing": "Pre-training with tasks like the Inverse Cloze Task benefits open-domain question answering models by improving retrieval and generation steps, ultimately enhancing the accuracy of the process."},
89: {"question": "What is the main goal of prompt engineering in language models?",
  "gold_answer_research": "The main goal of prompt engineering in language models is to effectively steer the behavior of the model towards desired outcomes without updating the model weights. This is achieved by composing and formatting prompts in a way that maximizes the model's performance on a specific task. Prompt engineering involves treating prompts as trainable parameters and optimizing them directly on the embedding space through methods like AutoPrompt, Prefix-Tuning, P-tuning, and Prompt-Tuning. The ultimate aim is to enhance the model's performance and alignment with user-defined tasks.",
  "gold_answer_marketing": "The main goal of prompt engineering in language models is to steer the behavior of the model for desired outcomes without updating the model weights."},
91: {"question": "What are some known biases that can affect the performance of few-shot classification in LLMs?",
  "gold_answer_research": "Some known biases that can affect the performance of few-shot classification in LLMs include majority label bias, recency bias, and common token bias. Majority label bias occurs when the distribution of labels among examples is unbalanced, recency bias refers to the tendency for the model to repeat the label at the end, and common token bias indicates that LLM tends to produce common tokens more often than rare tokens. These biases can contribute to high variance in few-shot classification tasks and may impact the model's ability to generalize effectively.",
  "gold_answer_marketing": "Some known biases that can affect the performance of few-shot classification in LLMs are majority label bias, recency bias, and common token bias."},
92: {"question": "Why might increasing model size not reduce variance in model performance with varying prompts?",
  "gold_answer_research": "Increasing model size may not necessarily reduce variance in model performance with varying prompts because the model's ability to generalize and adapt to different prompts is not solely dependent on its size. Factors such as the quality and relevance of the training examples, the learning rate or schedule, and the model's sensitivity to different hyperparameters can also play a significant role in determining performance variability. Additionally, the complexity of the task or dataset being used for training can impact how effectively the model scales with size. It is essential to consider these factors holistically when optimizing model performance rather than relying solely on increasing model size.",
  "gold_answer_marketing": "Increasing model size may not reduce variance in model performance with varying prompts because the same order of prompts may work well for one model but poorly for another. Additionally, when the validation set is limited, choosing the order of prompts that prevents the model from producing extremely unbalanced predictions or being overconfident can also affect performance."},
93: {"question": "What is the benefit of instruction-based finetuning in language models?",
  "gold_answer_research": "Instruction-based finetuning improves models' ability to generalize to unseen domains and tasks by providing task-specific representations that can be used for many downstream language tasks without additional training. This method also allows pretrained language models to follow instructions provided in prompts, enabling them to generate the desired output given specific inputs. Additionally, instruction finetuning helps transform raw pretrained LLMs into chatbot-like models, making finetuning more accessible and common, particularly for researchers with limited resources. Overall, the benefit of instruction-based finetuning is improved model performance, enhanced generalizability, and reduced communication costs in aligning with human intentions.",
  "gold_answer_marketing": "The benefit of instruction-based finetuning in language models is improved ability to generalize to unseen domains and tasks, without the need for additional training."},
94: {"question": "Can you describe a situation where retrieval-based methods would be necessary to enhance language model performance?",
  "gold_answer_research": "Retrieval-based methods are necessary to enhance language model performance in scenarios where the model needs to generate accurate and informative responses for entity-rich queries, such as 'George Washington standing in front of the Eiffel Tower.' In such cases, incorporating a retrieval module can provide additional context and relevant information to improve the model's understanding and generation of the desired output. Additionally, retrieval-based methods are crucial for question answering tasks, where the model needs to access external knowledge sources to provide accurate and comprehensive answers. By utilizing retrieval mechanisms, the language model can benefit from a wider range of information and improve its performance in handling complex and ambiguous queries effectively.",
  "gold_answer_marketing": "Retrieval-based methods are necessary to enhance language model performance in tasks like question answering, where incorporating additional information from external sources can improve the model's ability to generate accurate and relevant responses."},
95: {"question": "What is the Chain-of-Thought prompting technique and for which types of tasks is it particularly beneficial?",
  "gold_answer_research": "Chain-of-Thought (CoT) prompting is a technique that generates reasoning chains or rationales step by step to lead to a final answer, benefiting complicated reasoning tasks using large models with more than 50B parameters. It can be implemented through iterative Monte Carlo search methods or through a three-step process called augment-prune-select. CoT is particularly beneficial for enhancing model performance on complex tasks by decomposing them into smaller and simpler steps, shedding light on the model's thinking process. Task decomposition in CoT can be done with simple prompting, task-specific instructions, or human inputs.",
  "gold_answer_marketing": "Chain-of-Thought (CoT) prompting is a technique that generates reasoning chains or rationales step by step to lead to a final answer. It is particularly beneficial for complicated reasoning tasks when using large models with more than 50B parameters. Simple tasks only benefit slightly from CoT prompting."},
96: {"question": "How do augmented language models with external tools differ from regular models in functionality?",
  "gold_answer_research": "Augmented language models with external tools, such as TALM and Toolformer, are fine-tuned to learn how to use external tool APIs, expanding their capabilities beyond traditional language processing tasks. These models are trained to incorporate external tool API calls in order to improve the quality of their outputs, allowing them to perform tasks like speech recognition, machine translation, and information retrieval more effectively. By leveraging external tools, these models have the ability to access and utilize a wider range of resources and functionalities, enhancing their overall performance and versatility compared to regular language models.",
  "gold_answer_marketing": "Augmented language models with external tools differ from regular models by fine-tuning a LM to use external tool APIs, expanding the dataset to improve model outputs and enhancing tasks like speech recognition, machine translation, and natural language generation."},
97: {"question": "What can be inferred about the utilization of attention in neural networks?",
  "gold_answer_research": "Attention mechanisms in neural networks play a crucial role in allowing models to focus on specific parts of input data when making predictions or generating outputs. By assigning importance weights to different elements, such as pixels in an image or words in a sentence, attention helps the model to attend to relevant information and make more accurate predictions. The use of attention can improve the interpretability of neural networks by showing which parts of the input data are being focused on during the prediction process. Additionally, attention mechanisms, like multi-head attention, can enhance model performance by allowing the model to jointly attend to information from different representation subspaces at different positions.",
  "gold_answer_marketing": "Attention in neural networks allows the model to focus on specific parts of input data, such as images or text, in order to make predictions or generate output. It helps the model to learn relationships and correlations between different elements and improve performance in tasks like image captioning or language translation."},
101: {"question": "Can the use of attention mechanisms in deep learning models be applied to both machine translation and computer vision?",
  "gold_answer_research": "Yes, attention mechanisms in deep learning models have shown success in both machine translation and computer vision tasks. In machine translation, attention allows the model to capture dependencies between source and target sequences regardless of distance, leading to improved translation quality. Similarly, in computer vision, attention mechanisms have been used to focus on relevant parts of an image during caption generation, showcasing the ability to handle details and global dependencies effectively. Therefore, utilizing attention in both domains can enhance the performance of deep learning models significantly.",
  "gold_answer_marketing": "Yes, attention mechanisms in deep learning models can be applied to both machine translation and computer vision."},
102: {"question": "What are the potential benefits of incorporating self-attention mechanisms into Generative Adversarial Networks (GANs)?",
  "gold_answer_research": "Incorporating self-attention mechanisms into GANs can help the generator and discriminator better model relationships between spatial regions, leading to improved generation of detailed and realistic images. This is particularly useful for capturing global dependencies and enhancing the performance of transformer architectures. Additionally, self-attention can enable the model to assess its own predictions after each generated segment, allowing for customizable decoding algorithms to meet specific constraints or user preferences. Overall, self-attention in GANs can enhance detail handling and overall performance.",
  "gold_answer_marketing": "Incorporating self-attention mechanisms into GANs can help the generator and discriminator better model relationships between spatial regions, leading to improved performance in handling details and capturing global dependencies."},
103: {"question": "How does the transformer model variate from traditional sequence-aligned recurrent architectures?",
  "gold_answer_research": "The transformer model differs from traditional sequence-aligned recurrent architectures by not having a recurrent or convolutional structure. Instead, it heavily relies on self-attention mechanisms for processing sequences. This lack of recurrence and convolution, even with positional encoding, weakly incorporates sequential order, which can be a drawback for tasks sensitive to positional dependencies. Additionally, the transformer's architecture includes embedding layers, sinusoid-wave-based positional encoding, and softmax and linear layers in the final decoder output to maintain position information and facilitate processing of long sequences efficiently.",
  "gold_answer_marketing": "The transformer model differs from traditional sequence-aligned recurrent architectures by not having a recurrent or convolutional structure, and instead making heavy use of self-attention. This allows for handling very long sequences efficiently and achieving better performance on tasks involving long texts."},
104: {"question": "What implications does the concept of a Neural Turing Machine have for the theoretical power of neural networks?",
  "gold_answer_research": "The concept of a Neural Turing Machine (NTM) expands the theoretical power of neural networks by incorporating external memory storage, allowing for more complex computations and tasks. This mimics the Turing machine tape, enabling the neural network to control operation heads for reading and writing to the tape. However, the finite memory in NTM suggests it may resemble more of a 'Neural von Neumann Machine,' limiting its mathematical limitlessness seen in traditional Turing machines. Overall, the addition of external memory in NTM enhances the capabilities and potential applications of neural networks in solving more advanced problems.",
  "gold_answer_marketing": "The concept of a Neural Turing Machine suggests that neural networks can be equipped with external memory storage for more complex operations, potentially increasing their theoretical power."},
}


test_questions = {
4: {"question": "When was the transformer architecture introduced, and by which organization?"},
5: {"question": "How has the accessibility of powerful language models, such as GPT-3 and GPT-4, been controlled by their developers?"},
6: {"question": "What benchmarks or ratings are used to compare the capabilities of different language models?"},
10: {"question": "What are some of the primary applications for language models in technology and computing?"},
14: {"question": "How are language models typically evaluated and what benchmarks are used for this purpose?"},
15: {"question": "What datasets are available for evaluating language processing systems?"},
21: {"question": "What collaborations with other companies have contributed to the development of Claude's capabilities?"},
26: {"question": "According to DeepMind, how should the number of training tokens change relative to the model size?"},
29: {"question": "How do the sizes of models in the Gopher family range?"},
31: {"question": "What type of model architecture do the Gopher and Chinchilla families belong to?"},
32: {"question": "Can you name the author who wrote the novels A Farewell to Arms and The Sun Also Rises?"},
37: {"question": "What are the key advantages of InstructGPT models over GPT-3 models according to the findings in the research?"},
40: {"question": "What metrics are used to compare the performance of different models on training and validation splits according to the document provided?"},
42: {"question": "What types of evaluation metrics are commonly used to assess the accuracy of answers in AI-driven question and answer datasets?"},
49: {"question": "What factors contribute to the performance improvement in retrieval-augmented language models compared to non-retrieval-augmented models?"},
56: {"question": "What are the benchmarks used to evaluate the performance of the Deep Policy Optimization (DPO) method compared to other preference learning algorithms in the document provided?"},
57: {"question": "What methodologies have been evaluated for training language models to align with human preferences, and how do they compare in terms of effectiveness?"},
58: {"question": "What methods have been discussed in the literature for improving the alignment of language models with human preferences or feedback?"},
66: {"question": "What are some of the evaluation metrics used for assessing different types of text generation tasks presented in the study?"},
68: {"question": "Consider a document related to research in natural language processing or artificial intelligence. Can you name some of the recent topics or methods that have been discussed or introduced in the field according to the document?"},
71: {"question": "What is the significance of using reflection tokens in a model like SELF-RAG?"},
72: {"question": "How does the inclusion of selected context as opposed to appending all retrieved text spans impact computational cost during both training and inference times in language model generation tasks?"},
77: {"question": "What are the benefits of modeling human biases in Human-Aware Loss Optimizations (HALOs), and how do they compare to non-HALOs on the same datasets?"},
79: {"question": "What are the modifications made to the traditional Kahneman-Tversky model to adapt it for optimizing language model performance?"},
83: {"question": "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?"},
90: {"question": "How can adding examples to a prompt affect the performance of language models?"},
98: {"question": "What are the main components of a Neural Turing Machine (NTM) architecture?"},
99: {"question": "How might a seq2seq model's limitations be addressed in natural language processing tasks?"},
100: {"question": "What differentiates hard attention from soft attention in image processing algorithms?"},
}


### 1.C) My Additional Setup

#### 1.C.1) Installs & Imports

Below are some additional installs and imports that we are using.

In [None]:
%%capture
# necessary installs for metrics
!pip install evaluate
!pip install bert_score

In [None]:
from evaluate import load
import pandas as pd # for df of metrics
import time # for pauses
import gc # for garbage collection
from operator import itemgetter # for RAG chain arguments

#### 1.C.2) Define Functions

To allow for incrementally exploring and tuning of the model, we converted the previously provided steps into a series of functions, so that new models can be run efficiently by just modifying the arguments to functions.

##### Document Functions

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def vectorstore_wiki_docs(query, global_doc_number, text_splitter):
    """
    Queries Wikipedia to retrieve documents based on a specified query, annotates them with a global document number,
    and then splits each document into smaller parts. Each part is also indexed with a split ID. This function is
    designed for preparing and structuring Wikipedia text data for use in vector storage or further processing.

    The function performs three main steps:
    1. Queries Wikipedia based on the given query parameter and annotates each retrieved document with a global document number
        and the document's source ("Wikipedia").
    2. Increments the global document number by one.
    3. Uses the provided text_splitter object to split each Wikipedia document into smaller parts, each part is then annotated
        with a unique split ID within its document.

    Parameters:
    - query (str): The search query to be used for retrieving Wikipedia documents.
    - global_doc_number (int): The starting global document number to be assigned to the first document retrieved. This number
                                is incremented by 1 for each new set of document retrievals.
    - text_splitter (object): An object capable of splitting text documents into smaller parts. This object must have a
                              method `split_documents` that takes a list of documents and returns a list of document splits.

    Returns:
    - tuple: A tuple containing two elements:
        - The first element is a list of the split parts of the Wikipedia documents, each part annotated with metadata including
          a unique 'split_id' and inheriting 'doc_num' and 'doc_source' from its parent document.
        - The second element is the updated global document number after incrementation.

    Note:
    The WikipediaLoader and its method `load` are used to retrieve documents from Wikipedia. The actual implementation
    of WikipediaLoader and text_splitter's `split_documents` method are assumed to be defined elsewhere.
    """
    # querying wikipedia
    wiki_docs = WikipediaLoader(query="Generative Artificial Intelligence", load_max_docs=4).load()
    for idx, text in enumerate(wiki_docs):
        wiki_docs[idx].metadata['doc_num'] = global_doc_number
        wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

    # updating doc number
    global_doc_number += 1

    # splitting and indexing splits
    wiki_splits = text_splitter.split_documents(wiki_docs)
    for idx, text in enumerate(wiki_splits):
        wiki_splits[idx].metadata['split_id'] = idx

    return wiki_splits, global_doc_number



##### Building RAG Model Functions

In [None]:
def build_embedding_splitter_vectorstore(embedding_model, text_splitter,
                                         retr_search_type = "similarity",
                                         retr_k = 4,
                                         retr_score_threshold = 0):
  """
  Initializes an in-memory vector store with document embeddings for text
  retrieval.
  This function initializes a vector store using embeddings from the specified
  `embedding_model` and splits documents using the provided `text_splitter`.
  It loads documents from a pre-defined web source, splits them into segments,
  and stores their embeddings in a Qdrant vector store. A retriever is then
  created to facilitate document retrieval based on similarity or other specified search type.

  Parameters:
  - embedding_model (str): The name of the embedding model to use for generating document embeddings.
  - text_splitter (object): An object capable of splitting documents into segments for embedding.
  - retr_search_type (str, optional): The type of retrieval search to perform. Defaults to "similarity".
  - retr_k (int, optional): The number of results to return for each retrieval query. Defaults to 4.
  - retr_score_threshold (float, optional): A threshold for filtering retrieval results based on their score. If None, no threshold is applied. Defaults to None.

  Returns:
  - tuple: A tuple containing the base embeddings object, text splitter object,
  the initialized Qdrant vector store, and the configured retriever object.
  """

  # assigning the base imbeddings
  base_embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

  # sample doc for content to initiate vectorstore
  loader = WebBaseLoader(
      web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
      bs_kwargs=dict(
          parse_only=bs4.SoupStrainer(
              class_=("post-content", "post-title", "post-header")
          )
      ),
  )

  documents = loader.load()
  splits = text_splitter.split_documents(documents)

  # creating vector store in memory
  qdrant_vectorstore = Qdrant.from_documents(splits,
      base_embeddings,
      location=":memory:",  # Local mode with in-memory storage only
      collection_name="rag_tech_db",
      force_recreate=True
  )

  # assigning retreiver
  retriever = qdrant_vectorstore.as_retriever(
      search_type=retr_search_type,
      search_kwargs={
        "k": int(retr_k),
        "score_threshold": float(retr_score_threshold) if retr_score_threshold is not None else None
      },
  )

  return base_embeddings, text_splitter, qdrant_vectorstore, retriever




def vectorize_documents(text_splitter, qdrant_vectorstore):
  """
  Processes and vectorizes documents from multiple sources, adding them to a Qdrant vector store.
  This function retrieves documents from a predefined list of ArXiv papers, a set of Wikipedia queries, and selected web pages. It assigns a unique number to each document, splits them into smaller chunks using the provided `text_splitter`, and enriches them with metadata (including document number, source, and split ID). These document chunks are then added to the specified `qdrant_vectorstore` for indexing and retrieval purposes.

  Parameters:
  - text_splitter (object): An object capable of splitting documents into manageable chunks for processing.
  - qdrant_vectorstore (object): The Qdrant vector store instance where document vectors will be stored.

  Returns:
  - object: The updated Qdrant vector store containing the newly added document vectors.
  """
  #assign a unique number to each document we ingest
  global_doc_number = 1

  ########### ARXIV PAPERS ############
  arxiv_numbers = ('2005.11401', '2104.07567', '2104.09864', '2105.03011', '2106.09685', '2203.02155', '2211.09260', '2211.12561',
                  '2212.09741', '2305.14314', '2305.18290', '2306.15595', '2309.08872', '2309.15217', '2310.06825', '2310.11511',
                  '2311.08377', '2312.05708', '2401.06532', '2402.01306')

  all_arxiv_pages = []

  # loop through the papers
  for identifier in arxiv_numbers:
      # Construct URL using the arXiv unique identifier
      arx_url = f"https://arxiv.org/pdf/{identifier}.pdf"

      # Extract pages from the document and add them to the list of pages
      arx_loader = PyMuPDFLoader(arx_url)
      arx_pages = arx_loader.load()
      for page_num in range(len(arx_pages)):
          page = arx_pages[page_num]
          #CHANGED
          page.metadata['page_num'] = page_num
          page.metadata['doc_num'] = global_doc_number
          page.metadata['doc_source'] = "ArXiv"
          all_arxiv_pages.append(page)

      global_doc_number += 1

  # index doc chunks
  splits = text_splitter.split_documents(all_arxiv_pages)
  for idx, text in enumerate(splits):
      splits[idx].metadata['split_id'] = idx

  # adding to vector store
  qdrant_vectorstore.add_documents(documents=splits)


  ########## WIKI DOCS ##############
  queries = ['Generative Artificial Intelligence', 'Information Retrieval', 'Large Language Models']

  for query in queries:
      wiki_splits, global_doc_number = vectorstore_wiki_docs(query, global_doc_number, text_splitter)
      # adding to vector store
      qdrant_vectorstore.add_documents(documents=wiki_splits)

  ############ LILIANWENG ############
  web_loader = WebBaseLoader(
      web_paths=("https://lilianweng.github.io/posts/2020-10-29-odqa/",
                "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
                "https://lilianweng.github.io/posts/2018-06-24-attention/"),

      bs_kwargs=dict(
          parse_only=bs4.SoupStrainer(
              class_=("post-content", "post-title", "post-header")
          )
      ),
  )

  web_documents = web_loader.load()

  for idx, text in enumerate(web_documents):
      web_documents[idx].metadata['doc_num'] = global_doc_number
      web_documents[idx].metadata['doc_source'] = "WWW"
  global_doc_number += 1

  web_splits = text_splitter.split_documents(web_documents)

  for idx, text in enumerate(web_splits):
      web_splits[idx].metadata['split_id'] = idx

  qdrant_vectorstore.add_documents(documents=web_splits)
  return qdrant_vectorstore




def load_llm(model_name, temperature=0.6, top_p=0.5):
  """
  Loads a large language model (LLM) based on the specified model name and configuration.
  Depending on the `model_name`, this function can load either a quantized version of the Mistral model from Hugging Face with specific settings for temperature and top_p, or initialize a Cohere model for text generation tasks. For the Mistral model, it applies quantization to reduce memory footprint, sets up the device mapping for efficiency, and configures the generation pipeline with specified parameters. For the Cohere model, it simply initializes it with an API key.

  Parameters:
  - model_name (str): The name of the model to load. Supports "mistral" or "cohere".
  - temperature (float, optional): The sampling temperature to use for generation with the Mistral model. Defaults to 0.6.
  - top_p (float, optional): The nucleus sampling (top_p) threshold to use for generation with the Mistral model. Defaults to 0.5.

  Returns:
  - object: A model pipeline object for the loaded LLM, ready for text generation tasks.
  """
  if model_name == "mistral":
    quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         llm_int4_enable_fp32_cpu_offload=True)

    llm_mistral_model = AutoModelForCausalLM.from_pretrained(
        "mistralai/Mistral-7B-Instruct-v0.1",
        torch_dtype=torch.float32,
        device_map='auto',
        quantization_config=quantization_config
    )

    llm_mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

    mistral_pipe = pipeline(
        "text-generation",
        model=llm_mistral_model,
        tokenizer=llm_mistral_tokenizer,
        max_length=1500,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.2
    )
    mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

    llm_model = HuggingFacePipeline(pipeline=mistral_pipe)
  elif model_name == "cohere":
    llm_model = ChatCohere(cohere_api_key=COHERE_API_KEY)

  return llm_model



def print_input(input, label="Input to LLM"):
  """
  Prints the given input with an optional label and returns the input unchanged.
  This function is primarily used for logging or debugging purposes, allowing the inspection of data as it flows through different stages of a pipeline or processing sequence. By printing the input with a customizable label, it facilitates the tracking of data at specific points.

  Parameters:
  - input: The data to be printed. Can be of any type that supports string representation.
  - label (str, optional): A descriptive label to prefix the printed input, enhancing clarity. Defaults to "Input to LLM".

  Returns:
  - The original input, unchanged, facilitating its use in a pipeline without altering the data flow.
  """
  print(f"{label}: {input}")
  return input



def build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs):
  """
  Constructs a retrieval-augmented generation (RAG) prompt chain for question answering or text generation.
  This function creates a RAG prompt chain pipeline that integrates document retrieval, document formatting, question generation based on a template, and text generation with a large language model (LLM). It retrieves documents relevant to a given context or question, formats these documents, injects them into a templated prompt, and then feeds the prompt to an LLM for generating a response. The output from the LLM is parsed to a string for the final response.

  Parameters:
  - rag_template (str): The template string for generating prompts that include retrieved document content.
  - llm_model (object): The large language model pipeline object for generating text responses.
  - retriever (object): The document retriever object for fetching relevant documents based on a query.
  - format_docs (object): The document formatting object to structure retrieved documents before prompt generation.

  Returns:
  - object: A pipeline object representing the complete RAG prompt chain, ready for executing the full query-to-response process.
  """

  output_parser = StrOutputParser()

  logging_step1 = lambda input: print_input(input, "After Retriever")
  logging_step2 = lambda input: print_input(input, "Before LLM")

  rag_prompt = ChatPromptTemplate.from_template(rag_template)

  rag_chain = (
    {"context": retriever | logging_step1 | format_docs,
     "question": RunnablePassthrough()}
    | rag_prompt
    | logging_step2
    | llm_model
    | output_parser
  )

  return rag_chain

##### Evaluation Functions

In [None]:
def evaluate(metrics, validation_set, rag_chain, iterations=0, verbose=True, dept_specific=False, sleep=True, print_results=False):
  '''
  Function that predicts answers for a validation set using a defined RAG Chain,
  then computes the BLUE, ROUGE, & BERTScore of the predictions compared to the gold answers.
  Returns a dataframe of those results.
  '''
  # initializing a dataframe for results
  columns = ['Sample']
  if 'bleu' in metrics:
    columns += ['eng_bleu', 'mk_bleu']
  if 'rouge' in metrics:
    columns += ['eng_rouge', 'mk_rouge']
  if 'bertscore' in metrics:
    columns += ['eng_f1', 'mk_f1']
  results = pd.DataFrame(columns=columns)

  # setting iterations to length of validation set if not defined.
  if iterations == 0:
    iterations = len(validation_set)

  # sorting validation keys
  sorted_keys = sorted(validation_set.keys())

  # running through metrics for specified validation question/answers
  for i in range(min(iterations, len(sorted_keys))):
    # retrieving example key
    key = sorted_keys[i]
    # assigning questions & answers
    question = validation_set[key]['question']
    engineering_answer = validation_set[key]["gold_answer_research"]
    marketing_answer = validation_set[key]["gold_answer_marketing"]

    if dept_specific:
      eng_prediction = rag_chain["engineering"].invoke(question)
      mk_prediction = rag_chain["marketing"].invoke(question)
    else:
      eng_prediction = rag_chain.invoke(question)
      mk_prediction = rag_chain.invoke(question)

    # print predictions and targets
    if print_results:
      print("-"*60)
      print(f"Question: {question}")
      print(" ")
      print(f"Engineering Prediction: {eng_prediction}")
      print(f"Engineering Answer: {engineering_answer}")
      print("")
      print(f"Marketing Prediction: {mk_prediction}")
      print(f"Marketing Answer: {marketing_answer}")

    # calculating metrics
    resulting_row = [key]
    if 'bleu' in metrics:
      en_bl = bleu.compute(predictions=[eng_prediction],
                          references=[engineering_answer])
      mk_bl = bleu.compute(predictions=[mk_prediction],
                          references=[marketing_answer])
      resulting_row += [en_bl['bleu'], mk_bl['bleu']]
    if 'rouge' in metrics:
      en_ro = rouge.compute(predictions=[eng_prediction],
                          references=[engineering_answer])
      mk_ro = rouge.compute(predictions=[mk_prediction],
                          references=[marketing_answer])
      resulting_row += [en_ro['rougeLsum'], mk_ro['rougeLsum']]
    if 'bertscore' in metrics:
      en_bs = bertscore.compute(predictions=[eng_prediction],
                                references=[engineering_answer],
                                lang = 'en')
      mk_bs = bertscore.compute(predictions=[mk_prediction],
                                references=[marketing_answer],
                                lang = 'en')
      resulting_row += [en_bs['f1'][0], mk_bs['f1'][0]]

    results.loc[len(results)] = resulting_row

    if verbose:
      print(key, end=" -> ")
    # sleep process to abide by Cohere's 20 API/min requirement
    if sleep:
      time.sleep(4)

  return results

def composite_evaluation(results):
  """
  Calculates composite scores for engineering and marketing prompt results,
  and a combined total composite score, based on weighted averages of F1, ROUGE, and BLEU scores.

  The function takes a dictionary `results` containing the F1, ROUGE, and BLEU scores for English
  (`eng_f1`, `eng_rouge`, `eng_bleu`) and Macedonian (`mk_f1`, `mk_rouge`, `mk_bleu`). It calculates
  composite scores for each language by applying specific weights to these scores: 50% to F1, 30% to ROUGE,
  and 20% to BLEU. The total composite score is then calculated as a weighted sum of the engineering and marketing
  composite scores, with weights of 60% and 40%, respectively.

  Parameters:
  - results (dict): A dictionary containing the F1, ROUGE, and BLEU scores for engineering and marketing. Expected keys are:
      - 'eng_f1', 'eng_rouge', 'eng_bleu': The F1, ROUGE, and BLEU scores for engineering.
      - 'mk_f1', 'mk_rouge', 'mk_bleu': The F1, ROUGE, and BLEU scores for marketing
  Returns:
  - dict: The input dictionary updated with the following keys:
      - 'composite_eng': The composite score for English.
      - 'composite_mk': The composite score for Macedonian.
      - 'composite_total': The overall composite score, combining the scores for English and Macedonian.
  """
  results['composite_eng'] = (0.5*results['eng_f1'] + 0.3*results['eng_rouge'] + 0.2*results['eng_bleu'])
  results['composite_mk'] = (0.5*results['mk_f1'] + 0.3*results['mk_rouge'] + 0.2*results['mk_bleu'])
  results['composite_total'] = 0.6*results['composite_eng'] + 0.4*results['composite_mk']
  return results

## 2) Loading & Exploring Evaluation Metrics

In [None]:
%%capture

# loading BLEU
bleu = load("bleu")

# loading BERTScore
bertscore = load("bertscore")

# loading ROUGE
rouge = load('rouge')

Below we explore a few different metrics for evaluating our model.

### BLEU Score

First metric we am exploring is the BLEU score. Originally designed for evaluating the quality of language translation, this score looks at the overlap of tokens between a prediction and label.  A pitfall of this is that it does not compare meaning.

A couple other cons to the BLUE score evaluation, which may be relevant to this use case:
 - Shorter predicted translations achieve higher scores than longer ones, simply due to how the score is calculated. A brevity penalty is introduced to attempt to counteract this.
 - BLEU scores can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used. It is therefore not possible to compare BLEU scores generated using different parameters, or when these parameters are unknown.

In [None]:
# test prediction and reference
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
              ["hello there general kenobi", "hello there !"],
              ["foo bar foobar"]
              ]

# quantifying results
results_bleu = bleu.compute(predictions=predictions,
                            references=references)

print(results_bleu)

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}


### ROUGE

The second metric we am using is ROUGE. (Recall-Oriented Understudy for Gisting Evaluation). ROUGE mainly measures the overlap in content between automated and target summaries.  While it can capture some surface similarities, it does not fully capture semantic quality. We are primarly focused on the Rouge-L which measures the longest common subsequences between the summaries.


In [None]:
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
results_rouge = rouge.compute(predictions=predictions,
                        references=references)
print(results_rouge)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}


### BERTScore

The third metric we am exploring is the BERTScore. This uses pretrained BERT model embeddings to calculate the cosine simularity between generated and target summaries.  Of this three this measures does a much better job of capturing semantic meaning between the summaries. While it likely is more robust than ROUGE and BLEU, BERTScore also may not fully capture readability or factual accuracy.


In [None]:
# test prediction and reference
predictions = ["hello world", "general kenobi"]
references = ["hello world", "general kenobi"]

# quantifying results
results_bert = bertscore.compute(predictions=predictions,
                            references=references,
                            lang = 'en') # default model is 'roberta-large' which requires 1.4GB of storage.
                            # model_type='roberta-large')
print(results_bert)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [1.0000001192092896, 0.9999998807907104], 'recall': [1.0000001192092896, 0.9999998807907104], 'f1': [1.0000001192092896, 0.9999998807907104], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0.dev0)'}


### Human Review

While the intent is to utilize the above metrics to measure performance of the tuning of our models.  Human feedback will also be necessary to insure the readability and factual accuracy of the generated responses.

## 3) Baseline Evaluation

Now we are generating a baseline scenario to measure our progress against.  As part of this baseline we will define what we are considering our train dataset.  We will also review the various metric scores and define the specific metric that we will utilize for measuring our tuning.

We are defining the paramaters for the initiated model as the baseline.  This includes the parameters highlighted in the model rebuild below.

### 3.A) **Rebuilding Baseline Model**

We are defining our "baseline" model, as the model that was provided to us in the initial notebook.  

Below we are rebuilding the parameters, prompts and pipelines that represented the baseline model we were provided.

In [None]:
%%capture

embedding_model = "multi-qa-mpnet-base-dot-v1"
chunk_size = 128
chunk_overlap = 0
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
llm = "cohere"
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""

base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)


Unused kwargs: ['llm_int4_enable_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


### 3.B) **Single Prediction Sample**

Now we review one sample question/answer pair.

In [None]:
# selecting a sample question and answers
question = validation_questions_answers[0]['question']
engineering_answer = validation_questions_answers[0]["gold_answer_research"]
marketing_answer = validation_questions_answers[0]["gold_answer_marketing"]
print("Question: ", question)
print("Engineering Answer: ", engineering_answer)
print("Marketing Answer: ", marketing_answer)

Question:  What purpose do large language models serve in the field of natural language processing?
Engineering Answer:  Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks.
Marketing Answer:  Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, 

Let's see what our model predicts for this question. . .

In [None]:
# invoking the RAG chain for one question
prediction = rag_chain.invoke(question)
prediction

'Large language models serve the purpose of improving the performance of natural language processing (NLP) tasks by providing a larger and more diverse set of data for models to learn from. They are designed to understand and generate human language, and can be applied to a wide range of language-related tasks such as language generation, machine translation, text classification, and question answering. These models aim to capture the complex nuances of human language, including context, syntax, and semantics, to provide more accurate and contextually relevant responses. \n\nBy training on vast amounts of text data, large language models can learn to predict the next word in a sequence, generating coherent and contextually appropriate responses. They can also understand and interpret user queries, providing relevant and useful answers. This makes them extremely useful for a variety of applications, such as virtual assistants, language translation services, content generation, and senti

Our predicted answer is decent, and would be described as a 'middle-of-the-road' summary blending the gold standard for both research and marketing.  It appears to contain both the contents of the engineering and marketing target answers. It does not capture the 'statistical relationships' context present in the reseach answer though, indicating that it may not be designed to dive deeper into the technical aspects.  Overall the response is also too verbose for either target answer, and creeps into descriptions beyond the initial question.  As an example it appears to go onto a tangent about the short comings of the models. Now we look at what the various evaluation scores are for this answer.

**BLEU**

As can be seen below the BLEU scores are very low for the baseline.  This indicates that the word choices between the two models are not well aligned.  However as previously noted, this does not provide much insight into semantic similarity.  While BLEU does not provide much semantic value, we do see the value in having some measurement of n-grams between the summaries.

In [None]:
%%capture
# calculating BLEU for sample answer
test_engbleu = bleu.compute(predictions=[prediction],
                            references=[engineering_answer])

test_mkbleu = bleu.compute(predictions=[prediction],
                            references=[marketing_answer])

In [None]:
print("Engineering: ", test_engbleu)
print("Marketing: ", test_mkbleu)

Engineering:  {'bleu': 0.045795704397007975, 'precisions': [0.23076923076923078, 0.07296137339055794, 0.03017241379310345, 0.008658008658008658], 'brevity_penalty': 1.0, 'length_ratio': 2.4893617021276597, 'translation_length': 234, 'reference_length': 94}
Marketing:  {'bleu': 0.05776699519634414, 'precisions': [0.12393162393162394, 0.07296137339055794, 0.04741379310344827, 0.025974025974025976], 'brevity_penalty': 1.0, 'length_ratio': 5.571428571428571, 'translation_length': 234, 'reference_length': 42}


**ROUGE**

Below are the Rouge scores for the sample answer. From the output, we are specifically focused on the **'rougeLsum'** values as that is the most holistic evaluation of the generated answer's quality, as it is evaluating the precense of phrases at a sentence level. This will act as a baseline score of how well there is sequence-based similarity between the summaries.  We will want to continue to monitor this as we tune as well to see if we can improve the similarity on phrases between the generated and target summaries.

In [None]:
%%capture
# calculating ROUGE for sample answer
test_engrouge = rouge.compute(predictions=[prediction],
                            references=[engineering_answer])

test_mkrouge = rouge.compute(predictions=[prediction],
                            references=[marketing_answer])

In [None]:
print("Engineering: ", test_engrouge)
print("Marketing: ", test_mkrouge)

Engineering:  {'rouge1': 0.30344827586206896, 'rouge2': 0.10416666666666669, 'rougeL': 0.18620689655172415, 'rougeLsum': 0.19999999999999998}
Marketing:  {'rouge1': 0.17647058823529413, 'rouge2': 0.11016949152542373, 'rougeL': 0.1680672268907563, 'rougeLsum': 0.1680672268907563}


**BERTScore**

Below are the BERTScore's for the sample answer.  For this output, we are specifically focused on the **'F1'** scores, since this score is a score that balances the typically opposing metrics, precision and recall.  In our particular case, a balance between these is acceptable.  We are not in a 'high-stakes' situation that would require maximizing one metric over the other (i.e. medical diagnosis).

As can be seen this sample answer already scores relatively well for both departments.  While the BERTScore is the more context based metric of the three, this high baseline reduces the room for improvement that we can make on this particular metric.  So we will factor that into our metric decision.

*Note:  While there are numerous parameter and embedding options available for BERTscore, we have chosen to just utilize the default settings for this POC. Should this project advance, additional exploration of other parameters on this metric would be advised.*

In [None]:
%%capture
# calculating BERTscore for sample answer
test_engbert = bertscore.compute(predictions=[prediction],
                            references=[engineering_answer],
                            lang = 'en')

test_mkbert = bertscore.compute(predictions=[prediction],
                            references=[marketing_answer],
                            lang = 'en')

In [None]:
print("Engineering: ", test_engbert)
print("Marketing: ", test_mkbert)

Engineering:  {'precision': [0.8595860004425049], 'recall': [0.8819125890731812], 'f1': [0.8706061840057373], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0.dev0)'}
Marketing:  {'precision': [0.8385220766067505], 'recall': [0.9097399711608887], 'f1': [0.8726804256439209], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0.dev0)'}


### 3.C) **Train Set Evaluation**

We will be splitting the validation question answer set into train and validation sets.  To minimize the compute requirements as we explore various parameter tuning, we are setting our exploratory train set to the first 10 examples.  As we get closer to our desired model, we will then expand to a larger dataset of 50 examples.  Once we have chosen our final model, we will then compute against all 70 of the validation set.

In [None]:
%%capture

# using function to capture all metrics
metrics = ['rouge', 'bleu', 'bertscore']
train_baseline = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=True)

In [None]:
train_baseline.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.036154,0.031349,0.206718,0.141839,0.862086,0.850168
std,4.788876,0.040034,0.037755,0.046351,0.082965,0.018634,0.026099
min,0.0,0.0,0.0,0.129213,0.03268,0.837753,0.805979
25%,2.25,0.0,0.0,0.192153,0.077106,0.84859,0.83842
50%,7.5,0.034476,0.020574,0.201474,0.136511,0.860356,0.84543
75%,10.5,0.051532,0.052379,0.212566,0.176126,0.87004,0.871407
max,13.0,0.124233,0.103572,0.298969,0.305556,0.891206,0.886961


### 3.D) **Composite Evaluation Metric**

After reviewing the training set scores that we are getting for the three separate metrics, we have determined that we want to create a 'composite' metric that is a weighted combination of the three. Given the individual normalized scores for BLEU, ROUGE-Lsum, and BERTScore F1 for both Engineering (eng) and Marketing (mk), the composite score can be calculated as follows:

<br>

\begin{align*}
\text{Composite Score}_{\text{eng}} &= w_{\text{BLEU}} \cdot N_{\text{BLEU}_{\text{eng}}} + w_{\text{ROUGE-Lsum}} \cdot N_{\text{ROUGE-Lsum}_{\text{eng}}} + w_{\text{BERTScoreF1}} \cdot N_{\text{BERTScoreF1}_{\text{eng}}} \\
\text{Composite Score}_{\text{mk}} &= w_{\text{BLEU}} \cdot N_{\text{BLEU}_{\text{mk}}} + w_{\text{ROUGE-Lsum}} \cdot N_{\text{ROUGE-Lsum}_{\text{mk}}} + w_{\text{BERTScoreF1}} \cdot N_{\text{BERTScoreF1}_{\text{mk}}}
\end{align*}

<br>

where $w_{\text{BLEU}}$, $w_{\text{ROUGE-Lsum}}$, and $w_{\text{BERTScoreF1}}$ are the weights for the BLEU, ROUGE-Lsum, and BERTScore F1 scores respectively, and $N_{\text{Metric}_{\text{dept}}}$ represents the normalized score for each metric in each department (Engineering or Marketing).

We are choosing the following weights based off the importance that each metric brings to evaluating our generated answers. With BERT being best at measuring the contextual similarity between the summaries, it is given the most weight. Then ROUGE, with its measurement of phrase-based similarities, and lastly BLEU providing specific word choice similiarities. Below are the weights:
\begin{align*}
    w_{\text{BLEU}} &= 0.2 \\
    w_{\text{ROUGE-Lsum}} &= 0.3 \\
    w_{\text{BERTScoreF1}} &= 0.5
\end{align*}

We also decided to weight the importance of the responses between the departments.  We understand that this is just a POC, and that should a production system be implemented, it would likely contain much more content in the document store, and with that a significant increase in the technical jargon and complexity within those documents. With the company having 300 engineers and 40 marketers, this system will likely see more usage from engineers.  Also in addition to that, the importance of accuractly containing specific (technical) details is more heavily weighted towards engineering. Marketing requirements will be higher-level and more generalized, and with such will not be as detrimented by missing specifics.  This this we are choosing the following departmental weighting:

\begin{align*}
    w_{\text{eng}} &= 0.6 \\
    w_{\text{mk}} &= 0.4
\end{align*}

<br>
The final composite score, adjusted for departmental impact, is calculated as:

<br>

\begin{equation*}
\text{Final Composite Score} = w_{\text{eng}} \cdot \text{Composite Score}_{\text{eng}} + w_{\text{mk}} \cdot \text{Composite Score}_{\text{mk}}
\end{equation*}

<br>

This equation provides a framework to evaluate the performance of the RAG system, taking into account the precision, recall, semantic similarity, and the strategic importance of each department to the company. Below is that calculuation for the train dataset.



In [None]:
train_baseline = composite_evaluation(train_baseline)
train_baseline.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.036154,0.031349,0.206718,0.141839,0.862086,0.850168,0.500289,0.473905,0.489736
std,4.788876,0.040034,0.037755,0.046351,0.082965,0.018634,0.026099,0.025552,0.043826,0.027881
min,0.0,0.0,0.0,0.129213,0.03268,0.837753,0.805979,0.45764,0.412793,0.439702
25%,2.25,0.0,0.0,0.192153,0.077106,0.84859,0.83842,0.488342,0.442528,0.477322
50%,7.5,0.034476,0.020574,0.201474,0.136511,0.860356,0.84543,0.496478,0.466079,0.481197
75%,10.5,0.051532,0.052379,0.212566,0.176126,0.87004,0.871407,0.502899,0.497392,0.510889
max,13.0,0.124233,0.103572,0.298969,0.305556,0.891206,0.886961,0.548372,0.555862,0.538268


**Our baseline model returns a 'Composite Score' of 0.489736.**

### 3.E) Reviewing Baseline Model Document Context Retrieval

Now we perform some additional EDA on the train set. Exploring some of the interim steps within the RAG system. Below we look at what we retrieving what we are getting as context from our docstore and feeding to the LLM.


In [None]:
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
metrics = ['rouge', 'bleu', 'bertscore']
train_baseline = evaluate(metrics, validation_questions_answers, rag_chain, iterations=1, verbose=True)

Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\namong the largest language models today and we apply them on a wide range of language tasks,\n\nlimitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.\n\narli, and Denny Zhou. Large language models can be easily distracted by irrelevant context.\n\nimportant for language models that are deployed and used in hundreds of applications. \n\nHere is a question: \nWhat purpose do large language models serve in the field of natural language processing?.[/INST]')]
Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\namong the largest language models today and we apply them on a wide range of language tasks,\n\nlimitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.\n\narl

Below is the actual prompt to the llm reformated for clarity:

<blockquote>

'[INST]Please answer the question below only based on the context information provided.

Here is a context:

among the largest language models today and we apply them on a wide range of language tasks,

limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.

arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context.

important for language models that are deployed and used in hundreds of applications.

Here is a question:

What purpose do large language models serve in the field of natural language processing?.[/INST]'

</blockquote>

While the content is relevant to large language models, none of them are directly aiding the question being asked and instead provide some distracting content, like a blurb on how LLM can get easily distracted (see what I did there. . .).  Based on the generated summaries we were recieving, it appears the LLM might be filling in a lot of the context gaps on its own to arrive at adequate answers.  We will be focusing some significant training energy on providing the llm more content to work with to answer questions.

## 4) Fine Tuning Model and Parameter Choices

Here we are now exploring some of the fine tuning available to improve our model.  

#### 4.A) **Exploring embedding options**

We were provided three model options for embeddings: 'multi-qa-mpnet-base-dot-v1', 'all-MiniLM-L6-v2', 'avsolatorio/GIST-Embedding-v0', with the first one being the default.  After some research it appears that 'all-MiniLM-L6-v2' is a smaller model that is built for efficiency, but at the general sacrifice of performance.  However, we will still test it's performance to confirm.

In [None]:
# defining model parameters
chunk_size = 128
chunk_overlap = 0
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
llm = "cohere"
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""

# embedding models to iterate through
embedding_models = ['avsolatorio/GIST-Embedding-v0',  'all-MiniLM-L6-v2', 'multi-qa-mpnet-base-dot-v1']

results = []
index = 0
# for each embedding model, completing the RAG chain and evaluating.
for embedding_model in embedding_models:
  print("")
  print("Index ", index)
  index += 1
  print(f"Embedding Model: {embedding_model}")

  #building model for current embedding
  base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter)
  qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
  llm_model = load_llm(llm)
  rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)

  # computing metrics
  metrics = ['rouge', 'bleu', 'bertscore']
  embed_result = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=True, sleep=False)
  print(embed_result.mean())
  results.append(embed_result)

  # deleting previous variables
  del qdrant_vectorstore
  del text_splitter
  del llm_model
  del rag_chain
  del base_embeddings
  del retriever
  gc.collect() # forcing garbage collection


Index  0
Embedding Model: avsolatorio/GIST-Embedding-v0
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.033869
mk_bleu      0.021581
eng_rouge    0.194728
mk_rouge     0.121817
eng_f1       0.855194
mk_f1        0.839183
dtype: float64

Index  1
Embedding Model: all-MiniLM-L6-v2


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.051392
mk_bleu      0.026366
eng_rouge    0.225280
mk_rouge     0.132156
eng_f1       0.865265
mk_f1        0.847400
dtype: float64

Index  2
Embedding Model: multi-qa-mpnet-base-dot-v1
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.042218
mk_bleu      0.024269
eng_rouge    0.204649
mk_rouge     0.127378
eng_f1       0.862459
mk_f1        0.846005
dtype: float64


In [None]:
# calculating the composite score for each embedding
for i, result in enumerate(results):
  print("Index: ,", i)
  result = composite_evaluation(result)
  print(result['composite_total'].mean())

Index: , 0
0.479854879028463
Index: , 1
0.4937449646786769
Index: , 2
0.48706869462383756


**Context Check:** To our surprise the 'Mini' model actually performed the best.  Now we want to take a look at the context retrieval to see if we see any noticeable difference in using this embedding.

In [None]:
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore('all-MiniLM-L6-v2', splitter)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
metrics = ['rouge', 'bleu', 'bertscore']
train_baseline = evaluate(metrics, validation_questions_answers, rag_chain, iterations=1, verbose=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nlarge language models efficient. Through our work, our aim is to help the community create more\n\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nThis implies that large models are more generaliz-\nable to compute texts in various domains and task\n\namong the largest language models today and we apply them on a wide range of language tasks, \n\nHere is a question: \nWhat purpose do large language models serve in the field of natural language processing?.[/INST]')]
Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nlarge language models efficient. Through our work, our aim is to help the community create more\n\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the 

Below is the prompt with context, reformatted.

<blockquote>

'[INST]Please answer the question below only based on the context information provided.

Here is a context:

large language models efficient. Through our work, our aim is to help the community create more

[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models

This implies that large models are more generalizable to compute texts in various domains and task

among the largest language models today and we apply them on a wide range of language tasks,

Here is a question:

What purpose do large language models serve in the field of natural language processing?.[/INST]

</blockquote>

While there is some change in the context that is provided it is hard to discern enough of a difference beyond random chance.  In order to stay true to our defined performance metric, we will be continuing with the 'Mini' embedding.

**Embedding Decision**: Based on these results, with all else remaining as baseline, the 'MINI' embedding model managed to produce the best results. We will be proceeding with the **'all-MiniLM-L6-v2'** model.

#### 4.B) **Adjusting Chunking**

Now we are going to perform a grid search of various chunk sizes and overlaps to find what combination performs the best.  Based off our previous review of the context retreive, we suspect there is room for improvement here.

In [None]:
# assigning model parameters
embedding_model = 'all-MiniLM-L6-v2'
llm = "cohere"
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
llm_model = load_llm(llm)

# chunk sizes and overlaps to iterate through
chunk_sizes = [32, 64, 128, 256]
chunk_overlaps = [0, 4, 8, 16]

results = []
index = 0
# for each chunk size & overlap combo, completing the RAG system and evaluating.
for chunk_size in chunk_sizes:
    for chunk_overlap in chunk_overlaps:
        print("")
        print("Index ", index)
        index += 1
        print(f"Chunk Size: {chunk_size}, Chunk Overlap: {chunk_overlap}")

        # building model for current chunking
        splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter)
        qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
        rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)

        # computing metrics
        metrics = ['rouge', 'bleu', 'bertscore']
        chunk_result = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=True, sleep=False)
        print(chunk_result.mean())
        results.append(chunk_result)

        # deleting previous variables
        del qdrant_vectorstore
        del text_splitter
        gc.collect() # forcing garbage collection


Index  0
Chunk Size: 32, Chunk Overlap: 0
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.028118
mk_bleu      0.012418
eng_rouge    0.194782
mk_rouge     0.109207
eng_f1       0.855823
mk_f1        0.840538
dtype: float64

Index  1
Chunk Size: 32, Chunk Overlap: 4
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.030376
mk_bleu      0.007748
eng_rouge    0.197281
mk_rouge     0.095496
eng_f1       0.851986
mk_f1        0.835213
dtype: float64

Index  2
Chunk Size: 32, Chunk Overlap: 8
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.024835
mk_bleu      0.021533
eng_rouge    0.158797
mk_rouge     0.118179
eng_f1       0.850785
mk_f1        0.845171
dtype: float64

Index  3
Chunk Size: 32, Chunk Overlap: 16
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.035043
mk_bleu      0.011213
eng_rouge    0.177818
mk_rouge     0.09995

In [None]:
# calculating the composite score for each iteration
for i, result in enumerate(results):
  print("Index: ,", i)
  result = composite_evaluation(result)
  print(result['composite_total'].mean())

Index: , 0
0.47738752984398225
Index: , 1
0.47387331441219416
Index: , 2
0.471737410212241
Index: , 3
0.47247251703615556
Index: , 4
0.475619310549057
Index: , 5
0.47434139872304604
Index: , 6
0.47363885699726627
Index: , 7
0.4777977152501647
Index: , 8
0.4921290047686167
Index: , 9
0.49110410919102854
Index: , 10
0.4869879168879005
Index: , 11
0.4927676679165994
Index: , 12
0.49762912126782427
Index: , 13
0.49664616910159837
Index: , 14
0.4975537489503412
Index: , 15
0.5002383253742265


Based on the above results, it appears that the max chunk size (256) and chunk overlap (16) performed the best.  Because of this we are going to expand the max values to see if it continues to improve the model.

In [None]:
embedding_model = 'all-MiniLM-L6-v2'
llm = "cohere"
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
llm_model = load_llm(llm)
chunk_sizes = [512]
chunk_overlaps = [8, 16, 32]

results = []
index = 0
for chunk_size in chunk_sizes:
    for chunk_overlap in chunk_overlaps:
        print("")
        print("Index ", index)
        index += 1
        print(f"Chunk Size: {chunk_size}, Chunk Overlap: {chunk_overlap}")

        # building model for current chunking
        splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter)
        qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
        rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)

        # computing metrics
        metrics = ['rouge', 'bleu', 'bertscore']
        chunk_result = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=True, sleep=False)
        print(chunk_result.mean())
        results.append(chunk_result)

        # deleting previous variables
        del qdrant_vectorstore
        del text_splitter
        gc.collect() # forcing garbage collection


Index  0
Chunk Size: 512, Chunk Overlap: 8
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.055507
mk_bleu      0.028529
eng_rouge    0.220229
mk_rouge     0.145474
eng_f1       0.867614
mk_f1        0.854710
dtype: float64

Index  1
Chunk Size: 512, Chunk Overlap: 16
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.043765
mk_bleu      0.025808
eng_rouge    0.224259
mk_rouge     0.126912
eng_f1       0.863763
mk_f1        0.850277
dtype: float64

Index  2
Chunk Size: 512, Chunk Overlap: 32
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.053143
mk_bleu      0.031342
eng_rouge    0.210684
mk_rouge     0.141335
eng_f1       0.862925
mk_f1        0.848733
dtype: float64


In [None]:
for i, result in enumerate(results):
  print("Index: ,", i)
  result = composite_evaluation(result)
  print(result['composite_total'].mean())

Index: , 0
0.49726734330825567
Index: , 1
0.4920968688966886
Index: , 2
0.49239176281594227


##### **Context Check:**


Now we review difference this has made in our context retrieval.

In [None]:
chunk_size = 256
chunk_overlaps = 16
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore('all-MiniLM-L6-v2', splitter)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
metrics = ['rouge', 'bleu', 'bertscore']
train_baseline = evaluate(metrics, validation_questions_answers, rag_chain, iterations=1, verbose=True)

Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\narli, and Denny Zhou. Large language models can be easily distracted by irrelevant context.\nIn Proceedings of the 40th International Conference on Machine Learning, 2023. URL https:\n//proceedings.mlr.press/v202/shi23a.html.\n\namong the largest language models today and we apply them on a wide range of language tasks,\nincluding classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.\n\n[7] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan,\nS. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language

Below is the reformatted prompt with embedded context.

<blockquote>

'[INST]Please answer the question below only based on the context information provided.

Here is a context:

algorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models

arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://proceedings.mlr.press/v202/shi23a.html.

among the largest language models today and we apply them on a wide range of language tasks, including classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.

[7] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models

Here is a question:

What purpose do large language models serve in the field of natural language processing?.[/INST]'

</blockquote>

It can be seen from this that with the new chunking we were able to capture the additional, helpful context of the types of language tasks that LLMs are used for.  This is a good.  However, it also looks like the chunking now also incorporates additional noise in the context like, what looks like are reference snippets that happen to align with the question.  Overall, since this is showing some additional value and is aligned with our performance metric, we will proceed with the best scoring scenario.

**Chunking Decision:** From our grid search, we found that, for our current model, on this train set, the chunking of **chunk size = 256** and **chunk overlap = 16** returned the best results, so we will proceed with those values.

#### 4.C) **Adjusting the Retriever**

After our review of chunking we are noticing an improvement in some of the content coming in, but also some additional noise.  Next we want to explore some of the retriever parameters to see if there is an opportunity to improve the chunks that we retrieve.  We will be exploring the search type, and number of chunks to return.

*Note: In a prior run of this grid search we also explored various 'score_thresholds' to accompany the 'similarity_score_threshold' search. Due to the expotential increase in compute that this creates, and the fact that it did not perform as well, it was dropped and not rerun. So the below output is just utilizing the default '0.5' value.*

In [None]:
embedding_model = 'all-MiniLM-L6-v2'
llm = "cohere"
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
llm_model = load_llm(llm)
chunk_size = 256
chunk_overlap = 16
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

retr_search_types = ["similarity_score_threshold", "similarity", "mmr"]
retr_ks = [2, 4, 6, 8, 10]
retr_score_thresholds = [0.5]
results = []
index = 0
for search_type in retr_search_types:
    for k in retr_ks:
        if search_type != "similarity_score_threshold":
            thresholds = [None]
        else:
            thresholds = retr_score_thresholds.copy()
        for threshold in thresholds:
          threshold = float(threshold) if threshold is not None else None
          print("")
          print("Index ", index)
          index += 1
          print(f"Retriever Search Type: {search_type}, Retriever K: {k}, Retriever Score Threshold: {threshold}")

          # building model for current retriever
          base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type=search_type, retr_k=k)
          qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
          rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)

          # computing metrics
          metrics = ['rouge', 'bleu', 'bertscore']
          retriever_result = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=True, sleep=False)
          print(retriever_result.mean())
          results.append(retriever_result)

          # deleting previous variables
          del qdrant_vectorstore
          del text_splitter
          del retriever
          del rag_chain
          gc.collect() # forcing garbage collection


Index  0
Retriever Search Type: similarity_score_threshold, Retriever K: 2, Retriever Score Threshold: 0.5


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.044431
mk_bleu      0.022195
eng_rouge    0.216947
mk_rouge     0.130084
eng_f1       0.862653
mk_f1        0.844525
dtype: float64

Index  1
Retriever Search Type: similarity_score_threshold, Retriever K: 4, Retriever Score Threshold: 0.5
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.063517
mk_bleu      0.030256
eng_rouge    0.216755
mk_rouge     0.130160
eng_f1       0.866460
mk_f1        0.850197
dtype: float64

Index  2
Retriever Search Type: similarity_score_threshold, Retriever K: 6, Retriever Score Threshold: 0.5
0 -> 1 -> 2 -> 3 -> 7 -> 8 -> 9 -> 11 -> 12 -> 13 -> Sample       6.600000
eng_bleu     0.066372
mk_bleu      0.028718
eng_rouge    0.222179
mk_rouge     0.141708
eng_f1       0.869331
mk_f1        0.850582
dtype: float64

Index  3
Retriever Search Type: similarity_score_threshold, Retriever K: 8, Retriever Score Threshold: 0.5
0 -> 1 -> 

In [None]:
for i, result in enumerate(results):
  print("Index: ,", i)
  result = composite_evaluation(result)
  print(result['composite_total'].mean())

Index: , 0
0.48946872115337003
Index: , 1
0.49465515025590856
Index: , 2
0.49817502517732154
Index: , 3
0.49980076225463976
Index: , 4
0.5033544772597616
Index: , 5
0.4832889073471821
Index: , 6
0.5037578324854686
Index: , 7
0.49687409315131637
Index: , 8
0.5029232284077463
Index: , 9
0.5021960673241921
Index: , 10
0.49718776473797616
Index: , 11
0.5048170650110574
Index: , 12
0.5052131639923401
Index: , 13
0.5002370761126699
Index: , 14
0.5029710566734872


##### **Context Check**

 Below we are checking how our best performing retriever metric context looks.

In [None]:
chunk_size = 256
chunk_overlaps = 16
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore('all-MiniLM-L6-v2', splitter, retr_search_type="mmr", retr_k=6)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
rag_chain.invoke(question)

Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nmisuse, there are many domains where large language models should be deployed only with great\ncare, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying\n\nAfter the success of many large-scale general language models, many QA models embrace the following approach:\n\nthem to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.\n\nMany applications in natural language processing rely on adapt-\ning one large-scale, pre-trained language mod

'Large language models serve as the foundation for many natural language processing (NLP) applications. They are pre-trained on vast amounts of text data and can be adapted to various downstream tasks through fine-tuning. This allows for efficient and effective language model deployment in multiple domains, including classification, summarization, question-answering, creative writing, and dialogue.'

Below is the reformatted prompt with the context embedded

<blockquote>

'[INST]Please answer the question below only based on the context information provided.

Here is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8. [10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models

misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying

After the success of many large-scale general language models, many QA models embrace the following approach:

them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do.

Many applications in natural language processing rely on adapt-ing one large-scale, pre-trained language model to multiple down-stream applications. Such adaptation is usually done via ﬁne-tuning,

among the largest language models today and we apply them on a wide range of language tasks, including classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.

Here is a question:

What purpose do large language models serve in the field of natural language processing?.[/INST]'

</blockquote>

It does appear that, for this specific instance, changing the search type removed some of the 'references' snippets that had appeared previously.  The increase in number chunks returned also, for this specific instance, provides some additional, benefitial context.  without too much additional noise.

**Tangent Curiousity Check** However, if we look at the gold standard answers below there is context that we are not retreiving directly from our documentation, such as 'speech recognition'.  Out of curiousity, and for experimentation, we are going to pass the gold answer through the RAG chain, increase the chunks to retreive to 12, to see what context it picks up.

<blockquote>

Target Responses Engineering = "Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like **speech recognition, machine translation**, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks."

Marketing = "Large language models serve the purpose of improving performance in various natural language processing tasks, such as **speech recognition, machine translation,** natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval."

</blockquote>

In [None]:
# selecting a sample question and answers
question = validation_questions_answers[0]['question']
engineering_answer = validation_questions_answers[0]["gold_answer_research"]
marketing_answer = validation_questions_answers[0]["gold_answer_marketing"]
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore('all-MiniLM-L6-v2', splitter, retr_search_type="mmr", retr_k=12)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
rag_chain.invoke(marketing_answer)

Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\namong the largest language models today and we apply them on a wide range of language tasks,\nincluding classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.\n\nguage models trained on a large amount of text – where ﬁne-tuning on task-speciﬁc data after pre-\ntraining on general domain data provides a signiﬁcant performance gain compared to training on\n\nFig. 13. The amount of computation used for training big language models of different sizes is getting big. (Image source: Brown et al., 2020).\n\n[2017], have achieved the state-of-the-art performance of various natural language processing (NLP) tasks, including\ncontext representation learning Devlin et al. [2019], machine translation Vaswani et al. [2017], and language modeling\n\naugmentation for language models, recent works also study\nretrie

'Yes, that is correct. Large language models (LLMs) are designed to enhance performance across a broad range of natural language processing (NLP) tasks. The list of tasks you provided, including speech recognition, machine translation, natural language generation, and others, falls within the scope of applications where LLMs strive to excel.'



<blockquote>

'[INST]Please answer the question below only based on the context information provided.

Here is a context:

among the largest language models today and we apply them on a wide range of language tasks, including classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.

guage models trained on a large amount of text where ﬁne-tuning on task-speciﬁc data after pre-training on general domain data provides a signiﬁcant performance gain compared to training on

Fig. 13. The amount of computation used for training big language models of different sizes is getting big. (Image source: Brown et al., 2020).

[2017], have achieved the state-of-the-art performance of various natural language processing (NLP) tasks, including\ncontext representation learning Devlin et al. [2019], machine translation Vaswani et al. [2017], and language modeling

augmentation for language models, recent works also study retrieval for computer vision models (Ashual et al., 2022; Blattmann et al., 2022; Gur et al., 2021; Sarto et al., 2022; Li et al., 2022; Ramos et al., 2023; Wang et al., 2022a). More\

Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more

Q19-1026. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model

at https://github.com/DaoD/INTERS.\n1\nIntroduction\nLarge language models (LLMs) have shown re-markable capabilities across various natural language processing (NLP) tasks. While these models have learned vast knowledge from large text cor-

Big language models have been pre-trained on a large collection of unsupervised textual corpus. Given enough parameters, these models are able to memorize some factual knowledge within parameter weights. Therefore, we can use these models to do

affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications. Architectural details Figure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence

speciﬁc downstream tasks that were learned but not emphasized in the general pre-training model. 8 CONCLUSION AND FUTURE WORK Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning Yutao Zhu1, Peitian Zhang1, Chenghao Zhang1,2, Yifei Chen1,3, Binyu Xie1 Zhicheng Dou1†, Zheng Liu4, and Ji-Rong Wen1

Here is a question:

Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval..[/INST]'

</blockquote>

So this shows that even if we used the gold answer as our prompt for document retreival, we are not returning any context that speaks specifically to 'speech recognition', which is found in both gold standard answers.  With this, we suspect that we may be limited in our ability to achieve the gold standard answers with the documenation that we have available.

We will continue to hold true to our guiding metric and proceed with the highest performing parameters for the Retriever.

**Retriever Decision:** From our grid search, we found that, for our current model, on this train set, using the default **mmr** search type, but with the return of **k=6** examples produces the best results.

#### 4.D) **Exploring LLM's**

This concludes our exploration of the setup and retreival of our documentation. Now we move on to the downstream tasks; LLMs and prompts.

Below we compare the two models we have available to us.  Before we explore these two models, we want to highlight some of the pros and cons of both.

**Cohere:** While Cohere has a free trial version, for this model, and any production usage, a subscription is necessary. The expectation would be that a paid for service will have less downtime, better maintenance and consistent upgrades, increasing the value and justifying the cost.  The other thing to consider is that Cohere is an external resource.  Right now we are utilizing publicly available documents.  Should we, at some point, have sensitive documentation that we want to include in our document store, we should take into consideration the security of this context being shared externally.  An additional note to the advantages of Cohere would be that it should be a safer model in terms of outputting inappropriate content.

**Mistral:**  With Mistral being an open source model, there are no subscription costs, but that comes at the potential detriment of down time, reduced maintenance and slower upgrades.  It also, in theory, is more likely to output inappropriate answers.  With that being said, one major advantage of Mistral being open source is that, should we start including sensitive documentation, this is a model that we could set up locally to ensure that no sensitive content is shared externally. You also have more freedom and flexibility to adjust parameters and fine-tune for specific use cases.

Now we compare the models outputs.

In [None]:
%%capture

embedding_model = 'all-MiniLM-L6-v2'
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
chunk_size = 256
chunk_overlap = 16
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type="mmr", retr_k=6)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)

llms = ['cohere', 'mistral']
results = []
index = 0
for llm in llms:
    print("")
    print("Index ", index)
    index += 1
    print(f"LLM: {llm}")

    # building model for current llm
    llm_model = load_llm(llm)
    rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)

    # computing metrics
    metrics = ['rouge', 'bleu', 'bertscore']
    llm_result = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=True, sleep=False)
    results.append(llm_result)

    # deleting previous variables
    del llm_model
    del rag_chain
    gc.collect() # forcing garbage collection

Unused kwargs: ['llm_int4_enable_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_to

In [None]:
for i, result in enumerate(results):
  print("Index: ,", i)
  result = composite_evaluation(result)
  print(result.mean())

Index: , 0
Sample             6.600000
eng_bleu           0.058681
mk_bleu            0.030267
eng_rouge          0.240201
mk_rouge           0.166196
eng_f1             0.875064
mk_f1              0.859098
composite_eng      0.521329
composite_mk       0.485461
composite_total    0.506981
dtype: float64
Index: , 1
Sample             6.600000
eng_bleu           0.031601
mk_bleu            0.018405
eng_rouge          0.171653
mk_rouge           0.105633
eng_f1             0.817403
mk_f1              0.809078
composite_eng      0.466518
composite_mk       0.439910
composite_total    0.455875
dtype: float64


##### **Output Readability Check**

Although the Cohere model is showing significantly better scores, we still want to look at the actually output between the two models for comparison.

In [None]:
%%capture

embedding_model = 'all-MiniLM-L6-v2'
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
chunk_size = 256
chunk_overlap = 16
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type="mmr", retr_k=6)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)

# selecting a sample question and answers
question = validation_questions_answers[0]['question']
engineering_answer = validation_questions_answers[0]["gold_answer_research"]
marketing_answer = validation_questions_answers[0]["gold_answer_marketing"]

In [None]:
llm_model = load_llm('cohere')
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
rag_chain.invoke(question)

Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nmisuse, there are many domains where large language models should be deployed only with great\ncare, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying\n\nAfter the success of many large-scale general language models, many QA models embrace the following approach:\n\nthem to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.\n\nMany applications in natural language processing rely on adapt-\ning one large-scale, pre-trained language mod

'Large language models serve as the foundation for many natural language processing (NLP) applications. They are pre-trained on vast amounts of text data and can be adapted to various downstream tasks through fine-tuning. This allows for efficient and effective language model deployment in multiple domains, including classification, summarization, question-answering, creative writing, and dialogue. While there are concerns about potential misuse, large language models have revolutionized NLP by providing adaptable and contextually aware language understanding and generation capabilities.'

In [None]:
llm_model = load_llm('mistral')
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
rag_chain.invoke(question)

Unused kwargs: ['llm_int4_enable_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Before LLM: messages=[HumanMessage(content='[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nmisuse, there are many domains where large language models should be deployed only with great\ncare, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying\n\nAfter the success of many large-scale general language models, many QA models embrace the following approach:\n\nthem to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.\n\nMany applications in natural language processing rely on adapt-\ning one large-scale, pre-trained language mod

"Human: [INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nmisuse, there are many domains where large language models should be deployed only with great\ncare, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying\n\nAfter the success of many large-scale general language models, many QA models embrace the following approach:\n\nthem to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.\n\nMany applications in natural language processing rely on adapt-\ning one large-scale, pre-trained language model to multiple down-\nstream applica

Below is a breakdown of the prompt, responses and target answers.

**LLM Prompt**

<blockquote>

'[INST]Please answer the question below only based on the context information provided.

Here is a context:

algorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8.[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models

misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying

After the success of many large-scale general language models, many QA models embrace the following approach:

them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do.

Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via ﬁne-tuning,

among the largest language models today and we apply them on a wide range of language tasks,\nincluding classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.

Here is a question:

What purpose do large language models serve in the field of natural language processing?.[/INST]'

</blockquote>

**Model Answers**

<blockquote>

**Cohere Answer** 'Large language models serve as the foundation for many natural language processing (NLP) applications. They are pre-trained on vast amounts of text data and can be adapted to various downstream tasks through fine-tuning. This allows for efficient and effective language model deployment in multiple domains, including classification, summarization, question-answering, creative writing, and dialogue. While there are concerns about potential misuse, large language models have revolutionized NLP by providing adaptable and contextually aware language understanding and generation capabilities.'

**Mistral Answer** 'Large language models serve as a tool to solve various problems in natural language processing by providing a means of understanding and generating textual data. They can be used for classification, summarization, question answering, creative writing, dialogue, among other applications. However, it's important to note that they may have limitations and potential misuses, especially in high-stakes domains such as medical diagnosis.'

</blockquote>

**Target Responses**

<blockquote>

**Engineering** "Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks."

**Marketing** "Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval."

</blockquote>

For this example it does appear that the Cohere model is providing a more comprehensive answer that Mistral.  We, once again, will be proceeding with what our evaluation metric as identified as the superior (default) LLM.

**LLM Decision:** We will be proceeding with the 'Cohere' LLM.

#### 4.E) **Modifying Prompt**

With our LLM chosen, we now look at our last (but not least) chosen step in tuning our RAG chain model; the prompt. We first look to improve the standardized response, and then defining specific responses for the separate departements.

*Note: The RAG chains functions were defined to allow for the delineation between 'engineering' and 'marketing' departments.*

In [None]:
%%capture

embedding_model = 'all-MiniLM-L6-v2'
chunk_size = 256
chunk_overlap = 16
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
llm = "cohere"

base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type="mmr", retr_k=6)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)

##### **Improving Agnostic Language in Prompt**

**First Pass:**
Below is a simple adjustment (from the baseline prompt) in formatting and word selection.

In [None]:
rag_template = """[INST]
              Please provide an precise and concise answer to the question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n[/INST]
              """

rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)

In [None]:
# using function to capture all metrics
metrics = ['rouge', 'bleu', 'bertscore']
impr_base_prompt = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=False, sleep=False, print_results=True)
impr_base_prompt = composite_evaluation(impr_base_prompt)

Before LLM: messages=[HumanMessage(content='[INST]\n              Please provide an precise and concise answer to the question below based on the context information provided.\n\n\n              Below is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nmisuse, there are many domains where large language models should be deployed only with great\ncare, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying\n\nAfter the success of many large-scale general language models, many QA models embrace the following approach:\n\nthem to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.\n\nMany applications in natural language processing

In [None]:
impr_base_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.085301,0.056981,0.27715,0.22631,0.875679,0.869244,0.538045,0.513911,0.528391
std,4.788876,0.037857,0.05617,0.030955,0.110159,0.020937,0.024893,0.020511,0.053888,0.024182
min,0.0,0.036419,0.0,0.231293,0.0625,0.840178,0.812753,0.507959,0.435503,0.494234
25%,2.25,0.062209,0.0,0.25482,0.155309,0.862992,0.862882,0.526855,0.480649,0.508022
50%,7.5,0.076656,0.054083,0.280987,0.225564,0.872152,0.873871,0.535689,0.512473,0.531047
75%,10.5,0.118162,0.112289,0.285273,0.279781,0.886593,0.884759,0.554789,0.549099,0.549279
max,13.0,0.142238,0.125896,0.338462,0.394366,0.912465,0.898166,0.567109,0.591089,0.55601


Our composite score has increase about 1% with these slight modifications.  A quick review of the output answers does highlight a few things that need to be addressed:
 - Some outputs are in bullet points, but the target answers are all in sentences.
 - The way that our output handles missing information is different than the target answers.
 - Some answers are separated into multiple paragraphs, but should be singular.

**Second Pass:** Now we will update the prompt to address some of the above issues with the results.

In [None]:
rag_template = """[INST]
              Please provide an precise and concise answer to the question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Keep your answer to one paragraph.
- Answer with complete sentences or phrases. Do not use bullet points.
[/INST]
"""

rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
print(rag_template)

[INST]
              Please provide an precise and concise answer to the question below based on the context information provided.


              Below is a context:
{context}

              Below is a question:
{question}

              Below are answer instructions:
- Keep your answer to one paragraph.
- Answer with complete sentences or phrases. Do not use bullet points. 
[/INST]



In [None]:
# using function to capture all metrics
metrics = ['rouge', 'bleu', 'bertscore']
impr_base_prompt = evaluate(metrics, validation_questions_answers, rag_chain, iterations=10, verbose=False, sleep=False, print_results=True)
impr_base_prompt = composite_evaluation(impr_base_prompt)

Before LLM: messages=[HumanMessage(content='[INST]\n              Please provide an precise and concise answer to the question below based on the context information provided.\n\n\n              Below is a context:\nalgorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8.\nURL https://doi.org/10.1007/s10994-014-5458-8.\n[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models\n\nmisuse, there are many domains where large language models should be deployed only with great\ncare, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying\n\nAfter the success of many large-scale general language models, many QA models embrace the following approach:\n\nthem to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.\n\nMany applications in natural language processing

In [None]:
impr_base_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.082288,0.03357,0.266774,0.197963,0.886655,0.870535,0.539817,0.50137,0.524438
std,4.788876,0.063534,0.03874,0.066192,0.072478,0.012793,0.012934,0.026867,0.032884,0.017455
min,0.0,0.0,0.0,0.176211,0.076923,0.873402,0.841682,0.496033,0.443918,0.495907
25%,2.25,0.016754,0.0,0.243405,0.142622,0.875753,0.865623,0.527241,0.474888,0.514752
50%,7.5,0.093354,0.019997,0.266273,0.2,0.882927,0.871686,0.54017,0.507126,0.521807
75%,10.5,0.136794,0.056816,0.28051,0.250646,0.89399,0.880954,0.560596,0.522799,0.531066
max,13.0,0.155562,0.099238,0.415842,0.311111,0.908026,0.884005,0.576867,0.54591,0.555555


While these additional instructions create a slight decrease in our evaluation metric score, it does show significant improvements in the structuring and formatting of the answers.  It is now time to separate the specific departments responses and start tuning them for the intricacies of the specific departments.

##### **Basic Inclusion of Department**

**First Pass:** We first start with creating a separate prompt for both Marketing and Engineering based off the agnostic prompt that we created above, and adding additional context to the specific user.

In [None]:
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Keep your answer to one paragraph.
- Answer with complete sentences or phrases. Do not use bullet points.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide an precise and concise answer to the marketer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Keep your answer to one paragraph.
- Answer with complete sentences or phrases. Do not use bullet points.
[/INST]
"""

eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)

metrics = ['rouge', 'bleu', 'bertscore']

rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}

dpt_prompt = evaluate(metrics, validation_questions_answers, rag_chains, iterations=10, verbose=False, dept_specific=True, sleep=False, print_results=True)
dpt_prompt = composite_evaluation(dpt_prompt)

------------------------------------------------------------
Question: What purpose do large language models serve in the field of natural language processing?
 
Engineering Prediction: Large language models serve as the foundation for many natural language processing applications. These models are pre-trained on vast amounts of text data and can be adapted to various downstream tasks through fine-tuning. They excel at predicting the next word in a sequence, which forms the basis for more complex language tasks such as classification, summarization, question-answering, and creative writing. The ability to leverage large language models enables more accurate and contextually aware language processing, powering a new generation of language-based applications.
Engineering Answer: Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relatio

In [None]:
dpt_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.10276,0.040998,0.269231,0.200481,0.880977,0.870698,0.54181,0.503693,0.526563
std,4.788876,0.078244,0.044402,0.084117,0.073555,0.019346,0.014568,0.048173,0.033488,0.028314
min,0.0,0.0,0.0,0.179775,0.065574,0.855122,0.843806,0.486989,0.441575,0.468824
25%,2.25,0.07048,0.0,0.212644,0.139139,0.86853,0.863022,0.513371,0.478186,0.520719
50%,7.5,0.085833,0.032646,0.240105,0.232763,0.874516,0.869736,0.526741,0.516,0.52436
75%,10.5,0.150724,0.075738,0.299668,0.243902,0.894291,0.88323,0.564459,0.526708,0.550636
max,13.0,0.251152,0.098555,0.440367,0.277778,0.9158,0.890389,0.623024,0.53793,0.559457


From a metric standpoint we are seeing improvement in our engineering answers, but a decay in the marketing answers.  Reviewing the marketing answers to the targets, we notice the following:
 - Every marketing answer is longer than its accompanying target.
 - There are specific references to papers, which is not present in the target answers.

**Second Pass:** We will now modify the prompts to address these issues.

In [None]:
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Keep your answer to one paragraph.
- Answer with complete sentences or phrases. Do not use bullet points.
- Do not explicitly reference papers in your answers.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide an precise and concise answer to the marketer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Keep your answer to one paragraph.
- Your answers need to be brief and to the point. Only focus on the most important aspects.
- Answer with complete sentences or phrases. Do not use bullet points.
- Do not explicitly reference papers in your answers.
[/INST]
"""

eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)

metrics = ['rouge', 'bleu', 'bertscore']

rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}

dpt_prompt = evaluate(metrics, validation_questions_answers, rag_chains, iterations=10, verbose=False, dept_specific=True, sleep=False, print_results=True)
dpt_prompt = composite_evaluation(dpt_prompt)

------------------------------------------------------------
Question: What purpose do large language models serve in the field of natural language processing?
 
Engineering Prediction: Large language models serve as the foundation for many natural language processing applications, providing a versatile and adaptable tool for a wide range of tasks. These models are pre-trained on vast amounts of text data, enabling them to understand and generate human-like language. Their primary purpose is to act as a flexible language-understanding system that can be fine-tuned for specific tasks, such as classification, summarization, question-answering, and creative writing. This adaptability allows large language models to be applied to diverse domains and has led to their widespread adoption in the field of natural language processing.
Engineering Answer: Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks suc

In [None]:
dpt_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.094416,0.062217,0.272868,0.218768,0.885499,0.876008,0.543493,0.516078,0.532527
std,4.788876,0.06824,0.075487,0.081954,0.093901,0.015445,0.015154,0.040963,0.048777,0.036278
min,0.0,0.0,0.0,0.189055,0.064516,0.867752,0.852505,0.497641,0.445607,0.488332
25%,2.25,0.043476,0.0,0.207168,0.160839,0.873349,0.865376,0.50919,0.483244,0.505423
50%,7.5,0.097114,0.046431,0.245907,0.192946,0.881872,0.873887,0.539244,0.500279,0.521146
75%,10.5,0.152868,0.083261,0.346076,0.296894,0.898585,0.886098,0.579225,0.54685,0.56395
max,13.0,0.187549,0.208805,0.411215,0.369748,0.908937,0.898655,0.597204,0.602013,0.59095


These additional instructions improved the answers and metrics for both engineering and marketing.  However the marketing answers are still too long, and therefore too detailed and verbose.  

**Third Pass:** We will further explore prompting to improve the marketing answers.

In [None]:
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Keep your answer to one paragraph.
- Answer with complete sentences or phrases. Do not use bullet points.
- Do not explicitly reference papers in your answer.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide an precise and concise answer to the marketer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Tailor your answer for someone in marketing. Keep your answer high-level.
- Keep your answer to one paragraph.
- Your answer need to be brief and to the point. Only focus on the most important aspects.
- Answer with complete sentences or phrases. Do not use bullet points.
- Do not explicitly reference papers in your answer.
[/INST]
"""

eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)

metrics = ['rouge', 'bleu', 'bertscore']

rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}

dpt_prompt = evaluate(metrics, validation_questions_answers, rag_chains, iterations=10, verbose=False, dept_specific=True, sleep=False, print_results=True)
dpt_prompt = composite_evaluation(dpt_prompt)

------------------------------------------------------------
Question: What purpose do large language models serve in the field of natural language processing?
 
Engineering Prediction: Large language models serve as the foundation for many natural language processing applications. These models are pre-trained on vast amounts of text data and can be adapted to various downstream tasks through fine-tuning. They excel at predicting the next word in a sequence, which forms the basis for more complex language tasks such as classification, summarization, question-answering, and creative writing. The ability to leverage large language models in this way has revolutionized the field of natural language processing, enabling the development of more sophisticated and versatile language-based applications.
Engineering Answer: Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achie

In [None]:
dpt_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.091181,0.024989,0.259,0.175901,0.883504,0.867653,0.537688,0.491595,0.519251
std,4.788876,0.073101,0.032995,0.081297,0.058967,0.015155,0.008976,0.043316,0.026986,0.027061
min,0.0,0.0,0.0,0.164835,0.072727,0.865019,0.853006,0.483224,0.448321,0.473266
25%,2.25,0.014269,0.0,0.19225,0.140231,0.874698,0.86117,0.49782,0.471076,0.49822
50%,7.5,0.110333,0.0,0.242978,0.192685,0.879172,0.87024,0.540679,0.494379,0.51949
75%,10.5,0.149372,0.052693,0.306723,0.209895,0.896641,0.874439,0.569333,0.509681,0.543573
max,13.0,0.173959,0.076532,0.421053,0.253968,0.908103,0.879368,0.60977,0.527218,0.553119


At this point we seem to get getting diminishing returns on the additional instructions provided to the model. Considering that this additional instructions seems to have provided no additional value, and it deteriorated our evaluation metric, we will revert back to the previous prompt.

As a final experimentation of prompt testing, we are going to explore the extreme of providing extensive instructions and direction for the model on what to focus on to see how different the results become.

##### **Detailed instructions by department**

**First Pass:** Below we explore the extreme case of detailed instructions for the model to adhere to in developing its answer.  Because we are struggling with improving the marketing responses, we are specifically focusing on this.

In [None]:
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Formatting: Answers need to be in paragraph format. Do not use bullet points. Do not explicitly reference papers in your answer.
- Comprehensive Explanations: Provide a detailed and comprehensive explanation of the topic at hand, reflecting a depth of understanding and research. Focus on covering the subject matter thoroughly.
- Technical Detail: Include technical details and terminologies that relate to the question.
- Research Focus: Orient answers towards the research aspects of the questions.
- Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.
- Forward-Looking Insights: Include insights into the future direction of the topic at hand.
- Addressing Broad Implications: Go beyond the technical details to address broader implications, including ethical considerations, societal impacts, and practical applications.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide a precise and concise answer to the marketer's question below based on the context provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions:
- Formatting: Answers need to be in paragraph format. Do not use bullet points. Do not explicitly reference papers in your answer.
- Succinctness: Make sure your answer is concise and to the point. Provide essential information without delving into the technical depth. Answer in the least amount of sentences possible. Answer with a single sentence where possible.
- Broad Overview: Give a broad overview of the topic, focusing on the practical applications and implications. Aim to communicate the value and utility for a wider audience.
- Focus on Applications: Emphasize the applications of the topic at hand. Focus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.
- Accessibility: The language and presentation of information needs to be accessible to lay audiences. This means less jargon and a more straightforward explanation, making it easier for individuals without a technical background to understand.
- Promotional Tone: While still informative, the answer should have a subtly promotional tone, aiming to generate interest or enthusiasm for LLMs and their potential.
- User-Oriented: Make the answer user-oriented, considering the interests and needs of potential users or stakeholders. Frame the information in a way that relates directly to how individuals or organizations might use or benefit from LLMs.
[/INST]
"""

eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)

metrics = ['rouge', 'bleu', 'bertscore']

rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}

dpt_prompt = evaluate(metrics, validation_questions_answers, rag_chains, iterations=10, verbose=False, dept_specific=True, print_results=True)
dpt_prompt = composite_evaluation(dpt_prompt)

------------------------------------------------------------
Question: What purpose do large language models serve in the field of natural language processing?
 
Engineering Prediction: Large language models serve as the foundation for advanced natural language processing (NLP) applications, offering a versatile and powerful tool for a wide range of tasks. These models are designed to understand and generate human language, capturing the intricacies of language structure and semantics. With their ability to process vast amounts of text data, they have revolutionized NLP research and development. 

At their core, large language models are trained on massive text corpora, learning to predict the next word in a sequence, which is known as the next word prediction objective. This foundational capability enables the models to grasp language patterns, syntax, and context, forming the basis for more complex language understanding and generation tasks. 

The versatility of these models lies in

In [None]:
dpt_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.022457,0.022664,0.163657,0.163158,0.857672,0.864816,0.482425,0.485888,0.48381
std,4.788876,0.022514,0.031799,0.027217,0.058593,0.01122,0.013611,0.016665,0.026859,0.012091
min,0.0,0.0,0.0,0.13308,0.053333,0.846499,0.85208,0.465014,0.443277,0.464387
25%,2.25,0.0,0.0,0.138702,0.130832,0.849387,0.854787,0.468065,0.467576,0.477666
50%,7.5,0.024847,0.0,0.166814,0.160544,0.85414,0.860159,0.480602,0.484277,0.486436
75%,10.5,0.033557,0.043609,0.18021,0.184439,0.864253,0.870092,0.491857,0.502284,0.489412
max,13.0,0.064525,0.076333,0.215517,0.26087,0.881512,0.890612,0.518316,0.527233,0.505513


From this the first thing that stands out it that the engineering answers are no longer abiding by the single paragraph requirement.  Also, overall, it appears that putting an excessive amount instructions is causing the predictions to create very lengthy, very detailed responses, likely in attempt to ensure it is covering all the requirements.

**Second Pass** We now see if we can reduce some of these instructions and counteract some for the decay we are seeing in the responses.

In [None]:
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Do not use bullet points. Do not explicitly reference papers in your answer.
- Technical Detail: Include technical details and terminologies that relate to the question.
- Research Focus: Orient answers towards the research aspects of the questions.
- Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide a precise and concise answer to the marketer's question below based on the context provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Do not use bullet points. Do not explicitly reference papers in your answer.
- Succinctness: Make sure your answer is concise and to the point. Provide essential information without delving into the technical depth.
- Broad Overview: Give a broad overview of the topic, focusing on the practical applications and implications. Aim to communicate the value and utility for a wider audience.
- Focus on Applications: Emphasize the applications of the topic at hand. Focus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.
[/INST]
"""

eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)

metrics = ['rouge', 'bleu', 'bertscore']

rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}

dpt_prompt = evaluate(metrics, validation_questions_answers, rag_chains, iterations=10, verbose=False, dept_specific=True, print_results=True)
dpt_prompt = composite_evaluation(dpt_prompt)

------------------------------------------------------------
Question: What purpose do large language models serve in the field of natural language processing?
 
Engineering Prediction: Large language models serve as foundational components in natural language processing (NLP), offering a versatile framework adaptable to diverse downstream tasks. These models, through extensive training on vast text corpora, master the intricacies of language, enabling them to excel across various NLP applications, including classification, summarization, question-answering, and creative content generation. The adaptability of these models empowers their integration into specialized domains, making them invaluable tools for researchers and practitioners in the field of NLP.
Engineering Answer: Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relatio

In [None]:
dpt_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.076019,0.035172,0.281328,0.186434,0.887596,0.870335,0.5434,0.498132,0.525293
std,4.788876,0.066993,0.048409,0.094311,0.060871,0.017845,0.012767,0.047071,0.032103,0.029187
min,0.0,0.0,0.0,0.195122,0.097561,0.867403,0.848807,0.49423,0.45728,0.485284
25%,2.25,0.00893,0.0,0.22549,0.137547,0.874111,0.861171,0.508116,0.472447,0.500436
50%,7.5,0.066554,0.0,0.245545,0.195209,0.883182,0.868754,0.534388,0.498154,0.525159
75%,10.5,0.142629,0.070567,0.302726,0.240754,0.896885,0.882575,0.562151,0.524658,0.537392
max,13.0,0.164778,0.114861,0.511111,0.265306,0.922094,0.887209,0.643705,0.545109,0.569876


These reduced verbose prompting greatly improved the output for both types of answers.  We are getting close to honing in on finalized prompts for our model.  If anything, the engineering answers are now too concise, and could benefit from additional detail.  While the marketing prompt is still too verbose and could be more concise while still protraying the same requirements.

**Third Pass:** We will perform some additional tweaking to the marketing prompt to see if we can get better results.

In [None]:
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a succint, single-paragraph answer. Do not use bullet points. Do not explicitly reference papers in your answer.
- Technical Detail: Include technical details and terminologies that relate to the question.
- Research Focus: Orient answers towards the research aspects of the questions.
- Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide a precise and concise answer to the marketer's question below based on the context provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Answer in a single sentence or phrase if you can. Do not use bullet points. Do not explicitly reference papers in your answer.
- Succinctness: Make sure your answer is concise and to the point. Provide only the essential information without delving into the technical depth.
- Broad Overview: Give a broad overview of the topic. Provide only the essential information without delving into the technical depth.
- Focus on Applications: EFocus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.
[/INST]
"""

eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)

metrics = ['rouge', 'bleu', 'bertscore']

rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}

dpt_prompt = evaluate(metrics, validation_questions_answers, rag_chains, iterations=10, verbose=False, dept_specific=True, print_results=True)
dpt_prompt = composite_evaluation(dpt_prompt)

------------------------------------------------------------
Question: What purpose do large language models serve in the field of natural language processing?
 
Engineering Prediction: Large language models serve as foundational components in natural language processing (NLP), offering a versatile framework adaptable to diverse downstream tasks. These models, through extensive training on vast text corpora, master the intricacies of language, including syntax, semantics, and context. This enables them to excel at tasks like classification, summarization, question-answering, and creative content generation. The key advantage of large language models lies in their ability to be fine-tuned for specific tasks, leveraging their inherent language understanding to quickly learn and excel in specialized domains, thus forming the backbone of modern NLP research and applications.
Engineering Answer: Large language models (LLMs) serve the purpose of enabling general-purpose language generation a

In [None]:
dpt_prompt.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,6.6,0.05463,0.082257,0.248699,0.253923,0.879226,0.882839,0.525149,0.534048,0.528708
std,4.788876,0.055017,0.091525,0.073032,0.074496,0.017931,0.012112,0.036096,0.044037,0.026515
min,0.0,0.0,0.0,0.174242,0.123077,0.853045,0.865691,0.489071,0.47164,0.493352
25%,2.25,0.0,0.0,0.204674,0.219192,0.871204,0.874695,0.500971,0.503105,0.51248
50%,7.5,0.047056,0.064106,0.219867,0.250355,0.875027,0.881328,0.507349,0.531632,0.527421
75%,10.5,0.098188,0.153673,0.277629,0.301277,0.892838,0.891498,0.551115,0.563969,0.535807
max,13.0,0.141085,0.235765,0.425926,0.354839,0.904271,0.904543,0.600129,0.605876,0.577518


After a thorough review of our latest prompts output, we are happy with these results.  Since this for a POC review, we are going to stop here.  Should a decision be made to pursue a production version of this, we would recommend additional experimentation with prompting to maximize the accuracy of the outputs.

### Validation Set

With are POC model near completion of tuning, we now test it against our validation questions and answers.  Recall that we utilized the first ten examples for training, will use the next 30 for validation, and the remaining 30 will be our hold out set.

In [None]:
# Sort the dictionary by values
sorted_items = sorted(validation_questions_answers.keys())

# Select keys between the 10th and 40th ranked values
val_range = [item for item in sorted_items[10:40]]
test_range = [item for item in sorted_items[40:]]

# validation dictionary
validation_dict = {key: validation_questions_answers[key] for key in val_range}
# test dictionary
test_dict = {key: validation_questions_answers[key] for key in test_range}

print(validation_dict)

{16: {'question': 'What factors influenced the development of generative language models by Anthropic?', 'gold_answer_research': "Several factors influenced the development of generative language models by Anthropic, including the limitations in coding, math, and reasoning capabilities of the initial version Claude, the partnerships with companies like Notion and Quora to enhance the model's capabilities, and the need to address biases, unsafe content, and ethical considerations in training data. Additionally, the reliance on supervised learning and the need for controlled generation in generative models played a role in shaping the development of Anthropic's language models.", 'gold_answer_marketing': 'Factors that influenced the development of generative language models by Anthropic include partnerships with companies like Notion and Quora, limitations in coding, math, and reasoning capabilities in initial models like Claude, and the need to address biases and unsafe content in train

In [None]:
%%capture

embedding_model = 'all-MiniLM-L6-v2'
chunk_size = 256
chunk_overlap = 16
llm = "cohere"
retreiver_search_type = "mmr"
retreiver_k = 6
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a succint, single-paragraph answer. Do not use bullet points. Do not explicitly reference papers in your answer.
- Technical Detail: Include technical details and terminologies that relate to the question.
- Research Focus: Orient answers towards the research aspects of the questions.
- Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide a precise and concise answer to the marketer's question below based on the context provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Answer in a single sentence or phrase if you can. Do not use bullet points. Do not explicitly reference papers in your answer.
- Succinctness: Make sure your answer is concise and to the point. Provide only the essential information without delving into the technical depth.
- Broad Overview: Give a broad overview of the topic. Provide only the essential information without delving into the technical depth.
- Focus on Applications: EFocus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.
[/INST]
"""

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type=retreiver_search_type, retr_k=retreiver_k)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)
rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}
metrics = ['rouge', 'bleu', 'bertscore']

In [None]:
val_eval = evaluate(metrics, validation_dict, rag_chains, iterations=0, verbose=False, dept_specific=True, print_results=True)
val_eval = composite_evaluation(val_eval)

------------------------------------------------------------
Question: What factors influenced the development of generative language models by Anthropic?
 
Engineering Prediction: The development of generative language models by Anthropic was influenced by advancements in transformer architectures, such as the attention mechanism described in the paper "Attention Is All You Need," which paved the way for large language models like BERT. Additionally, the emergence of techniques for training generative models in other modalities and the steering of language models using additional language models, as referenced in the context, likely played a role in informing Anthropic's research direction.
Engineering Answer: Several factors influenced the development of generative language models by Anthropic, including the limitations in coding, math, and reasoning capabilities of the initial version Claude, the partnerships with companies like Notion and Quora to enhance the model's capabilities, 

In [None]:
val_eval.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,35.333333,0.088765,0.120483,0.257284,0.327426,0.881109,0.895641,0.535493,0.570144,0.549353
std,12.318568,0.075619,0.11161,0.082174,0.130646,0.017558,0.018309,0.046027,0.066699,0.046993
min,16.0,0.0,0.0,0.124031,0.08,0.853427,0.86354,0.469208,0.45577,0.470262
25%,24.25,0.041496,0.049608,0.215513,0.241538,0.867283,0.879976,0.50258,0.523488,0.52103
50%,35.5,0.077352,0.103649,0.24107,0.316435,0.880387,0.895908,0.528582,0.56164,0.550129
75%,45.75,0.14363,0.168739,0.295682,0.387278,0.893217,0.90785,0.563955,0.601913,0.562159
max,55.0,0.259686,0.4497,0.45614,0.758621,0.920762,0.941401,0.638634,0.776219,0.686209


Surprisingly, from a metric standpoint, we are now resulting much better on marketing prompts than engineering.  And both answers scored higher on the validation set than the train set, indicating that we have not overfit our model to the original questions that we trained on, which is great.

However, a quick review of the answers that are being returned highlights some issues. Below are an example:

 - Question: How has the token handling capacity changed between different versions of the Claude model?
 - Predicted Answers indicate the capacity is limited to 2,048 tokens, which is a decrease from previous inputs.
 - Target Answers indicate that the capacity as continously increased and is maxing at 1 million now.

And another example

 - Question: What benchmark did Chinchilla achieve an average accuracy of 67.5% on?
 - One predicted answer states WebSource Corpus benchmark and the other Vicuna benchmark.
 - Both target answers indicate the MMLU benchmark.

THe inaccuracy of these results is concerning.  Unfortunately we have dedicated significant effort/resources to this model and have reached our cap.  We will be noting these issues in our write-up.

##### **Curiousity on Embedding Model**
At the very beginning of our process we leaned on our evaluation metric and chose the 'all-MiniLM-L6-v2' embedding model because it performed the best.  Because research stated that this model is designed for compute efficiency and that can be at the sacrifice of performance, we are curious how our tuned model would perform if we switched to the 'multi-qa-mpnet-base-dot-v1' embedding model.  So we are going to check that now on our validation set.

In [None]:
%%capture

embedding_model = 'multi-qa-mpnet-base-dot-v1'
chunk_size = 256
chunk_overlap = 16
llm = "cohere"
retreiver_search_type = "mmr"
retreiver_k = 6
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a succint, single-paragraph answer. Do not use bullet points. Do not explicitly reference papers in your answer.
- Technical Detail: Include technical details and terminologies that relate to the question.
- Research Focus: Orient answers towards the research aspects of the questions.
- Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide a precise and concise answer to the marketer's question below based on the context provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Answer in a single sentence or phrase if you can. Do not use bullet points. Do not explicitly reference papers in your answer.
- Succinctness: Make sure your answer is concise and to the point. Provide only the essential information without delving into the technical depth.
- Broad Overview: Give a broad overview of the topic. Provide only the essential information without delving into the technical depth.
- Focus on Applications: EFocus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.
[/INST]
"""

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type=retreiver_search_type, retr_k=retreiver_k)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)
rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}
metrics = ['rouge', 'bleu', 'bertscore']

In [None]:
val_eval = evaluate(metrics, validation_dict, rag_chains, iterations=0, verbose=False, dept_specific=True, print_results=True)
val_eval = composite_evaluation(val_eval)

------------------------------------------------------------
Question: What factors influenced the development of generative language models by Anthropic?
 
Engineering Prediction: The development of generative language models by Anthropic was influenced by advancements in transformer models and the emergence of large language models like BERT. The two-stage approach of unsupervised generative pre-training and supervised discriminative fine-tuning played a crucial role in setting initial parameters and adapting the model to specific tasks. Additionally, the limitations of traditional supervised learning methods, such as the need for well-annotated datasets and the high computational cost of training large language models, prompted the exploration of semi-supervised approaches. Furthermore, the study by Petroni et al. highlighted the importance of context retrieval in improving the performance of generative language models, influencing the techniques employed by Anthropic.
Engineering A

In [None]:
val_eval.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,35.333333,0.107417,0.10268,0.272335,0.307178,0.885394,0.892369,0.545881,0.558874,0.551078
std,12.318568,0.090938,0.109589,0.091762,0.138567,0.014865,0.022163,0.049358,0.070004,0.054151
min,16.0,0.0,0.0,0.164557,0.126984,0.858475,0.85046,0.481927,0.463325,0.482244
25%,24.25,0.059236,0.0,0.226618,0.220343,0.875913,0.876366,0.519325,0.509367,0.522835
50%,35.5,0.096139,0.075702,0.253537,0.27178,0.88368,0.888437,0.533379,0.545674,0.536198
75%,45.75,0.13019,0.16729,0.293094,0.35686,0.892692,0.903393,0.557593,0.589656,0.56494
max,55.0,0.478152,0.4497,0.647059,0.758621,0.927063,0.942075,0.742196,0.774845,0.755255


From a metric standpoint, we are getting even results between Marketing and Engineering.  However, as we highlighted in the previous run, there are some significant concerns around accruacy of the output of the model.  We are going to review those same specific questions to see what we got:

 - Question: How has the token handling capacity changed between different versions of the Claude model?
 - The predicted answers now highlight that the capacity as increased over time without explictly stating the levels. This is a step better than what we previously got.

And for the second reviewed question:

 - Question: What benchmark did Chinchilla achieve an average accuracy of 67.5% on?
 - The predicted answers are now returning a different wrong benchmark, HellaSwag.

While slight improvements over the previous answers, there is still plenty of room for concern.  But we will move forward with this embedding for our final evaluation.

## 5) Results

We have now come to the final selection of our model.  With the final selection confirmed, we will now evaluate our model against the hold out set.  We will evaluate it for the orginal baseline model, as well as for our finalized model selection.

#### Baseline Model on Holdout Questions

In [None]:
%%capture

embedding_model = "multi-qa-mpnet-base-dot-v1"
chunk_size = 128
chunk_overlap = 0
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
llm = "cohere"
rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""

base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
rag_chain = build_RAG_prompt_chain(rag_template, llm_model, retriever, format_docs)
metrics = ['rouge', 'bleu', 'bertscore']

In [None]:
baseline_test_eval = evaluate(metrics, test_dict, rag_chain, iterations=0, verbose=False, dept_specific=False, print_results=True)
baseline_test_eval = composite_evaluation(baseline_test_eval)

------------------------------------------------------------
Question: Considering the structure and content of the provided text, what guidelines should be used to evaluate the effectiveness of a summary or chatbot response in this context?
 
Engineering Prediction: To evaluate the effectiveness of a summary or chatbot response in this context, the following guidelines should be considered: 

- Precision and conciseness: A good summary should capture the most important information accurately and succinctly. Irrelevant details should be omitted to maintain clarity and focus on the key points. 

- Relevance: The summary or response should directly address the topic or query. It should provide information that is pertinent to the specific issue or question raised. 

- Coherence: The structure of the summary or response should be logical and easy to follow. Ideas should be presented in a clear and organized manner, ensuring a coherent argument or narrative. 

- Completeness: While concise

In [None]:
baseline_test_eval.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0
mean,81.314286,0.076826,0.080432,0.254461,0.232605,0.874972,0.876381,0.529189,0.524059,0.527137
std,13.785695,0.051092,0.105101,0.071135,0.142133,0.019283,0.032526,0.037121,0.077232,0.04825
min,59.0,0.0,0.0,0.101695,0.041958,0.844042,0.818897,0.466631,0.422036,0.45842
25%,69.5,0.042885,0.023644,0.214925,0.127077,0.854528,0.849923,0.505706,0.468872,0.495613
50%,82.0,0.077324,0.035725,0.244224,0.205128,0.877001,0.872199,0.532112,0.502946,0.515046
75%,92.5,0.095641,0.100431,0.280654,0.273116,0.887044,0.896754,0.54599,0.544595,0.542411
max,104.0,0.190747,0.433725,0.447761,0.625,0.919355,0.959609,0.631146,0.754049,0.659628


#### Final Selected Model on Holdout Questions

In [None]:
%%capture

embedding_model = 'multi-qa-mpnet-base-dot-v1'
chunk_size = 256
chunk_overlap = 16
llm = "cohere"
retreiver_search_type = "mmr"
retreiver_k = 6
eng_rag_template = """[INST]
              Please provide an precise and concise answer to the engineer's question below based on the context information provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a succint, single-paragraph answer. Do not use bullet points. Do not explicitly reference papers in your answer.
- Technical Detail: Include technical details and terminologies that relate to the question.
- Research Focus: Orient answers towards the research aspects of the questions.
- Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.
[/INST]
"""

mk_rag_template = """[INST]
              Please provide a precise and concise answer to the marketer's question below based on the context provided.\n\n
              Below is a context:\n{context}\n
              Below is a question:\n{question}\n
              Below are answer instructions in order of importance:
- Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Answer in a single sentence or phrase if you can. Do not use bullet points. Do not explicitly reference papers in your answer.
- Succinctness: Make sure your answer is concise and to the point. Provide only the essential information without delving into the technical depth.
- Broad Overview: Give a broad overview of the topic. Provide only the essential information without delving into the technical depth.
- Focus on Applications: EFocus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.
[/INST]
"""

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
base_embeddings, text_splitter, qdrant_vectorstore, retriever = build_embedding_splitter_vectorstore(embedding_model, splitter, retr_search_type=retreiver_search_type, retr_k=retreiver_k)
qdrant_vectorstore = vectorize_documents(text_splitter, qdrant_vectorstore)
llm_model = load_llm(llm)
eng_rag_prompt = ChatPromptTemplate.from_template(eng_rag_template)
mk_rag_prompt = ChatPromptTemplate.from_template(mk_rag_template)
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)
rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}
metrics = ['rouge', 'bleu', 'bertscore']

In [None]:
final_test_eval = evaluate(metrics, test_dict, rag_chains, iterations=0, verbose=False, dept_specific=True, print_results=True)
final_test_eval = composite_evaluation(final_test_eval)

------------------------------------------------------------
Question: Considering the structure and content of the provided text, what guidelines should be used to evaluate the effectiveness of a summary or chatbot response in this context?
 
Engineering Prediction: To evaluate the effectiveness of a summary or chatbot response in this context, one should consider the following guidelines: a concise and precise comparison of the two responses should be made, highlighting the differences in structure, content, and effectiveness in addressing the user query. The evaluation should also consider the relevance and faithfulness of the responses to the original text, ensuring that important points are not omitted or distorted. Finally, the evaluation should take into account the research focus, assessing whether the responses provide insightful and accurate summaries that reflect an understanding of the technical aspects discussed in the text, such as fine-tuning LMs using PPO on human prefe

In [None]:
final_test_eval.describe()

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0
mean,81.314286,0.102436,0.122358,0.266928,0.337275,0.887978,0.903985,0.544555,0.577647,0.557791
std,13.785695,0.069939,0.11538,0.063827,0.135472,0.015008,0.026057,0.036596,0.073995,0.04374
min,59.0,0.0,0.0,0.148148,0.071429,0.858473,0.844432,0.473681,0.445593,0.480705
25%,69.5,0.070237,0.0,0.223808,0.265046,0.8769,0.887983,0.521151,0.527932,0.530832
50%,82.0,0.102122,0.109031,0.259259,0.319149,0.889604,0.900319,0.543725,0.566947,0.552785
75%,92.5,0.146574,0.203389,0.307961,0.416747,0.898214,0.918959,0.570543,0.614744,0.579747
max,104.0,0.289633,0.388273,0.410256,0.64,0.921361,0.955494,0.635767,0.737396,0.676418




### 5.1) Model Specifications

Document the detailed specs of your choices. Also comment on how you valued the needs of the marketing tean vs the needs of the researchers, in case you had to make a trade-off.


To re-iterate from the beginning of this notebook, below are the parameters that were explored and the values that were selected for the final model:

**Embedding Models**
 - 'multi-qa-mpnet-base-dot-v1'

**Splitter Chunk Parameters**
 - 'CHUNK_SIZE' = 256
 - 'OVERLAP' = 16

**Retriever Parameters**
 - 'k' = 6 (Num of chunks)
 - search_type = 'mmr' (Type of retriever search)

**LLM Model**
 - Cohere

**RAG Prompt Template(s)**

    **For Engineering**
<blockquote>


              Please provide an precise and concise answer to the engineer's question below based on the context information provided.
              
              Below is a context:
        {context}
              Below is a question:
        {question}
              Below are answer instructions in order of importance:
        - Formatting: Provide a succint, single-paragraph answer. Do not use bullet points. Do not explicitly reference papers in your answer.
        - Technical Detail: Include technical details and terminologies that relate to the question.
        - Research Focus: Orient answers towards the research aspects of the questions.
        - Objective Tone: Maintain an objective and informative tone, aiming to educate the reader without persuasive language.

</blockquote>

    **For Marketing**
<blockquote>

              Please provide a precise and concise answer to the marketer's question below based on the context provided.
              
              Below is a context:
        {context}
              Below is a question:
        {question}
              Below are answer instructions in order of importance:
        - Formatting: Provide a concise, single-paragraph answer that uses the fewest words necessary to fully address the question. Answer in a single sentence or phrase if you can. Do not use bullet points. Do not explicitly reference papers in your answer.
        - Succinctness: Make sure your answer is concise and to the point. Provide only the essential information without delving into the technical depth.
        - Broad Overview: Give a broad overview of the topic. Provide only the essential information without delving into the technical depth.
        - Focus on Applications: EFocus on real-world uses and benefits and highlight how technology can solve problems or create opportunities.

</blockquote>

**Evaluation Metric**

Copying evaluation metric from section 3.D in this notebook:

After reviewing the training set scores that we are getting for the three separate metrics, we have determined that we want to create a 'composite' metric that is a weighted combination of the three. Given the individual normalized scores for BLEU, ROUGE-Lsum, and BERTScore F1 for both Engineering (eng) and Marketing (mk), the composite score can be calculated as follows:

<br>

\begin{align*}
\text{Composite Score}_{\text{eng}} &= w_{\text{BLEU}} \cdot N_{\text{BLEU}_{\text{eng}}} + w_{\text{ROUGE-Lsum}} \cdot N_{\text{ROUGE-Lsum}_{\text{eng}}} + w_{\text{BERTScoreF1}} \cdot N_{\text{BERTScoreF1}_{\text{eng}}} \\
\text{Composite Score}_{\text{mk}} &= w_{\text{BLEU}} \cdot N_{\text{BLEU}_{\text{mk}}} + w_{\text{ROUGE-Lsum}} \cdot N_{\text{ROUGE-Lsum}_{\text{mk}}} + w_{\text{BERTScoreF1}} \cdot N_{\text{BERTScoreF1}_{\text{mk}}}
\end{align*}

<br>

where $w_{\text{BLEU}}$, $w_{\text{ROUGE-Lsum}}$, and $w_{\text{BERTScoreF1}}$ are the weights for the BLEU, ROUGE-Lsum, and BERTScore F1 scores respectively, and $N_{\text{Metric}_{\text{dept}}}$ represents the normalized score for each metric in each department (Engineering or Marketing).

We are choosing the following weights based off the importance that each metric brings to evaluating our generated answers. With BERT being best at measuring the contextual similarity between the summaries, it is given the most weight. Then ROUGE, with its measurement of phrase-based similarities, and lastly BLEU providing specific word choice similiarities. Below are the weights:
\begin{align*}
    w_{\text{BLEU}} &= 0.2 \\
    w_{\text{ROUGE-Lsum}} &= 0.3 \\
    w_{\text{BERTScoreF1}} &= 0.5
\end{align*}

We also decided to weight the importance of the responses between the departments.  We understand that this is just a POC, and that should a production system be implemented, it would likely contain much more content in the document store, and with that a significant increase in the technical jargon and complexity within those documents. With the company having 300 engineers and 40 marketers, this system will likely see more usage from engineers.  Also in addition to that, the importance of accuractly containing specific (technical) details is more heavily weighted towards engineering. Marketing requirements will be higher-level and more generalized, and with such will not be as detrimented by missing specifics.  This this we are choosing the following departmental weighting:

\begin{align*}
    w_{\text{eng}} &= 0.6 \\
    w_{\text{mk}} &= 0.4
\end{align*}

<br>
The final composite score, adjusted for departmental impact, is calculated as:

<br>

\begin{equation*}
\text{Final Composite Score} = w_{\text{eng}} \cdot \text{Composite Score}_{\text{eng}} + w_{\text{mk}} \cdot \text{Composite Score}_{\text{mk}}
\end{equation*}

<br>


###5.2) Some Test Questions

**QUESTIONS:**


Please study the answers generated by your chosen setup for these specific test questions:

1. "What purpose do large language models serve in the field of natural language processing?" (Question 0)

2. "What methods are typically employed to create training data for embedding models that use task-specific instructions?" (Question 50)

3. "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?" (Question 83, no labeled answers)

For each of the three questions above please provide:

a) The RAG results (research and marketing response)  
b) The context provided  
c) The document sources for the context  
d) Also discuss your metric(s) for the first two examples (for both responses) compared to the gold responses

Then, for questions 1 and 2, comment on how well you feel your metrics captured the differences and similarities between your answer and the gold answer?

Put your answers to these questions into the answers file as you have done on previous assignments. Please consult the answer file for further details.










####5.2.1 Test Question 1

Please run the query:

In [None]:
eng_rag_chain = build_RAG_prompt_chain(eng_rag_template, llm_model, retriever, format_docs)
mk_rag_chain = build_RAG_prompt_chain(mk_rag_template, llm_model, retriever, format_docs)
rag_chains = {"engineering": eng_rag_chain, "marketing": mk_rag_chain}
metrics = ['rouge', 'bleu', 'bertscore']
q0_dict = {
    0: {"question": "What purpose do large language models serve in the field of natural language processing?",
  "gold_answer_research": "Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks.",
  "gold_answer_marketing": "Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval."}
}
question_0 = evaluate(metrics, q0_dict, rag_chains, iterations=0, verbose=False, dept_specific=True, print_results=True)
question_0 = composite_evaluation(question_0)
question_0.describe()

After Retriever: [Document(page_content='limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.\nThoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos,', metadata={'source': 'https://arxiv.org/pdf/2203.02155.pdf', 'file_path': 'https://arxiv.org/pdf/2203.02155.pdf', 'page': 24, 'total_pages': 68, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20220307013712Z', 'modDate': 'D:20220307013712Z', 'trapped': '', 'page_num': 24, 'doc_num': 6, 'doc_source': 'ArXiv', 'split_id': 2014, '_id': '5b052b48b0654036bd513dfc447dcdbc', '_collection_name': 'rag_tech_db'}), Document(page_content='them to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.', metadata={'source': 'https://

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,0.0,0.096852,0.0,0.214634,0.282051,0.882951,0.884043,0.525236,0.526637,0.525797
std,,,,,,,,,,
min,0.0,0.096852,0.0,0.214634,0.282051,0.882951,0.884043,0.525236,0.526637,0.525797
25%,0.0,0.096852,0.0,0.214634,0.282051,0.882951,0.884043,0.525236,0.526637,0.525797
50%,0.0,0.096852,0.0,0.214634,0.282051,0.882951,0.884043,0.525236,0.526637,0.525797
75%,0.0,0.096852,0.0,0.214634,0.282051,0.882951,0.884043,0.525236,0.526637,0.525797
max,0.0,0.096852,0.0,0.214634,0.282051,0.882951,0.884043,0.525236,0.526637,0.525797


a) RAG Result:

Engineering Prediction:
<blockquote>
     Large language models (LLMs) have become integral to natural language processing (NLP) due to their ability to understand and generate human-like language. With pre-training on vast textual corpora, LLMs can memorize factual knowledge within their parameters, enabling a wide range of language tasks such as classification, summarization, and question-answering. The adaptation of LLMs to specific downstream applications via fine-tuning has led to their widespread use in NLP research and applications. This process allows models to learn task-specific objectives beyond next-word prediction, making them versatile tools for language understanding and generation. The effectiveness of LLMs in NLP is further enhanced by their capacity for instruction tuning, which enables them to follow human instructions accurately.
</blockquote>

Marketing Prediction:
<blockquote>
    Large language models are an essential tool in natural language processing, enabling a wide range of applications such as classification, summarization, and question-answering by leveraging their capacity for language understanding and ability to be fine-tuned for specific tasks with minimal additional data.
</blockquote>

b,c) Context Provide and Sources
<blockquote>

page_content='limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.\nThoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos,',
'source': 'https://arxiv.org/pdf/2203.02155.pdf'

page_content='them to do what a given set of humans want them to do. By default, language models optimize\nthe next word prediction objective, which is only a proxy for what we want these models to do.',
'source': 'https://arxiv.org/pdf/2203.02155.pdf'

page_content='Big language models have been pre-trained on a large collection of unsupervised textual corpus. Given enough parameters, these models are able to memorize some factual knowledge within parameter weights. Therefore, we can use these models to do',
'source': 'https://lilianweng.github.io/posts/2020-10-29-odqa/'

page_content='among the largest language models today and we apply them on a wide range of language tasks,\nincluding classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.',
'source': 'https://arxiv.org/pdf/2203.02155.pdf'

page_content='Many applications in natural language processing rely on adapt-\ning one large-scale, pre-trained language model to multiple down-\nstream applications. Such adaptation is usually done via ﬁne-tuning,',
'source': 'https://arxiv.org/pdf/2106.09685.pdf',

page_content='substantial data volume can benefit the efficacy of\ninstruction tuning.\n2\nRelated Work\nLarge Language Models for Information Re-\ntrieval\nLLMs possess a remarkable capacity for\nlanguage understanding, enabling them to be highly',
'source': 'https://arxiv.org/pdf/2401.06532.pdf'

</blockquote>

d) Metric(s)

 - Engineering Composite: 0.525236
 - Marketing Composite: 0.526637
 - Joint Composite: 0.525797

####5.2.2 Test Question 2

Please run the query:

In [None]:
q50_dict = {
  50: {"question": "What methods are typically employed to create training data for embedding models that use task-specific instructions?",
    "gold_answer_research": "To create training data for embedding models that use task-specific instructions, a common method is to combine datasets from different sources, such as the SuperNaturalInstructions dataset with existing collections designed for embedding training. The SuperNaturalInstructions dataset provides natural language instructions, which can be paired with positive and negative examples to form training samples. Additionally, for tasks like classification or similarity, training samples can be constructed by selecting text sequences associated with different classes or similarities. This diverse training data is essential for instruction-based finetuning, which enables the embedding model to learn from a wide range of tasks and domains.",
    "gold_answer_marketing": "Training data for embedding models that use task-specific instructions is typically created by formulating a wide variety of tasks as text-to-text problems, distinguishing good/bad candidate outputs given an input text. This is done by combining datasets with natural language instructions and constructing positive and negative pairs for training."}
}
question_50 = evaluate(metrics, q50_dict, rag_chains, iterations=0, verbose=False, dept_specific=True, print_results=True)
question_50 = composite_evaluation(question_50)
question_50.describe()

After Retriever: [Document(page_content='datasets with instructions across diverse task cate-\ngories and domains: Multitask Embeddings Data\nwith Instructions (MEDI).\nData\nConstruction\nWe\nbuild\nMEDI\nby\ncombining\n300\ndatasets\nfrom\nSuper-\nNaturalInstructions\n(super-NI;\nWang\net\nal.,', metadata={'source': 'https://arxiv.org/pdf/2212.09741.pdf', 'file_path': 'https://arxiv.org/pdf/2212.09741.pdf', 'page': 2, 'total_pages': 18, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'creationDate': 'D:20230531004557Z', 'modDate': 'D:20230531004557Z', 'trapped': '', 'page_num': 2, 'doc_num': 9, 'doc_source': 'ArXiv', 'split_id': 3222, '_id': '65728727919742d1bfdfbf7d188c85e5', '_collection_name': 'rag_tech_db'}), Document(page_content='where the goal is to retrieve a few in-context learn-\ning (i.e., demonstration) examples from annotated\nexamples given a test instance. The embedding\nmode

Unnamed: 0,Sample,eng_bleu,mk_bleu,eng_rouge,mk_rouge,eng_f1,mk_f1,composite_eng,composite_mk,composite_total
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,50.0,0.152844,0.087211,0.25,0.263736,0.890255,0.882311,0.550696,0.537719,0.545505
std,,,,,,,,,,
min,50.0,0.152844,0.087211,0.25,0.263736,0.890255,0.882311,0.550696,0.537719,0.545505
25%,50.0,0.152844,0.087211,0.25,0.263736,0.890255,0.882311,0.550696,0.537719,0.545505
50%,50.0,0.152844,0.087211,0.25,0.263736,0.890255,0.882311,0.550696,0.537719,0.545505
75%,50.0,0.152844,0.087211,0.25,0.263736,0.890255,0.882311,0.550696,0.537719,0.545505
max,50.0,0.152844,0.087211,0.25,0.263736,0.890255,0.882311,0.550696,0.537719,0.545505


a) RAG Result:

Engineering Prediction:
<blockquote>
     Training data for embedding models that employ task-specific instructions can be generated through a few approaches. One method involves combining large-scale datasets that contain diverse task instructions, such as the Super-NaturalInstructions dataset, to train models on a wide range of tasks and domains. Another approach is to utilize unlabeled corpora and generate training data automatically by training customized retrievers for each specific task. This can be done concurrently with the use of task-specific templates and prompts to fine-tune language models on instruction-following tasks. These methods aim to address the challenge of adapting to new knowledge and contexts that may not have been present during the model's pretraining phase.
</blockquote>

Marketing Prediction:
<blockquote>
    Training data for embedding models with task-specific instructions can be generated by leveraging another model to automatically create training examples for a customized retriever model, which then encodes and retrieves relevant in-context learning examples for specific tasks.
</blockquote>

b,c) Context Provide and Sources
<blockquote>

page_content='datasets with instructions across diverse task cate-\ngories and domains: Multitask Embeddings Data\nwith Instructions (MEDI).\nData\nConstruction\nWe\nbuild\nMEDI\nby\ncombining\n300\ndatasets\nfrom\nSuper-\nNaturalInstructions\n(super-NI;\nWang\net\nal.,',
'source': 'https://arxiv.org/pdf/2212.09741.pdf'

page_content='where the goal is to retrieve a few in-context learn-\ning (i.e., demonstration) examples from annotated\nexamples given a test instance. The embedding\nmodel is used to encode all annotated examples\nand to find the few most similar examples to the',
'source': 'https://arxiv.org/pdf/2212.09741.pdf'

page_content='[43] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,\nK. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.',
'source': 'https://arxiv.org/pdf/2305.14314.pdf'

page_content='address this, the third paradigm trains customized\nretrievers for each task using unlabeled corpora,\nleveraging another model to automatically generate\ntraining data (Wang et al., 2022a). Concurrent to\nour work, Dai et al. (2022) use task-speciﬁc tem-',
'source': 'https://arxiv.org/pdf/2211.09260.pdf'

page_content='Often we need to complete tasks that require latest knowledge after the model pretraining time cutoff or internal/private knowledge base. In that case, the model would not know the context if we don’t explicitly provide it in the prompt. Many methods for',
'source': 'https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/'

page_content='(3) Task execution: Expert models execute on the specific tasks and log results.\nInstruction:',
'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'

</blockquote>

d) Metric(s)

 - Engineering Composite: 0.550696
 - Marketing Composite: 0.537719
 - Joint Composite: 0.545505

####5.2.3 Test Question 3

Please run the query:

In [None]:
question = "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?"
eng_prediction = rag_chains["engineering"].invoke(question)
mk_prediction = rag_chains["marketing"].invoke(question)

print("-"*60)
print(f"Question: {question}")
print(" ")
print(f"Engineering Prediction: {eng_prediction}")
print("")
print(f"Marketing Prediction: {mk_prediction}")


After Retriever: [Document(page_content='A model is able to answer novel questions which have answers not contained in the training dataset.', metadata={'source': 'https://lilianweng.github.io/posts/2020-10-29-odqa/', 'doc_num': 24, 'doc_source': 'WWW', 'split_id': 15, '_id': '22078dca273d4b3aa79c784e2c11a77e', '_collection_name': 'rag_tech_db'}), Document(page_content='as the associated general questions:\n1. The overall quality of the question, such as its\ndifficulty, clarity, and information needed for\nanswering it.', metadata={'source': 'https://arxiv.org/pdf/2309.08872.pdf', 'file_path': 'https://arxiv.org/pdf/2309.08872.pdf', 'page': 5, 'total_pages': 17, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'creationDate': 'D:20231109011357Z', 'modDate': 'D:20231109011357Z', 'trapped': '', 'page_num': 5, 'doc_num': 13, 'doc_source': 'ArXiv', 'split_id': 4699, '_id': '93ac6c5433b54b81b43342

a) RAG Result:

Engineering Prediction:
<blockquote>
     A model's ability to answer questions is closely tied to the diversity and specificity of the training data it is exposed to. The more diverse and comprehensive the training dataset is, the better the model becomes at generalizing and providing accurate responses to novel questions. This is because the model learns patterns, contexts, and relationships between the questions and answers during training, which enables it to extrapolate this knowledge to new, unseen scenarios. However, if the training data lacks certain types of questions or covers a limited range of topics, the model may struggle with questions that deviate from its learned patterns, exhibiting lower performance in terms of accuracy and relevance of responses.

</blockquote>

Marketing Prediction:
<blockquote>
    A model's ability to answer questions is directly related to its exposure to specific question types during training; the more diverse and comprehensive the training data, the better equipped the model is to handle novel questions and provide accurate responses.
</blockquote>

b,c) Context Provide and Sources
<blockquote>

page_content='A model is able to answer novel questions which have answers not contained in the training dataset.',
'source': 'https://lilianweng.github.io/posts/2020-10-29-odqa/',

page_content='as the associated general questions:\n1. The overall quality of the question, such as its\ndifficulty, clarity, and information needed for\nanswering it.',
'source': 'https://arxiv.org/pdf/2309.08872.pdf',

page_content='on open Natural Questions [29], WebQuestions [3] and CuratedTrec [2] and strongly outperform\nrecent approaches that use specialised pre-training objectives on TriviaQA [24]. Despite these being',
'source': 'https://arxiv.org/pdf/2005.11401.pdf',

page_content='avenues for having the model sometimes prioritizing truthfulness and harmlessness over helpfulness\nduring training, particularly through the use of refusals: having the model refuse to answer certain',
'source': 'https://arxiv.org/pdf/2203.02155.pdf',

page_content='0.71\n0.34\nHuman (lower bound)\n-\n71.62\nTable 3: Model and lower-bound human performance\non selecting evidence for questions in QASPER\nFigure 2: Learning curves showing Answer-F1 and\nEvidence-F1 on the dev. set while varying training data\nsize.',
'source': 'https://arxiv.org/pdf/2105.03011.pdf',

page_content='With the recommended system prompt, the model properly\ndeclines to answer 100% of the harmful questions.\nAs an illustration, we provide in Table 5 the answers of\nboth Mistral 7B – Instruct and Llama 2 Chat 13B to the',
'source': 'https://arxiv.org/pdf/2310.06825.pdf',

</blockquote>

###5.3 Other Questions

Below are a few questions that you should think about. Please answer them in the answer file directly (in a short paragraph) and also see whether they may be relevant for your final write-up.

**QUESTION:**

5.3.a. How would you expect your response quality to change if you had a chunk size of 50?

5.3.b. How would you expect your response quality to change if you had a chunk size of 5000?

5.3.c. If you had time, how do you think fine-tuning of the LLM could help?  What type of data would you want for that? And which training approach would you take?

5.3.d. What was your design philosophy  of the prompts? How did they differ between engineering and marketing support?

5.3.e. What are your average and peak load estimates for the system? Given that, would you suggest a pay-per-use deployment or one that reserves the LLM?

5.3.f. What type of limitations/risks would you see in using this system?
