# DATASCI 290 - GenAI - Assignment 5

> Submitted by [Eshwaran Venkat](eshwaran@ischool.berkeley.edu)

In Assignment 5 you will create and test a RAG system yourself, and write a corresponding business proposal.

The overall scenarios is as follows:

You work at a tech company that is looking for new ways to organize their question answering and search capabilities to accelerate both engineering activity and the marketing team. The company also wants to roll out new GenAI-based products, so a lot of the questions will center around Generative AI concepts. The company has about 300 engineers and a marketing staff of 40. Product releases are done quarterly.

Your role is to implement and conduct a (mini-)POC helping the company to evaluate RAG capabilities for the improvement of their document search (and corresponding question answering), supporting particularly the engineering and marketing organizations. You will have a gold dataset with 'good' responses to questions from marketing and engineering teams. You need to develop metric(s) that help you to evaluate how well your RAG system performs relative to the gold data. You should work with the tunables of the setup (LLM, chunking, embeddings, ...) for your iterations.

You will also need to write up your findings as a short proposal.

(See instructions throughout this notebook.)

So overall, the goals of this assignment is for you to:

*  To implement a RAG system using LangChain
*  Be able to formulate metric(s) that you may want to choose as your evaluation to what degree your system replicates gold answers (labeled data) that we will provide.
* Try out various hyper-parameters and settings to see which configuration works the best (given your chosen metric)  
* Write a comprehensive evaluation, which also includes risks and limitations (and a lot more)

The notebook is organized as follows:

1. Set-Up

2. Base RAG components

    We will provide a base LangChain-based framework for you to use for your RAG system. The components we’ll need include:  

  2.1 Text Embeddings    
  2.2 Text Chunking   
  2.3 The Vector DB & Semantic Search  
  2.4 The Language Model   
  2.5 Testing the LLM in a LangChain Chain   
  2.6. Setting up a simple RAG Chain     


3. Using RAG  
  3.1 Loading of Data  
  3.2 Test Queries


4.  Evaluations

  Here, you will conduct your evaluations


5. Final Results

  In this section you provide the RAG answers to the test questions

RULES:  

* You can only use the language models specified here  
* You can only use the embedding methods we discuss  
* You can only use the focuments we provide. And they all must be in your store   
* Apart from the provided specifications, some of the things you can freely experiment with include chunk sizes, prompts, etc.


**To run this notebook** you should copy it to your personal Colab Pro Google account by uploading it into your Google Drive. From there you can open it as a Colab notebook and run it.  Note it needs a T4 GPU to run.  You may be able to run it in a free Colab notebook.

NOTES:
* The Open Source Model is not trained for safety. So unsafe answers could be returned.

Reference Blogs:

* [Understanding BLEU and ROUGE](https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb)
* [Document Retrieval - ChatGPT RAG](https://medium.com/@filipkny/chatgpt-for-document-retrieval-langchain-rag-with-code-0c68ebb19c9c)
* [Langchain RAG - Bible](https://medium.com/@mosesdaudu001/indexing-and-querying-the-holy-bible-with-rag-langchain-gcp-postgresql-hnsw-and-vertex-ai-for-92a8f49ff613)


Let's begin!

---

##1. Setup

We will first install a number of libraries and import what we will need.






In [1]:
# @title Colab Setup
# For google colab environment variables
from google.colab import userdata

# For temporary checkpoints of data and resuming
from google.colab import drive

drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [2]:
# @title Package Installation
%%capture
!pip -q install git+https://github.com/huggingface/transformers
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate
!pip -q install langchain
!pip install einops
!pip install faiss-gpu
!pip install --upgrade --quiet  langchain-community chromadb bs4 qdrant-client
!pip install langchainhub
!pip install joblib
!pip install evaluate
!pip install rouge_score bert_score

!pip install --upgrade --quiet  wikipedia
!pip install --upgrade --quiet  arxiv
!pip install --upgrade --quiet  pymupdf
!pip install -U langchain-cohere

!pip install xmltodict

!pip install cohere

In [3]:
# @title Imports
# Ignore Warnings
import warnings

warnings.filterwarnings("ignore")

# Native Libraries
import json
import time
import os

# Other External Libraries
import torch
import numpy as np
from pprint import pprint
import locale
import bs4  # For parsing web text
from tqdm.autonotebook import tqdm
import pandas as pd
import joblib

# Huggingface Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig,
)  # To declare and download Mistral LLM

# Main Langchain Imports
from langchain.llms import (
    HuggingFacePipeline,
)  # Can use an existing transformer pipeline and convert to LC
from langchain import PromptTemplate, LLMChain, hub
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings import (
    HuggingFaceEmbeddings,
)  # To generate embeddings to store in vectorstore

# Vector Stores
from langchain_community.vectorstores import FAISS, Chroma, Qdrant  # Main vectorstore
from langchain_community.utils.math import cosine_similarity

# Text
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)  # for vector store generation
from langchain_core.output_parsers import (
    StrOutputParser,
)  # for parsing [INST] in Mistral
from langchain_core.runnables import (
    RunnablePassthrough,
)  # for passing str {} formatters to llm prompt template
import evaluate


# Cohere Community Specific Model
from langchain_community.chat_models import ChatCohere
from langchain_cohere import ChatCohere

# Document Loaders for RAG
from langchain_community.document_loaders import (
    WebBaseLoader,
    ArxivLoader,
    TextLoader,
    WikipediaLoader,
    OnlinePDFLoader,
    PyMuPDFLoader,
    PubMedLoader,
)

In [4]:
# @title Setting Configs
locale.getpreferredencoding = lambda: "UTF-8"
os.makedirs("mids_290_a5", exist_ok=True)
tqdm.pandas()

In [5]:
%%capture
!pip install sentence_transformers

Add your keys from the secret store (do **NOT** print them out or leave them exposed as plaintext in your notebook!):

In [6]:
COHERE_API_KEY: str = userdata.get("COHERE_API_KEY")

---



#### Fine Tuning Vs Indexing

To get large language models to answer questions based on domain specific knowledge, we can either fine-tune a large language model or let the large language model use an external index to first find similar embeddings related to the query and subsequently answer the query based on similar embeddings.

With `Indexing`, new documents are available in real time compared to tuning where we would could require a couple of hours depending on the size of the data as well as the model. When `Fine tuning`, we could also run into the context size limitation which is posed by most LLMs. Most LLMs allow around 4000 tokens per request and this makes it impossible to provide a large lump of data. With the indexing approach, we could literally provide an unlimited amount of data because of the retrieval of similar documents which we only send to the LLM

Fine-tuning might not be necessary when:

* The pre-trained model already performs well on the type of data or domain in question.
* Computational resources or time are limited.
* The task involves only retrieving relevant documents without needing to adapt the model's generation capabilities to specific nuances of the new data.

In such cases, you could directly use the retriever to fetch relevant documents sorted by their relevance to the input question, leveraging existing embeddings and a similarity measure without adjusting the model's parameters.

##2. Building the Components of our RAG System

Let us introduce and test the base components of our RAG system. We will largely use the `Hugging Face` and `LangChain` libraries.


#### RAG Components

A pre-trained `Retrieval-Augmented Generation (RAG)` model is not just any large language model (LLM) like Mistral7B; it specifically combines the powers of a retriever and a generator to enhance question-answering capabilities. The RAG setup uses:

* `Retriever`: This component fetches relevant documents or passages (e.g., from a database of Wikipedia or arXiv texts) based on the input query. Typically, this uses dense passage retrieval techniques like `DPR (Dense Passage Retrieval)`.

* `Generator`: After retrieval, this component, usually a transformer-based model, generates an answer based on the retrieved texts combined with the query. This is often a fine-tuned version of a large language model.


###2.1 The Embedding Model

We will need to represent text (pieces) as vectors. For this, we will use the sentence_transformer architecture.

**NOTE:** The models you can use are: `all-MiniLM-L6-v2`, `multi-qa-mpnet-base-dot-v1` and `avsolatorio/GIST-Embedding-v0`



In [7]:
# @title Embeddings Setup
%%capture

minilm_embd: str = "all-MiniLM-L6-v2"
multi_qa_embd: str = "multi-qa-mpnet-base-dot-v1"
avso_embd: str = "avsolatorio/GIST-Embedding-v0"

base_embeddings = HuggingFaceEmbeddings(model_name=multi_qa_embd)

In [8]:
text = "This is a test document."
query_result = base_embeddings.embed_query(text)
print(f"Embedding dimension: {len(query_result)}")

doc_result = base_embeddings.embed_documents([text, "This is not a test document."])
len(doc_result)

Embedding dimension: 768


2

In [9]:
query_result[0:5]

[-0.33193400502204895,
 -0.31659939885139465,
 -0.43882352113723755,
 -0.12748152017593384,
 -0.15996131300926208]

That looks reasonable. This is how you should define your embedding models.

Next, we turn to text chunks.

###2.2. Loading and Chunking Texts

We first need to load the documents. Here is an example: [`https://lilianweng.github.io/posts/2023-06-23-agent/`](https://lilianweng.github.io/posts/2023-06-23-agent/)

In [10]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)

documents = loader.load()

In [11]:
doc_content = documents[0].to_json()["kwargs"]["page_content"]
print(len(doc_content.strip()))

43122


Once the documents are loaded, they may need to be split or chunked into smaller parts. This is particularly useful for large documents. LangChain also offers different algorithms for splitting documents, optimized for specific document types like code or markdown. Some of the splitters are:

* `RecursiveCharacterTextSplitter`: Splits by characters; adjustable fragment size
* `CharacterTextSplitter`: Like the above, but with custom separators like `“\n\n”` and `“ “`.
* `RecursiveTextSplitter`: Splits by words/tokens, ideal for content analysis.
* `TokenTextSplitter`: Uses OpenAI for token-based, context-aware segmentation. Ideal for advanced NLP.

---
We will need to split the  text in chunks that are 'suitable' as retrieval units. Let's for starters define a chunk size of 128 and have no overlap between the chunks:  


In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)
splits = text_splitter.split_documents(documents)
print("Number of splits/chunks: ", str(len(splits)))

Number of splits/chunks:  444


Ok, so it looks like we have now many splits (chunks) from one document. Here is how you can get the content:

In [13]:
splits[39].page_content

'correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.'

Perfect. Now we have the splits and embeddings. Next, the embeddings need to be stores in a vector db.

###2.3 Storing the Embeddings of Chunks in Vectorstores

> `Collections` in `Qdrant` are similar to tables in traditional databases. They store vectors along with their payload

After loading and chunking the data, we need to save the vector representations of the chunks in a vectorstore. We will use Qdrant here for simplicity. We load the splits (structured chunks) and the embeddings:

In [14]:
vectorstore = Qdrant.from_documents(
    splits,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="test",
)
retriever = vectorstore.as_retriever()

The nice thing is that the vector store also does the similarity searches for us:

In [15]:
query = "What is Chain of Thought doing?"
docs = vectorstore.similarity_search_by_vector(
    base_embeddings.embed_query(query)
)  # will rank the splits

In [16]:
docs

[Document(page_content='the model’s thinking process.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': 'c441d03b159d4dd689f7aac012651613', '_collection_name': 'test'}),
 Document(page_content='[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': 'c35878417fbc4c19b66abd58c86dde38', '_collection_name': 'test'}),
 Document(page_content='the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': '8f56e0b635684f0d8e4b40292ceca0fb', '_collection_name': 'test'}),
 Document(page_content='Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-2

Looks good! We have an ordered list of documents that seem to relate to the question. That is what we need.

The last major component is the actual LLM.

###2.4. The LLM

We will use one Open Source Model (`mistralai/Mistral-7B-Instruct-v0.1`) and one Proprietery Model (`Cohere`) for our tests. Let's first set up the OS model:

In [17]:
# @title LLM Setup - `Mistral7B`
# %%capture

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, llm_int4_enable_fp32_cpu_offload=True
)


llm_mistral_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    torch_dtype=torch.float32,
    device_map="auto",
    quantization_config=quantization_config,
)

llm_mistral_tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1"
)

Unused kwargs: ['llm_int4_enable_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

We use the model first to generate a Hugging Face pipeline. A pipeline simplifies the process of actually generating responses.

In [18]:
# @title `Mistral7B` Transformers Pipeline
mistral_pipe = pipeline(
    "text-generation",
    model=llm_mistral_model,
    tokenizer=llm_mistral_tokenizer,
    max_length=1000,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.2,
)
mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

Does it work?

In [19]:
q: list[str] = [
    "[INST]Give me a two-sentence story about an apple![/INST]",
    "[INST]What is the capital of France[/INST]",
    "[INST]What is RAG (Retrieval Augmented Generation)[/INST]",
    "[INST]What are the best places to dine-out in Bangalore?[/INST]",
]

for each_q in q:
    print(mistral_pipe(each_q))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': '[INST]Give me a two-sentence story about an apple![/INST] Once upon a time, there was a beautiful red apple that grew on a tree in the orchard. One day, a little girl came and picked the apple, and it became her favorite fruit.'}]


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': '[INST]What is the capital of France[/INST] The capital city of France is Paris.'}]


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': '[INST]What is RAG (Retrieval Augmented Generation)[/INST] Retrieval-Augmented Generation (RAG) refers to a type of natural language processing technique used in text generation. In this approach, the system uses both retrieval and generative processes to generate new text.\n\nThe retrieval process involves searching for relevant information from existing sources such as databases or texts that can be incorporated into the generated text. The generative process involves creating new text based on the input data using machine learning algorithms such as sequence-to-sequence models or transformers.\n\nIn RAG systems, the output of the retrieval process is fed into the generative model as an additional input. This allows the system to leverage the strengths of both approaches - the ability to incorporate relevant information through retrieval, and the flexibility and creativity of generative models to create new content.\n\nRAG has shown promise in generating high-qual

Reasonable!

We will also use a Cohere model, but will create this below as part of the LangChain framework.

###2.5 Testing the LLM in a LangChain Chain

#### Langchain Components
To build a RAG using LangChain, you'll primarily interact with the following components:

* `Language Models`: These are the core components used for generating text. In LangChain, you can integrate various LLMs, including custom models trained or fine-tuned for specific tasks.

* `Retrievers`: LangChain supports integrating various retrieval systems that can fetch relevant information from a knowledge base. This is critical for the RAG's first step.

* `Chains`: These are sequences of operations (like retrieving then answering) that LangChain can execute. Building a RAG typically involves creating a chain that first retrieves relevant documents and then uses a language model to generate an answer based on the retrieved content.

* `Adapters`: These allow LangChain to connect to different backends for both language models and databases, facilitating the integration of specific models or data sources.

LangChain provides a high-level API that abstracts much of the complexity involved in setting up these components, allowing you to focus more on tailoring the solution to your needs

---

Chains will be defined and discussed in Week 11. In short, they are convenient programmatic ways to deal with 'chains' of actions that involve LLMs. For example, a list of events like 'here is a city name. Plug that city name into prompt template, then generate a story about that city. Lastly, format the model output as a string' can be easily handled by LangChain's Chain framework. In this case, the Chain would consist of the prompt template, the LLM, and the String Formatter. The parameter (the city in this case) will be provided at run time by invocation of the Chain. Let's test that.

To use a Hugging Face model in a LangChain environment, we need to wrap the model into a LangChain pipeline object:

In [20]:
# @title `Mistral7B` Langchain Pipeline
# Create a langchain instance of the mistral transformer pipeline
mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

Next, we need to define a template and create a corresponding prompt template that can take any questiion

In [21]:
test_llm_template: str = (
    """[INST] Give me a two-sentence story about {object}! [/INST]"""
)
test_llm_prompt_template = PromptTemplate(
    template=test_llm_template, input_variables=["object"]
)

Let's define a Chain, a static flow of actions that (usually) involve at least a definition of the variables used in the chain, one or more templates, LLM step(s) and potentially other actions. This would be a chain that declares the variable 'object' to be expected when the chain is invoked, then inserts it into the template, and passes this to our mistral model pipeline (wrapped as a LangChain object):    

In [22]:
test_llm_chain_short = (
    {"object": RunnablePassthrough()} | test_llm_prompt_template | mistral_llm_lc
)

In [23]:
obs: list[str] = [
    "an apple",
    "a planet",
    "a mountain flower",
    "a beach",
    "a skyscraper",
]
for ob in obs:
    print(test_llm_chain_short.invoke(ob))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Give me a two-sentence story about an apple! [/INST] Once upon a time, there was a small red apple that grew on the branch of an old oak tree. One day, a curious little girl plucked it from the tree and discovered that inside was a tiny note written in delicate cursive: "To whoever finds this - may you always have joy in your heart."


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Give me a two-sentence story about a planet! [/INST] Once upon a time, on a distant planet far beyond our solar system, there lived an intelligent species of beings who thrived in their lush and colorful world. But one day, a terrible natural disaster struck the planet, threatening to destroy everything they had built and cherished.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Give me a two-sentence story about a mountain flower! [/INST] As the sun rose over the mountain, the delicate petals of the flower slowly unfurled to reveal its beauty. Despite the harsh conditions and rocky terrain, the flower thrived with resilience, brightening up the barren landscape.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Give me a two-sentence story about a beach! [/INST] The sun was setting over the endless horizon, painting the sky with shades of orange and pink as Sarah walked barefoot on the warm sand of the beach, feeling the waves gently kissing her toes. She closed her eyes, taking in the peaceful sound of the ocean and the salty breeze that rustled through her hair, feeling at peace in this moment.
[INST] Give me a two-sentence story about a skyscraper! [/INST] Once a symbol of human ambition and progress, the towering skyscraper now stood as an abandoned relic of a bygone era, its once gleaming glass facade tarnished with age and neglect. As the sun set behind it, casting long shadows across the deserted streets below, the skyscraper whispered stories of the people who had once worked and lived within its walls, their hopes and dreams now lost to the sands of time.


Works too. We will use this notation moving forward.

Next, how would we do this with a Cohere Chat Model instead of Mistral?

In [24]:
# @title Cohere LLM Setup
cohere_chat_model = ChatCohere(cohere_api_key=COHERE_API_KEY)

This can be plugged straight into the Chain:

In [25]:
test_cohere_llm_chain_short = (
    {"object": RunnablePassthrough()} | test_llm_prompt_template | cohere_chat_model
)

In [26]:
for ob in obs:
    print(test_cohere_llm_chain_short.invoke(ob))

content="The apple, once bitter and unappealing, fell from the tree and transformed into a sweet, juicy treat. Its once sour notes now a distant memory, it became a symbol of nature's surprising gifts." additional_kwargs={'documents': None, 'citations': None, 'search_results': None, 'search_queries': None, 'is_search_required': None, 'generation_id': '49d35faf-84f6-46e9-b774-ca82f347c69b'} response_metadata={'documents': None, 'citations': None, 'search_results': None, 'search_queries': None, 'is_search_required': None, 'generation_id': '49d35faf-84f6-46e9-b774-ca82f347c69b'} id='run-9feea12e-5889-437c-a35d-83d336b9394e-0'
content='The planet Xylara, a lush, verdant world teeming with exotic life, orbits a distant star. Its peaceful existence is disrupted when an ancient evil, long dormant, stirs and threatens to consume all that is good and pure.' additional_kwargs={'documents': None, 'citations': None, 'search_results': None, 'search_queries': None, 'is_search_required': None, 'gener

Works! (Note: you may want to review the format of the template. The one we used here is the one from Mistral, and the format may or may not be optimal for Cohere.)

How can we get the output formatting under control? We can add a String Formatter to the chain:


In [27]:
output_parser = StrOutputParser()

test_cohere_llm_chain_short_formatted = (
    {"object": RunnablePassthrough()}
    | test_llm_prompt_template
    | cohere_chat_model
    | output_parser
)

for ob in obs:
    print(test_cohere_llm_chain_short_formatted.invoke(ob))

The apple, once bitter and unappealing, transformed into a sweet and juicy delight after a long, sunny summer. It was then that the farmer knew his hard work had paid off, and he could finally reap the rewards of his labor.
The planet Xylara, a lush, verdant world teeming with exotic life, orbits a distant star. Its peaceful existence is threatened when an alien race arrives, seeking to exploit its natural resources, but the brave resistance fighters of Xylara rise up to defend their beloved home.
A delicate mountain flower, rare and pristine, blooms amidst the rugged peaks. Its beauty inspires a sense of wonder and awe in all who are fortunate enough to behold it.
The sun-soaked beach, with its golden sand and crystal-clear waters, was a haven of relaxation. With the gentle breeze carrying the scent of the ocean, it was the perfect place to unwind and forget the world.
The towering skyscraper, a symbol of ambition and innovation, soared above the city. Its sleek glass and steel struct

###2.6 Setting Up a Simple RAG Chain

For RAG, we will follow the same approach. Except... you will **later** need to change the chain to include the retrieval step.

We first do a simple test: create a RAG template that takes a question and a pre-defined context as input, and generates the answer based on the provided context:

In [28]:
rag_template: str = """
[INST] Answer the question based only on the following context:
{context}

Question: {question}
[/INST]
"""
rag_prompt_template = ChatPromptTemplate.from_template(rag_template)

base_rag_chain = (
    {"context": RunnablePassthrough(), "question": RunnablePassthrough()}
    | rag_prompt_template
    | mistral_llm_lc
    | output_parser
)

predefined_context = "Germany has won the World Cup 4 times."
question = "How many times did Germany win the world cup?"

resp = base_rag_chain.invoke({"context": predefined_context, "question": question})

print(resp)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Human: 
[INST] Answer the question based only on the following context:
{'context': 'Germany has won the World Cup 4 times.', 'question': 'How many times did Germany win the world cup?'}

Question: {'context': 'Germany has won the World Cup 4 times.', 'question': 'How many times did Germany win the world cup?'}
[/INST]
Based on the provided context, Germany won the World Cup 4 times.


That's great. But of course, the context needs to be created in an earlier retrieval step. More precisely, the documents will be first retrieved as a list, and then they will need to be formatted into one string to pass to the LLM in the context window.

Here is a simple formatting function that can be hooked into the chain, which combines a list of chunks into one string:



In [29]:
def format_docs(docs: list[str]) -> str:
    return "\n\n".join(doc.page_content for doc in docs)

So how could we build a simple chain? Let's first just get the retrieval done and the formatted retrieved data and the question inserted into the prompt template:

In [30]:
rag_template = (
    """Here is a context:\n{context} \n\nand here is a question: \n{question}"""
)

rag_prompt = ChatPromptTemplate.from_template(rag_template)

rag_chain = (
    # retriever from vector store declaration - piped with format docs function above
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
)

In [31]:
output = rag_chain.invoke("What is Chain of Thought?")

Ok... with some formatting... this looks good:

In [32]:
print(output.messages[0].content)

Here is a context:
the model’s thinking process.

[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022

the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes 

and here is a question: 
What is Chain of Thought?


Let's complete the RAG Chain by using a `StrOutputParser` directly, instead of indexing the output like before:

In [33]:
output_parser = StrOutputParser()

rag_template = """
[INST]Please answer the question below only based on the context information provided.

Here is a context: {context}
Here is a question: {question}.
[/INST]"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | mistral_llm_lc
    | output_parser
)

In [34]:
rag_chain.invoke("What is Chain of Thought?")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'Human: \n[INST]Please answer the question below only based on the context information provided.\n\nHere is a context: the model’s thinking process.\n\n[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022\n\nthe problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process\n\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes\nHere is a question: What is Chain of Thought?.\n[/INST] Based on the given context information, "Chain of Thought" refers to a method used in natural language processing and machine learning that involves breaking down complex problems or tasks into smaller sub-problems or sub-tasks, and generating a sequence of intermediate solutions or ideas that lead to a final solution. This approach is often used in training large language models, such as those used in artificial intelligence

What about the Cohere models?

In [35]:
cohere_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | cohere_chat_model
    | output_parser
)

In [36]:
cohere_rag_chain.invoke("What is Chain of Thought?")

"Chain of Thought (CoT) is a prompting technique used to elicit reasoning and improve the performance of large language models on complex tasks. It involves decomposing a problem into a series of thought steps, generating multiple thoughts for each step, and then using these thoughts to guide the model's search process. By organizing the model's thinking process in a structured manner, CoT helps the model to provide more accurate and reasoned responses."

Works too! Time to build the real thing and do experimentation.

##3. The RAG Model & Experimentation

With this we can get started. First, we need to acquire the data, chunk it, vectorize it, and store the embeddings (and in this simple case also the docs) in our Qdrant vector db.


###3.1 The Vector Database

We will start by creating our datastore, Qdrant. Usually, you would deploy the vector db as a server, but in this case let's simply put everything in memory. Also, in this case we will store not only the embeddings but the whole document in the vector store. We will seed the store with the splits from the blog post we had used before.

We will also create the retriever, which defines the way the documents are being retrieved. The retriever parameters define for example which method is used, how many docs are retrieved, etc. See [this LangChain link ](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore)for more information.


In [37]:
# @title Vectorstore Initialization - `Qdrant`
qdrant_vectorstore = Qdrant.from_documents(
    splits,
    base_embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="rag_tech_db",
    force_recreate=True,
)

retriever = qdrant_vectorstore.as_retriever()

###3.2 Data Acquisition, Chunking, and Vectorization

Now where we have our store we need to get the data into it. We will need to retrieve the data, create the chunks, then vectorize them, and finally store the vectors (along with the docs in this case) in the vector db.

Let us first set chunk size and overlap, as well as the type of splitter. These are starting parameters and you may want to experiment with them:

In [38]:
# @title **Hyperparams**
# Note that these defaults may or may not be ideal!

# CHUNK_SIZE: int = 64
# CHUNK_SIZE: int = 128
CHUNK_SIZE: int = 256
OVERLAP: int = 0

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP
)

Now let's work with an actual document collection.  We will work with four types of documents:

* A few papers from the `ArXiv` on RAG and NLP
* A few blogs from `Lily Weng` that talk about Open Domain Question Answering and related topics
* A number of `Wikipedia` articles on that topic


First we'll grab some papers from ArXiv.  We'll grab the pdf files and get all of the pages as separate documents.

In [39]:
arxiv_numbers = (
    "2005.11401",
    "2104.07567",
    "2104.09864",
    "2105.03011",
    "2106.09685",
    "2203.02155",
    "2211.09260",
    "2211.12561",
    "2212.09741",
    "2305.14314",
    "2305.18290",
    "2306.15595",
    "2309.08872",
    "2309.15217",
    "2310.06825",
    "2310.11511",
    "2311.08377",
    "2312.05708",
    "2401.06532",
    "2402.01306",
)

In [40]:
# @title Arxiv Data Loader
all_arxiv_pages = []

# assign a unique number to each document we ingest
global_doc_number = 1

# loop through the papers
for identifier in tqdm(arxiv_numbers):
    # Construct URL using the arXiv unique identifier
    arx_url: str = f"https://arxiv.org/pdf/{identifier}.pdf"

    # Extract pages from the document and add them to the list of pages
    arx_loader = PyMuPDFLoader(arx_url)
    arx_pages = arx_loader.load()
    for page_num in range(len(arx_pages)):
        page = arx_pages[page_num]

        # CHANGED
        page.metadata["page_num"] = page_num
        page.metadata["doc_num"] = global_doc_number
        page.metadata["doc_source"] = "ArXiv"
        all_arxiv_pages.append(page)

    global_doc_number += 1

print(f"{len(all_arxiv_pages)=}")
print(f"{len(arxiv_numbers)=}")
print(f"{global_doc_number-1=}")

  0%|          | 0/20 [00:00<?, ?it/s]

len(all_arxiv_pages)=420
len(arxiv_numbers)=20
global_doc_number-1=20


In [41]:
print(all_arxiv_pages[5].page_content[:150])  # all pages of the Document content

Table 1: Open-Domain QA Test Scores. For TQA,
left column uses the standard test set for Open-
Domain QA, right column uses the TQA-Wiki
test set. See


Now we need to split the docs into chunks.  LangChain provides a couple of ways to do that.  We'll use for now the `RecursiveCharacterTextSplitter`.

In [42]:
# index doc chunks
splits = text_splitter.split_documents(all_arxiv_pages)
for idx, text in enumerate(splits):
    splits[idx].metadata["split_id"] = idx

print("Number of splits/chunks: ", len(splits))

Number of splits/chunks:  6712


In [43]:
splits[0]

Document(page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrick Lewis†‡, Ethan Perez⋆,\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,', metadata={'source': 'https://arxiv.org/pdf/2005.11401.pdf', 'file_path': 'https://arxiv.org/pdf/2005.11401.pdf', 'page': 0, 'total_pages': 19, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210413004838Z', 'modDate': 'D:20210413004838Z', 'trapped': '', 'page_num': 0, 'doc_num': 1, 'doc_source': 'ArXiv', 'split_id': 0})

Let's add the vectors to the datastore and see whether we can retrieve a nearest neighbor to a query. Let's look at the second closest match:

In [44]:
%%capture
vs_exception: bool = True

try:
    qdrant_vectorstore.add_documents(documents=splits)
except Exception as exc:
    print(exc)
    vs_exception = False

In [45]:
assert vs_exception, "There was an error storing all documents in the vectorstore"

In [46]:
query = "How can we train a model for preferences?"
found_docs = qdrant_vectorstore.similarity_search_with_score(query)

In [47]:
print(found_docs[0][0].page_content)
print(found_docs[0][1])
print(found_docs[1][0].page_content)
print(found_docs[1][1])

preferences at once, or where everyone would endorse the tradeoffs.
One path forward could be to train models that can be conditioned on the preferences of certain
0.7036675973413791
a large text dataset. While the most straightforward approach to preference learning is supervised
fine-tuning on human demonstrations of high quality responses, the most successful class of methods
0.6703005536978632


Next, let's get some information from Wikipedia on our main topic -- Gen AI.  LangChain provides a `DocumentLoader` that accesses the `Wikipedia API`.

In [48]:
# @title Wikipedia Data Loader
wiki_queries: list[str] = [
    "Generative Artificial Intelligence",
    "Information Retrieval",
    "Large Language Models",
]

for wq in tqdm(wiki_queries):
    wiki_docs = WikipediaLoader(query=wq, load_max_docs=4).load()

    for idx, text in enumerate(wiki_docs):
        wiki_docs[idx].metadata["doc_num"] = global_doc_number
        wiki_docs[idx].metadata["doc_source"] = "Wikipedia"

    global_doc_number += 1

    print(f"Number of documents {wq}: ", len(wiki_docs))

    # index docs

    wiki_splits = text_splitter.split_documents(wiki_docs)
    for idx, text in enumerate(wiki_splits):
        wiki_splits[idx].metadata["split_id"] = idx

    print(f"Number of splits/chunks {wq}: ", len(wiki_splits))
    try:
        qdrant_vectorstore.add_documents(documents=wiki_splits)
    except Exception as exc:
        print(exc)
        vs_exception = False

    assert (
        vs_exception
    ), f"{wq}: There was an error storing all documents in the vectorstore"

  0%|          | 0/3 [00:00<?, ?it/s]

Number of documents Generative Artificial Intelligence:  4
Number of splits/chunks Generative Artificial Intelligence:  80
Number of documents Information Retrieval:  4
Number of splits/chunks Information Retrieval:  94
Number of documents Large Language Models:  4
Number of splits/chunks Large Language Models:  86


We'll also augment our collection with some blog entries about Open Domain Question Answering, of which RAG is an approach, and some related topics in case users want to ask how the new Search system works.

In [49]:
# @title Online Blog Data Loader
web_loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2020-10-29-odqa/",
        "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
        "https://lilianweng.github.io/posts/2018-06-24-attention/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)

web_documents = web_loader.load()

for idx, text in enumerate(web_documents):
    web_documents[idx].metadata["doc_num"] = global_doc_number
    web_documents[idx].metadata["doc_source"] = "WWW"

global_doc_number += 1

print("Number of documents: ", len(web_documents))

web_splits = text_splitter.split_documents(web_documents)

for idx, text in enumerate(web_splits):
    web_splits[idx].metadata["split_id"] = idx

print("Number of splits: ", len(web_splits))

try:
    qdrant_vectorstore.add_documents(documents=web_splits)
except Exception as exc:
    print(exc)
    vs_exception = False

Number of documents:  3
Number of splits:  625


In [50]:
joblib.dump(
    qdrant_vectorstore,
    f"mids_290_a5/qdrant_vectorstore_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl",
)

['mids_290_a5/qdrant_vectorstore_CHUNK_256_OVR_0.pkl']

---
### 3.3 The Test Data

You will want to test the system that you (will) have built. Below we give you a validation set that you could take as labeled data (imagine, your user personas would have had these questions and deemed the answers to be good). We also will give you a test set that only contains questions. (This is the set that we will use to get a feel for how well your RAG system corresponds to our Gold model).

Here are is the gold validation set and the test questions. **DO NOT CHANGE OR DELETE!!**

In [51]:
# @title
validation_questions_answers = {
    0: {
        "question": "What purpose do large language models serve in the field of natural language processing?",
        "gold_answer_research": "Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks.",
        "gold_answer_marketing": "Large language models serve the purpose of improving performance in various natural language processing tasks, such as speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval.",
    },
    1: {
        "question": "How does a large language model learn from text during training?",
        "gold_answer_research": "A large language model learns from text during training by first going through an unsupervised generative 'pretraining' stage where it sets initial parameters using a language modeling objective. Then, it goes through a supervised discriminative 'fine-tuning' stage where it refines its parameters based on annotated examples or task demonstrations. This dual-stage approach allows the model to learn statistical relationships from text documents in a computationally intensive process, enabling it to achieve general-purpose language generation and natural language processing tasks.",
        "gold_answer_marketing": "A large language model learns from text during training by first pretraining on a diverse dataset to acquire general language knowledge, and then fine-tuning on specific tasks or demonstrations to adapt its parameters for more targeted performance.",
    },
    2: {
        "question": "What are some key architectures behind the development of large language models?",
        "gold_answer_research": "Key architectures behind the development of large language models include the use of self-attention mechanisms, such as those seen in Transformer decoders. These architectures have been applied to tasks like autoregressive language modeling and have led to the dominance of Transformer-based language models in NLP. Models like BERT and GPT-2 have further advanced this paradigm, showcasing the power of large Transformer language models in achieving state-of-the-art results across various NLP tasks. Additionally, architectures like neural-retriever-in-the-loop generative-based models have shown improvements in tasks like open-domain QA and knowledge-grounded dialogue, emphasizing the importance of consistent and engaging responses in long-form generation and multi-turn conversations.",
        "gold_answer_marketing": "Key architectures behind the development of large language models include Transformer-based models such as BERT and GPT-2, which utilize self-attention mechanisms for tasks like autoregressive language modeling and knowledge-grounded dialogue. These models have shown significant success in NLP tasks and have led to advancements in general-purpose language generation and natural language processing.",
    },
    3: {
        "question": "Can you name some specific large language models and the companies or organizations that have developed them?",
        "gold_answer_research": "Some specific large language models include GPT-3 by OpenAI, Chinchilla by DeepMind, and BERT by Google. OpenAI developed GPT-3, DeepMind developed Chinchilla, and Google developed BERT. These models have been significant advancements in the field of natural language processing.",
        "gold_answer_marketing": "Chinchilla by DeepMind, GPT-3 by OpenAI.",
    },
    7: {
        "question": "What licensing models have been adopted for the distribution of source-available language models?",
        "gold_answer_research": "Based on the provided context, it seems that licensing models for the distribution of source-available language models have not been explicitly discussed in the referenced papers. However, it is crucial to consider potential licensing options such as open-source licenses (e.g., GPL, MIT) or proprietary licenses when distributing language models to ensure legal compliance and control over usage rights. Additionally, considering the implications of different licensing models on accessibility, collaboration, and commercialization is essential for determining the most suitable approach for sharing language models with the community. Further research or consultation with legal experts may be necessary to explore specific licensing strategies for source-available language models.",
        "gold_answer_marketing": "Answer: Some organizations choose open-sourcing, while others restrict access to a few organizations with resources or offer end-to-end deployment via API.",
    },
    8: {
        "question": "What are language models and what is their purpose in natural language processing?",
        "gold_answer_research": "Language models are probabilistic models of natural language that help predict or correct text. Their purpose in natural language processing is to assist in various tasks such as speech recognition, machine translation, natural language generation, and information retrieval. By analyzing the performance of human subjects, language models improve the understanding and generation of human-like text.",
        "gold_answer_marketing": "Language models are probabilistic models of natural language that are used in tasks such as speech recognition, machine translation, and natural language generation in natural language processing.",
    },
    9: {
        "question": "How have language models evolved in terms of architecture, from the 1980s to present times?",
        "gold_answer_research": "Language models have evolved significantly in terms of architecture from the 1980s to present times. In the 1980s, the first statistical language model was proposed, leading to experiments by IBM that identified areas for improvement by observing human subjects. However, it wasn't until 2017 when the transformer architecture was introduced by Google, revolutionizing the field. This development paved the way for models like BERT in 2018, which marked a shift towards large-scale transformer-based language models. These modern architectures, based on self-attention mechanisms, have dominated the field of natural language processing, achieving state-of-the-art performance in various tasks.",
        "gold_answer_marketing": "Language models have evolved from early statistical models in the 1980s to modern transformer architectures, such as BERT and GPT-2, which use self-attention mechanisms and have become dominant in natural language processing tasks.",
    },
    11: {
        "question": "Can you explain how maximum entropy language models work and what the partition function signifies?",
        "gold_answer_research": "Maximum entropy language models use feature functions to encode the relationship between a word and its n-gram history, aiming to maximize reward while satisfying a KL-constrained objective. The partition function, denoted as Z(x), is crucial in normalizing the probabilities of all possible outputs given the input. It represents the sum of the exponential of the reward function over all possible output sequences, making it computationally expensive to estimate but essential for accurate modeling. The partition function ensures that the model's predicted probabilities sum up to 1, providing a foundation for effective language modeling.",
        "gold_answer_marketing": "Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The partition function in this context represents the total probability of all possible outcomes, making it a crucial factor in determining the optimal solution for the reward maximization objective.",
    },
    12: {
        "question": "What is the benefit of using continuous space embeddings in recurrent neural network language models?",
        "gold_answer_research": "Continuous space embeddings in recurrent neural network language models help alleviate the curse of dimensionality by representing words as non-linear combinations of weights in the embedding space. This approach helps address the data sparsity problem caused by the exponential increase in possible word sequences with vocabulary size. By utilizing continuous space embeddings, neural networks can effectively capture semantic relationships and meaning within the language model.",
        "gold_answer_marketing": "Continuous space embeddings in recurrent neural network language models help alleviate the curse of dimensionality caused by the exponential increase in possible word sequences, reducing data sparsity issues.",
    },
    13: {
        "question": "What challenges do large language models face in mirroring human cognitive patterns?",
        "gold_answer_research": "Large language models face challenges in mirroring human cognitive patterns because they sometimes learn patterns that humans do not learn, while also failing to learn patterns that humans typically learn. This discrepancy suggests that the models may not be plausible cognitive models, despite matching human performance in some tasks. Further research is needed to address these limitations and improve the alignment of large language models with human cognitive patterns.",
        "gold_answer_marketing": "Large language models sometimes learn patterns that humans do not learn and fail to learn patterns that humans typically do learn.",
    },
    16: {
        "question": "What factors influenced the development of generative language models by Anthropic?",
        "gold_answer_research": "Several factors influenced the development of generative language models by Anthropic, including the limitations in coding, math, and reasoning capabilities of the initial version Claude, the partnerships with companies like Notion and Quora to enhance the model's capabilities, and the need to address biases, unsafe content, and ethical considerations in training data. Additionally, the reliance on supervised learning and the need for controlled generation in generative models played a role in shaping the development of Anthropic's language models.",
        "gold_answer_marketing": "Factors that influenced the development of generative language models by Anthropic include partnerships with companies like Notion and Quora, limitations in coding, math, and reasoning capabilities in initial models like Claude, and the need to address biases and unsafe content in training datasets.",
    },
    17: {
        "question": "What is Constitutional AI and how does it affect the functionality of AI systems?",
        "gold_answer_research": "Constitutional AI is an approach developed by Anthropic for training AI systems, particularly language models like Claude, to be harmless and helpful without relying on extensive human feedback. It involves two phases: supervised learning, where the model generates responses to prompts and self-critiques based on a set of guiding principles, and reinforcement learning, where the model is trained with AI-generated feedback according to constitutional principles. This approach enables the training of AI assistants that are both helpful and harmless, with the ability to explain objections to harmful requests, enhancing transparency and reducing the need for human supervision.",
        "gold_answer_marketing": "Constitutional AI is an approach developed by Anthropic for training AI systems, particularly language models like Claude, to be harmless and helpful without relying on extensive human feedback. It involves supervised learning and reinforcement learning phases to guide the model's responses based on a set of guiding principles (a 'constitution'). This approach aims to create AI systems that are both helpful and transparent in their decision-making process, reducing the need for constant human supervision.",
    },
    18: {
        "question": "How do advances in AI models impact their ability to interact with different types of data, such as images?",
        "gold_answer_research": "Advances in AI models, such as multimodal models like RA-CM3, have significantly improved their ability to interact with different types of data, such as images. These models can refer to external memory, like web data, to increase their knowledge capacity, allowing them to generate correct images from entity-rich captions. Additionally, these models can perform image editing and manually specify examples in-context for better results. The use of large language models, combined with larger datasets and neural networks, has also enhanced their performance in tasks like image generation and text generation.",
        "gold_answer_marketing": "Advances in AI models, such as multimodal models like RA-CM3, allow for better interaction with different types of data, like images, by accessing external memory for increased knowledge capacity and improving performance in tasks like image generation and image editing.",
    },
    19: {
        "question": "What are the potential trade-offs between AI system alignment with ethical guidelines and practical utility?",
        "gold_answer_research": "The potential trade-offs between AI system alignment with ethical guidelines and practical utility include the risk of reduced performance and usability due to stringent ethical alignment measures, as seen with Claude 2. Users may face limitations and refusal of assistance for benign requests, leading to debates over the 'alignment tax' in AI development. Balancing ethical considerations with practical functionality is crucial to ensure alignment with ethical guidelines without compromising the practical utility of AI systems. Research is needed to find a middle ground that prioritizes ethical alignment while maintaining usability and performance.",
        "gold_answer_marketing": "The potential trade-offs between AI system alignment with ethical guidelines and practical utility include balancing stringent ethical alignment that may reduce usability and performance, ensuring transparency and fairness in alignment processes, and addressing the alignment tax that may impact adoption of AI systems.",
    },
    20: {
        "question": "How has the token handling capacity changed between different versions of the Claude model?",
        "gold_answer_research": "The token handling capacity has increased with each new version of the Claude model. Claude Instant has a context length of 100,000 tokens, Claude 2.1 doubled this to 200,000 tokens, and Claude 3 Opus default version has a context window of 200,000 tokens but can be expanded to 1 million for specific use cases. This progression shows a trend towards handling larger amounts of text data for improved performance and capabilities.",
        "gold_answer_marketing": "The token handling capacity has increased from Claude to Claude Instant to Claude 2.1, with Claude Instant having a input context length of 100,000 tokens, Claude 2.1 having a context window of 200,000 tokens, and Claude 3 Opus having a context window of 1 million tokens.",
    },
    22: {
        "question": "In what ways has the Claude model's ability to self-critique and revise its responses enhanced its transparency?",
        "gold_answer_research": "The Claude model's ability to self-critique and revise its responses has enhanced its transparency by allowing for iterative improvements based on past actions and mistakes. Through self-reflection, the model can refine its output by learning from feedback and generating special tokens to signal the need for retrieval or confirm the relevance, support, or completeness of its responses. This process ensures that the model's statements about the world are truthful and accurate, ultimately increasing transparency in its decision-making and reasoning processes.",
        "gold_answer_marketing": "The Claude model's ability to self-critique and revise its responses has enhanced its transparency by allowing it to generate text informed by retrieved passages, criticize the output, and signal the need for retrieval or confirm the output's relevance, support, or completeness. This self-reflection process helps improve the model's accuracy and reliability in generating responses.",
    },
    23: {
        "question": "How do subsequent versions of Claude compare in terms of their likelihood to produce false statements?",
        "gold_answer_research": "Claude Instant is a faster and lighter version of Claude, with an input context length of 100,000 tokens. In contrast, Claude 3 has faced criticism for its stringent ethical alignment, leading to a debate over the 'alignment tax' in AI development. Users have been refused assistance with benign requests, which has sparked discussions on balancing ethical considerations and practical functionality. This suggests that Claude Instant may have a lower likelihood of producing false statements compared to Claude 3 due to its focus on usability and performance.",
        "gold_answer_marketing": "Claude Instant is a faster, less expensive, and lighter version of Claude with a shorter input context length. Claude 3 has faced criticism for ethical alignment issues that may affect usability and performance.",
    },
    24: {
        "question": "Who developed the language model family known as Chinchilla?",
        "gold_answer_research": "The Chinchilla language model family was developed by the research team at DeepMind and presented in March 2022. It is named 'Chinchilla' as an advancement over the previous Gopher model family. The Chinchilla family has been trained to investigate the scaling laws of large language models and is designed to outperform GPT-3.",
        "gold_answer_marketing": "The research team at DeepMind developed the language model family known as Chinchilla.",
    },
    25: {
        "question": "What benchmark did Chinchilla achieve an average accuracy of 67.5% on?",
        "gold_answer_research": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding).",
        "gold_answer_marketing": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding).",
    },
    27: {
        "question": "What is the relationship between Chinchilla and the Gopher language model families?",
        "gold_answer_research": "The Chinchilla family of transformer models is essentially the same as the Gopher family, with minor modifications and different training optimizers. Chinchilla uses AdamW optimizer while Gopher uses Adam optimizer. Additionally, Chinchilla uses relative positional encoding and RMSNorm instead of absolute positional encoding and LayerNorm used by Gopher. Chinchilla has 70B parameters and outperforms Gopher on the MMLU benchmark by 7%, showcasing an improvement in performance. Both families follow similar naming conventions and were developed to investigate the scaling laws of large language models.",
        "gold_answer_marketing": "Chinchilla is a family of transformer models developed by DeepMind, which is a further development over a previous model family named Gopher. Both model families were trained to investigate the scaling laws of large language models.",
    },
    28: {
        "question": "What distinguishes the architectures of the Chinchilla and Gopher family models in terms of optimization techniques used?",
        "gold_answer_research": "The main distinction in optimization techniques between the Chinchilla and Gopher family models lies in the choice of optimizers. The Gopher family utilizes the Adam optimizer, whereas the Chinchilla family is trained using the AdamW optimizer. Additionally, the Gopher family employs RMSNorm instead of LayerNorm, and relative positional encoding rather than absolute positional encoding. These differences in optimization techniques contribute to the unique characteristics and performance of each model family.",
        "gold_answer_marketing": "The Chinchilla family uses AdamW optimizer, while the Gopher family uses the Adam optimizer.",
    },
    30: {
        "question": "What is the recommended strategy for training large autoregressive language models with limited compute resources, as contributed by the Chinchilla team?",
        "gold_answer_research": "The Chinchilla team recommends that the number of training tokens should be doubled for every model size doubling to achieve better results on downstream tasks. They also suggest using larger, higher-quality training datasets to improve performance. Additionally, they mention the importance of balancing model size and efficiency to address computational costs and inference latency limitations. It is advised to focus on Transformer language models and consider sharing model parameters for quick task-switching when deploying as a service.",
        "gold_answer_marketing": "The Chinchilla team recommends doubling the number of training tokens for every model size doubling and using larger, higher-quality training datasets to achieve better results on downstream tasks.",
    },
    33: {
        "question": "What are some key areas of research in the field of artificial intelligence as reflected in recent academic literature?",
        "gold_answer_research": "Recent academic literature in the field of artificial intelligence reflects key areas of research such as natural language processing with state-of-the-art transformers, feature learning in infinite-width neural networks, diverse beam search for complex scene description, and the development of generative AI models capable of generating text and images. Additionally, research focuses on human preferences in dueling bandits, the use of few-shot learners in language models, and the exploration of knowledge-grounded neural conversation models. These areas of research highlight the advancements in AI technology and its applications across various domains.",
        "gold_answer_marketing": "Some key areas of research in artificial intelligence include natural language processing, deep neural networks, generative AI, AI safety, AI art, reinforcement learning, and language agents alignment.",
    },
    34: {
        "question": "What are some of the limitations of traditional position encoding methods in the architecture of pre-trained language models (PLMs), and what novel approach does the paper propose to address these issues?",
        "gold_answer_research": "One limitation of traditional position encoding methods in PLMs is that they may not enable length extrapolation of pre-existing models, leading to the need for substantial pre-training costs. The paper proposes a novel approach called Position Interpolation, which extends existing PLMs without deviating far from existing definitions of position encoding or attention mechanisms. This method allows for much extended context windows for text modeling, leading to significant perplexity gains and improved model performance.",
        "gold_answer_marketing": "Traditional position encoding methods in PLMs have limitations in enabling length extrapolation and adapting to extended context windows. The paper proposes a novel approach called Position Interpolation, which generates strong models that can effectively make use of much extended context windows. This method allows for substantial pre-training cost savings and preserves the quality of the original models, even for small context window tasks.",
    },
    35: {
        "question": "How does the Rotary Position Embedding (RoPE) approach in Transformers differ from the traditional additive method of position embedding with respect to encoding position information?",
        "gold_answer_research": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by being multiplicative instead of additive. While traditional methods add position encoding to context representations, RoPE incorporates relative position information through rotation matrix product. This means that RoPE naturally includes relative position dependency in the self-attention formulation, without altering terms in the expanded formulation like the additive method does. Additionally, RoPE's properties show that it decays as the relative distance between positions increases, providing a clear theoretical interpretation of how position information is encoded.",
        "gold_answer_marketing": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by incorporating relative position information through rotation matrix product instead of altering terms in the expanded formulation of additive position encoding.",
    },
    36: {
        "question": "What is the significance of comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices when analyzing the adaptation of pre-trained language models?",
        "gold_answer_research": "Comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices provides insight into the underlying mechanism for adapting pre-trained language models. It helps determine the intrinsic rank of the adaptation matrix ∆W and sheds light on the connection between ∆W and the original weight matrix W. By analyzing these similarities, we can understand how much of the adaptation is specific to the task at hand and how much is influenced by the pre-trained model. This comparison is crucial for optimizing the adaptation process and maximizing downstream performance in NLP tasks.",
        "gold_answer_marketing": "Comparing the normalized subspace similarity between ∆Wq, ∆Wv, and random Gaussian matrices helps understand the underlying mechanism for adapting pre-trained language models. It reveals the intrinsic rank and common singular value directions learned by different runs, shedding light on the fundamental principles of using pre-trained language models for downstream tasks in NLP.",
    },
    38: {
        "question": "What issues are associated with the homogeneity of language model training contractors, and how might it affect the behavior of the models?",
        "gold_answer_research": "The issues associated with the homogeneity of language model training contractors include potential biases in the labeling process, lack of diverse perspectives leading to limited coverage of sensitive content, and reduced robustness in model performance across different tasks. This homogeneity can affect the behavior of the models by reinforcing certain biases, increasing the risk of harmful content generation, and limiting the models' ability to generalize effectively. To address these issues, it is important to ensure diversity among labelers, incorporate varied perspectives in training data, and implement measures to enhance model robustness and performance across a range of tasks.",
        "gold_answer_marketing": "The homogeneity of language model training contractors can lead to biased or limited perspectives in the data, which may result in the models producing harmful content, gaming objectives, or lacking sensitivity to diverse viewpoints. This can affect the behavior of the models by reinforcing stereotypes, increasing toxicity, and reducing their ability to accurately represent under-represented groups.",
    },
    39: {
        "question": "What are common research topics and themes found in recent publications about artificial intelligence and natural language processing?",
        "gold_answer_research": "Recent publications in artificial intelligence and natural language processing have covered topics such as transformer models, feature learning in neural networks, attention mechanisms, multi-task benchmark platforms, semantic search using sentence embeddings, cross-task generalization, and question generation for question answering. Themes commonly explored include machine comprehension of text, reinforcement learning algorithms, sentence embeddings, semantic compositionality, reasoning with language models and knowledge graphs, and the gap between neural text and human text. These publications also delve into deep language understanding, retrieval-augmented transformers, image captioning, and open datasets for image-text pairs.",
        "gold_answer_marketing": "Common research topics and themes in recent publications on artificial intelligence and natural language processing include transformer models, attention mechanisms, semantic search, sentence embeddings, and question answering using language models and knowledge graphs.",
    },
    41: {
        "question": "Question: When conducting demographic and technical assessments of teams or research subjects, what types of data categories are typically collected and analyzed to ensure a comprehensive understanding of the group's composition and the methods used?",
        "gold_answer_research": "When conducting demographic and technical assessments of teams or research subjects, it is important to collect and analyze data categories such as age, gender, education level, professional background, and expertise in specific areas. By gathering information on these categories, you can ensure a comprehensive understanding of the group's composition and the methods used in your assessments. Additionally, it may be helpful to consider factors like cultural background, language proficiency, and geographical location to capture a more nuanced picture of the group being assessed. This detailed approach to data collection and analysis can provide valuable insights for making informed decisions and recommendations based on the gathered information.",
        "gold_answer_marketing": "Answer: Demographic data such as age, gender, education level, and technical data related to skills and experience are typically collected and analyzed for comprehensive understanding.",
    },
    43: {
        "question": "What kind of tasks can be performed using the datasets described in the provided text, and what are some common features of these datasets?",
        "gold_answer_research": "The datasets described in the provided text can be used for tasks such as question answering, duplicate question retrieval, entity retrieval, citation prediction, query understanding, document understanding, passage retrieval, text summarization, fact verification, and code search. Common features of these datasets include diverse task categories, comprehensive instructions, a wide range of synthetic user personalities and interaction patterns, and a focus on enhancing comprehension of documents to deliver accurate results. Additionally, the datasets cover a variety of domains such as public health, scientific exams, climate, and general knowledge.",
        "gold_answer_marketing": "The datasets described in the provided text can be used for tasks such as question answering, document summarization, duplicate question retrieval, code search, sentence simplification, dialogue generation, body retrieval, caption generation, fact verification, and more. Some common features of these datasets include diverse input-output pairs, incorporation of various knowledge-intensive datasets, and a focus on generating high-quality synthetic data points.",
    },
    44: {
        "question": "What conclusions can be drawn about the relationship between input prompt toxicity and output toxicity when using different language models and prompts?",
        "gold_answer_research": "Based on the findings presented in the results section, it can be concluded that the relationship between input prompt toxicity and output toxicity varies depending on the language model used and the specific prompt given. When instructed to produce a safe and respectful output, InstructGPT models generate less toxic outputs compared to GPT-3, but this advantage disappears when the respectful prompt is removed. On the other hand, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than GPT-3 outputs. Additionally, the toxicity of the model outputs is highly correlated with the toxicity of the input prompt, as shown in Figure 39.",
        "gold_answer_marketing": "The study found that when instructed to produce a safe and respectful output, InstructGPT models generate less toxic outputs compared to GPT-3. However, this advantage disappears when the respectful prompt is removed. Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than GPT-3. This suggests that the toxicity of the output is highly correlated with the toxicity of the input prompt.",
    },
    45: {
        "question": "What are some challenges in training retrieval systems and how are negative samples used to address them?",
        "gold_answer_research": "Training retrieval systems face challenges such as redundancy in retrieved documents and lack of diversity in retrieval. Negative samples, including randomly sampled negatives, denoised hard negatives, and instruction-unfollowing negatives, are crucial for improving system performance. Carefully designed negative samples help the system effectively learn the task, but they can also lead to performance drops in out-of-domain datasets. Combining random samples and challenging negatives during training is key to building a competitive system for both in-domain and out-of-domain retrieval.",
        "gold_answer_marketing": "Some challenges in training retrieval systems include high cost of annotating datasets for new tasks and improving performance in zero-shot settings. Negative samples, such as denoised hard negative documents and instruction-unfollowing negative documents, are used to train retrieval systems effectively and address performance drops in out-of-domain datasets.",
    },
    46: {
        "question": "What factors have been found to potentially impact the ability of models to follow instructions, based on the analysis provided?",
        "gold_answer_research": "Based on the analysis provided, factors that have been found to potentially impact the ability of models to follow instructions include the human feedback obtained from contractors, which may be influenced by their beliefs, cultural backgrounds, and personal history. Additionally, the model's behavior can be affected by false premises in instructions, tendencies to hedge, and performance degradation with multiple explicit constraints in instructions. The models are also not fully aligned or safe, as they can generate toxic or biased outputs, make up facts, and fail to generate reasonable outputs in some cases.",
        "gold_answer_marketing": "Factors that may impact the ability of models to follow instructions include false premises in instructions, models hedging unnecessarily, performance degradation with multiple constraints in instructions, generation of toxic or biased outputs, and over-generalization leading to refusal of innocuous instructions.",
    },
    47: {
        "question": "What are some key factors to consider when building a successful multi-task instruction-following retrieval system as identified in the research?",
        "gold_answer_research": "Some key factors to consider when building a successful multi-task instruction-following retrieval system include the need for cross-task interdependence for training a single retriever, the flexibility and zero-shot transfer enabled by instructions compared to task identifiers, and the elimination of the need for hosting multiple task-specific retrievers. Additionally, optimizing the mix and volume of instructional data for diverse tasks is crucial, as well as considering the impact of ranking strategy in data construction. Finally, the effectiveness of the dataset scale in retrieval and the importance of carefully designed negative samples should be taken into account for improved efficiency of instruction-following retrievers.",
        "gold_answer_marketing": "Key factors to consider when building a successful multi-task instruction-following retrieval system include the effectiveness of the dataset scale in retrieval, the diversity in data and model scale, carefully designed negative samples, and the ability to adapt to new tasks via instructions.",
    },
    48: {
        "question": "What are the benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model in the document?",
        "gold_answer_research": "The benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model, include significantly better training efficiency with less training compute, outperforming existing models by using less training data, compute, and parameters. The retrieval augmentation allows the model to focus on learning how to use retrieved documents in context, leading to improved accuracy in classification tasks. Additionally, the RA-CM3 model achieves strong performance in image and caption generation, surpassing existing models like DALL-E and Flamingo despite using fewer resources.",
        "gold_answer_marketing": "The benefits of using retrieval-augmented techniques in multimodal language modeling, as demonstrated by the performance of the RA-CM3 model in the document, include outperforming existing models by using less training data, compute, and parameters, achieving significantly better training efficiency, and improving accuracy in k-shot classification tasks. Additionally, retrieval augmentation allows the model to focus on learning how to use retrieved documents in context, leading to stronger performance in tasks such as image and caption generation.",
    },
    50: {
        "question": "What methods are typically employed to create training data for embedding models that use task-specific instructions?",
        "gold_answer_research": "To create training data for embedding models that use task-specific instructions, a common method is to combine datasets from different sources, such as the SuperNaturalInstructions dataset with existing collections designed for embedding training. The SuperNaturalInstructions dataset provides natural language instructions, which can be paired with positive and negative examples to form training samples. Additionally, for tasks like classification or similarity, training samples can be constructed by selecting text sequences associated with different classes or similarities. This diverse training data is essential for instruction-based finetuning, which enables the embedding model to learn from a wide range of tasks and domains.",
        "gold_answer_marketing": "Training data for embedding models that use task-specific instructions is typically created by formulating a wide variety of tasks as text-to-text problems, distinguishing good/bad candidate outputs given an input text. This is done by combining datasets with natural language instructions and constructing positive and negative pairs for training.",
    },
    51: {
        "question": "Question: What are some of the challenges and innovations associated with fine-tuning large language models, and how does the approach discussed in the referenced text aim to address them?",
        "gold_answer_research": "Some challenges associated with fine-tuning large language models include limited access to and manipulation of knowledge, lagging performance on knowledge-intensive tasks, and the need for provenance in decision-making and updating world knowledge. The approach discussed in the referenced text aims to address these challenges by utilizing Retrieval Augmented Generation (RAG), which involves retrieving relevant passages from a corpus to feed to the language model for improved performance in tasks such as question-answering and dialogue. This iterative approach focuses on improving alignment with user intent and fine-tuning models to control sentiment and improve response quality in various language tasks.",
        "gold_answer_marketing": "The challenges with fine-tuning large language models include aligning them with user intent and controlling the quality of generated outputs. The approach discussed in the referenced text aims to address these challenges by using Retrieval Augmented Generation (RAG) to retrieve relevant passages from a corpus and feed them to the language model, improving alignment and performance.",
    },
    52: {
        "question": "What is a common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors, and how does it work?",
        "gold_answer_research": "A common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant. This approach involves dividing the input tensor into contiguous blocks of size B by flattening the tensor and slicing it into n blocks, where n is determined by the size of the blocks. Each block is then quantized independently using a quantization constant c, which helps prevent outlier values from causing performance degradation.",
        "gold_answer_marketing": "A common technique used to address the outlier issue when applying block-wise k-bit quantization to input tensors is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant. This helps prevent performance degradation by reducing the impact of outliers on the quantization process.",
    },
    54: {
        "question": "What considerations or techniques are commonly implemented when setting up finetuning experiments for machine learning models?",
        "gold_answer_research": "When setting up finetuning experiments for machine learning models, it is common to use a two-stage approach. The initial stage involves setting the initial parameters using a language modeling objective. This is followed by a supervised discriminative 'fine-tuning' stage to adapt these parameters to the target task. Additionally, it is typical to train all models using the Adam optimizer and a triangular learning rate scheduler with 10% warmup. Experimentation with different hyperparameters such as number of epochs, peak learning rate, and batch size is also conducted to optimize model performance. Finally, utilizing a mixture of datasets and balancing the sizes of datasets can help improve the robustness and generalization of the finetuned models.",
        "gold_answer_marketing": "Considerations for setting up finetuning experiments for machine learning models commonly include using a language modeling objective for initial parameter setting and supervised discriminative fine-tuning for adapting parameters to the target task. Techniques such as hyperparameter search, Adam optimizer with triangular learning rate scheduler, and balancing dataset sizes through mixing strategies are also commonly implemented. Additionally, freezing some model layers during fine-tuning and incorporating negative examples for contrastive learning can be effective strategies.",
    },
    55: {
        "question": "What are the implications of the equivalence relation defined in the theoretical analysis of the DPO model for understanding the relationship between reward functions in reinforcement learning?",
        "gold_answer_research": "The equivalence relation defined in the theoretical analysis of the DPO model implies that two reward functions are considered equivalent if they differ by a constant function. This means that the class of learned reward models is not constrained by this reparameterization, allowing for the exact recovery of the optimal policy. Understanding this relationship between reward functions in reinforcement learning helps in defining a unique reward function within each equivalence class, which is crucial for optimizing policies under existing models of human preferences. It also highlights the generality and flexibility in the reward model due to the proposed reparameterization.",
        "gold_answer_marketing": "The equivalence relation defined in the theoretical analysis of the DPO model shows that two reward functions are considered equivalent if they differ by a fixed function. This implies that different reward functions can lead to the same optimal policy, allowing for flexibility in designing reward models in reinforcement learning.",
    },
    59: {
        "question": "Considering the structure and content of the provided text, what guidelines should be used to evaluate the effectiveness of a summary or chatbot response in this context?",
        "gold_answer_research": "To evaluate the effectiveness of a summary or chatbot response in this context, guidelines should include assessing the faithfulness of the answer to the retrieved context, the relevance of the answer to the question, and the focus of the retrieved context. Additionally, consider using quality metrics such as answer relevancy to rank responses based on how directly they address the question and avoid redundant or incomplete information. Lastly, take into account the performance of different tasks such as summarization, citation prediction, and passage ranking to determine the overall effectiveness of the response.",
        "gold_answer_marketing": "Answer: Evaluate based on faithfulness, answer relevance, and context relevance.",
    },
    60: {
        "question": "What are some recent methods and technologies that have been developed to enhance the capabilities and performance of natural language processing models?",
        "gold_answer_research": "Recent methods and technologies developed to enhance natural language processing models include retrieval-augmented multimodal language modeling, which outperforms existing models with less training data and parameters. Another advancement is the use of feature learning in infinite-width neural networks to improve performance. Additionally, embedding techniques in NLP have been developed to map words or phrases to real number vectors, enhancing the model's understanding of language. These innovations have led to improvements in tasks like query reformulation, document ranking, and fine-tuning larger language models for various applications.",
        "gold_answer_marketing": "Recent methods and technologies include retrieval-augmented language models, feature learning in infinite-width neural networks, and word embeddings.",
    },
    61: {
        "question": "What are some potential directions for future work mentioned in the document related to enhancing question-answering techniques for document-oriented tasks?",
        "gold_answer_research": "One potential direction for future work mentioned in the document is the development of multi-modal approaches that incorporate table and figure information into GPT-4 question-answering for documents. Another direction is to incorporate question type in the PDFTriage approach to improve the efficiency and efficacy of the approach. Additionally, the document suggests further research in document-grounded, information-seeking question answering, which the dataset is designed to facilitate.",
        "gold_answer_marketing": "Some potential future directions mentioned in the document include developing multi-modal approaches that incorporate table and figure information into question-answering for documents, and incorporating question type in the PDFTriage approach to improve efficiency and efficacy.",
    },
    62: {
        "question": "What information would you expect to find in section 2 of a document, based on the types of questions classified under Summarization?",
        "gold_answer_research": "Based on the types of questions classified under Summarization, you would expect to find key takeaways, concise summaries, and specific content extraction related to different sections of the document in section 2. The section likely contains detailed summaries of specific parts of the document, along with structured metadata representation and instructions for summarizing the content effectively. It may also include guidelines for extracting specific information and rewriting text for clarity and conciseness.",
        "gold_answer_marketing": "Based on the types of questions classified under Summarization, you would expect to find key takeaways, concise summaries, and specific content extraction related to the document in section 2.",
    },
    63: {
        "question": "What are the main advantages and attention mechanisms that contribute to the enhanced performance and efficiency of the newly introduced language model as compared to its predecessors?",
        "gold_answer_research": "The main advantages of the newly introduced language model include utilizing retrieval-augmentation to incorporate external knowledge, which improves prediction accuracy. Additionally, the model employs attention mechanisms that allow for better understanding of dependencies between source and target sequences, leading to more informed predictions. These attention mechanisms have been extended from machine translation to various other fields, enhancing the model's adaptability and performance across different tasks. Finally, the model's use of self-attention mechanisms enables better contextual representation learning, parallelization, and modeling of longer intra-token relations, improving efficiency and performance compared to previous models.",
        "gold_answer_marketing": "The main advantages of the newly introduced language model include the use of retrieval-augmented mechanisms, attention mechanisms, and context representation learning, which contribute to enhanced performance and efficiency compared to its predecessors.",
    },
    64: {
        "question": "What criteria are used to assess the quality of recommendations provided by different language models in a comparison study?",
        "gold_answer_research": "In a comparison study of language models, criteria such as sentence relevance, lexical accuracy, and contextual understanding are used to assess the quality of recommendations. Different tasks may benefit from different evaluation measures, such as STRINC, LEXICAL, and CXMI. Additionally, template selection plays a vital role in the quality of recommendations, with deliberate template design being important for tasks like query suggestion. The overall quality of recommendations is often judged using a Likert scale, along with metadata collection for each model output.",
        "gold_answer_marketing": "The criteria used to assess the quality of recommendations provided by different language models in a comparison study include comparing to human-created benchmarks, examining intrinsic character, comparing two models, investigating rate of learning, and analyzing learning curves.",
    },
    65: {
        "question": "What approaches have been proposed to enhance the task performance of language models while considering the trade-offs such as runtime efficiency, robustness to irrelevant context, and attribution quality?",
        "gold_answer_research": "Several approaches have been proposed to enhance the task performance of language models while considering trade-offs. These include using compression and selective augmentation methods to decrease the propensity of models to generate toxic or biased outputs. Adversarial setups have been suggested where labelers find worst-case behaviors of the model and add them to the dataset. Additionally, models like BART and T5 leverage bi-directional attention to achieve stronger performance on both discriminative and generative tasks. These methods aim to balance model performance with considerations such as runtime efficiency, robustness to irrelevant context, and attribution quality.",
        "gold_answer_marketing": "Approaches proposed to enhance language model task performance include compression and selective augmentation, adversarial set-ups for labeling worst-case behaviors, retrieval-augmented models, and extending existing models to enable length extrapolation while maintaining quality.",
    },
    67: {
        "question": "What metrics are commonly used to compare the performance of language models in various tasks, as outlined in an experimental results table?",
        "gold_answer_research": "Common metrics used to compare the performance of language models in various tasks, as outlined in an experimental results table, include Exact Match and Unigram F1. These metrics have become standard in evaluating language models. Additionally, other metrics such as BLEU score, FactScore (factuality), precision, and recall are also commonly used to assess the performance of language models across different tasks. It is important to consider a variety of metrics to get a comprehensive understanding of the effectiveness of a language model in different contexts.",
        "gold_answer_marketing": "The metrics commonly used to compare the performance of language models in various tasks are Exact Match and Unigram F1.",
    },
    69: {
        "question": "What is the role of manual assessment in the validation of language model predictions according to the text provided?",
        "gold_answer_research": "Manual assessment plays a crucial role in the validation of language model predictions. The engineers evaluate the quality of model outputs by having labelers rate them on test sets consisting of prompts from held-out customers. This manual assessment helps ensure that the models are aligned with a broad distribution of language tasks and can identify any behavioral issues that may arise from misalignment. Additionally, human annotators find that certain reflection token predictions are aligned with their assessments, providing valuable insights into the accuracy and effectiveness of the models.",
        "gold_answer_marketing": "Answer: Manual assessment plays a key role in evaluating the quality of language model predictions by having labelers rate the model outputs and comparing them to prompts from held-out customers.",
    },
    70: {
        "question": "What are the general steps outlined for training a language model in the document, and how is the training data for the generator language model collected and utilized?",
        "gold_answer_research": "The document outlines the general steps for training a language model, including incorporating retrieved documents into the main input sequence and optimizing the loss function to train the generator. The training data for the generator language model is collected through various techniques such as supervised fine-tuning, critic learning, and custom retrievers for downstream tasks. The collected data is used to train the generator on specific tasks like summarization, machine reading comprehension, and natural language to SQL translation, improving performance on those tasks.",
        "gold_answer_marketing": "The general steps for training a language model include fine-tuning on specific datasets, filtering pretraining data, and using critic learning. Training data for the generator language model is collected from open-access NLP papers and used for downstream conditional text generation tasks.",
    },
    73: {
        "question": "What are the three main categories used to refine language model abilities in understanding and executing search tasks according to the given document?",
        "gold_answer_research": "The three main categories used to refine language model abilities in understanding and executing search tasks are query understanding, document understanding, and query-document relationship understanding. Tasks within these categories focus on interpreting queries, comprehending documents, and understanding the relationships between queries and documents. This approach aims to enhance the models' performance in interpreting and responding to search-related instructions effectively, improving their utility in complex information retrieval scenarios.",
        "gold_answer_marketing": "The three main categories used to refine language model abilities in understanding and executing search tasks are query understanding, document understanding, and query-document relationship understanding.",
    },
    74: {
        "question": "What are some of the emerging research topics and challenges in the field of natural language processing and information retrieval according to recent academic conferences and publications?",
        "gold_answer_research": "Recent academic conferences and publications have highlighted emerging research topics and challenges in natural language processing and information retrieval. Some key areas of focus include efficient retrieval augmented generation, unsupervised dense information retrieval with contrastive learning, citation-informed transformers, and knowledge refinement via interaction between search engines and large language models. Additionally, challenges such as zero-shot retrieval, semantic search using GPT sentence embeddings, and prompt-based effective input reformulation for legal case retrieval have been identified as important research directions. These topics reflect the ongoing advancements and complexities in the field, driving innovation and progress in NLP and IR research.",
        "gold_answer_marketing": "Some emerging research topics and challenges in the field of natural language processing and information retrieval include efficient generation from unstructured knowledge, semantic code search evaluation, unsupervised dense information retrieval, context-aware document term weighting, knowledge refinement through interaction with large language models, and investigating the effectiveness of large language models in search re-ranking.",
    },
    75: {
        "question": "Question: How do models with different fine-tuning strategies compare in terms of accuracy and F1 score for fact verification tasks?",
        "gold_answer_research": "Models with different fine-tuning strategies are compared in terms of accuracy and F1 score for fact verification tasks. The introduction of LLMs has led to notable developments, with some studies leveraging prompting methods to apply LLMs in IR tasks. However, not all LLMs consistently outperform fine-tuned smaller models. For example, RankGPT based on gpt-3.5-turbo underperforms monoBERT in certain scenarios. Fine-tuning is not strictly necessary for models like GPT3, which has been evaluated on closed book question answering tasks without any updates or fine-tuning.",
        "gold_answer_marketing": "Models with different fine-tuning strategies have shown mixed results in terms of accuracy and F1 score for fact verification tasks. Some studies have found that large language models (LLMs) outperform smaller fine-tuned models, while others have reported inconsistent performance. Factors such as task complexity and the need for prompt methods to apply LLMs in information retrieval tasks can also impact the comparison.",
    },
    76: {
        "question": "What components does a fact verification task typically involve in order to assess the accuracy of a given statement?",
        "gold_answer_research": "A fact verification task typically involves assessing the relationship between a claim and the evidence provided, analyzing if there is enough information for a conclusive judgment. This task requires a detailed understanding of the claim and evidence to determine if it is supported or refuted. The use of performance metrics based on including gold answers in model generations instead of exact matching can help search engines deliver accurate and relevant results. Additionally, incorporating lexical measures and verification functions can aid in determining the accuracy of statements.",
        "gold_answer_marketing": "A fact verification task typically involves assessing the relationship between a claim and supporting evidence to determine accuracy.",
    },
    78: {
        "question": "What are the key factors that determine the performance of HALO-aligned models compared to non-HALO models, according to the results presented in the analysis?",
        "gold_answer_research": "According to the analysis presented, the key factors that determine the performance of HALO-aligned models compared to non-HALO models include the specific alignment method used (such as DPO and PPO variant), the model size (significant gap at 13B+ model sizes), and the ability to match or exceed the generation quality of SFT target sequences. Additionally, the study suggests that the cost of increasing model alignment is modest relative to pretraining, and that the modeling of human biases in HALOs may have practical benefits in improving overall performance.",
        "gold_answer_marketing": "The key factor that determines the performance of HALO-aligned models compared to non-HALO models is the model size, with HALO-aligned models generally outperforming non-HALO models at larger sizes (13B+ model sizes).",
    },
    80: {
        "question": "How does the performance of KTO compare to DPO in model alignment, and what are the potential implications for data usage and training efficiency?",
        "gold_answer_research": "Based on the provided data and experiments, KTO consistently outperforms DPO in model alignment, even with restrictions such as using only one output per input. This suggests that KTO can achieve higher win rates and improve performance across various benchmarks compared to DPO. The implications of this performance difference include the ability to achieve quality generation results with significantly fewer desirable examples, potentially leading to more efficient data usage and training processes. This indicates that KTO may offer a more efficient and effective approach to model alignment compared to DPO.",
        "gold_answer_marketing": "KTO outperforms DPO in model alignment with up to 90% fewer examples. This suggests that KTO can achieve high performance even with imbalanced data, potentially leading to more efficient training processes.",
    },
    81: {
        "question": "What are some common approaches to building an open-domain question answering system?",
        "gold_answer_research": "Some common approaches to building an open-domain question answering system include using the RAG model, which minimizes the negative log-likelihood of answers, and comparing it to extractive QA paradigms that rely on non-parametric knowledge retrieval. Another approach is to incorporate question rewriting techniques to make open-domain QA more conversational. Additionally, utilizing datasets like QASPER, which contain questions requiring complex reasoning, can improve the performance of the system. References to papers by Anantha et al. and Asai et al. provide further insights into building ODQA systems.",
        "gold_answer_marketing": "Common approaches to building an open-domain question answering system include using retrieval over a knowledge base and incorporating the retrieved content as part of the prompt. Other methods involve pretraining models on large amounts of text data and fine-tuning them for question answering tasks.",
    },
    82: {
        "question": "What is the difference between open-book and closed-book question answering?",
        "gold_answer_research": "Open-book question answering involves the use of external sources of knowledge, such as Wikipedia, to retrieve information and generate a response. In contrast, closed-book question answering relies on pre-trained language models that have memorized factual knowledge within their parameters to generate responses without explicit context. Closed-book QA can be seen as analogous to a closed-book exam where no external resources are allowed. The key distinction lies in the reliance on external knowledge sources for open-book QA versus internal memorized knowledge for closed-book QA.",
        "gold_answer_marketing": "Open-book question answering involves using external sources of knowledge to answer questions, while closed-book question answering relies on pre-trained language models to provide answers without explicit context.",
    },
    84: {
        "question": "What are the basic components of the Retriever-Reader framework in open-domain QA?",
        "gold_answer_research": "The basic components of the Retriever-Reader framework in open-domain QA include a retriever model, which fetches relevant information based on input prompts efficiently using FAISS. The retriever component is responsible for retrieving contextually relevant documents or evidence blocks based on the input question. The reader component then processes this retrieved information to generate answers to the questions posed. This framework combines information retrieval and machine reading comprehension to achieve state-of-the-art results in open-domain question answering tasks.",
        "gold_answer_marketing": "The basic components of the Retriever-Reader framework in open-domain QA are the retriever and the reader components, which can be set up and trained independently or jointly trained end-to-end. The retriever component automatically fetches relevant information based on input prompts, while the reader component processes and comprehends the retrieved information to answer questions.",
    },
    85: {
        "question": "How is the TF-IDF model used in question answering retrieval systems?",
        "gold_answer_research": "In question answering retrieval systems, the TF-IDF model is used to represent queries and documents as bag-of-word vectors with terms weighted by term frequency multiplied by inverse document frequency. This allows for efficient non-learning-based search engine operations based on the vector space model. The TF-IDF model helps in calculating the relevance of documents to queries by measuring the importance of terms in the context of the entire document collection. This classic information retrieval approach aids in retrieving relevant information to answer questions accurately and efficiently.",
        "gold_answer_marketing": "The TF-IDF model is used in question answering retrieval systems to weight terms in queries and documents based on their importance in determining relevance.",
    },
    86: {
        "question": "Can neural networks enhance the process of information retrieval in QA systems?",
        "gold_answer_research": "Neural networks, such as MLP, LSTM, and bidirectional LSTM, can be used to learn dense representations of text for information retrieval in QA systems. These approaches, known as 'Neural IR', are a new category of methods that can improve performance in retrieval problems. The introduction of neural retrievers in recent QA literature has shown to outperform traditional word-similarity-based architectures, such as BM25, and can scale to handle knowledge-grounded dialogue tasks effectively. Additionally, incorporating pre-trained retrievers in QA systems has been shown to enhance the performance of generative language models.",
        "gold_answer_marketing": "Yes, neural networks can enhance the process of information retrieval in QA systems by improving performance in open-domain QA tasks and enabling the generation of more accurate answers.",
    },
    87: {
        "question": "What is the importance of fine-tuning in the context of QA data for open-domain question answering models?",
        "gold_answer_research": "Fine-tuning is important in the context of QA data for open-domain question answering models because it allows the model to adapt and improve its performance on specific QA datasets. By fine-tuning the model with common QA datasets, engineers can optimize the model's ability to answer questions accurately. However, there is a concern about the significant overlap between questions in the train and test sets of public QA datasets, which could affect the generalization ability of the fine-tuned models. Engineers should carefully consider this overlap and potentially explore ways to mitigate its impact during the fine-tuning process to ensure the model's effectiveness in real-world applications.",
        "gold_answer_marketing": "Fine-tuning is important in the context of QA data for open-domain question answering models to improve search task performance and the ability to generalize to unseen datasets.",
    },
    88: {
        "question": "How does pre-training with tasks like the Inverse Cloze Task benefit open-domain question answering models?",
        "gold_answer_research": "Pre-training with tasks like the Inverse Cloze Task benefits open-domain question answering models by improving the retrieval process over a knowledge base. By predicting the context given a sentence, the model can better understand the relationship between the question and the evidence. This approach helps in incorporating retrieved content effectively into the prompt, leading to higher accuracy in the question answering task. Additionally, using models pretrained with ICT can enhance the overall performance of the QA system by providing a better understanding of the context.",
        "gold_answer_marketing": "Pre-training with tasks like the Inverse Cloze Task benefits open-domain question answering models by improving retrieval and generation steps, ultimately enhancing the accuracy of the process.",
    },
    89: {
        "question": "What is the main goal of prompt engineering in language models?",
        "gold_answer_research": "The main goal of prompt engineering in language models is to effectively steer the behavior of the model towards desired outcomes without updating the model weights. This is achieved by composing and formatting prompts in a way that maximizes the model's performance on a specific task. Prompt engineering involves treating prompts as trainable parameters and optimizing them directly on the embedding space through methods like AutoPrompt, Prefix-Tuning, P-tuning, and Prompt-Tuning. The ultimate aim is to enhance the model's performance and alignment with user-defined tasks.",
        "gold_answer_marketing": "The main goal of prompt engineering in language models is to steer the behavior of the model for desired outcomes without updating the model weights.",
    },
    91: {
        "question": "What are some known biases that can affect the performance of few-shot classification in LLMs?",
        "gold_answer_research": "Some known biases that can affect the performance of few-shot classification in LLMs include majority label bias, recency bias, and common token bias. Majority label bias occurs when the distribution of labels among examples is unbalanced, recency bias refers to the tendency for the model to repeat the label at the end, and common token bias indicates that LLM tends to produce common tokens more often than rare tokens. These biases can contribute to high variance in few-shot classification tasks and may impact the model's ability to generalize effectively.",
        "gold_answer_marketing": "Some known biases that can affect the performance of few-shot classification in LLMs are majority label bias, recency bias, and common token bias.",
    },
    92: {
        "question": "Why might increasing model size not reduce variance in model performance with varying prompts?",
        "gold_answer_research": "Increasing model size may not necessarily reduce variance in model performance with varying prompts because the model's ability to generalize and adapt to different prompts is not solely dependent on its size. Factors such as the quality and relevance of the training examples, the learning rate or schedule, and the model's sensitivity to different hyperparameters can also play a significant role in determining performance variability. Additionally, the complexity of the task or dataset being used for training can impact how effectively the model scales with size. It is essential to consider these factors holistically when optimizing model performance rather than relying solely on increasing model size.",
        "gold_answer_marketing": "Increasing model size may not reduce variance in model performance with varying prompts because the same order of prompts may work well for one model but poorly for another. Additionally, when the validation set is limited, choosing the order of prompts that prevents the model from producing extremely unbalanced predictions or being overconfident can also affect performance.",
    },
    93: {
        "question": "What is the benefit of instruction-based finetuning in language models?",
        "gold_answer_research": "Instruction-based finetuning improves models' ability to generalize to unseen domains and tasks by providing task-specific representations that can be used for many downstream language tasks without additional training. This method also allows pretrained language models to follow instructions provided in prompts, enabling them to generate the desired output given specific inputs. Additionally, instruction finetuning helps transform raw pretrained LLMs into chatbot-like models, making finetuning more accessible and common, particularly for researchers with limited resources. Overall, the benefit of instruction-based finetuning is improved model performance, enhanced generalizability, and reduced communication costs in aligning with human intentions.",
        "gold_answer_marketing": "The benefit of instruction-based finetuning in language models is improved ability to generalize to unseen domains and tasks, without the need for additional training.",
    },
    94: {
        "question": "Can you describe a situation where retrieval-based methods would be necessary to enhance language model performance?",
        "gold_answer_research": "Retrieval-based methods are necessary to enhance language model performance in scenarios where the model needs to generate accurate and informative responses for entity-rich queries, such as 'George Washington standing in front of the Eiffel Tower.' In such cases, incorporating a retrieval module can provide additional context and relevant information to improve the model's understanding and generation of the desired output. Additionally, retrieval-based methods are crucial for question answering tasks, where the model needs to access external knowledge sources to provide accurate and comprehensive answers. By utilizing retrieval mechanisms, the language model can benefit from a wider range of information and improve its performance in handling complex and ambiguous queries effectively.",
        "gold_answer_marketing": "Retrieval-based methods are necessary to enhance language model performance in tasks like question answering, where incorporating additional information from external sources can improve the model's ability to generate accurate and relevant responses.",
    },
    95: {
        "question": "What is the Chain-of-Thought prompting technique and for which types of tasks is it particularly beneficial?",
        "gold_answer_research": "Chain-of-Thought (CoT) prompting is a technique that generates reasoning chains or rationales step by step to lead to a final answer, benefiting complicated reasoning tasks using large models with more than 50B parameters. It can be implemented through iterative Monte Carlo search methods or through a three-step process called augment-prune-select. CoT is particularly beneficial for enhancing model performance on complex tasks by decomposing them into smaller and simpler steps, shedding light on the model's thinking process. Task decomposition in CoT can be done with simple prompting, task-specific instructions, or human inputs.",
        "gold_answer_marketing": "Chain-of-Thought (CoT) prompting is a technique that generates reasoning chains or rationales step by step to lead to a final answer. It is particularly beneficial for complicated reasoning tasks when using large models with more than 50B parameters. Simple tasks only benefit slightly from CoT prompting.",
    },
    96: {
        "question": "How do augmented language models with external tools differ from regular models in functionality?",
        "gold_answer_research": "Augmented language models with external tools, such as TALM and Toolformer, are fine-tuned to learn how to use external tool APIs, expanding their capabilities beyond traditional language processing tasks. These models are trained to incorporate external tool API calls in order to improve the quality of their outputs, allowing them to perform tasks like speech recognition, machine translation, and information retrieval more effectively. By leveraging external tools, these models have the ability to access and utilize a wider range of resources and functionalities, enhancing their overall performance and versatility compared to regular language models.",
        "gold_answer_marketing": "Augmented language models with external tools differ from regular models by fine-tuning a LM to use external tool APIs, expanding the dataset to improve model outputs and enhancing tasks like speech recognition, machine translation, and natural language generation.",
    },
    97: {
        "question": "What can be inferred about the utilization of attention in neural networks?",
        "gold_answer_research": "Attention mechanisms in neural networks play a crucial role in allowing models to focus on specific parts of input data when making predictions or generating outputs. By assigning importance weights to different elements, such as pixels in an image or words in a sentence, attention helps the model to attend to relevant information and make more accurate predictions. The use of attention can improve the interpretability of neural networks by showing which parts of the input data are being focused on during the prediction process. Additionally, attention mechanisms, like multi-head attention, can enhance model performance by allowing the model to jointly attend to information from different representation subspaces at different positions.",
        "gold_answer_marketing": "Attention in neural networks allows the model to focus on specific parts of input data, such as images or text, in order to make predictions or generate output. It helps the model to learn relationships and correlations between different elements and improve performance in tasks like image captioning or language translation.",
    },
    101: {
        "question": "Can the use of attention mechanisms in deep learning models be applied to both machine translation and computer vision?",
        "gold_answer_research": "Yes, attention mechanisms in deep learning models have shown success in both machine translation and computer vision tasks. In machine translation, attention allows the model to capture dependencies between source and target sequences regardless of distance, leading to improved translation quality. Similarly, in computer vision, attention mechanisms have been used to focus on relevant parts of an image during caption generation, showcasing the ability to handle details and global dependencies effectively. Therefore, utilizing attention in both domains can enhance the performance of deep learning models significantly.",
        "gold_answer_marketing": "Yes, attention mechanisms in deep learning models can be applied to both machine translation and computer vision.",
    },
    102: {
        "question": "What are the potential benefits of incorporating self-attention mechanisms into Generative Adversarial Networks (GANs)?",
        "gold_answer_research": "Incorporating self-attention mechanisms into GANs can help the generator and discriminator better model relationships between spatial regions, leading to improved generation of detailed and realistic images. This is particularly useful for capturing global dependencies and enhancing the performance of transformer architectures. Additionally, self-attention can enable the model to assess its own predictions after each generated segment, allowing for customizable decoding algorithms to meet specific constraints or user preferences. Overall, self-attention in GANs can enhance detail handling and overall performance.",
        "gold_answer_marketing": "Incorporating self-attention mechanisms into GANs can help the generator and discriminator better model relationships between spatial regions, leading to improved performance in handling details and capturing global dependencies.",
    },
    103: {
        "question": "How does the transformer model variate from traditional sequence-aligned recurrent architectures?",
        "gold_answer_research": "The transformer model differs from traditional sequence-aligned recurrent architectures by not having a recurrent or convolutional structure. Instead, it heavily relies on self-attention mechanisms for processing sequences. This lack of recurrence and convolution, even with positional encoding, weakly incorporates sequential order, which can be a drawback for tasks sensitive to positional dependencies. Additionally, the transformer's architecture includes embedding layers, sinusoid-wave-based positional encoding, and softmax and linear layers in the final decoder output to maintain position information and facilitate processing of long sequences efficiently.",
        "gold_answer_marketing": "The transformer model differs from traditional sequence-aligned recurrent architectures by not having a recurrent or convolutional structure, and instead making heavy use of self-attention. This allows for handling very long sequences efficiently and achieving better performance on tasks involving long texts.",
    },
    104: {
        "question": "What implications does the concept of a Neural Turing Machine have for the theoretical power of neural networks?",
        "gold_answer_research": "The concept of a Neural Turing Machine (NTM) expands the theoretical power of neural networks by incorporating external memory storage, allowing for more complex computations and tasks. This mimics the Turing machine tape, enabling the neural network to control operation heads for reading and writing to the tape. However, the finite memory in NTM suggests it may resemble more of a 'Neural von Neumann Machine,' limiting its mathematical limitlessness seen in traditional Turing machines. Overall, the addition of external memory in NTM enhances the capabilities and potential applications of neural networks in solving more advanced problems.",
        "gold_answer_marketing": "The concept of a Neural Turing Machine suggests that neural networks can be equipped with external memory storage for more complex operations, potentially increasing their theoretical power.",
    },
}


test_questions = {
    4: {
        "question": "When was the transformer architecture introduced, and by which organization?"
    },
    5: {
        "question": "How has the accessibility of powerful language models, such as GPT-3 and GPT-4, been controlled by their developers?"
    },
    6: {
        "question": "What benchmarks or ratings are used to compare the capabilities of different language models?"
    },
    10: {
        "question": "What are some of the primary applications for language models in technology and computing?"
    },
    14: {
        "question": "How are language models typically evaluated and what benchmarks are used for this purpose?"
    },
    15: {
        "question": "What datasets are available for evaluating language processing systems?"
    },
    21: {
        "question": "What collaborations with other companies have contributed to the development of Claude's capabilities?"
    },
    26: {
        "question": "According to DeepMind, how should the number of training tokens change relative to the model size?"
    },
    29: {"question": "How do the sizes of models in the Gopher family range?"},
    31: {
        "question": "What type of model architecture do the Gopher and Chinchilla families belong to?"
    },
    32: {
        "question": "Can you name the author who wrote the novels A Farewell to Arms and The Sun Also Rises?"
    },
    37: {
        "question": "What are the key advantages of InstructGPT models over GPT-3 models according to the findings in the research?"
    },
    40: {
        "question": "What metrics are used to compare the performance of different models on training and validation splits according to the document provided?"
    },
    42: {
        "question": "What types of evaluation metrics are commonly used to assess the accuracy of answers in AI-driven question and answer datasets?"
    },
    49: {
        "question": "What factors contribute to the performance improvement in retrieval-augmented language models compared to non-retrieval-augmented models?"
    },
    56: {
        "question": "What are the benchmarks used to evaluate the performance of the Deep Policy Optimization (DPO) method compared to other preference learning algorithms in the document provided?"
    },
    57: {
        "question": "What methodologies have been evaluated for training language models to align with human preferences, and how do they compare in terms of effectiveness?"
    },
    58: {
        "question": "What methods have been discussed in the literature for improving the alignment of language models with human preferences or feedback?"
    },
    66: {
        "question": "What are some of the evaluation metrics used for assessing different types of text generation tasks presented in the study?"
    },
    68: {
        "question": "Consider a document related to research in natural language processing or artificial intelligence. Can you name some of the recent topics or methods that have been discussed or introduced in the field according to the document?"
    },
    71: {
        "question": "What is the significance of using reflection tokens in a model like SELF-RAG?"
    },
    72: {
        "question": "How does the inclusion of selected context as opposed to appending all retrieved text spans impact computational cost during both training and inference times in language model generation tasks?"
    },
    77: {
        "question": "What are the benefits of modeling human biases in Human-Aware Loss Optimizations (HALOs), and how do they compare to non-HALOs on the same datasets?"
    },
    79: {
        "question": "What are the modifications made to the traditional Kahneman-Tversky model to adapt it for optimizing language model performance?"
    },
    83: {
        "question": "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?"
    },
    90: {
        "question": "How can adding examples to a prompt affect the performance of language models?"
    },
    98: {
        "question": "What are the main components of a Neural Turing Machine (NTM) architecture?"
    },
    99: {
        "question": "How might a seq2seq model's limitations be addressed in natural language processing tasks?"
    },
    100: {
        "question": "What differentiates hard attention from soft attention in image processing algorithms?"
    },
}

###3.3 Running the RAG System

Let's have a quick look at the validation and test data:

In [52]:
validation_df: pd.DataFrame = pd.DataFrame(validation_questions_answers).transpose()
validation_df.to_pickle("mids_290_a5/val_df_base.pkl")
print(validation_df.shape)
validation_df.head()

(75, 3)


Unnamed: 0,question,gold_answer_research,gold_answer_marketing
0,What purpose do large language models serve in...,Large language models (LLMs) serve the purpose...,Large language models serve the purpose of imp...
1,How does a large language model learn from tex...,A large language model learns from text during...,A large language model learns from text during...
2,What are some key architectures behind the dev...,Key architectures behind the development of la...,Key architectures behind the development of la...
3,Can you name some specific large language mode...,Some specific large language models include GP...,"Chinchilla by DeepMind, GPT-3 by OpenAI."
7,What licensing models have been adopted for th...,"Based on the provided context, it seems that l...",Answer: Some organizations choose open-sourcin...


In [53]:
test_df: pd.DataFrame = pd.DataFrame(test_questions).transpose()
print(test_df.shape)
test_df.head()

(29, 1)


Unnamed: 0,question
4,When was the transformer architecture introduc...
5,How has the accessibility of powerful language...
6,What benchmarks or ratings are used to compare...
10,What are some of the primary applications for ...
14,How are language models typically evaluated an...


Let's now use the data to ask questions against it. So we need to define our prompt templates, the RAG Chain, etc.

We have two types of User Personas we need to support:

1. The engineers, who require pretty detailed information when they ask questions  
2. The marketing team and supporting staff who also will ask questions around GenAI in order to better understand the products and the field as a whole, but a lot more high level answers would likely be in order

**Below, please build your RAG pipeline including the relevant prompts. This is free form so you will need to create your own cells, text documentation as you need, etc.**

---

In [54]:
# Verify vectorstore contains relevant info
pd.DataFrame(
    qdrant_vectorstore.client.get_collection(collection_name="rag_tech_db").dict()
)

Unnamed: 0,status,optimizer_status,vectors_count,indexed_vectors_count,points_count,segments_count,config,payload_schema
params,green,ok,8041,0,8041,1,"{'vectors': {'size': 768, 'distance': 'Cosine'...",
hnsw_config,green,ok,8041,0,8041,1,"{'m': 16, 'ef_construct': 100, 'full_scan_thre...",
optimizer_config,green,ok,8041,0,8041,1,"{'deleted_threshold': 0.2, 'vacuum_min_vector_...",
wal_config,green,ok,8041,0,8041,1,"{'wal_capacity_mb': 32, 'wal_segments_ahead': 0}",
quantization_config,green,ok,8041,0,8041,1,,


The number of vectors should be approximately equivalent to the number of splits. This seems to add up, since 14k of the splits would come from `arxiv`

In [55]:
# Check if documents from each source are present in the vectorstore
print(qdrant_vectorstore.similarity_search("arxiv.org")[0])
print(qdrant_vectorstore.similarity_search("wikipedia.org large language models")[0])
print(qdrant_vectorstore.similarity_search("lilianweng.github.io")[0])

page_content='arXiv:2104.09864v5  [cs.CL]  8 Nov 2023' metadata={'source': 'https://arxiv.org/pdf/2104.09864.pdf', 'file_path': 'https://arxiv.org/pdf/2104.09864.pdf', 'page': 0, 'total_pages': 14, 'format': 'PDF 1.5', 'title': 'RoFormer', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'creationDate': 'D:20231109011924Z', 'modDate': 'D:20231109011924Z', 'trapped': '', 'page_num': 0, 'doc_num': 3, 'doc_source': 'ArXiv', 'split_id': 726, '_id': '732f8b3bbefc4b58a908ef256db6783a', '_collection_name': 'rag_tech_db'}
page_content='=== Large language models ===' metadata={'title': 'Language model', 'summary': 'A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of h

## Iteration 1: Baseline - Naive Solution

We first try to use the existing `rag_chain` with the basic prompt of providing context and a question to see how well model's answers perform compared to the validation set. This will behave as a baseline from where we can improve.

In [None]:
# @title Mistral Raw Response
validation_df["naive_response_raw"] = validation_df["question"].progress_apply(
    rag_chain.invoke
)
validation_df["naive_answer"] = (
    validation_df["naive_response_raw"]
    .dropna()
    .apply(lambda cell: cell.split("[/INST]")[-1])
)

In [None]:
# @title Cohere Raw Response
validation_df["naive_response_raw_cohere"] = validation_df["question"].progress_apply(
    cohere_rag_chain.invoke
)

  0%|          | 0/75 [00:00<?, ?it/s]

In [None]:
# Save checkpoint of data after iteration 1
validation_df.to_pickle(
    f"mids_290_a5/val_df_iteration_1_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)

##4. Tests & Evaluations

Here you should evaluate the results. First, you should define your evaluation metrics and then you should run evaluation tests. This is really your area, but key results to show are:

1) Your metrics of choice  
2) How  your various models compare to the labeled validation data.

Make sure you look at the results for the marketing team and the research team separately.

**Note:** You do not need to run all models against all labeled questions, as that may take some time. Just do that for a few models/configs, and test a larger set with a smaller subset.

**This is free form so you will need to create your own cells, text documentation as you need, etc.**

###4.1. Metrics

Please define and motivate your metrics here. Please feel free to add more text and code cells as needed.



We define the following metrics (more details in report):
* Cosine Similarity of Base Embeddings
* `BLEU` score
* `ROUGE` score
* `BERT` score

We compare samples with the baseline results - since that would be the poorest performing metric. We expect low numbers when using this sample on metrics.

In [None]:
validation_df: pd.DataFrame = pd.read_pickle(
    f"mids_290_a5/val_df_iteration_1_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)
print(validation_df.shape)
validation_df.head()

In [56]:
# @title Load Metric Evaluators
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertsc = evaluate.load("bertscore")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

In [None]:
# Declare Sample Data
x = validation_df.iloc[0]["gold_answer_research"]
y = validation_df.iloc[0]["naive_answer"]
z = validation_df.iloc[0]["naive_response_raw_cohere"]

print(f"{x=}")
print(f"{y=}")
print(f"{z=}")

x='Large language models (LLMs) serve the purpose of enabling general-purpose language generation and other natural language processing tasks such as classification. They achieve this by learning statistical relationships from text documents during computationally intensive self-supervised and semi-supervised training. LLMs can be used for text generation by predicting the next token or word, making them valuable for tasks like speech recognition, machine translation, and information retrieval. Additionally, LLMs have superseded previous models like recurrent neural networks, showcasing their efficiency and effectiveness in NLP tasks.'
y=' In the field of natural language processing, large language models serve as controllers for building agents that can engage in deliberative problem solving by leveraging their ability to process and generate text. They are capable of learning from vast amounts of data, enabling them to understand complex language patterns and make accurate prediction

In [None]:
# @title BLEU Sample
print(bleu.compute(predictions=[y], references=[x]))
print(bleu.compute(predictions=[z], references=[x]))

{'bleu': 0.0, 'precisions': [0.39285714285714285, 0.04819277108433735, 0.012195121951219513, 0.0], 'brevity_penalty': 0.8877655252065778, 'length_ratio': 0.8936170212765957, 'translation_length': 84, 'reference_length': 94}
{'bleu': 0.05691366651073745, 'precisions': [0.20948616600790515, 0.0873015873015873, 0.035856573705179286, 0.016], 'brevity_penalty': 1.0, 'length_ratio': 2.6914893617021276, 'translation_length': 253, 'reference_length': 94}


In [None]:
# @title Rouge Sample
print(rouge.compute(predictions=[y], references=[x]))
print(rouge.compute(predictions=[z], references=[x]))

{'rouge1': 0.34355828220858897, 'rouge2': 0.049689440993788817, 'rougeL': 0.15950920245398773, 'rougeLsum': 0.15950920245398773}
{'rouge1': 0.2977346278317152, 'rouge2': 0.10423452768729642, 'rougeL': 0.18122977346278318, 'rougeLsum': 0.22006472491909382}


In [None]:
# @title BERT Score Sample
print(bertsc.compute(predictions=[y], references=[x], lang="en"))
print(bertsc.compute(predictions=[z], references=[x], lang="en"))

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.8741947412490845], 'recall': [0.8582159280776978], 'f1': [0.866131603717804], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0.dev0)'}
{'precision': [0.8594847917556763], 'recall': [0.8865609169006348], 'f1': [0.8728128671646118], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0.dev0)'}


In [None]:
# @title Embeddings Cosine Similarity Sample
csxy = cosine_similarity(
    [np.array(base_embeddings.embed_query(y))],
    [np.array(base_embeddings.embed_query(x))],
)[0][0]

csxz = cosine_similarity(
    [np.array(base_embeddings.embed_query(z))],
    [np.array(base_embeddings.embed_query(x))],
)[0][0]

print(f"{csxy=:.2f}")
print(f"{csxz=:.2f}")

csxy=0.75
csxz=0.90


Based on a sample test of few metrics on the baseline response vs. gold responses -

1. `Cohere` almost always performs better (which might be expected)
2. `Cosine Similarity` and `BERT Score` tend to have much higher values, which would make threshold setting a lot harder, since the minimum for the baseline consists of already high scores.

###4.2. Evaluation Comparisons

Document your key runs here. Feel free to add more text and code cells as needed.

In [57]:
# @title Evaluate Baseline (Iteration 1)

gold_col_acronyms: dict[str, str] = {
    "gold_answer_research": "eng",
    "gold_answer_marketing": "mkt",
}


def assign_metrics(
    dataframe: pd.DataFrame, iteration: int, col_names_iter: dict
) -> pd.DataFrame:
    df = dataframe.copy()

    for model_choice in ["mistral", "cohere"]:
        for ref_answer in ["gold_answer_research", "gold_answer_marketing"]:
            acro = gold_col_acronyms[ref_answer]
            df[
                f"{model_choice}_bleu_{gold_col_acronyms[ref_answer]}_it_{iteration}"
            ] = df.apply(
                lambda row: bleu.compute(
                    references=[row[ref_answer]],
                    predictions=[row[col_names_iter[acro][model_choice]]],
                )["bleu"],
                axis=1,
            )

            df[
                f"{model_choice}_rouge_{gold_col_acronyms[ref_answer]}_it_{iteration}"
            ] = df.progress_apply(
                lambda row: rouge.compute(
                    references=[row[ref_answer]],
                    predictions=[row[col_names_iter[acro][model_choice]]],
                )["rougeL"],
                axis=1,
            )

            df[
                f"{model_choice}_bert_{gold_col_acronyms[ref_answer]}_it_{iteration}"
            ] = df.progress_apply(
                lambda row: bertsc.compute(
                    references=[row[ref_answer]],
                    predictions=[row[col_names_iter[acro][model_choice]]],
                    lang="en",
                )["f1"][0],
                axis=1,
            )
    return df

In [None]:
validation_df: pd.DataFrame = pd.read_pickle(
    f"mids_290_a5/val_df_iteration_1_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)
validation_df = assign_metrics(
    validation_df,
    iteration=1,
    col_names_iter={
        "eng": {"mistral": "naive_answer", "cohere": "naive_response_raw_cohere"},
        "mkt": {"mistral": "naive_answer", "cohere": "naive_response_raw_cohere"},
    },
)
validation_df[
    [c for c in validation_df.columns if "cohere" in c or "mistral" in c]
].describe().transpose().round(2)

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mistral_bleu_eng_it_1,75.0,0.08,0.06,0.0,0.03,0.07,0.12,0.29
mistral_rouge_eng_it_1,75.0,0.22,0.06,0.04,0.18,0.21,0.26,0.44
mistral_bert_eng_it_1,75.0,0.87,0.02,0.81,0.86,0.87,0.88,0.91
mistral_bleu_mkt_it_1,75.0,0.07,0.09,0.0,0.0,0.05,0.1,0.44
mistral_rouge_mkt_it_1,75.0,0.22,0.11,0.0,0.14,0.22,0.27,0.64
mistral_bert_mkt_it_1,75.0,0.87,0.02,0.81,0.85,0.87,0.89,0.93
cohere_bleu_eng_it_1,75.0,0.06,0.06,0.0,0.02,0.05,0.08,0.43
cohere_rouge_eng_it_1,75.0,0.2,0.08,0.07,0.14,0.19,0.23,0.67
cohere_bert_eng_it_1,75.0,0.86,0.02,0.82,0.84,0.86,0.87,0.91
cohere_bleu_mkt_it_1,75.0,0.04,0.06,0.0,0.0,0.02,0.05,0.43


In [None]:
validation_df.to_pickle(
    f"mids_290_a5/val_df_iteration_1_METRICS_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)
validation_df.describe().transpose().round(2).to_pickle(
    f"mids_290_a5/val_df_iteration_1_DESCRIBE_METRICS_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)

### Iteration 2: Prompt Engineering

In this iteration, we will use separate prompts for engineering and marketing. We will first construct the prompts by inspection of the validation dataset. And then, we will add keywords like `arxiv` for engineering team, and `wikipedia` for marketing so that they get relevant results.

In [58]:
# @title Distributions of Tokens for Engineering and Marketing Gold Responses
print(validation_df["gold_answer_marketing"].str.split().apply(len).describe())
print(validation_df["gold_answer_research"].str.split().apply(len).describe())

count    75.000000
mean     37.120000
std      15.507592
min       6.000000
25%      27.000000
50%      34.000000
75%      47.500000
max      75.000000
Name: gold_answer_marketing, dtype: float64
count     75.000000
mean      85.853333
std       15.619369
min       16.000000
25%       81.000000
50%       87.000000
75%       94.500000
max      113.000000
Name: gold_answer_research, dtype: float64


* Engineering has on average `80-100` tokens per gold response
* Marketing has on average `30-50` tokens per gold response

In [59]:
# @title Declare Mistral Pipeline for Prompting - Less Temperature (Fix Context)
mistral_prompt_pipe = pipeline(
    "text-generation",
    model=llm_mistral_model,
    tokenizer=llm_mistral_tokenizer,
    max_length=1000,
    temperature=0.1,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.2,
)
mistral_prompt_pipe.model.config.pad_token_id = (
    mistral_prompt_pipe.model.config.eos_token_id
)

mistral_prompt_lc = HuggingFacePipeline(pipeline=mistral_prompt_pipe)

In [60]:
# @title Marketing Prompt
rag_2_mkt_template = """
[INST]
Please answer the `question` below only based on the `context` information provided, and the rules mentioned below.
You are a chatbot for knowledge search on "Large Language Models" and "Generative AI".
The target audience for this is marketing professionals.
Provide a detailed response not exceeding sixty words and reduce use of adjectives.

Here is a `context`: {context}
Here is a `question`: {question}.
[/INST]"""

rag_2_mkt_prompt = ChatPromptTemplate.from_template(rag_2_mkt_template)

rag_2_sample_q: str = validation_df.iloc[1]["question"]
gold_2_mkt: str = validation_df.iloc[1]["gold_answer_marketing"]

In [61]:
# @title Mistral Marketing Sample
rag_2_mistral_mkt_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_2_mkt_prompt
    | mistral_prompt_lc
)

response_2_mkt_mistral = rag_2_mistral_mkt_chain.invoke(rag_2_sample_q).split(
    "[/INST]"
)[-1]

print("Mistral Response:", response_2_mkt_mistral)
len(response_2_mkt_mistral.split())
print()
print("Gold:", gold_2_mkt)

print("Mistral Iteration 2 Sample Metrics - Marketing")
print(bleu.compute(references=[gold_2_mkt], predictions=[response_2_mkt_mistral]))
print(rouge.compute(references=[gold_2_mkt], predictions=[response_2_mkt_mistral]))
print(
    bertsc.compute(
        references=[gold_2_mkt], predictions=[response_2_mkt_mistral], lang="en"
    )
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Mistral Response:  A large language model learns from text during training through a process called self-attention, which allows it to weigh the importance of different parts of the input text when generating output. This enables the model to capture complex relationships between words and phrases, as well as contextual clues that help it generate coherent and accurate responses.

Gold: A large language model learns from text during training by first pretraining on a diverse dataset to acquire general language knowledge, and then fine-tuning on specific tasks or demonstrations to adapt its parameters for more targeted performance.
Mistral Iteration 2 Sample Metrics - Marketing
{'bleu': 0.14405591864498601, 'precisions': [0.25, 0.13559322033898305, 0.1206896551724138, 0.10526315789473684], 'brevity_penalty': 1.0, 'length_ratio': 1.5384615384615385, 'translation_length': 60, 'reference_length': 39}
{'rouge1': 0.2736842105263158, 'rouge2': 0.17204301075268819, 'rougeL': 0.2526315789473684

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.8771520853042603], 'recall': [0.8769662380218506], 'f1': [0.8770591616630554], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0.dev0)'}


In [62]:
# @title Cohere Marketing Sample
rag_2_cohere_mkt_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_2_mkt_prompt
    | cohere_chat_model
    | output_parser
)

response_2_mkt_cohere = rag_2_cohere_mkt_chain.invoke(rag_2_sample_q)

print("Cohere Response:", response_2_mkt_cohere)
print()
print("Gold:", gold_2_mkt)

print("Cohere Iteration 2 Sample Metrics - Marketing")
print(bleu.compute(references=[gold_2_mkt], predictions=[response_2_mkt_cohere]))
print(rouge.compute(references=[gold_2_mkt], predictions=[response_2_mkt_cohere]))
print(
    bertsc.compute(
        references=[gold_2_mkt], predictions=[response_2_mkt_cohere], lang="en"
    )
)

Cohere Response: Large language models are trained on vast amounts of text data, allowing them to learn patterns and memorize factual knowledge. This process involves unsupervised learning, where the model identifies patterns and relationships in the data without explicit guidance. The models' capacity to learn from text is due to their ability to memorize and generalize, forming the basis for their language understanding and generation capabilities.

Gold: A large language model learns from text during training by first pretraining on a diverse dataset to acquire general language knowledge, and then fine-tuning on specific tasks or demonstrations to adapt its parameters for more targeted performance.
Cohere Iteration 2 Sample Metrics - Marketing
{'bleu': 0.0, 'precisions': [0.18571428571428572, 0.014492753623188406, 0.0, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.794871794871795, 'translation_length': 70, 'reference_length': 39}
{'rouge1': 0.23529411764705882, 'rouge2': 0.04, 'ro

We see that both the `rougeL` and `bleu` scores have improved from the baseline - showcasing that the prompt appears to improve the model performance. We will now run it en-batch across all validation points

In [63]:
# @title Engineering Prompt and Model Declarations
rag_2_eng_template = """
[INST]
Please answer the `question` below only based on the `context` information provided, and the rules mentioned below.
You are a chatbot for knowledge search on "Large Language Models" and "Generative AI".
The target audience for this is engineering and resarch professionals. You can dive deep into technical details.
Provide a detailed response not exceeding one hundred words and reduce use of adjectives.
Prefer responses from arxiv.org

Here is a `context`: {context}
Here is a `question`: {question}.
[/INST]"""

rag_2_eng_prompt = ChatPromptTemplate.from_template(rag_2_eng_template)

rag_2_mistral_eng_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_2_eng_prompt
    | mistral_prompt_lc
)

rag_2_cohere_eng_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_2_eng_prompt
    | cohere_chat_model
    | output_parser
)

In [103]:
validation_df = pd.read_pickle("mids_290_a5/val_df_base.pkl")

In [None]:
# @title Mistral Raw Marketing Response - Iteration 2
validation_df["mistral_prompt_response_mkt_raw"] = validation_df[
    "question"
].progress_apply(rag_2_mistral_mkt_chain.invoke)
validation_df["mistral_prompt_answer_mkt"] = (
    validation_df["mistral_prompt_response_mkt_raw"]
    .dropna()
    .apply(lambda cell: cell.split("[/INST]")[-1])
)

In [None]:
# @title Mistral Raw Engineering Response - Iteration 2
validation_df["mistral_prompt_response_eng_raw"] = validation_df[
    "question"
].progress_apply(rag_2_mistral_eng_chain.invoke)
validation_df["mistral_prompt_answer_eng"] = (
    validation_df["mistral_prompt_response_eng_raw"]
    .dropna()
    .apply(lambda cell: cell.split("[/INST]")[-1])
)

In [106]:
# @title Cohere Raw Engineering Response - Iteration 2
def safe_call_cohere(chain, q: str) -> str:
    """Cohere has a rate limit of 20 rpm"""
    time.sleep(3)
    return chain.invoke(q)


validation_df["cohere_prompt_response_eng_raw"] = validation_df[
    "question"
].progress_apply(lambda cell: safe_call_cohere(rag_2_cohere_eng_chain, cell))

  0%|          | 0/75 [00:00<?, ?it/s]

In [107]:
# @title Cohere Raw Marketing Response - Iteration 2
validation_df["cohere_prompt_response_mkt_raw"] = validation_df[
    "question"
].progress_apply(lambda cell: safe_call_cohere(rag_2_cohere_mkt_chain, cell))

  0%|          | 0/75 [00:00<?, ?it/s]

In [108]:
# Save checkpoint of data after iteration 2
validation_df.to_pickle(
    f"mids_290_a5/val_df_iteration_2_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)

In [109]:
print(validation_df.shape)
validation_df.head()

(75, 9)


Unnamed: 0,question,gold_answer_research,gold_answer_marketing,mistral_prompt_response_mkt_raw,mistral_prompt_answer_mkt,mistral_prompt_response_eng_raw,mistral_prompt_answer_eng,cohere_prompt_response_eng_raw,cohere_prompt_response_mkt_raw
0,What purpose do large language models serve in...,Large language models (LLMs) serve the purpose...,Large language models serve the purpose of imp...,Human: \n[INST]\nPlease answer the `question` ...,Large language models serve as powerful tools...,Human: \n[INST]\nPlease answer the `question` ...,Large language models (LLMs) are used to gene...,Large language models (LLMs) have become integ...,Large language models (LLMs) are an essential ...
1,How does a large language model learn from tex...,A large language model learns from text during...,A large language model learns from text during...,Human: \n[INST]\nPlease answer the `question` ...,Large language models learn through a process...,Human: \n[INST]\nPlease answer the `question` ...,Large language models learn from text through...,Large language models learn from text through ...,Large Language Models (LLMs) learn by ingestin...
2,What are some key architectures behind the dev...,Key architectures behind the development of la...,Key architectures behind the development of la...,Human: \n[INST]\nPlease answer the `question` ...,Some key architectures include transformer-ba...,Human: \n[INST]\nPlease answer the `question` ...,The development of large language models invo...,Some key architectures that have played a sign...,Some key architectures that have driven the de...
3,Can you name some specific large language mode...,Some specific large language models include GP...,"Chinchilla by DeepMind, GPT-3 by OpenAI.",Human: \n[INST]\nPlease answer the `question` ...,Some specific large language models include B...,Human: \n[INST]\nPlease answer the `question` ...,Some specific large language models include B...,Some prominent examples of large language mode...,Some prominent examples of large language mode...
7,What licensing models have been adopted for th...,"Based on the provided context, it seems that l...",Answer: Some organizations choose open-sourcin...,Human: \n[INST]\nPlease answer the `question` ...,Licensing models for source-available languag...,Human: \n[INST]\nPlease answer the `question` ...,There has been no specific licensing model ad...,The licensing of source-available language mod...,The licensing models adopted for distributing ...


In [110]:
validation_df: pd.DataFrame = pd.read_pickle(
    f"mids_290_a5/val_df_iteration_2_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)
validation_df = assign_metrics(
    validation_df,
    iteration=2,
    col_names_iter={
        "eng": {
            "mistral": "mistral_prompt_answer_eng",
            "cohere": "cohere_prompt_response_eng_raw",
        },
        "mkt": {
            "mistral": "mistral_prompt_answer_mkt",
            "cohere": "cohere_prompt_response_mkt_raw",
        },
    },
)
validation_df.describe().transpose().round(2)

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

  0%|          | 0/75 [00:00<?, ?it/s]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mistral_bleu_eng_it_2,75.0,0.1,0.08,0.0,0.05,0.1,0.14,0.3
mistral_rouge_eng_it_2,75.0,0.26,0.07,0.12,0.2,0.25,0.3,0.45
mistral_bert_eng_it_2,75.0,0.88,0.01,0.85,0.87,0.89,0.9,0.91
mistral_bleu_mkt_it_2,75.0,0.11,0.12,0.0,0.0,0.09,0.15,0.56
mistral_rouge_mkt_it_2,75.0,0.3,0.12,0.05,0.21,0.29,0.38,0.72
mistral_bert_mkt_it_2,75.0,0.89,0.02,0.84,0.88,0.89,0.91,0.96
cohere_bleu_eng_it_2,75.0,0.08,0.07,0.0,0.05,0.09,0.12,0.38
cohere_rouge_eng_it_2,75.0,0.26,0.06,0.14,0.22,0.25,0.29,0.49
cohere_bert_eng_it_2,75.0,0.88,0.02,0.84,0.87,0.88,0.89,0.93
cohere_bleu_mkt_it_2,75.0,0.09,0.09,0.0,0.0,0.08,0.14,0.43


In [111]:
validation_df.to_pickle(
    f"mids_290_a5/val_df_iteration_2_METRICS_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)
validation_df.describe().transpose().round(2).to_pickle(
    f"mids_290_a5/val_df_iteration_2_DESCRIBE_METRICS_CHUNK_{CHUNK_SIZE}_OVR_{OVERLAP}.pkl"
)

##5. Results

###5.1 Model Specifications

Document the detailed specs of your choices. Also comment on how you valued the needs of the marketing tean vs the needs of the researchers, in case you had to make a trade-off.


> Marketing and Engineering got separate chains per model, with separate prompts. The Engineering prompt valued technical details and a higher word limit, while marketing used a lower word count limit reducing technical details.

In [73]:
{
    "mistral_eng": rag_2_mistral_eng_chain,
    "mistral_mkt": rag_2_mistral_mkt_chain,
}

{'mistral_eng': {
   context: VectorStoreRetriever(tags=['Qdrant', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.qdrant.Qdrant object at 0x7d55fafc0820>)
            | RunnableLambda(format_docs),
   question: RunnablePassthrough()
 }
 | ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='\n[INST]\nPlease answer the `question` below only based on the `context` information provided, and the rules mentioned below.\nYou are a chatbot for knowledge search on "Large Language Models" and "Generative AI".\nThe target audience for this is engineering and resarch professionals. You can dive deep into technical details.\nProvide a detailed response not exceeding one hundred words and reduce use of adjectives.\nPrefer responses from arxiv.org\n\nHere is a `context`: {context}\nHere is a `question`: {question}.\n[/INST]'))])
 | HuggingFacePipeline(pipeline


###5.2 Some Test Questions

**QUESTIONS:**


Please study the answers generated by your chosen setup for these specific test questions:

1. "What purpose do large language models serve in the field of natural language processing?" (Question 0)

2. "What methods are typically employed to create training data for embedding models that use task-specific instructions?" (Question 50)

3. "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?" (Question 83, no labeled answers)

For each of the three questions above please provide:

a) The RAG results (research and marketing response)  
b) The context provided  
c) The document sources for the context  
d) Also discuss your metric(s) for the first two examples (for both responses) compared to the gold responses

Then, for questions 1 and 2, comment on how well you feel your metrics captured the differences and similarities between your answer and the gold answer?

Put your answers to these questions into the answers file as you have done on previous assignments. Please consult the answer file for further details.

####5.2.1 Test Question 1

Please run the query:








In [65]:
tq1: str = validation_df.loc[0]["question"]
print(tq1)
tq1_eng_response: str = rag_2_mistral_eng_chain.invoke(tq1)
tq1_mkt_response: str = rag_2_mistral_mkt_chain.invoke(tq1)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [68]:
print("A: Engineering Response", tq1_eng_response.split("[/INST]")[-1])
print("A: Marketing Response", tq1_mkt_response.split("[/INST]")[-1])

A: Engineering Response  Large language models serve as powerful tools in the field of natural language processing (NLP) by enabling various language tasks such as classification, summarization, question-answering, creative writing, and dialogue generation. These models are pre-trained on vast amounts of unsupervised textual data and possess the ability to memorize factual knowledge through their parameter weights. As a result, they can be used to perform a wide range of NLP tasks with high accuracy and efficiency.

A: Marketing Response  Large language models serve as powerful tools in natural language processing by enabling various language tasks such as classification, summarization, question-answering, creative writing, and dialogue. They achieve this through being pre-trained on vast amounts of unsupervised textual data and possessing sufficient parameters to store factual knowledge.


In [71]:
print("B,C: Engineering Context")
print(tq1_eng_response.split("`question`:")[0])

Human: 
[INST]
Please answer the `question` below only based on the `context` information provided, and the rules mentioned below.
You are a chatbot for knowledge search on "Large Language Models" and "Generative AI".
The target audience for this is engineering and resarch professionals. You can dive deep into technical details.
Provide a detailed response not exceeding one hundred words and reduce use of adjectives.
Prefer responses from arxiv.org

Here is a `context`: === Large language models ===

limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos,

among the largest language models today and we apply them on a wide range of language tasks,
including classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.

Big language models have been pre-trained on a large collection of unsupervised textual corpus. Given enoug

In [72]:
print("B,C: Marketing Context")
print(tq1_mkt_response.split("`question`:")[0])

B: Marketing Context
Human: 
[INST]
Please answer the `question` below only based on the `context` information provided, and the rules mentioned below.
You are a chatbot for knowledge search on "Large Language Models" and "Generative AI".
The target audience for this is marketing professionals.
Provide a detailed response not exceeding sixty words and reduce use of adjectives.

Here is a `context`: === Large language models ===

limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos,

among the largest language models today and we apply them on a wide range of language tasks,
including classiﬁcation, summarization, question-answering, creative writing, dialogue, and others.

Big language models have been pre-trained on a large collection of unsupervised textual corpus. Given enough parameters, these models are able to memorize some factual knowledge wi

In [77]:
print("D: Marketing Metrics")
print(
    "BLEU",
    bleu.compute(
        predictions=[tq1_mkt_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[0]["gold_answer_marketing"]],
    )["bleu"],
)
print(
    "ROUGE",
    rouge.compute(
        predictions=[tq1_mkt_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[0]["gold_answer_marketing"]],
    )["rougeL"],
)
print(
    "BERT",
    bertsc.compute(
        predictions=[tq1_mkt_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[0]["gold_answer_marketing"]],
        lang="en",
    )["f1"][0],
)

D: Marketing Metrics
BLEU 0.08966592262979808
ROUGE 0.2926829268292683
BERT 0.8874030709266663


In [78]:
print("D: Engineering Metrics")
print(
    "BLEU",
    bleu.compute(
        predictions=[tq1_eng_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[0]["gold_answer_research"]],
    )["bleu"],
)
print(
    "ROUGE",
    rouge.compute(
        predictions=[tq1_eng_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[0]["gold_answer_research"]],
    )["rougeL"],
)
print(
    "BERT",
    bertsc.compute(
        predictions=[tq1_eng_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[0]["gold_answer_research"]],
        lang="en",
    )["f1"][0],
)

D: Engineering Metrics
BLEU 0.0736091550477613
ROUGE 0.2484472049689441
BERT 0.8888468742370605


* BERT Scored high on both engineering and marketing at `88`
* ROUGE is reasonable for both

####5.2.2 Test Question 2

Please run the query:

In [95]:
tq1: str = validation_df.loc[50]["question"]
print(tq1)
tq1_eng_response: str = rag_2_mistral_eng_chain.invoke(tq1)
tq1_mkt_response: str = rag_2_mistral_mkt_chain.invoke(tq1)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What methods are typically employed to create training data for embedding models that use task-specific instructions?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [96]:
print("A: Engineering Response", tq1_eng_response.split("[/INST]")[-1])
print("A: Marketing Response", tq1_mkt_response.split("[/INST]")[-1])

A: Engineering Response  To create training data for embedding models using task-specific instructions, several methods are typically employed. One approach is to combine multiple datasets with task instructions, as demonstrated by Wang et al. (2022) through their construction of the Multi-Task Embeddings Data with Instructions (MEDI) dataset. Another method involves fine-tuning the model on specific tasks or domains, which has been shown to address challenges such as training a single model on diverse datasets (Sanh et al., 2022; Zhou et al.). Additionally, providing explicit context in prompts can be necessary when completing tasks that require up-to-date knowledge or access to internal/private knowledge bases.
A: Marketing Response  To create training data for embedding models using task-specific instructions, methods such as fine-tuning and explicit prompting are often employed. Fine-tuning involves adjusting the model's parameters based on new task-specific data, while explicit pr

In [97]:
print("D: Marketing Metrics")
print(
    "BLEU",
    bleu.compute(
        predictions=[tq1_mkt_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[50]["gold_answer_marketing"]],
    )["bleu"],
)
print(
    "ROUGE",
    rouge.compute(
        predictions=[tq1_mkt_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[50]["gold_answer_marketing"]],
    )["rougeL"],
)
print(
    "BERT",
    bertsc.compute(
        predictions=[tq1_mkt_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[50]["gold_answer_marketing"]],
        lang="en",
    )["f1"][0],
)

D: Marketing Metrics
BLEU 0.06043056408431284
ROUGE 0.20202020202020202
BERT 0.8703487515449524


In [98]:
print("D: Engineering Metrics")
print(
    "BLEU",
    bleu.compute(
        predictions=[tq1_eng_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[50]["gold_answer_research"]],
    )["bleu"],
)
print(
    "ROUGE",
    rouge.compute(
        predictions=[tq1_eng_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[50]["gold_answer_research"]],
    )["rougeL"],
)
print(
    "BERT",
    bertsc.compute(
        predictions=[tq1_eng_response.split("[/INST]")[-1].strip()],
        references=[validation_df.loc[50]["gold_answer_research"]],
        lang="en",
    )["f1"][0],
)

D: Engineering Metrics
BLEU 0.10532164022986051
ROUGE 0.25120772946859904
BERT 0.8725976347923279


####5.2.3 Test Question 3

Please run the query:

In [None]:
tq1: str = "How does a model's ability to answer questions relate to its exposure to specific types of questions during training?"
print(tq1)
tq1_eng_response: str = rag_2_mistral_eng_chain.invoke(tq1)
tq1_mkt_response: str = rag_2_mistral_mkt_chain.invoke(tq1)

In [91]:
print(tq1_eng_response.split("[/INST]")[-1])

 To create training data for embedding models using task-specific instructions, several methods are typically employed. One approach is to combine multiple datasets with task instructions, as demonstrated by Wang et al. (2022) through their construction of the Multi-Task Embeddings Data with Instructions (MEDI) dataset. Another method involves fine-tuning the model on specific tasks or domains, which has been shown to address challenges such as training a single model on diverse datasets (Sanh et al., 2022; Zhou et al.). Additionally, providing explicit context in prompts may be necessary when completing tasks that require up-to-date knowledge or access to internal/private knowledge bases.


In [92]:
print(tq1_mkt_response.split("[/INST]")[-1])

 To create training data for embedding models using task-specific instructions, methods such as combining multiple datasets from Super-NaturalInstructions (super-NI) and explicit provision of context through prompts are often used.


###5.3 Other Questions

Below are a few questions that you should think about. Please answer them in the answer file directly (in a short paragraph) and also see whether they may be relevant for your final write-up.

**QUESTION:**

5.3.a. How would you expect your response quality to change if you had a chunk size of `50`?

> Having tried a chunk size of `64`, the results were not as favourable as a larger chunk size between `128-256`. This could likely be because the particular subject matter in question (LLMs and Gen AI) may be more technical, and cannot be expressed in such a small chunk of characters, as you're almost always going to not capture the complete relevant information

5.3.b. How would you expect your response quality to change if you had a chunk size of 5000?

> I think a chunk size of `5000` would limit the number of documents the model can look at, since each split would contain 5000 characters in this case, rendering the breadth of search as being signficantly reduced. Larger chunks might improve detailed understanding of fewer documents but reduce the number of documents the system can consider simultaneously. Another issue with `5000` is that it might get close to the model's parameterized token limit

5.3.c. If you had time, how do you think fine-tuning of the LLM could help?  What type of data would you want for that? And which training approach would you take?

> Fine-tuning could enhance the LLM's ability to generate more relevant and context-specific answers, particularly for technical content or industry-specific jargon. We would use continued pretraining on the context from the gold validation set. This could be achieved through a training loop in `pytorch`, but continous retraining based on document updates may become tedious to maintain technically long term, as opposed to using a RAG.

5.3.d. What was your design philosophy  of the prompts? How did they differ between engineering and marketing support?

> The prompts were designed to elicit tailored responses suitable for the specific needs of the engineering and marketing departments. Engineering prompts were more technical, requiring detailed, longer responses, while marketing prompts were designed to be concise and to the point, focusing on clarity and accessibility. The word limit for each was deduced from a statistical distribution of the gold validation set answers.

5.3.e. What are your average and peak load estimates for the system? Given that, would you suggest a pay-per-use deployment or one that reserves the LLM?

> When starting out, and for internal use - a pay per use model may work better. This is because until adoption reaches a critical mass within the company, it's preferable to have it as pay-per-use / serverless. When the product goes customer-facing, it might be more prudent to reserve a managed LLM since it offer more predictable cost at scale.

5.3.f. What type of limitations/risks would you see in using this system?

> * `Quality and Freshness of Data`: The performance heavily depends on the quality and timeliness of the data fed into the system. Outdated or low-quality data can lead to irrelevant or incorrect answers.

> * `Model Bias and Safety`: There is a risk of generating biased or unsafe responses, especially if the underlying LLM has not been adequately trained to handle sensitive topics safely.

> * `System Complexity and Maintenance`: The sophisticated infrastructure might require significant maintenance and technical expertise, which could be a challenge for the operational team
