# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.9/256.9 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h

As well, instead of the remote hosted solution that we used last week (Pinecone), we'll be leveraging Meta's [FAISS](https://github.com/facebookresearch/faiss) as the backend for our LangChain `VectorStore`.

We'll also install `unstructured` (from [Unstructured-IO](https://github.com/Unstructured-IO/unstructured)) and its dependencies which will allow us to load PDFs using the `UnstructuredPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU faiss_cpu pymupdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [5]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 54 (delta 15), reused 20 (delta 7), pack-reused 8[K
Receiving objects: 100% (54/54), 51.28 MiB | 37.11 MiB/s, done.
Resolving deltas: 100% (15/15), done.


In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/MuskComplaint.pdf",
)

documents = loader.load()

In [7]:
print(f"The document has a length of {len(documents)} ")
for i in documents:
  print(f"{i} \n\n")

The document has a length of 46 
page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \nCOMPLAINT \n \nIRELL & MANELLA LLP \nMorgan Chu (SBN 70446) \nAlan Heinrich (SBN 212782) \nIian Jablon (SBN 205458) \nAbigail Sellers (SBN 342380) \nJustin Koo (SBN 351547) \nHenry White (SBN 351549) \n1800 Avenue of the Stars, Suite 900 \nLos Angeles, CA 90067 \nTelephone: (310) 277-1010 \nFax: (310) 203-7199 \nmchu@irell.com \naheinrich@irell.com  \nijablon@irell.com \nasellers@irell.com \njkoo@irell.com \nhwhite@irell.com \n \n \nAttorneys for Plaintiff Elon Musk \nSUPERIOR COURT OF CALIFORNIA \nIN AND FOR THE COUNTY OF SAN FRANCISCO \nELON MUSK, an individual, \nPlaintiff, \nvs. \nSAMUEL ALTMAN, an individual, GREGORY \nBROCKMAN, an individual, OPENAI, INC., a \ncorporation, OPENAI, L.P., a limited \npartnership, OPENAI, L.L.C., a limited liability \ncompany, OPENAI GP, L.L.C., a limited \nliability c

In [8]:
# class langchain_core.documents.base.Document[source]

# - Pass page_content in as positional or named arg.

# - param metadata: dict [Optional] - Arbitrary metadata about the page content (e.g., source, relationships to other documents, etc.).

# - param page_content: str [Required]

# - param type

# print(f"The length of doc is : {type(documents[0])}")
# for i in documents[0]:
#   print(f"{i} \n")


print(f"TYPE : {documents[0].type} \n")
print(f"METADATA : {documents[0].metadata} \n")
print(f"PAGE CONTENT : {documents[0].page_content} \n ")

TYPE : Document 

METADATA : {'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 0, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''} 

PAGE CONTENT : 1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
 
 
COMPLAINT 
 
IRELL & MANELLA LLP 
Morgan Chu (SBN 70446) 
Alan Heinrich (SBN 212782) 
Iian Jablon (SBN 205458) 
Abigail Sellers (SBN 342380) 
Justin Koo (SBN 351547) 
Henry White (SBN 351549) 
1800 Avenue of the Stars, Suite 900 
Los Angeles, CA 90067 
Telephone: (310) 277-1010 
Fax: (310) 203-7199 
mchu@irell.com 
aheinrich@irell.com  
ijablon@irell.com 
asellers@irell.com 
jkoo@irell.com 
hwhite@irell.com 
 
 
Attorneys for Plaintiff Elon Musk 
SUPERIOR COURT OF CALIFORNIA 
IN AND FOR THE COUNTY OF SAN FRANCISCO 
ELON MUSK, an individual, 
Plaintiff, 
v

In [9]:
# for i in documents[0]:
#   print(i)

documents[0].page_content

'1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \nCOMPLAINT \n \nIRELL & MANELLA LLP \nMorgan Chu (SBN 70446) \nAlan Heinrich (SBN 212782) \nIian Jablon (SBN 205458) \nAbigail Sellers (SBN 342380) \nJustin Koo (SBN 351547) \nHenry White (SBN 351549) \n1800 Avenue of the Stars, Suite 900 \nLos Angeles, CA 90067 \nTelephone: (310) 277-1010 \nFax: (310) 203-7199 \nmchu@irell.com \naheinrich@irell.com  \nijablon@irell.com \nasellers@irell.com \njkoo@irell.com \nhwhite@irell.com \n \n \nAttorneys for Plaintiff Elon Musk \nSUPERIOR COURT OF CALIFORNIA \nIN AND FOR THE COUNTY OF SAN FRANCISCO \nELON MUSK, an individual, \nPlaintiff, \nvs. \nSAMUEL ALTMAN, an individual, GREGORY \nBROCKMAN, an individual, OPENAI, INC., a \ncorporation, OPENAI, L.P., a limited \npartnership, OPENAI, L.L.C., a limited liability \ncompany, OPENAI GP, L.L.C., a limited \nliability company, OPENAI OPCO, LLC, a \nlimited liabilit

In [10]:
documents[0].metadata

{'source': 'DataRepository/MuskComplaint.pdf',
 'file_path': 'DataRepository/MuskComplaint.pdf',
 'page': 0,
 'total_pages': 46,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [12]:
len(documents)

159

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [13]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

In [14]:
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7b607aaece80>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7b6079e4c610>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [15]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

In [16]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x7b60785324d0>

In [17]:
### Test the searching technique with FAISS
# Your sentence to embed
sentence = "Why Elon Musk hates OpenAI and wants to sue them."

# Use the embeddings object to get the embedding vector for the sentence
embedding_vector = embeddings.embed_query(sentence)

# The embedding_vector is now a numerical representation of your sentence
print(embedding_vector)



top_k = 4
results = vector_store.similarity_search(sentence, top_k)
for i in results:
  print(f"\nTop K nearest sentences : {i}")


results = vector_store.similarity_search_by_vector(embedding_vector, top_k)
for i in results:
  print(f"\nTop K nearest sentences : {i}")

[0.0017677240672397173, -0.02317386171651162, 0.01937158665317027, -0.006330453421474386, -2.79073176081549e-05, -0.009992643607925981, -0.018744545826707912, 0.024774820030054306, -0.009032069178593883, -0.0036088254060388496, 0.022533478763623556, 0.004442657550568157, 0.021759681997058588, -0.022213288218502044, 0.0034253823435556273, 0.00576011197570868, 0.02340066482723335, -0.006390489265099985, 0.02428119115462381, 0.0035054302126666355, -0.008711877702149852, 0.005753441378157086, -0.0286304598738084, -0.003692208573925654, -0.011833745295971058, -0.006707345675598851, 0.011707002545507009, -0.022920376215583523, 0.015289145561339486, -0.0016184681082466975, 0.028577093230750618, -0.00260155609535371, 0.003101855335505168, 0.01859779174925034, -0.030631655903091717, -0.0007191801722696358, 0.02238672468616599, -0.004739501702750983, 0.03722226402606202, -0.03490087558901215, 0.001137347010679042, 0.021612927919601017, -0.00038794038766354767, -0.022800304528332323, -0.037942697

In [18]:
### Test the searching technique with FAISS
# Your sentence to embed
sentence = "Who is the plantiff?"

# Use the embeddings object to get the embedding vector for the sentence
embedding_vector = embeddings.embed_query(sentence)

# The embedding_vector is now a numerical representation of your sentence
print(embedding_vector)


top_k = 4
results = vector_store.similarity_search(sentence, top_k)
for i in results:
  print(f"\nTop K nearest sentences : {i}")


results = vector_store.similarity_search_by_vector(embedding_vector, top_k)
for i in results:
  print(f"\nTop K nearest sentences : {i}")

[-0.010692491246009428, -0.018966114237033063, -0.013695458945488589, 0.00773075319710096, -0.017852885731947044, 0.03578823449032578, 0.001017023581171081, 0.0027040457479223096, -0.03628300147304502, -0.003590505354103413, 0.023597695043570566, 0.005274091871860847, -0.0020976110925096086, 0.008802751120293712, 0.016670940521694814, 0.004916759230796597, 0.03133532046998169, -0.00047028748887333255, -0.0011398566474330866, -0.0027744814417386345, 0.010032800073071954, 0.00010898214582936232, -0.020381702205314888, -0.0040440429190826055, -4.4935004762586825e-05, 0.010238952982538303, 0.021192571247205107, -0.03007091223329777, 0.0042364527669124874, -0.0044907089532435635, 0.0020168677532199663, -0.009716698013900973, -0.01642355516769004, -0.014169612034245061, -0.007675779087909933, -0.014705610530180147, 0.012382948828923808, -0.006569422346472789, 0.004525067771487955, 0.016299864353332808, 0.03565079735470306, -0.005933782579137033, -0.01353053661791551, -0.002779635380890616, -

####❓ Question #1:

List out a few of the techniques that FAISS uses that make it performant.

> NOTE: Check the [repository](https://github.com/facebookresearch/faiss) for more information about FAISS!


#### Answer #1:
Facebook AI Similarity Search employs the following techniques to handle large-scale datasett effectively :
- quantization
- indexing
- efficient distance computation

FAISS reduces the search space by organizing the data points into a hierarchical structure, which allows for faster search times and reduced memory usage compared to traditional methods.




#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [19]:
retriever = vector_store.as_retriever()

The `retriever` object acts as an interface to the underlying vector_store, allowing users or systems to submit queries in the form of vectors and receive a set of retrieved vectors or the corresponding original data points as results.
In applications such as semantic search, document retrieval, or even in the components of larger AI systems like open-domain question answering, the retriever plays a critical role in narrowing down the search space or selecting candidate information for further processing or analysis.


In summary, in this context, a retriever is a specialized tool or system derived from a vector_store that is capable of efficiently retrieving information based on vector similarity. It's a key component in systems where the fast and accurate retrieval of information from large datasets is necessary.







In [20]:
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7b60785324d0>)

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [21]:
retrieved_documents = retriever.invoke("Who is the plantiff?")

In [22]:
for doc in retrieved_documents:
  print(doc)

page_content='would be owned by the foundation and used ‘for the good of the world’[.]” Plaintiff \nreplied: “Agree on all.” Ex. 2 at 1.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 27, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='property and derivative works funded by those monies, Plaintiff is presently unable to ascertain his \ninterest in or the use, allocation, or distribution of assets without an accounting. Plaintiff is therefore \nentitled to an accounting.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 32, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='1

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [23]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [24]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [25]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [26]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

In [27]:
prompt | primary_qa_llm

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':\n\nContext:\n{context}\n\nQuestion:\n{question}\n"))])
| ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7b6077d33c70>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7b6079a691e0>, temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='')

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

Let's test it out!

In [28]:
question = "Who is the plantiff?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Elon Musk


In [29]:
question = "What does this complaint pertain to?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

The complaint pertains to breach of fiduciary duty, unfair business practices, and accounting.
[Document(page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 31 – \nCOMPLAINT \n \nTHIRD CAUSE OF ACTION \nBreach of Fiduciary Duty  \nAgainst All Defendants \n133. \nPlaintiff realleges and incorporates by reference only paragraphs of this Complaint \nnecessary for his claim of Breach of Fiduciary Duty. \n134. \nUnder California law, Defendants owe fiduciary duties to Plaintiff, including a duty \nto use Plaintiff’s contributions for the purposes for which they were made. E.g., Cal. Bus. & Prof. \nCode § 17510.8. Defendants have repeatedly breached their fiduciary duties to Plaintiff, including \nby:', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 30, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# 🤝 Breakout Room #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [30]:
eval_documents = documents

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 400
)

eval_documents = text_splitter.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?


Depend on our use case
- how info is splitted between the chunks. Different chunk sizes and overlaps can lead to varying representations of the same document. Evaluating the model across these representations enables us to understand and utilize information from documents
- fine tuning chunk_size and chunk_overlap depends on our use case and it helps us to test different retrieval strategies to understand how the model's performance varies with changes in document segmentation.
- different parameters ensure that the dataset accurately represents real-world complexities especially context learning

In [31]:
len(eval_documents)

159

why synthetic data generation length is 159?

In [32]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(eval_documents, test_size=10, distributions={simple: 0.25, reasoning: 0.25, multi_context: 0.5})

embedding nodes:   0%|          | 0/318 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]

In [33]:
testset

TestDataset(test_data=[DataRow(question='What principles are outlined in the Founding Agreement of OpenAI?', contexts=['profit developing AGI for the benefit of humanity, not for a for-profit company seeking to maximize \nshareholder profits; and (b) would be open-source, balancing only countervailing safety \nconsiderations, and would not keep its technology closed and secret for proprietary commercial \nreasons (The “Founding Agreement”). Reflecting the Founding Agreement, Mr. Musk named this \nnew AI lab “OpenAI,” which would compete with, and serve as a vital counterbalance to, \nGoogle/DeepMind in the race for AGI, but would do so to benefit humanity, not the shareholders \nof a private, for-profit company (much less one of the largest technology companies in the world).  \n25.'], ground_truth='The principles outlined in the Founding Agreement of OpenAI are developing AGI for the benefit of humanity and being open-source.', evolution_type='simple'), DataRow(question='How did OpenA

####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).


####Answer #3:

- **Conditioning (simple)**: Add complexity to the question. The weighting of 0.25 suggests that 25% of the documents selected for generation or evaluation will be of this simple type. These could involve basic recall of facts, definitions, or simple processing tasks that don't require deep understanding or complex reasoning.


- **Reasoning**: Rewrite the question in a way that enhances the need for reasoning, analysing, or problem-solving to answer it effectively. The weight of 0.25 means that 25% of the generated or evaluated content will involve these reasoning tasks. These might require understanding relationships between concepts, inferring conclusions from given information, or applying principles to new situations.

- **Multi-Context**: Rephrase the question in a manner that necessitates information from multiple related sections or chunks to formulate an answer. With a weight of 0.5, half of the documents selected will involve these multi-contextual tasks. This could imply a need for the generator to handle more complex scenarios where information must be synthesized from various segments, requiring a deeper level of comprehension and the ability to draw connections across different pieces of information.







Let's look at the output and see what we can learn about it!

In [35]:
testset.test_data[0]

DataRow(question='What principles are outlined in the Founding Agreement of OpenAI?', contexts=['profit developing AGI for the benefit of humanity, not for a for-profit company seeking to maximize \nshareholder profits; and (b) would be open-source, balancing only countervailing safety \nconsiderations, and would not keep its technology closed and secret for proprietary commercial \nreasons (The “Founding Agreement”). Reflecting the Founding Agreement, Mr. Musk named this \nnew AI lab “OpenAI,” which would compete with, and serve as a vital counterbalance to, \nGoogle/DeepMind in the race for AGI, but would do so to benefit humanity, not the shareholders \nof a private, for-profit company (much less one of the largest technology companies in the world).  \n25.'], ground_truth='The principles outlined in the Founding Agreement of OpenAI are developing AGI for the benefit of humanity and being open-source.', evolution_type='simple')

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [36]:
test_df = testset.to_pandas()

In [37]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,What principles are outlined in the Founding A...,[profit developing AGI for the benefit of huma...,The principles outlined in the Founding Agreem...,simple,True
1,How did OpenAI use reinforcement learning in t...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to play Dot...,simple,True
2,What were the benefits of releasing models pub...,"[challenging.” At the time, OpenAI stated that...",The paper by OpenAI and independent scientists...,reasoning,True
3,Which company's collaboration with DeepMind ca...,[a superhuman level of play in the games of ch...,Google's collaboration with DeepMind caused co...,reasoning,True
4,What were Elon Musk's concerns about AGI and i...,[to its mission to develop AGI for the benefit...,Elon Musk's concerns about AGI and its develop...,multi_context,True
5,"""What game did OpenAI use reinforcement learni...",[77. \nInitial work at OpenAI followed much in...,"OpenAI used Dota 2, a strategy video game, to ...",multi_context,True
6,"""What was the impact of OpenAI's GPT-3 on futu...",[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Their publication did prove to be useful to th...,multi_context,True
7,What is one risk of AGI mentioned in the context?,[100. \nJURISDICTION AND VENUE \n14. \nOn inf...,,multi_context,True
8,What danger did Mr. Musk and Mr. Hassabis disc...,[18. \nMr. Musk has long recognized that AGI p...,The potential danger that the advancement of A...,multi_context,True
9,In which strategy video game did OpenAI's team...,[77. \nInitial work at OpenAI followed much in...,"OpenAI's team competed in Dota 2, a strategy v...",reasoning,True


In [38]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [39]:
test_questions, test_groundtruths

(['What principles are outlined in the Founding Agreement of OpenAI?',
  'How did OpenAI use reinforcement learning in their work?',
  'What were the benefits of releasing models publicly according to the paper by OpenAI and independent scientists?',
  "Which company's collaboration with DeepMind caused concern for Mr. Musk in the race for AGI?",
  "What were Elon Musk's concerns about AGI and its development by a for-profit company?",
  '"What game did OpenAI use reinforcement learning to surpass human-level play achieved by a world-champion program in chess, shogi, and Go?"',
  '"What was the impact of OpenAI\'s GPT-3 on future model development and the communities that emerged to enhance it?"',
  'What is one risk of AGI mentioned in the context?',
  "What danger did Mr. Musk and Mr. Hassabis discuss regarding AI's impact on machine usefulness with strong AGI?",
  "In which strategy video game did OpenAI's team compete using reinforcement learning, with more moving pieces than chess

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [40]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [41]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [42]:
response_dataset[0]

{'question': 'What principles are outlined in the Founding Agreement of OpenAI?',
 'answer': 'The principles outlined in the Founding Agreement of OpenAI include developing AGI for the benefit of humanity, being a non-profit organization, and being open-source.',
 'contexts': ['Agreement: “Open AI Institute,” or simply, “OpenAI.”  \n55. \nWith these principles of the Founding Agreement in mind, Mr. Musk joined forces \nwith Mr. Altman and Mr. Brockman to formalize and launch the project. Mr. Musk was actively \ninvolved in the project, even prior to it being publicly announced. For example, Mr. Musk advised \nMr. Brockman on compensation packages for employees, sharing with Mr. Brockman his strategies \nfor compensation and retaining talent.  \n56. \nOn December 8, 2015, a Certificate of Incorporation for OpenAI, Inc. was filed with \nthe Delaware Secretary of State. The Certificate memorialized in writing the Founding Agreement:',
  'a key role in recruiting world-class talent to Open

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [43]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [44]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [45]:
results

{'faithfulness': 0.8750, 'answer_relevancy': 0.9519, 'context_recall': 0.8000, 'context_precision': 0.7472, 'answer_correctness': 0.8239}

In [46]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What principles are outlined in the Founding A...,The principles outlined in the Founding Agreem...,"[Agreement: “Open AI Institute,” or simply, “O...",The principles outlined in the Founding Agreem...,1.0,1.0,1.0,0.583333,0.845501
1,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to play Dot...,1.0,0.971244,1.0,1.0,0.74696
2,What were the benefits of releasing models pub...,The benefits of releasing models publicly acco...,"[challenging.” At the time, OpenAI stated that...",The paper by OpenAI and independent scientists...,1.0,0.992501,1.0,1.0,0.841254
3,Which company's collaboration with DeepMind ca...,Google,[to its mission to develop AGI for the benefit...,Google's collaboration with DeepMind caused co...,,0.883154,1.0,1.0,0.949842
4,What were Elon Musk's concerns about AGI and i...,Elon Musk's concerns about AGI were that it po...,[to its mission to develop AGI for the benefit...,Elon Musk's concerns about AGI and its develop...,1.0,0.969848,1.0,0.583333,0.53922
5,"""What game did OpenAI use reinforcement learni...",Dota 2,[a superhuman level of play in the games of ch...,"OpenAI used Dota 2, a strategy video game, to ...",0.0,0.906029,0.0,0.5,0.961786
6,"""What was the impact of OpenAI's GPT-3 on futu...",The impact of OpenAI's GPT-3 on future model d...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Their publication did prove to be useful to th...,1.0,0.955954,1.0,0.805556,0.731562
7,What is one risk of AGI mentioned in the context?,One risk of AGI mentioned in the context is th...,[18. \nMr. Musk has long recognized that AGI p...,,1.0,1.0,0.0,0.0,0.935808
8,What danger did Mr. Musk and Mr. Hassabis disc...,They discussed the danger of machines becoming...,[18. \nMr. Musk has long recognized that AGI p...,The potential danger that the advancement of A...,1.0,0.907018,1.0,1.0,0.723913
9,In which strategy video game did OpenAI's team...,Dota 2,[77. \nInitial work at OpenAI followed much in...,"OpenAI's team competed in Dota 2, a strategy v...",,0.932811,1.0,1.0,0.962836


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [47]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [48]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [49]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [50]:
response = retrieval_chain.invoke({"input": "Who is the plantiff?"})

In [51]:
print(response["answer"])

The plaintiff is Elon Musk.


In [52]:
response = retrieval_chain.invoke({"input": "What does this complaint pertain to?"})

In [53]:
print(response["answer"])

The complaint pertains to a legal case involving Plaintiff Elon Musk alleging breach of fiduciary duty, unfair business practices under California Business and Professions Code sections 17200 et seq, and demanding an accounting, restitution, disgorgement of monies received by Defendants, prejudgment interest, and an injunction against future activities. The complaint also includes a demand for a jury trial.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [54]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [55]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [56]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [57]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What principles are outlined in the Founding A...,The principles outlined in the Founding Agreem...,"[Agreement: “Open AI Institute,” or simply, “O...",The principles outlined in the Founding Agreem...,1.0,1.0,1.0,0.555556,0.742263
1,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to play Dot...,0.666667,0.872291,0.5,1.0,0.540233
2,What were the benefits of releasing models pub...,According to the paper by OpenAI and independe...,"[challenging.” At the time, OpenAI stated that...",The paper by OpenAI and independent scientists...,1.0,0.950691,1.0,1.0,0.839107
3,Which company's collaboration with DeepMind ca...,Google's collaboration with DeepMind caused co...,[to its mission to develop AGI for the benefit...,Google's collaboration with DeepMind caused co...,1.0,0.927679,1.0,0.876667,1.0
4,What were Elon Musk's concerns about AGI and i...,Elon Musk's concerns about AGI included the po...,[to its mission to develop AGI for the benefit...,Elon Musk's concerns about AGI and its develop...,0.8,0.969903,1.0,0.583333,0.568283
5,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning to surpass ...,[a superhuman level of play in the games of ch...,"OpenAI used Dota 2, a strategy video game, to ...",,0.946309,0.0,0.5,0.615178
6,"""What was the impact of OpenAI's GPT-3 on futu...",The publication of OpenAI's GPT-3 proved to be...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Their publication did prove to be useful to th...,,0.949865,1.0,0.804167,0.738901
7,What is one risk of AGI mentioned in the context?,One risk of AGI mentioned in the context is th...,[18. \nMr. Musk has long recognized that AGI p...,,1.0,1.0,0.0,0.0,0.182433
8,What danger did Mr. Musk and Mr. Hassabis disc...,Mr. Musk and Mr. Hassabis discussed the potent...,[18. \nMr. Musk has long recognized that AGI p...,The potential danger that the advancement of A...,,0.892477,1.0,1.0,0.597081
9,In which strategy video game did OpenAI's team...,"OpenAI's team competed in Dota 2, a strategy v...",[77. \nInitial work at OpenAI followed much in...,"OpenAI's team competed in Dota 2, a strategy v...",1.0,0.93145,1.0,1.0,0.746695


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [58]:
results

{'faithfulness': 0.8750, 'answer_relevancy': 0.9519, 'context_recall': 0.8000, 'context_precision': 0.7472, 'answer_correctness': 0.8239}

And see how our advanced retrieval modified our chain!

In [59]:
advanced_retrieval_results

{'faithfulness': 0.9238, 'answer_relevancy': 0.9441, 'context_recall': 0.7500, 'context_precision': 0.7320, 'answer_correctness': 0.6570}

In [60]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.875,0.92381,0.04881
1,answer_relevancy,0.951856,0.944067,-0.007789
2,context_recall,0.8,0.75,-0.05
3,context_precision,0.747222,0.731972,-0.01525
4,answer_correctness,0.823868,0.657017,-0.166851


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [61]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [62]:
vector_store = FAISS.from_documents(documents, new_embeddings)

In [63]:
new_retriever = vector_store.as_retriever()

In [64]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [65]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [66]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [67]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [68]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [69]:
new_advanced_retrieval_results

{'faithfulness': 0.9259, 'answer_relevancy': 0.9475, 'context_recall': 0.8000, 'context_precision': 0.8162, 'answer_correctness': 0.7624}

In [70]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta (TE3 -> ADA)'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta (TE3 -> Baseline)'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta (TE3 -> ADA),Delta (TE3 -> Baseline)
0,faithfulness,0.875,0.92381,0.925926,0.002116,0.050926
1,answer_relevancy,0.951856,0.944067,0.947539,0.003472,-0.004317
2,context_recall,0.8,0.75,0.8,0.05,0.0
3,context_precision,0.747222,0.731972,0.81625,0.084278,0.069028
4,answer_correctness,0.823868,0.657017,0.762394,0.105377,-0.061474


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?


####Answer #4:

- Based on the comparision, there is no significant difference between `ADA` and `text-embedding-3-small`. Ada might be more suitable for completing prompts in a human-like fashion.

- `text-embedding-3-small` can produce high-quality results faster and with less computational resources compared to a general-purpose model `ADA` when it comes to generating embeddings. But there  may be an issue with this embedding which is the order of cosine distance prefering data that was clearly more different.

https://arxiv.org/html/2403.05440v1#:~:text=While%20similarity%20in%20'cosine%20similarity,the%20norm%20of%20the%20learned

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [74]:
### YOUR CODE HERE
# Filter out the multi-context specific questions
multi_context_test_df = test_df.loc[test_df["evolution_type"] == "multi_context"]
multi_context_test_questions = multi_context_test_df["question"].values.tolist()
multi_context_test_groundtruths = multi_context_test_df["ground_truth"].values.tolist()

In [82]:
from datasets import Dataset

# baseline model

# Collect our retrieved contexts for each question.
answers = []
contexts = []

for question in multi_context_test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

# Generate Responses with RAG Pipeline
response_dataset = Dataset.from_dict({
    "question" : multi_context_test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_context_test_groundtruths
})

In [76]:
# Evaluate on the same metrics
multi_context_results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

In [77]:
multi_context_results

{'faithfulness': 0.8000, 'answer_relevancy': 0.9432, 'context_recall': 0.6000, 'context_precision': 0.5778, 'answer_correctness': 0.7792}

In [88]:
# advanced_retrieval
answers = []
contexts = []

for question in multi_context_test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])


multi_context_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : multi_context_test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_context_test_groundtruths
})

In [89]:
# response_dataset_advanced_retrieval = generate_response(multi_context_test_questions, multi_context_test_groundtruths, retrieval_chain)

multi_context_advanced_retrieval_results = evaluate(multi_context_response_dataset_advanced_retrieval, metrics)
multi_context_advanced_retrieval_results

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

{'faithfulness': 1.0000, 'answer_relevancy': 0.9517, 'context_recall': 0.6000, 'context_precision': 0.5639, 'answer_correctness': 0.5646}

In [86]:
# new advanced_retrieval
answers = []
contexts = []

for question in multi_context_test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])


multi_context_new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : multi_context_test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_context_test_groundtruths
})

In [87]:
# multi_context_new_response_dataset_advanced_retrieval = generate_response(multi_context_test_questions, multi_context_test_groundtruths, new_retrieval_chain)

# Evaluate on the same metrics
multi_context_new_advanced_retrieval_results = evaluate(multi_context_new_response_dataset_advanced_retrieval, metrics)
multi_context_new_advanced_retrieval_results

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

{'faithfulness': 0.8000, 'answer_relevancy': 0.9587, 'context_recall': 0.6000, 'context_precision': 0.6967, 'answer_correctness': 0.7740}

In [90]:
df_baseline = pd.DataFrame(list(multi_context_results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(multi_context_advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(multi_context_new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_multi_context = pd.merge(df_original, df_comparison, on='Metric')
df_multi_context = pd.merge(df_baseline, df_multi_context, on="Metric")

df_multi_context['Delta (TE3 -> ADA)'] = df_multi_context['Text Embedding 3'] - df_multi_context['ADA']
df_multi_context['Delta (TE3 -> Baseline)'] = df_multi_context['Text Embedding 3'] - df_multi_context['Baseline']

df_multi_context

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta (TE3 -> ADA),Delta (TE3 -> Baseline)
0,faithfulness,0.8,1.0,0.8,-0.2,0.0
1,answer_relevancy,0.94315,0.951655,0.958748,0.007092,0.015598
2,context_recall,0.6,0.6,0.6,0.0,0.0
3,context_precision,0.577778,0.563944,0.696667,0.132722,0.118889
4,answer_correctness,0.779165,0.56458,0.774016,0.209436,-0.00515
