# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [362]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [363]:
!pip install -qU ragas

As well, instead of the remote hosted solution that we used last week (Pinecone), we'll be leveraging Meta's [FAISS](https://github.com/facebookresearch/faiss) as the backend for our LangChain `VectorStore`.

We'll also install `unstructured` (from [Unstructured-IO](https://github.com/Unstructured-IO/unstructured)) and its dependencies which will allow us to load PDFs using the `UnstructuredPDFLoader` in the `langchain-community` package!

In [364]:
!pip install -qU faiss_cpu pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [365]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [366]:
!git clone https://github.com/AI-Maker-Space/DataRepository

fatal: destination path 'DataRepository' already exists and is not an empty directory.


In [367]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/MuskComplaint.pdf",
)

documents = loader.load()

In [368]:
documents[0].metadata

{'source': 'DataRepository/MuskComplaint.pdf',
 'file_path': 'DataRepository/MuskComplaint.pdf',
 'page': 0,
 'total_pages': 46,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [369]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [370]:
len(documents)

159

In [371]:
documents[1]

Document(page_content='ELON MUSK, an individual, \nPlaintiff, \nvs. \nSAMUEL ALTMAN, an individual, GREGORY \nBROCKMAN, an individual, OPENAI, INC., a \ncorporation, OPENAI, L.P., a limited \npartnership, OPENAI, L.L.C., a limited liability \ncompany, OPENAI GP, L.L.C., a limited \nliability company, OPENAI OPCO, LLC, a \nlimited liability company, OPENAI GLOBAL, \nLLC, a limited liability company, OAI \nCORPORATION, LLC, a limited liability \ncompany, OPENAI HOLDINGS, LLC, a limited \nliability company, and DOES 1 through 100, \ninclusive, \nDefendants. \nCase No.:  \n[UNLIMITED JURISDICTION] \n \nCOMPLAINT FOR (1) BREACH OF \nCONTRACT, (2) PROMISSORY \nESTOPPEL, (3) BREACH OF FIDUCIARY \nDUTY, (4) UNFAIR COMPETITION', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 0, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modD

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [372]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [373]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

####❓ Question #1:

List out a few of the techniques that FAISS uses that make it performant.

**Answer** FAISS is very flexible in terms of retrieval options. The default options seem to be Euclidean distance, dot product and cosine similarity (which is essentially a dot product on normalized vectors). However, FAISS also supports alternative retrieval options that offer various levels of compromise between memory, search speed and accuracy. Its GPU implementation significantly accelerates search operations, supporting native multi-GPU configurations. FAISS efficiently handles billions of vectors, crucial for large-scale datasets common in AI applications.

> NOTE: Check the [repository](https://github.com/facebookresearch/faiss) for more information about FAISS!

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [374]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [375]:
retrieved_documents = retriever.invoke("Who is the plantiff?")

In [376]:
for doc in retrieved_documents:
  print(doc)

page_content='would be owned by the foundation and used ‘for the good of the world’[.]” Plaintiff \nreplied: “Agree on all.” Ex. 2 at 1.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 27, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='property and derivative works funded by those monies, Plaintiff is presently unable to ascertain his \ninterest in or the use, allocation, or distribution of assets without an accounting. Plaintiff is therefore \nentitled to an accounting.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 32, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='1

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [377]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [378]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [379]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [380]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

**Answer**: Above we have a RAG chain that first uses Python's itemgetter to extract the "question" from input, passing it to a retriever but also keeping the original "question" intact. A RunnablePassthrough then temporarily holds the "context" (which is obtained as an output of the "question" chained into the retriever) without altering it. Finally, the "context" and "question" are used as inputs for a prompt for ChatOpenAI, generating a "response".

Let's test it out!

In [381]:
question = "Who is the plantiff?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Elon Musk


In [382]:
question = "What does this complaint pertain to?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

The complaint pertains to breach of fiduciary duty, unfair business practices, accounting, and a demand for a jury trial.
[Document(page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 31 – \nCOMPLAINT \n \nTHIRD CAUSE OF ACTION \nBreach of Fiduciary Duty  \nAgainst All Defendants \n133. \nPlaintiff realleges and incorporates by reference only paragraphs of this Complaint \nnecessary for his claim of Breach of Fiduciary Duty. \n134. \nUnder California law, Defendants owe fiduciary duties to Plaintiff, including a duty \nto use Plaintiff’s contributions for the purposes for which they were made. E.g., Cal. Bus. & Prof. \nCode § 17510.8. Defendants have repeatedly breached their fiduciary duties to Plaintiff, including \nby:', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 30, 'total_pages': 46, 'format': 'PDF 1.7', 'title':

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# 🤝 Breakout Room #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [383]:
loader = PyMuPDFLoader(
    "DataRepository/MuskComplaint.pdf",
)

eval_documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 400
)

eval_documents = text_splitter.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

**Answer**:  Because we want to test whether the system can handle unseen data and diverse scenarios effectively, not just the specific conditions it was trained or optimized on. A different strategy might reveal strengths or weaknesses that were not apparent under the training conditions, providing a better understanding of the system's performance and areas for improvement.

In [384]:
len(eval_documents)

92

In [385]:
documents == eval_documents

False

In [386]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(eval_documents, test_size=12, distributions={simple: 0.25, reasoning: 0.25, multi_context: 0.5})

embedding nodes:   0%|          | 0/188 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/12 [00:00<?, ?it/s]

####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

**Answer** RAGAS provides a synthetic Q&A data genaration module that can cover different levels of complexity. First, 'simple' questions are generated where seeding is used to ensure diversity. Then, the original questions might undergo "evolutions", whereby they become more convolved. The new questions might require reasoning in order to be answered ('reasoning' questions), or might require information contained in multiple chunks ('multi_context' questions). This is a way to simulate the variability in queries that production RAG systems might receive. 

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

Let's look at the output and see what we can learn about it!

In [387]:
testset.test_data[0]

DataRow(question='What did Mr. Altman suggest as a means to ensure AI is created safely?', contexts=['1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 11 – \nCOMPLAINT \n \n43. \nMr. Altman appeared to share Mr. Musk’s concerns surrounding AI. In public blog \nposts dating back to 2014, Mr. Altman stated that AGI, if made, would “be the biggest development \nin technology ever.” Mr. Altman pointed out that there are many companies making strides towards \nachieving AGI, but acknowledged the unfortunate reality that the “good ones are very secretive \nabout it.” \n44. \nOn February 25, 2015, Mr. Altman also expressed his concern surrounding the \ndevelopment of what he referred to as “superhuman machine intelligence” which he identified as \n“probably the greatest threat to the continued existence of humanity” and emphasized that “as a \nhuman programmed to survive and reproduce, I feel we should f

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [388]:
test_df = testset.to_pandas()

In [389]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,What did Mr. Altman suggest as a means to ensu...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Mr. Altman suggested government regulation as ...,simple,True
1,How did researchers at the University of Tokyo...,[implementation for others to build on. \n84. ...,Researchers at the University of Tokyo and Goo...,simple,True
2,What strategy video game did OpenAI compete in?,"[multiple members including OpenAI, Inc., Aest...","OpenAI competed in Dota 2, a strategy video game.",simple,True
3,How did the Board restructuring affect OpenAI'...,[the Board from which it could keep a close ey...,The Board restructuring affected OpenAI's chec...,reasoning,True
4,What does Mr. Nadella say about Microsoft's st...,"[Indeed, during an interview shortly after Mr....","Mr. Nadella states that if OpenAI disappeared,...",reasoning,True
5,Which organization used DeepMind's influence t...,"[multiple members including OpenAI, Inc., Aest...",OpenAI,reasoning,True
6,How does GPT-4's reasoning compare to humans o...,[dramatically compressing. \n86. \nOn March 14...,GPT-4's reasoning is better than average human...,multi_context,True
7,What was OpenAI's original intention in the AG...,[profit developing AGI for the benefit of huma...,"OpenAI's original intention in the AGI race, a...",multi_context,True
8,What is the ownership and control relationship...,"[multiple members including OpenAI, Inc., Aest...","OpenAI, Inc. manages OpenAI Global, LLC.",multi_context,True
9,"""What is Artificial General Intelligence (AGI)...",[food is shown in a photo. One of the hallmark...,The basic concept of AGI is a general purpose ...,multi_context,True


In [390]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [391]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [392]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [393]:
response_dataset[0]

{'question': 'What did Mr. Altman suggest as a means to ensure AI is created safely?',
 'answer': 'Government regulation.',
 'contexts': ['“probably the greatest threat to the continued existence of humanity” and emphasized that “as a \nhuman programmed to survive and reproduce, I feel we should fight it.” Further, Mr. Altman \ncriticized those who believed that “superhuman machine intelligence” was dangerous but dismissed \nit as “never going to happen or definitely very far off.” He accused them of engaging in “sloppy, \ndangerous thinking.” \n45. \nIndeed, in early March 2015, Mr. Altman extolled the importance of government \nregulation as a means to ensure AI is created safely and suggested that “a group of very smart people \nwith a lot of resources” likely involving “US companies in some way” would be the most probable',
  'to ensure that AI was developed and practiced safely. \n40. \nFollowing Google’s acquisition of DeepMind, Mr. Musk began “hosting his own \nseries of dinner 

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [394]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [395]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

In [426]:
print(results)

{'faithfulness': 0.9167, 'answer_relevancy': 0.9298, 'context_recall': 0.8194, 'context_precision': 0.8403, 'answer_correctness': 0.7327}


In [397]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What did Mr. Altman suggest as a means to ensu...,Government regulation.,[“probably the greatest threat to the continue...,Mr. Altman suggested government regulation as ...,1.0,0.890794,1.0,0.75,0.96254
1,How did researchers at the University of Tokyo...,By simply adding 'Let's think step by step' be...,[implementation for others to build on. \n84. ...,Researchers at the University of Tokyo and Goo...,1.0,0.908146,1.0,1.0,0.719096
2,What strategy video game did OpenAI compete in?,Dota 2,[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",1.0,0.953836,1.0,1.0,0.716374
3,How did the Board restructuring affect OpenAI'...,The Board restructuring collapsed OpenAI's cor...,[the Board from which it could keep a close ey...,The Board restructuring affected OpenAI's chec...,1.0,0.916064,1.0,1.0,0.671755
4,What does Mr. Nadella say about Microsoft's st...,Mr. Nadella stated that Microsoft was very con...,"[Indeed, during an interview shortly after Mr....","Mr. Nadella states that if OpenAI disappeared,...",1.0,0.974927,1.0,1.0,0.736825
5,Which organization used DeepMind's influence t...,OpenAI,[a superhuman level of play in the games of ch...,OpenAI,0.0,0.872235,0.333333,0.583333,1.0
6,How does GPT-4's reasoning compare to humans o...,GPT-4's reasoning is superior to humans on exa...,[titled “Sparks of Artificial General Intellig...,GPT-4's reasoning is better than average human...,1.0,0.921261,1.0,1.0,0.484975
7,What was OpenAI's original intention in the AG...,OpenAI's original intention in the AGI race as...,[profit developing AGI for the benefit of huma...,"OpenAI's original intention in the AGI race, a...",1.0,0.946791,1.0,1.0,0.748239
8,What is the ownership and control relationship...,"OpenAI Global, LLC has two members: Microsoft ...","[LLC through its general partner, OpenAI GP, L...","OpenAI, Inc. manages OpenAI Global, LLC.",1.0,0.896853,1.0,1.0,0.531793
9,"""What is Artificial General Intelligence (AGI)...",Artificial General Intelligence (AGI) is a gen...,[food is shown in a photo. One of the hallmark...,The basic concept of AGI is a general purpose ...,1.0,0.929179,0.5,0.75,0.53936


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [398]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [399]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [400]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [401]:
response = retrieval_chain.invoke({"input": "Who is the plantiff?"})

In [402]:
print(response["answer"])

The plaintiff is Elon Musk.


In [403]:
response = retrieval_chain.invoke({"input": "What does this complaint pertain to?"})

In [404]:
print(response["answer"])

The complaint pertains to a legal case involving Plaintiff Elon Musk alleging breach of fiduciary duty, unfair business practices, and seeking an accounting, restitution, disgorgement of funds, and injunctive relief against all Defendants. The complaint also includes a demand for a jury trial.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [405]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [406]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [407]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

In [408]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What did Mr. Altman suggest as a means to ensu...,Mr. Altman suggested that government regulatio...,[“probably the greatest threat to the continue...,Mr. Altman suggested government regulation as ...,1.0,0.974245,1.0,1.0,0.741462
1,How did researchers at the University of Tokyo...,Researchers at the University of Tokyo and Goo...,[implementation for others to build on. \n84. ...,Researchers at the University of Tokyo and Goo...,1.0,0.986865,1.0,1.0,0.999481
2,What strategy video game did OpenAI compete in?,"OpenAI competed in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",1.0,1.0,1.0,1.0,0.741034
3,How did the Board restructuring affect OpenAI'...,"The restructuring of the Board at OpenAI, Inc....",[the Board from which it could keep a close ey...,The Board restructuring affected OpenAI's chec...,1.0,0.886073,1.0,0.916667,0.842188
4,What does Mr. Nadella say about Microsoft's st...,Mr. Nadella stated that Microsoft was very con...,"[Indeed, during an interview shortly after Mr....","Mr. Nadella states that if OpenAI disappeared,...",1.0,0.974927,1.0,1.0,0.841375
5,Which organization used DeepMind's influence t...,OpenAI used DeepMind's influence to compete in...,[77. \nInitial work at OpenAI followed much in...,OpenAI,1.0,0.901718,0.5,1.0,0.718147
6,How does GPT-4's reasoning compare to humans o...,GPT-4's reasoning capabilities were found to b...,[titled “Sparks of Artificial General Intellig...,GPT-4's reasoning is better than average human...,0.8,0.921381,1.0,1.0,0.915019
7,What was OpenAI's original intention in the AG...,"OpenAI's original intention, as per the Foundi...",[profit developing AGI for the benefit of huma...,"OpenAI's original intention in the AGI race, a...",1.0,0.926395,1.0,1.0,0.541145
8,What is the ownership and control relationship...,"OpenAI Global, LLC has two members: Microsoft ...","[LLC through its general partner, OpenAI GP, L...","OpenAI, Inc. manages OpenAI Global, LLC.",1.0,0.896853,1.0,0.916667,0.446089
9,"""What is Artificial General Intelligence (AGI)...",Artificial General Intelligence (AGI) refers t...,[food is shown in a photo. One of the hallmark...,The basic concept of AGI is a general purpose ...,,0.929179,0.5,1.0,0.798882


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [409]:
results

{'faithfulness': 0.9167, 'answer_relevancy': 0.9298, 'context_recall': 0.8194, 'context_precision': 0.8403, 'answer_correctness': 0.7327}

And see how our advanced retrieval modified our chain!

In [410]:
advanced_retrieval_results

{'faithfulness': 0.9600, 'answer_relevancy': 0.9436, 'context_recall': 0.8333, 'context_precision': 0.9028, 'answer_correctness': 0.7209}

In [411]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.916667,0.96,0.043333
1,answer_relevancy,0.929754,0.943593,0.013839
2,context_recall,0.819444,0.833333,0.013889
3,context_precision,0.840278,0.902778,0.0625
4,answer_correctness,0.732664,0.720944,-0.01172


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [430]:
#Let us repeat the process by using "text-embedding-3-small" instead of "text-embedding-ada-002" as our embedding model:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [431]:
#Again, we store our documents alongside their new embeddings in an index:
new_vector_store = FAISS.from_documents(documents, new_embeddings)

In [432]:
#And expose our vector_store as a retriever:
new_retriever = new_vector_store.as_retriever()

In [433]:
# As we did before with "text-embedding-ada-002", we will also consider the effect of an advanced retriever:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [434]:
## As we did before with "text-embedding-ada-002", we will also consider the effect of document stuffing:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [435]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [436]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [437]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

In [438]:
new_advanced_retrieval_results

{'faithfulness': 0.8750, 'answer_relevancy': 0.9443, 'context_recall': 0.9167, 'context_precision': 0.8805, 'answer_correctness': 0.6977}

In [439]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.916667,0.96,0.043333
1,answer_relevancy,0.929754,0.943593,0.013839
2,context_recall,0.819444,0.833333,0.013889
3,context_precision,0.840278,0.902778,0.0625
4,answer_correctness,0.732664,0.720944,-0.01172


In [440]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,0.916667,0.96,0.875,-0.085,-0.041667
1,answer_relevancy,0.929754,0.943593,0.944336,0.000742,0.014581
2,context_recall,0.819444,0.833333,0.916667,0.083333,0.097222
3,context_precision,0.840278,0.902778,0.880463,-0.022315,0.040185
4,answer_correctness,0.732664,0.720944,0.697707,-0.023238,-0.034957


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

**Answer** Based on the results we obtained, `text-embedding-3-small` does not appear to be significantly better than `ada`.

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [441]:
#The quick way to do that would be realising that the 'multi_context' questions in our dataset correspond to rows 6 to 11, by inspecting test_df. 
# However, a rigorous solution would be methodically filtering the rows of test_df that have ['evolution_type']== 'multi_context':
multi_test_df = test_df[test_df['evolution_type']== 'multi_context'] 

#Then proceeding with all the steps with this filtered data frame:
multi_test_questions = multi_test_df["question"].values.tolist()
multi_test_groundtruths = multi_test_df["ground_truth"].values.tolist()

answers = []
contexts = []

#We invoke our RAG pipeline to get answers and contexts:
for question in multi_test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

multi_response_dataset = Dataset.from_dict({
    "question" : multi_test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_test_groundtruths
})
#Finally we get results for the metrics when only multi-context questions were used in the evaluation:
multi_results = evaluate(multi_response_dataset, metrics)

multi_results
                                        
                                        



Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'faithfulness': 1.0000, 'answer_relevancy': 0.9410, 'context_recall': 0.7500, 'context_precision': 0.7917, 'answer_correctness': 0.7080}

In [442]:
#We do the same procedure for the modified RAG with advanced retrieval and document stuffing: 

answers = []
contexts = []

for question in multi_test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

multi_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : multi_test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_test_groundtruths
})

multi_advanced_retrieval_results = evaluate(multi_response_dataset_advanced_retrieval, metrics)
multi_advanced_retrieval_results

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'faithfulness': 0.9600, 'answer_relevancy': 0.9367, 'context_recall': 0.7500, 'context_precision': 0.8194, 'answer_correctness': 0.5457}

In [448]:
#Finally we repeat the procedure considering the change of embeddings model from ADA to 3-small:

answers = []
contexts = []

for question in multi_test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])



In [453]:

multi_new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : multi_test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_test_groundtruths
})

multi_new_advanced_retrieval_results = evaluate(multi_new_response_dataset_advanced_retrieval, metrics)
multi_new_advanced_retrieval_results

KeyboardInterrupt: 

In [None]:
#Now all that is left is to build a table for comparison:

df_baseline = pd.DataFrame(list(multi_results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(multi_advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(multi_new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged