# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room Part #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.2.0](https://python.langchain.com/v0.2/docs/versions/v0_2/)
  4. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas) framework.
  

- 🤝 Breakout Room Part #2:
  1. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage `GPT-3.5-Turbo` as the `critic_llm`!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [25]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [26]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [27]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [28]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [29]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

documents = loader.load()

In [30]:
documents[0].metadata

{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'page': 0,
 'total_pages': 195,
 'format': 'PDF 1.3',
 'title': 'The Pmarca Blog Archives',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Mac OS X 10.10 Quartz PDFContext',
 'creationDate': "D:20150110020418Z00'00'",
 'modDate': "D:20150110020418Z00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [31]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [32]:
len(documents)

1864

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [33]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [21]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="PMarca Blogs",
)

#### ❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!

##### ANSWER:

1. Specialized indexing techniques: Qdrant uses advanced indexing methods like Hierarchical Navigable Small World (HNSW) to implement Approximate Nearest Neighbors search. This allows for efficient similarity searching in high-dimensional vector spaces.

2. Optimized distance metrics: Qdrant fully supports common distance metrics like Euclidean Distance, Cosine Similarity, and Dot Product. These are efficiently implemented to allow fast similarity calculations between vectors.

3. Flexible storage options: Qdrant offers both in-memory storage for highest speed (storing all vectors in RAM) and memmap storage (creating a virtual address space associated with files on disk). This allows users to optimize for performance or memory usage as needed.

4. Efficient data structures: Qdrant uses collections to organize sets of points (vectors with payloads), allowing for efficient management and searching of related data.

5. Support for payload filtering: The ability to attach JSON payloads to vectors allows for additional filtering and refinement of search results beyond just vector similarity.

6. Real-time indexing: Qdrant supports real-time updates and queries, allowing for dynamic data management without significant performance degradation.

7. Scalability: The architecture is designed to handle large-scale datasets with billions of data points efficiently.

These techniques combined allow Qdrant to provide fast and accurate similarity search capabilities, making it highly performant for vector database applications.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [22]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [23]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [24]:
for doc in retrieved_documents:
  print(doc)

page_content='the existing order — and make sure that those forces of change\nhave a reasonable chance at succeeding.\nSecond rule of thumb:\nOnce you have picked an industry, get right to the center of it' metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': 'ea71f24eeb46499ab8cf7417bf4f414b', '_collection_name': 'PMarca Blogs'}
page_content='Third rule:\nIn a rapidly changing Held like technology, the best place to\nget experience when you’re starting out is in younger, high-\ngrowth companies.' metadata={'source': 'https://d1lamhf6l6yk6d

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [27]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [28]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [29]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [35]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

Let's test it out!

In [36]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Get right to the center of it.


In [37]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.', metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '58957e2fd83f428fae625086ca920d38', '_collection_name': 'PMarca Blogs'}), Document(page_content='watching carefully — if everyone agrees right up front that\nwhatever you are doing makes total sense, it probably isn’t a new\nand radical enough idea to justify a

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [40]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

#### ❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

##### Answer:
It's important to split documents using different parameters when creating synthetic data for several reasons:

1. Diversity in test data: Using different splitting parameters ensures a more diverse range of document chunks, which better represents real-world scenarios.

2. Avoid overfitting: If the synthetic data is split in the same way as the training data, the evaluation might be biased towards the specific splitting method used in the RAG pipeline.

3. Robustness testing: Different splitting parameters challenge the RAG system's ability to handle various document structures and lengths.

4. Realistic evaluation: In practice, input documents may come in different formats and lengths. Using varied splitting parameters simulates this variability.

5. Identify weaknesses: Different splitting methods might reveal strengths or weaknesses in the RAG pipeline that wouldn't be apparent with a single splitting approach.

In [41]:
len(eval_documents)

624


> NOTE: 🛑 Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage GPT-3.5-Turbo as the critic_llm. If you're attempting to create a lot of samples please be aware of cost, as well as rate limits. 🛑

In [42]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
# critic_llm = ChatOpenAI(model="gpt-3.5-turbo") <--- If you don't have GPT-4 access, or to reduce cost/rate limiting issues.
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

testset = generator.generate_with_langchain_docs(eval_documents, 20, distributions, is_async = False)
testset.to_pandas()

  from .autonotebook import tqdm as notebook_tqdm
Filename and doc_id are the same for all nodes.                     
Generating: 100%|██████████| 20/20 [00:53<00:00,  2.69s/it]


Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What industry is Los Angeles known for in term...,[most interesting opportunity available — the ...,Los Angeles is known for entertainment opportu...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How can combining a useful graduate degree wit...,[workforce in a high-impact way when you gradu...,Combining a useful graduate degree with a subs...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,How does the Influence-from-Mere-Association T...,[One very practical consequence of Liking/Lovi...,The Influence-from-Mere-Association Tendency a...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What are the 9 most important steps for a CEO ...,[coherent message and strategy.\nThen go dark ...,The 9 most important steps for a CEO of a turn...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,How can you bootstrap off initial customers to...,[This obviously raises the issue of how you’re...,"Try to raise angel money, or bootstrap off ini...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,How can blogging help entrepreneurs interact w...,[looking for funding to blog — about their sta...,Blogging can help entrepreneurs interact with ...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What are some common layers of risk for a high...,[as if it’s an onion. Just like you peel an on...,"It depends on the startup, but here are some o...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,What data is available on the relationship bet...,[Age and the Entrepreneur: Some\ndata\nA short...,I’m not aware of any systematic data on age an...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,What does Dr. Simonton suggest is more importa...,[becomes irrelevant to determining the success...,Dr. Simonton suggests that focusing on more at...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,How can violating the chain of command help ga...,[sureXre signal that the executive is not work...,Violating the chain of command can help gather...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


#### ❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

##### Answer:
This mapping refers to the distribution of different question types in the synthetic dataset generation process:

1. simple: 50% of the generated questions will be straightforward, requiring direct retrieval of information.

2. reasoning: 25% of the questions will require some level of reasoning or inference based on the retrieved information.

3. multi_context: 25% of the questions will need information from multiple contexts to be answered correctly.

This distribution allows for a balanced evaluation of the RAG pipeline's performance across different types of queries, from simple fact retrieval to more complex reasoning tasks.

Let's look at the output and see what we can learn about it!

In [43]:
testset.test_data[0]

DataRow(question='What industry is Los Angeles known for in terms of entertainment opportunities?', contexts=['most interesting opportunity available — the new markets that\nare growing fast and changing rapidly.\nAlso apply this rule when selecting which city to live in. Go to\nthe city where all the action is happening.\nFor technology, at least in the US, this is Silicon Valley. For\nentertainment, this is Los Angeles. For politics, Washington DC.\nFor coWee, Seattle. For Xnancial services, New York — unless\nyou are convinced that there are equally compelling opportuni-\nties someplace else, like London or Hong Kong or Shanghai.\nIn my opinion, living anywhere other than the center of your industry'], ground_truth='Los Angeles is known for entertainment opportunities in the industry of entertainment.', evolution_type='simple', metadata=[{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.ne

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [46]:
test_df = testset.to_pandas()

In [47]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What industry is Los Angeles known for in term...,[most interesting opportunity available — the ...,Los Angeles is known for entertainment opportu...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How can combining a useful graduate degree wit...,[workforce in a high-impact way when you gradu...,Combining a useful graduate degree with a subs...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,How does the Influence-from-Mere-Association T...,[One very practical consequence of Liking/Lovi...,The Influence-from-Mere-Association Tendency a...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What are the 9 most important steps for a CEO ...,[coherent message and strategy.\nThen go dark ...,The 9 most important steps for a CEO of a turn...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,How can you bootstrap off initial customers to...,[This obviously raises the issue of how you’re...,"Try to raise angel money, or bootstrap off ini...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,How can blogging help entrepreneurs interact w...,[looking for funding to blog — about their sta...,Blogging can help entrepreneurs interact with ...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What are some common layers of risk for a high...,[as if it’s an onion. Just like you peel an on...,"It depends on the startup, but here are some o...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,What data is available on the relationship bet...,[Age and the Entrepreneur: Some\ndata\nA short...,I’m not aware of any systematic data on age an...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,What does Dr. Simonton suggest is more importa...,[becomes irrelevant to determining the success...,Dr. Simonton suggests that focusing on more at...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,How can violating the chain of command help ga...,[sureXre signal that the executive is not work...,Violating the chain of command can help gather...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [45]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [48]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [49]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [50]:
response_dataset[0]

{'question': 'What industry is Los Angeles known for in terms of entertainment opportunities?',
 'answer': 'Los Angeles is known for the entertainment industry.',
 'contexts': ['the city where all the action is happening.\nFor technology, at least in the US, this is Silicon Valley. For\nentertainment, this is Los Angeles. For politics, Washington DC.',
  'services.\nIt doesn’t seem to happen ever in certain other industries which\nI won’t name for fear of being permanently cut oW from my\nnecessary supply of oil, gas, music, and movies.',
  'you’re not in a major center of entrepreneurialism and you’re\nhaving trouble raising money, you probably need to move.\nThere’s a reason why most Xlms get made in Los Angeles, and',
  'and place. This is a big motivator for me, by the way.\nGrowing up, I would have never dreamed that an industry\nlike this would exist or that I would get to be a part of it. I\npinch myself every day.'],
 'ground_truth': 'Los Angeles is known for entertainment oppo

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [51]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [68]:
# ragas is hardcoded to use specific models - this insisted on gpt-3.5-turbo
results = evaluate(response_dataset, metrics)

Evaluating:  82%|████████▏ | 82/100 [00:23<00:05,  3.16it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 100/100 [00:29<00:00,  3.34it/s]


In [69]:
results

{'faithfulness': 0.7746, 'answer_relevancy': 0.8040, 'context_recall': 0.6033, 'context_precision': 0.7750, 'answer_correctness': 0.5605}

In [70]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What industry is Los Angeles known for in term...,Los Angeles is known for the entertainment ind...,[the city where all the action is happening.\n...,Los Angeles is known for entertainment opportu...,1.0,0.911584,1.0,0.0,0.996126
1,How can combining a useful graduate degree wit...,Combining a useful graduate degree with a subs...,"[degree, you are much better oW combining it w...",Combining a useful graduate degree with a subs...,1.0,0.90868,0.5,0.583333,0.67632
2,How does the Influence-from-Mere-Association T...,The Influence-from-Mere-Association Tendency a...,[other facts to facilitate love.\nThe applicat...,The Influence-from-Mere-Association Tendency a...,0.5,0.966484,0.0,0.805556,0.574931
3,What are the 9 most important steps for a CEO ...,I don't know.,[that’s recently been getting trounced by comp...,The 9 most important steps for a CEO of a turn...,,0.0,0.333333,0.5,0.179041
4,How can you bootstrap off initial customers to...,You can bootstrap off initial customers to sta...,[This obviously raises the issue of how you’re...,"Try to raise angel money, or bootstrap off ini...",0.5,1.0,1.0,1.0,0.437117
5,How can blogging help entrepreneurs interact w...,Blogging can help entrepreneurs interact with ...,"[esting things going on, about their point of ...",Blogging can help entrepreneurs interact with ...,0.75,0.983404,1.0,0.805556,0.745164
6,What are some common layers of risk for a high...,"Founder risk, technology risk, and product ris...",[What are the layers of risk for a high-tech\n...,"It depends on the startup, but here are some o...",1.0,0.990778,1.0,0.833333,0.767007
7,What data is available on the relationship bet...,I don't know.,"[Age and the Entrepreneur: Some data\n155, and...",I’m not aware of any systematic data on age an...,1.0,0.0,1.0,0.833333,0.197943
8,What does Dr. Simonton suggest is more importa...,Dr. Simonton suggests that focusing on more at...,[progress through a creative career. Instead y...,Dr. Simonton suggests that focusing on more at...,0.5,0.968397,1.0,0.916667,0.999276
9,How can violating the chain of command help ga...,Violating the chain of command can help gather...,"[nization.\nSecond, the minute you have a bad ...",Violating the chain of command can help gather...,0.5,1.0,0.0,0.5,0.61269


## Task 2: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

> NOTE: MultiQueryRetriever is expanded on [here](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever) but for now, the implementation is not important to our lesson!

In [71]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [72]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [73]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [74]:
response = retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [75]:
print(response["answer"])

I'm sorry, I cannot provide an answer to that question based on the context provided.


In [76]:
response = retrieval_chain.invoke({"input": "Why are they fueding?"})

In [77]:
print(response["answer"])

The text does not provide any information about a feud or conflict between individuals or groups. It mainly discusses factors that contribute to success or failure in business, the impact of human nature on decision-making, and the dynamics of decision-making in big companies.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [78]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [79]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [80]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating: 100%|██████████| 100/100 [01:07<00:00,  1.47it/s]


In [81]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What industry is Los Angeles known for in term...,Los Angeles is known for the entertainment ind...,[the city where all the action is happening.\n...,Los Angeles is known for entertainment opportu...,1.0,0.974996,1.0,0.2,0.539693
1,How can combining a useful graduate degree wit...,Combining a useful graduate degree with a subs...,"[degree, you are much better oW combining it w...",Combining a useful graduate degree with a subs...,1.0,0.908313,0.0,0.866667,0.846475
2,How does the Influence-from-Mere-Association T...,The Influence-from-Mere-Association Tendency c...,[ing to be liked can be a major impediment to ...,The Influence-from-Mere-Association Tendency a...,0.0,0.961569,1.0,0.477778,0.693072
3,What are the 9 most important steps for a CEO ...,The 9 important steps for a CEO of a turnaroun...,[clearly in charge.\nA company that requires a...,The 9 most important steps for a CEO of a turn...,1.0,0.95134,0.222222,0.5,0.382016
4,How can you bootstrap off initial customers to...,You can bootstrap off initial customers by foc...,"[existing “traction” of some form — customers,...","Try to raise angel money, or bootstrap off ini...",0.0,0.99285,1.0,0.830357,0.212088
5,How can blogging help entrepreneurs interact w...,Blogging can help entrepreneurs interact with ...,"[esting things going on, about their point of ...",Blogging can help entrepreneurs interact with ...,1.0,0.983405,1.0,1.0,0.687614
6,What are some common layers of risk for a high...,Some common layers of risk for a high-tech sta...,[What are the layers of risk for a high-tech\n...,"It depends on the startup, but here are some o...",1.0,1.0,1.0,1.0,0.206978
7,What data is available on the relationship bet...,There is no systematic data available on the r...,[and why?\nAge and the Entrepreneur: Some data...,I’m not aware of any systematic data on age an...,0.75,0.0,1.0,0.583333,0.520194
8,What does Dr. Simonton suggest is more importa...,Dr. Simonton suggests that instead of focusing...,[progress through a creative career. Instead y...,Dr. Simonton suggests that focusing on more at...,0.5,0.950158,1.0,1.0,0.743596
9,How can violating the chain of command help ga...,Violating the chain of command can help gather...,"[nization.\nSecond, the minute you have a bad ...",Violating the chain of command can help gather...,1.0,1.0,0.0,0.916667,0.800858


## Task 3: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [82]:
results

{'faithfulness': 0.7746, 'answer_relevancy': 0.8040, 'context_recall': 0.6033, 'context_precision': 0.7750, 'answer_correctness': 0.5605}

And see how our advanced retrieval modified our chain!

In [83]:
advanced_retrieval_results

{'faithfulness': 0.6479, 'answer_relevancy': 0.9098, 'context_recall': 0.6561, 'context_precision': 0.7873, 'answer_correctness': 0.6128}

In [84]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.774561,0.647917,-0.126645
1,answer_relevancy,0.804027,0.909752,0.105725
2,context_recall,0.603333,0.656111,0.052778
3,context_precision,0.775,0.787341,0.012341
4,answer_correctness,0.560545,0.612804,0.052258


## Task 4: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [85]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [86]:
vector_store = Qdrant.from_documents(
    documents,
    new_embeddings,
    location=":memory:",
    collection_name="PMarca Blogs - TE3 - MQR",
)

In [87]:
new_retriever = vector_store.as_retriever()

In [88]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [89]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [90]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [91]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [92]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating: 100%|██████████| 100/100 [01:14<00:00,  1.35it/s]


In [93]:
new_advanced_retrieval_results

{'faithfulness': 0.7500, 'answer_relevancy': 0.8554, 'context_recall': 0.6950, 'context_precision': 0.7850, 'answer_correctness': 0.6616}

In [94]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA + Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA + MQR'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'TE3 + MQR'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['ADA + MQR -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + MQR']
df_merged['Baseline -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + Baseline']

df_merged

Unnamed: 0,Metric,ADA + Baseline,ADA + MQR,TE3 + MQR,ADA + MQR -> TE3 + MQR,Baseline -> TE3 + MQR
0,faithfulness,0.774561,0.647917,0.75004,0.102123,-0.024522
1,answer_relevancy,0.804027,0.909752,0.85537,-0.054382,0.051343
2,context_recall,0.603333,0.656111,0.695,0.038889,0.091667
3,context_precision,0.775,0.787341,0.785011,-0.00233,0.010011
4,answer_correctness,0.560545,0.612804,0.661634,0.048831,0.101089


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

##### ANSWER:

Based on the results presented in the notebook, text-embedding-3-small does show improvements over ada (text-embedding-ada-002) in several metrics, but whether it's "significantly" better is subjective. Here's an analysis:

1. Improvements:
   - Context recall increased from 0.6033 to 0.6950
   - Answer correctness improved from 0.5605 to 0.6616
   - Context precision slightly improved from 0.7750 to 0.7850

2. Mixed results:
   - Faithfulness decreased from 0.7746 to 0.7500
   - Answer relevancy increased from 0.8040 to 0.8554 when comparing to the baseline, but decreased compared to the advanced retrieval method with ada

3. Overall impact:
   - The new model shows consistent improvements in most metrics
   - The improvements range from about 1% to 10% depending on the metric

While text-embedding-3-small does show improvements, calling it "significantly" better might be an overstatement. The improvements are noticeable and consistent across most metrics, which is promising. However, the magnitude of improvement varies, and there's even a slight decrease in faithfulness.

In practice, the importance of these improvements would depend on the specific use case and requirements of the application. For some applications, these improvements could lead to noticeably better performance, while for others, the difference might be less impactful.

It's also worth noting that the evaluation is based on a specific dataset and task, so results might vary in different contexts. To make a definitive statement about significance, it would be beneficial to perform statistical tests and evaluate the model across a wider range of tasks and datasets.

## BONUS ACTIVITY: Using a Better Generator

Now that we've seen how much more effective a better Retrieval pipeline is, let's look at what impact a better(?) Generator is!

Adapt the above `TE3 + MQR` pipeline to use `GPT-4o` and compare the results below!

### Load the Data

In [53]:
import os

QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
QDRANT_API_URL = os.getenv("QDRANT_URL")
collection_name="PMarca Blogs - TE3 - MQR - 4o"

In [54]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_qdrant import Qdrant
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

newer_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
primary_qa_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Doc Loader
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

# Text Splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

# Vector Store
vector_store = Qdrant.from_documents(
    embedding=newer_embeddings,
    collection_name=collection_name,
    url=QDRANT_API_URL,
    api_key=QDRANT_API_KEY,
    prefer_grpc=True,   
    documents=documents,
)

### Retrieve the Data

In [87]:
from langchain import hub

prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [88]:
print(prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


In [89]:
newer_retriever = vector_store.as_retriever()

In [90]:
from langchain_openai import ChatOpenAI

primary_ka_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

In [91]:
from langchain.retrievers import MultiQueryRetriever
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

newer_advanced_retriever = MultiQueryRetriever.from_llm(retriever=newer_retriever, llm=primary_ka_llm)

dokument_chain = create_stuff_documents_chain(primary_ka_llm, prompt)

newer_retrieval_chain = create_retrieval_chain(newer_advanced_retriever, dokument_chain)

In [92]:
response = newer_retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [93]:
print(response["answer"])

The provided context does not contain any information about Taylor Swift or any feuds she may be involved in.


In [94]:
response = newer_retrieval_chain.invoke({"input": "Why are they fueding?"})

In [95]:
print(response["answer"])

The context suggests that humans are "born to dislike and hate" due to various triggering forces in life, which has led to a long history of continuous war. This inherent tendency to dislike and hate can also manifest in competitive environments, such as startups, where rivalry with competitors can trigger similar dynamics. Therefore, the feuding likely stems from these deep-seated human tendencies and competitive pressures.


In [97]:
# answers = []
# contexts = []

# for question in test_questions:
#   response = newer_retrieval_chain.invoke({"input" : question})
#   answers.append(response["answer"])
#   contexts.append([context.page_content for context in response["context"]])

In [98]:
# newer_response_dataset_advanced_retrieval = Dataset.from_dict({
#     "question" : test_questions,
#     "answer" : answers,
#     "contexts" : contexts,
#     "ground_truth" : test_groundtruths
# })

In [99]:
# newer_advanced_retrieval_results = evaluate(newer_response_dataset_advanced_retrieval, metrics)