# The Art of RAG Evaluation

In the following notebook we'll explore the following:

- Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
- Evaluating our pipeline with the [Ragas](https://github.com/explodinggradients/ragas) library
- Making an adjustment to our RAG pipeline
- Evaluating our adjusted pipeline against our baseline

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: You'll notice we're including a number of `pip install` commands relating to LangChain now - this is part of their v0.1.0 release! Keep in mind that not all of these are critical to building a LangChain pipeline - we're only using them to show the plethora of options we have with the LangChain package!

In [None]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai ragas tiktoken cohere faiss_cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.7/806.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m238.5/238.5 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.1/226.1 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.4/65.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m87.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
import langchain
print(f"LangChain Version: {langchain.__version__}")

LangChain Version: 0.1.5


Since we'll be using OpenAI to power our RAG pipeline and part of the functionality of the RAGAS library - we'll need an OpenAI API key!

In [None]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Building our RAG pipeline

While the version may have changed - the process of creating our RAG pipeline remains largely the same:

- Create an Index
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data - we'll be using the LangChain v0.1.0 blog to both keep things simple, and keep things meta.

> NOTE: You'll notice that some specific loaders, LLMs, etc., are in their own libraries now. This allows you to stay as lightweight as you'd like while using LangChain!

In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    "https://blog.langchain.dev/langchain-v0-1-0/"
)

documents = loader.load()

In [None]:
documents[0].metadata

{'source': 'https://blog.langchain.dev/langchain-v0-1-0/',
 'title': 'LangChain v0.1.0',
 'language': 'en'}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [None]:
len(documents)

29

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task! (soon we'll be able to leverage OpenAI's newest embedding model which is waiting on an approved PR to be merged as we speak!)

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

We'll be leveraging Meta's FAISS for this task.

In [None]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [None]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [None]:
retrieved_documents = retriever.invoke("Why did they change to version 0.1.0?")

In [None]:
for doc in retrieved_documents:
  print(doc)

page_content='your feedback, so we can address it. They say, ‚ÄúA journey of a thousand miles begins with a single step.‚Äù ‚Äì or in our case, version 0.1.' metadata={'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'language': 'en'}
page_content='0 created a few challenges:Users couldn‚Äôt be confident that updating would not have breaking changeslangchain became bloated and unstable as we took a ‚Äúmaintain everything‚Äù approach to reduce breaking changes and deprecation notificationsHowever, starting today with the release of langchain 0.1.0, all future releases will follow a new versioning standard. Specifically:Any breaking changes to the public API will result in a minor version bump (the second digit)Any bug fixes or new features will result in a patch version bump (the third digit)We hope that this, combined with the previous architectural changes, will:Communicate clearly if breaking changes are made, allowing developers to' metadata={'s

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [None]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [None]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple - but we'll create our own to be a bit more specific!

In [None]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [None]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [None]:
question = "What are the major changes in v0.1.0?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

The major changes in v0.1.0 are the implementation of a new versioning standard, improved focus through functionality and documentation, and the release of the first stable version of LangChain.


In [None]:
question = "What is LangGraph?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(page_content="an LLM in some sort of a loop. So far, the only way we've had to do that is with AgentExecutor. We've added a lot of parameters and functionality to AgentExecutor, but its still just one way of running a loop.\uf8ffüí°We're excited to announce the release of langgraph, a new library to allow for creating language agents as graphs.This will allow users to create far more custom cyclical behavior. You can define explicit planning steps, explicit reflection steps, or easily hard code it so that a specific tool is always called first.It is inspired by\xa0Pregel\xa0and\xa0Apache Beam. The current interface exposed is one inspired by\xa0NetworkX, and looks something like:from langgraph.graph import END, Graph", metadata={'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'language': 'en'}), Document(page_content='main way we‚Äôve tackled this is by building LangSmith. One of the main value props that LangSmith provides

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Ragas Evaluation

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

#### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

Let's create a new set of documents to ensure we're not accidentally creating a sample test set that favours our base model too much!

In [None]:
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
documents = text_splitter.split_documents(documents)

In [None]:
len(documents)

24

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/48 [00:00<?, ?it/s]

Generating:   0%|          | 0/10 [00:00<?, ?it/s]

Let's look at the output and see what we can learn about it!

In [None]:
testset.test_data[0]

DataRow(question='What is the purpose of LangSmith in making Langchain more observable and debuggable?', contexts=['putting a non-deterministic component at the center of your system. These models can often output unexpected results, so having visibility into exactly what is happening in your system is integral. \uf8ffüí°We want to make langchain as observable and as debuggable as possible, whether through architectural decisions or tools we build on the side.We‚Äôve set about this in a few ways.The main way we‚Äôve tackled this is by building LangSmith. One of the main value props that LangSmith provides is a best-in-class debugging experience for your LLM application. We log exactly what steps are happening, what the inputs of each step are, what the outputs of each step are, how long each step takes, and more data. We display this in a user-friendly way, allowing you to identify which steps are taking the longest, enter a playground to debug unexpected LLM responses, track token usa

#### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [None]:
test_df = testset.to_pandas()

In [None]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,What is the purpose of LangSmith in making Lan...,[putting a non-deterministic component at the ...,The purpose of LangSmith is to provide a best-...,simple,True
1,What changes have been made to improve the rob...,[is our community ‚Äì both the user base and t...,"The changes made to improve the robustness, st...",simple,True
2,How does prompting work in the context of inte...,[prompting. When you resort to using prompting...,,simple,True
3,How does LangChain enable an LLM to call a too...,"[systems, we are not overly opinionated on how...",LangChain enables an LLM to call a tool multip...,simple,True
4,What were the reasons for separating out partn...,[we made two large architectural changes: sepa...,The reasons for separating out partner package...,simple,True
5,How does LangChain facilitate reasoning and to...,"[systems, we are not overly opinionated on how...",LangChain facilitates reasoning and tool use i...,reasoning,True
6,How does LangChain's output parsing feature en...,[prompting. When you resort to using prompting...,LangChain's output parsing feature enhances us...,reasoning,True
7,"""What changes will be made to the versioning s...",[releases will follow a new versioning standar...,Any breaking changes to the public API will re...,multi_context,True
8,"""What improvements have been made to LangChain...",[is our community ‚Äì both the user base and t...,The information provided does not mention any ...,multi_context,True
9,"""What are the two main aspects of agentic work...","[systems, we are not overly opinionated on how...",The two main aspects of agentic workloads in L...,reasoning,True


In [None]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [None]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [None]:
response_dataset[0]

{'question': 'What is the purpose of LangSmith in making Langchain more observable and debuggable?',
 'answer': 'The purpose of LangSmith is to provide a best-in-class debugging experience for LangChain applications by logging and displaying detailed information about the steps, inputs, outputs, and performance of each step.',
 'contexts': ['main way we‚Äôve tackled this is by building LangSmith. One of the main value props that LangSmith provides is a best-in-class debugging experience for your LLM application. We log exactly what steps are happening, what the inputs of each step are, what the outputs of each step are, how long each step takes, and more data. We display this in a user-friendly way, allowing you to identify which steps are taking the longest, enter a playground to debug unexpected LLM responses, track token usage and more. Even in private beta, the demand for LangSmith has been overwhelming, and we‚Äôre investing a lot in scalability so that we can release a public bet

#### Evaluating with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [None]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
results

{'faithfulness': 0.8750, 'answer_relevancy': 0.7527, 'context_recall': 0.6250, 'context_precision': 0.7722, 'answer_correctness': 0.5141}

In [None]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What is the purpose of LangSmith in making Lan...,The purpose of LangSmith is to provide a best-...,[main way we‚Äôve tackled this is by building ...,The purpose of LangSmith is to provide a best-...,1.0,0.899628,1.0,1.0,0.542937
1,What changes have been made to improve the rob...,"The changes made to improve the robustness, st...","[breaking changes are made, allowing developer...","The changes made to improve the robustness, st...",1.0,0.995921,1.0,1.0,0.623775
2,How does prompting work in the context of inte...,I don't know.,[(an early prompting strategy for doing so) fr...,,,0.0,0.0,0.0,0.178298
3,How does LangChain enable an LLM to call a too...,I don't know.,[how to best enable an LLM to call a tool mult...,LangChain enables an LLM to call a tool multip...,,0.0,1.0,0.805556,0.179147
4,What were the reasons for separating out partn...,The reasons for separating out partner package...,[decided to make significant changes to the L...,The reasons for separating out partner package...,1.0,1.0,1.0,1.0,0.572775
5,How does LangChain facilitate reasoning and to...,LangChain facilitates reasoning and tool use i...,[in search). We‚Äôve also made sure to support...,LangChain facilitates reasoning and tool use i...,0.0,0.911262,0.0,1.0,0.377573
6,How does LangChain's output parsing feature en...,LangChain's output parsing feature enhances us...,"[of just the LLM call (for example, in output ...",LangChain's output parsing feature enhances us...,1.0,0.962432,0.25,0.916667,0.454692
7,"""What changes will be made to the versioning s...",Any breaking changes to the public API will re...,"[breaking changes are made, allowing developer...",Any breaking changes to the public API will re...,1.0,0.892689,1.0,1.0,1.0
8,"""What improvements have been made to LangChain...",The improvements made to LangChain's integrati...,[base and the 2000+ contributors ‚Äì and we wa...,The information provided does not mention any ...,1.0,0.914379,0.0,0.0,0.21223
9,"""What are the two main aspects of agentic work...",The two main aspects of agentic workloads in L...,[in search). We‚Äôve also made sure to support...,The two main aspects of agentic workloads in L...,1.0,0.950255,1.0,1.0,1.0


## Testing a More Performant Retriever

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [None]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [None]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [None]:
response = retrieval_chain.invoke({"input": "What are the major changes in v0.1.0?"})

In [None]:
print(response["answer"])

The major changes in v0.1.0 of LangChain include:

- Adoption of a new versioning standard: Any breaking changes to the public API will result in a minor version bump, while bug fixes or new features will result in a patch version bump.
- Improved focus through both functionality and documentation.
- Full backward compatibility.
- Availability in both Python and JavaScript.
- The release of stable versions helps earn developer trust and allows for systematic and safe evolution of the library.


In [None]:
response = retrieval_chain.invoke({"input": "What is LangGraph?"})

In [None]:
print(response["answer"])

LangGraph is a new library that allows users to create language agents as graphs. It provides the capability to define explicit planning steps, explicit reflection steps, or easily hard code specific tools to be called first. LangGraph is inspired by Pregel and Apache Beam, and its current interface is similar to NetworkX.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [None]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [None]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]



### Comparing Results

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [None]:
results

{'faithfulness': 0.8750, 'answer_relevancy': 0.7527, 'context_recall': 0.6250, 'context_precision': 0.7722, 'answer_correctness': 0.5141}

And see how our advanced retrieval modified our chain!

In [None]:
advanced_retrieval_results

{'faithfulness': 0.8889, 'answer_relevancy': 0.9235, 'context_recall': 0.6583, 'context_precision': 0.7120, 'answer_correctness': 0.4103}

In [None]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.875,0.888889,0.013889
1,answer_relevancy,0.752657,0.923545,0.170888
2,context_recall,0.625,0.658333,0.033333
3,context_precision,0.772222,0.711972,-0.06025
4,answer_correctness,0.514143,0.410307,-0.103836


We can see that our faithfulness has improved - as well as our answer relevancy - but we lost a significant amount of answer correctness.

We'd need to do some more experimentation to determine how to improve our pipeline!