# The Art of RAG Evaluation

In the following notebook we'll explore the following:

- Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
- Evaluating our pipeline with the [Ragas](https://github.com/explodinggradients/ragas) library
- Making an adjustment to our RAG pipeline
- Evaluating our adjusted pipeline against our baseline

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: You'll notice we're including a number of `pip install` commands relating to LangChain now - this is part of their v0.1.0 release! Keep in mind that not all of these are critical to building a LangChain pipeline - we're only using them to show the plethora of options we have with the LangChain package!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai ragas tiktoken cohere faiss_cpu


[notice] A new release of pip available: 22.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import langchain
print(f"LangChain Version: {langchain.__version__}")

LangChain Version: 0.1.9


Since we'll be using OpenAI to power our RAG pipeline and part of the functionality of the RAGAS library - we'll need an OpenAI API key!

In [3]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Building our RAG pipeline

While the version may have changed - the process of creating our RAG pipeline remains largely the same:

- Create an Index
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data - we'll be using the LangChain v0.1.0 blog to both keep things simple, and keep things meta.

> NOTE: You'll notice that some specific loaders, LLMs, etc., are in their own libraries now. This allows you to stay as lightweight as you'd like while using LangChain!

In [4]:
!pip install docx2txt




[notice] A new release of pip available: 22.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import docx2txt

In [11]:
import os
import sys
from dotenv import load_dotenv
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv('.env')

documents = []
# Create a List of Documents from all of our files in the ./docs folder
for file in os.listdir("Evaluation Sets"):
    if file.endswith(".pdf"):
        pdf_path = "./Evaluation Sets/" + file
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())
    elif file.endswith('.docx') or file.endswith('.doc'):
        doc_path = "./Evaluation Sets/" + file
        loader = Docx2txtLoader(doc_path)
        documents.extend(loader.load())
    elif file.endswith('.txt'):
        text_path = "./Evaluation Sets/" + file
        loader = TextLoader(text_path)
        documents.extend(loader.load())



FileNotFoundError: [WinError 3] The system cannot find the path specified: 'Evaluation Sets'

In [None]:
documents[0].metadata

{'source': 'https://blog.langchain.dev/langchain-v0-1-0/',
 'title': 'LangChain v0.1.0',
 'language': 'en'}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)





Let's confirm we've split our document.

In [13]:
len(documents)

58

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task! (soon we'll be able to leverage OpenAI's newest embedding model which is waiting on an approved PR to be merged as we speak!)

In [14]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

We'll be leveraging Meta's FAISS for this task.

In [15]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [16]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [17]:
retrieved_documents = retriever.invoke("Why did they change to version 0.1.0?")

In [18]:
for doc in retrieved_documents:
  print(doc)

page_content='to the Company and become its full and exclusive property.' metadata={'source': './Evaluation Sets/Robinson Advisory.docx'}
page_content='to the Company and become its full and exclusive property.' metadata={'source': './Evaluation Sets/Robinson Advisory.docx'}
page_content='IP: Any Work Product, upon creation, shall be fully and exclusively owned by the Company. The Advisor, immediately upon Company’s request, shall sign any document and/or perform any action needed to formalize such ownership. The Advisor shall not obtain any rights in the Work Product, including moral rights and/or rights for royalties or other consideration under any applicable law (including Section 134 of the Israeli Patent Law – 1967 if applicable), and shall not be entitled to any compensation with respect to the Services, which was not specifically agreed, in writing, between the Advisor and the Company.' metadata={'source': './Evaluation Sets/Robinson Advisory.docx'}
page_content='IP: Any Work P

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [19]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [20]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple - but we'll create our own to be a bit more specific!

In [21]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [22]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [24]:
question = "Who are the parties to the Agreement and what are their defined names?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

The parties to the Agreement are Cloud Investments Ltd. and Advisor. Their defined names are Company and Jack Robinson, respectively.


In [None]:
# Q1: Who are the parties to the Agreement and what are their defined names?
# A1:  Cloud Investments Ltd. (“Company”) and Jack Robinson (“Advisor”)
# Q2:   What is the termination notice?
# A2: According to section 4:14 days for convenience by both parties. The Company may terminate without notice if the Advisor refuses or cannot perform the Services or is in breach of any provision of this Agreement.
# Q3: What are the payments to the Advisor under the Agreement?
# A3: According to section 6: 1. Fees of $9 per hour up to a monthly limit of $1,500, 2. Workspace expense of $100 per month, 3. Other reasonable and actual expenses if approved by the company in writing and in advance.
# Q4:  Can the Agreement or any of its obligations be assigned?
# A4: 1. Under section 1.1 the Advisor can’t assign any of his obligations without the prior written consent of the Company, 2. Under section 9  the Advisor may not assign the Agreement and the Company may assign it, 3 Under section 9 of the Undertaking the Company may assign the Undertaking.
# Q5: Who owns the IP?
# A5: According to section 4 of the Undertaking (Appendix A), Any Work Product, upon creation, shall be fully and exclusively owned by the Company.
# Q6: Is there a non-compete obligation to the Advisor?
# A6: Yes. During the term of engagement with the Company and for a period of 12 months thereafter.
# Q7: Can the Advisor charge for meal time?
# A7: No. See Section 6.1, Billable Hour doesn’t include meals or travel time.
# Q8: In which street does the Advisor live?
# A8: 1 Rabin st, Tel Aviv, Israel
# Q9: Is the Advisor entitled to social benefits?
# #

In [25]:
question = "What is the termination notice?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

The termination notice is fourteen (14) days' prior written notice.
[Document(page_content='Term: The term of this Agreement shall commence on the Effective Date and shall continue until terminated in accordance with the provisions herein (the "Term").  \n\n\n\n\t\tTermination: Either party, at any given time, may terminate this Agreement, for any reason whatsoever, with or without cause, upon fourteen (14) days’ prior written notice. Notwithstanding the above, the Company may terminate this Agreement immediately and without prior notice if Advisor refuses or is unable to perform the Services, or is in breach of any provision of this Agreement. \n\n\n\n\t\tCompensation:', metadata={'source': './Evaluation Sets/Robinson Advisory.docx'}), Document(page_content='Term: The term of this Agreement shall commence on the Effective Date and shall continue until terminated in accordance with the provisions herein (the "Term").  \n\n\n\n\t\tTermination: Either party, at any given time, may termin

In [28]:
question = " Can the Agreement or any of its obligations be assigned? "

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

No, the Agreement or any of its obligations cannot be assigned.
[Document(page_content='Entire Agreement; No Waiver or Assignment: This Agreement together with the Exhibits, which are attached hereto and incorporated herein, set forth the entire Agreement between the parties and shall supersede all previous communications and agreements between the parties, either oral or written. This Agreement may be modified only by a written amendment executed by both parties. This Agreement may not be assigned, sold, delegated or transferred in any manner by Advisor for any reason whatsoever. The Company may assign the Agreement to a successor of all or substantially all of its assets or business, provided the assignee has assumed the Company’s obligations under this Agreement.', metadata={'source': './Evaluation Sets/Robinson Advisory.docx'}), Document(page_content='Entire Agreement; No Waiver or Assignment: This Agreement together with the Exhibits, which are attached hereto and incorporated her

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Ragas Evaluation

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

#### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

Let's create a new set of documents to ensure we're not accidentally creating a sample test set that favours our base model too much!

In [29]:
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
documents = text_splitter.split_documents(documents)

In [30]:
len(documents)

20

In [90]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/40 [00:00<?, ?it/s]

Generating:   0%|          | 0/10 [00:00<?, ?it/s]

Let's look at the output and see what we can learn about it!

In [91]:
testset.test_data[0]

DataRow(question='What are the obligations of the Advisor regarding the use and disclosure of Confidential Information while providing the Services?', contexts=['Use: The Advisor may use the Confidential Information only for the purpose of providing the Services and shall not obtain any rights in it. The Advisor shall stop using Confidential Information and/or return it to the Company and/or destroy it immediately upon Company’s request. The Advisor may disclose Confidential Information in case this is required by law, but only to the extent required and after providing the Company a prompt written notice and subject to promptly cooperate with the Company in seeking a protective order. \n\n\n\nSafeguard: The Advisor shall safeguard the Confidential Information, keep it in strict confidence and shall not disclose it to any third party without the prior written consent of the Company.'], ground_truth='The obligations of the Advisor regarding the use and disclosure of Confidential Informa

#### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [92]:
test_df = testset.to_pandas()

In [93]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,What are the obligations of the Advisor regard...,[Use: The Advisor may use the Confidential Inf...,The obligations of the Advisor regarding the u...,simple,True
1,What is the scope of the Company's Business an...,[Definitions: (a) Company’s Business: developm...,The scope of the Company's Business is the dev...,simple,True
2,What are the obligations of the Advisor regard...,[Use: The Advisor may use the Confidential Inf...,The obligations of the Advisor regarding the d...,simple,True
3,What are the expectations for the Advisor's pe...,"[Without derogating from the foregoing, the Ad...",The expectations for the Advisor's performance...,simple,True
4,What are the obligations of the Advisor regard...,[Use: The Advisor may use the Confidential Inf...,The obligations of the Advisor regarding the h...,simple,True
5,"According to the context, what is the duration...","[Without derogating from the foregoing, the Ad...",The duration of the period in which the Adviso...,reasoning,True
6,"According to the context, what is the maximum ...","[Without derogating from the foregoing, the Ad...",The maximum number of days in a 12-month perio...,reasoning,True
7,What are the Advisor's responsibilities for Co...,[Use: The Advisor may use the Confidential Inf...,The Advisor's responsibilities for Confidentia...,multi_context,True
8,What are the Advisor's responsibilities and du...,"[Without derogating from the foregoing, the Ad...",The Advisor's responsibilities and duties towa...,multi_context,True
9,Under what circumstances can the Advisor discl...,[Use: The Advisor may use the Confidential Inf...,The Advisor can disclose Confidential Informat...,simple,True


In [94]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [95]:
test_questions

['What are the obligations of the Advisor regarding the use and disclosure of Confidential Information while providing the Services?',
 "What is the scope of the Company's Business and what types of information are considered Confidential Information?",
 'What are the obligations of the Advisor regarding the disclosure and return of Confidential Information to the Company?',
 "What are the expectations for the Advisor's performance and dedication to the Company during the 12-month period?",
 'What are the obligations of the Advisor regarding the handling and disclosure of Confidential Information?',
 'According to the context, what is the duration of the period in which the Advisor is not obligated to provide services to the Company?',
 'According to the context, what is the maximum number of days in a 12-month period that the Advisor is not required to provide the Services?',
 "What are the Advisor's responsibilities for Confidential Information and Work Product?",
 "What are the Advi

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [96]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

In [97]:
type(response)

dict

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [98]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [99]:
response_dataset[0]

{'question': 'What are the obligations of the Advisor regarding the use and disclosure of Confidential Information while providing the Services?',
 'answer': "The obligations of the Advisor regarding the use and disclosure of Confidential Information while providing the Services are to use the information only for the purpose of providing the Services, not obtain any rights in it, stop using or return the information upon the Company's request, disclose the information only if required by law and after providing prompt notice to the Company, and safeguard the information in strict confidence without disclosing it to any third party without the Company's prior written consent.",
 'contexts': ['Use: The Advisor may use the Confidential Information only for the purpose of providing the Services and shall not obtain any rights in it. The Advisor shall stop using Confidential Information and/or return it to the Company and/or destroy it immediately upon Company’s request. The Advisor may di

#### Evaluating with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [100]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [101]:
!pip install --upgrade

[31mERROR: You must give at least one requirement to install (see "pip help install")[0m[31m
[0m

In [137]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [87]:
!pip install ragas

In [None]:
metrics = [
    Faithfulness(llm="Non-Causal"),
    Faithfulness(llm="Causal"),
    Faithfulness(llm="GroundTruth"),
    BLEU(),
    ROUGE(),
]

In [103]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.8614, 'context_recall': 0.9500, 'context_precision': 1.0000, 'answer_correctness': 0.7422}

In [104]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are the obligations of the Advisor regard...,The obligations of the Advisor regarding the u...,[Use: The Advisor may use the Confidential Inf...,The obligations of the Advisor regarding the u...,1.0,0.985787,1.0,1.0,0.825408
1,What is the scope of the Company's Business an...,The scope of the Company's Business is the dev...,[Definitions: (a) Company’s Business: developm...,The scope of the Company's Business is the dev...,1.0,0.916908,1.0,1.0,0.848313
2,What are the obligations of the Advisor regard...,The obligations of the Advisor regarding the d...,[Use: The Advisor may use the Confidential Inf...,The obligations of the Advisor regarding the d...,1.0,0.993479,1.0,1.0,0.677106
3,What are the expectations for the Advisor's pe...,"The Advisor is expected to devote his time, kn...","[Without derogating from the foregoing, the Ad...",The expectations for the Advisor's performance...,1.0,0.901011,1.0,1.0,0.741594
4,What are the obligations of the Advisor regard...,The obligations of the Advisor regarding the h...,[Safeguard: The Advisor shall safeguard the Co...,The obligations of the Advisor regarding the h...,1.0,1.0,1.0,1.0,0.871378
5,"According to the context, what is the duration...","According to the context, the duration of the ...","[Without derogating from the foregoing, the Ad...",The duration of the period in which the Adviso...,1.0,0.974496,1.0,1.0,0.9981
6,"According to the context, what is the maximum ...",18 days,"[Without derogating from the foregoing, the Ad...",The maximum number of days in a 12-month perio...,1.0,0.925031,1.0,1.0,0.949182
7,What are the Advisor's responsibilities for Co...,The Advisor's responsibilities for Confidentia...,[Use: The Advisor may use the Confidential Inf...,The Advisor's responsibilities for Confidentia...,1.0,0.962193,0.5,1.0,0.66721
8,What are the Advisor's responsibilities and du...,The Advisor's responsibilities and duties towa...,"[Without derogating from the foregoing, the Ad...",The Advisor's responsibilities and duties towa...,1.0,0.954723,1.0,1.0,0.664138
9,Under what circumstances can the Advisor discl...,I don't know.,[Use: The Advisor may use the Confidential Inf...,The Advisor can disclose Confidential Informat...,,0.0,1.0,1.0,0.179282


## Testing a More Performant Retriever

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [105]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [106]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [107]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [108]:
response = retrieval_chain.invoke({"input": "What are the major changes in v0.1.0?"})

In [109]:
print(response["answer"])

The major changes in version 0.1.0 include the requirement for the Advisor to provide the Company with a written report detailing the number of hours spent providing services on a daily basis, as well as an aggregated monthly report at the end of each calendar month.


In [110]:
response = retrieval_chain.invoke({"input": "What is LangGraph?"})

In [111]:
print(response["answer"])

I'm sorry, but there is no information provided in the context about LangGraph.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [112]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [None]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [None]:
!pip3 install  --upgrade ragas

In [141]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

Exception in thread Thread-44:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 75, in run
    results = self.loop.run_until_complete(self._aresults())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 63, in _aresults
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 58, in _aresults
    r = await future
  File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 91, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/base.py", line 91, in ascore
    r

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass `raise_exceptions=False` incase you want to show only a warning message instead.

### Comparing Results

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [136]:
advanced_retrieval_results

NameError: name 'advanced_retrieval_results' is not defined

In [126]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.8614, 'context_recall': 0.9500, 'context_precision': 1.0000, 'answer_correctness': 0.7422}

And see how our advanced retrieval modified our chain!

In [127]:
advanced_retrieval_results

NameError: name 'advanced_retrieval_results' is not defined

In [None]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

We can see that our faithfulness has improved - as well as our answer relevancy - but we lost a significant amount of answer correctness.

We'd need to do some more experimentation to determine how to improve our pipeline!