# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

In [2]:
%pip install -U -q langchain langchain-ollama ragas arxiv pymupdf chromadb wandb tiktoken

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from operator import itemgetter
# import openai
from getpass import getpass

# openai.api_key = getpass("Please provide your OpenAI Key: ")
# os.environ["OPENAI_API_KEY"] = openai.api_key

### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [4]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=5).load()
len(base_docs)

3

In [5]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2024-10-30', 'Title': 'R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation', 'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen', 'Summary': "Retrieval augmented generation (RAG) has been applied in many scenarios to\naugment large language models (LLMs) with external documents provided by\nretrievers. However, a semantic gap exists between LLMs and retrievers due to\ndifferences in their training objectives and architectures. This misalignment\nforces LLMs to passively accept the documents provided by the retrievers,\nleading to incomprehension in the generation process, where the LLMs are\nburdened with the task of distinguishing these documents using their inherent\nknowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill\nthis gap by incorporating Retrieval information into Retrieval Augmented\nGeneration. Specifically, R$^2$AG utilizes the nuanced features from the\nretrievers and employs a R$^2$-Former to capt

### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [6]:
# from langchain.vectorstores import Chroma
# from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

docs = text_splitter.split_documents(base_docs)

# vectorstore = Chroma.from_documents(docs, OllamaEmbeddings())

# Initialize Ollama embeddings
ollama_embeddings = OllamaEmbeddings(model="llama3.2-vision", temperature=0.9)

# Create the vector store using Ollama embeddings
vectorstore = Chroma.from_documents(docs, ollama_embeddings)

# Initialize Ollama LLM
ollama_llm = OllamaLLM(model="llama3.2-vision", temperature=0.9)

# Create the base retriever
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

# Define the prompt template
template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# Update the retrieval_augmented_qa_chain to use Ollama LLM
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | ollama_llm | StrOutputParser(),
        'context': itemgetter('context')
    }
)

  ollama_embeddings = OllamaEmbeddings(model="llama3.2-vision", temperature=0.9)


ValueError: Error raised by inference API HTTP code: 500, {"error":{}}

In [None]:
len(docs)

In [8]:
print(max([len(chunk.page_content) for chunk in docs]))

249


Let's convert our `Chroma` vectorstore into a retriever with the `.as_retriever()` method.

In [9]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

Now to give it a test!

In [10]:
relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")

  relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")


In [11]:
len(relevant_docs)

2

## Creating a Retrieval Augmented Generation Prompt

Now we can set up a prompt template that will be used to provide the LLM with the necessary contexts, user query, and instructions!

In [12]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll follow *exactly* the chain we made on Tuesday to keep things simple for now - if you need a refresher on what it looked like - check out last week's notebook!

In [13]:
from operator import itemgetter

# from langchain.chat_models import ChatOpenAI
from langchain_ollama import ChatOllama
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

primary_qa_llm = ChatOllama(model="llama3.2-vision", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [14]:
question = "What is RAG?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

{'response': AIMessage(content='Retrieval Augmented Generation.', additional_kwargs={}, response_metadata={'model': 'llama3.2-vision', 'created_at': '2024-12-23T13:40:27.060249Z', 'done': True, 'done_reason': 'stop', 'total_duration': 5993951750, 'load_duration': 32725334, 'prompt_eval_count': 858, 'prompt_eval_duration': 5601000000, 'eval_count': 8, 'eval_duration': 357000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-ed4cc78f-b3f8-4a1a-b811-cf1ef3bef1e6-0', usage_metadata={'input_tokens': 858, 'output_tokens': 8, 'total_tokens': 866}), 'context': [Document(metadata={'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen', 'Published': '2024-10-30', 'Summary': "Retrieval augmented generation (RAG) has been applied in many scenarios to\naugment large language models (LLMs) with external documents provided by\nretrievers. However, a semantic gap exists between LLMs and retrievers due to\ndifferences in their training objectives and archi

### Ground Truth Dataset Creation Using GPT-3.5-turbo(mistral-nemo) and GPT-4 (llama3.2)

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [15]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [16]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [17]:
question_generation_llm = ChatOllama(model="mistral-nemo")

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

In [18]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})
output_dict = question_output_parser.parse(response.content)

In [19]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
How does the proposed 'R^2AG' framework address the semantic gap between Large Language Models (LLMs) and retrievers?
context
{'page_content': 'R2AG: Incorporating Retrieval Information into Retrieval Augmented Generation\nFuda Ye1, Shuangyin Li1,*, Yongqi Zhang2, Lei Chen2,3\n1School of Computer Science, South China Normal University', 'metadata': {'Published': '2024-10-30', 'Title': 'R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation', 'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen', 'Summary': "Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs ar

In [20]:
%pip install -q -U tqdm

Note: you may need to restart the kernel to use updated packages.


In [21]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages})
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [05:12<00:00, 31.25s/it]


In [22]:
qac_triples[5]

{'question': 'What is the primary goal of the proposed method R^2AG?',
 'context': Document(metadata={'Published': '2024-10-30', 'Title': 'R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation', 'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen', 'Summary': "Retrieval augmented generation (RAG) has been applied in many scenarios to\naugment large language models (LLMs) with external documents provided by\nretrievers. However, a semantic gap exists between LLMs and retrievers due to\ndifferences in their training objectives and architectures. This misalignment\nforces LLMs to passively accept the documents provided by the retrievers,\nleading to incomprehension in the generation process, where the LLMs are\nburdened with the task of distinguishing these documents using their inherent\nknowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill\nthis gap by incorporating Retrieval information into Retrieval Augmented\nGeneration. Specif

In [23]:
answer_generation_llm = ChatOllama(model="llama3.2-vision", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)

In [24]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
The enhanced RAG framework proposed in the paper, R^2AG, addresses the 'semantic gap' between Large Language Models (LLMs) and retrievers by incorporating retrieval information into the generation process. Specifically, it utilizes nuanced features from the retrievers and employs a R^2-Former to capture retrieval information, which is then integrated into LLMs' generation using a retrieval-aware prompting strategy. This approach fills the semantic gap by providing an anchor for LLMs to aid in the generation process, making it more effective, robust, and efficient.


In [25]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [02:11<00:00, 14.60s/it]


In [26]:
%pip install -q -U datasets

Note: you may need to restart the kernel to use updated packages.


In [27]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [28]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

In [29]:
eval_dataset[0]

{'question': "Compare and contrast Retrieval Augmented Generation (RAG) with the enhanced RAG framework proposed in the paper. How does the novel approach address the 'semantic gap' between Large Language Models (LLMs) and retrievers?",
 'context': 'R2AG: Incorporating Retrieval Information into Retrieval Augmented\nGeneration\nFuda Ye1, Shuangyin Li1,*, Yongqi Zhang2, Lei Chen2,3\n1School of Computer Science, South China Normal University',
 'ground_truth': "The enhanced RAG framework proposed in the paper, R^2AG, addresses the 'semantic gap' between Large Language Models (LLMs) and retrievers by incorporating retrieval information into the generation process. Specifically, it utilizes nuanced features from the retrievers and employs a R^2-Former to capture retrieval information, which is then integrated into LLMs' generation using a retrieval-aware prompting strategy. This approach fills the semantic gap by providing an anchor for LLMs to aid in the generation process, making it more

In [30]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

6110

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [None]:
# from datasets import Dataset
# eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

In [31]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

In [35]:
%pip install -U ragas

Note: you may need to restart the kernel to use updated packages.


In [64]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    # context_precision,
    answer_correctness,
    answer_similarity
)

# from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"].content,
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        # context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        answer_correctness,
        answer_similarity
    ],
  )
  return result

Lets create our dataset first:

In [65]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:17<00:00,  8.60s/it]


In [66]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 9
})

In [67]:
basic_qa_ragas_dataset[0]

{'question': "Compare and contrast Retrieval Augmented Generation (RAG) with the enhanced RAG framework proposed in the paper. How does the novel approach address the 'semantic gap' between Large Language Models (LLMs) and retrievers?",
 'answer': "I don't know",
 'contexts': ['This "brute force" optimization approach, scoring 25.92%, is outperformed\nnot only by the proposed method of meta-prompting optimization, with 34.69%,\nbut even also by the baseline, plain RAG, with 26.12%.',
  'Put to empirical test with the demanding multi-hop question answer-\ning task from the StrategyQA dataset, the evaluation results indicate\nthat this method outperforms a similar retrieval-augmented system but\nwithout this method by over 30 %.'],
 'ground_truths': ["The enhanced RAG framework proposed in the paper, R^2AG, addresses the 'semantic gap' between Large Language Models (LLMs) and retrievers by incorporating retrieval information into the generation process. Specifically, it utilizes nuanced 

Save it for later:

In [68]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

11976

And finally - evaluate how it did!

In [69]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

ValueError: The metric [context_recall] that is used requires the following additional columns ['reference'] to be present in the dataset.

In [57]:
basic_qa_result

NameError: name 'basic_qa_result' is not defined

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

We'll build this simple qa_chain factory to create standardized qa_chains where the only different component will be the retriever.

In [70]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOllama(model="mistral-nemo", temperature=0)
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context=itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

The basic outline of this retrieval method is as follows:

1. Obtain User Question
2. Retrieve child documents using Dense Vector Retrieval
3. Merge the child documents based on their parents. If they have the same parents - they become merged.
4. Replace the child documents with their respective parent documents from an in-memory-store.
5. Use the parent documents to augment generation.

In [71]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(collection_name="split_parents", embedding_function=OllamaEmbeddings())

store = InMemoryStore()

  vectorstore = Chroma(collection_name="split_parents", embedding_function=OllamaEmbeddings())


In [72]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [73]:
parent_document_retriever.add_documents(base_docs)

ValueError: Error raised by inference API HTTP code: 404, {"error":"model \"llama2\" not found, try pulling it first"}

Let's create, test, and then evaluate our new chain!

In [None]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [None]:
parent_document_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'RAG stands for Retrieval-Augmented Generation.'

In [None]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:17<00:00,  1.80s/it]


In [None]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

55620

In [None]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.80s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:09<00:00, 69.23s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [01:04<00:00, 64.76s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:06<00:00,  6.54s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [00:04<00:00,  4.02s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:06<00:00,  6.83s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.54it/s]


In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

#### Ensemble Retrieval

Next let's look at ensemble retrieval!

You can read more about this [here](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

The basic idea is as follows:

1. Obtain User Question
2. Hit the Retriever Pair
    - Retrieve Documents with BM25 Sparse Vector Retrieval
    - Retrieve Documents with Dense Vector Retrieval Method
3. Collect and "fuse" the retrieved docs based on their weighting using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm into a single ranked list.
4. Use those documents to augment our generation.

Ensure your `weights` list - the relative weighting of each retriever - sums to 1!

In [None]:
!pip install -q -U rank_bm25

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])

In [None]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [None]:
ensemble_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'RAG stands for Retrieval-Augmented Generation.'

In [None]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:20<00:00,  2.07s/it]


In [None]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

22820

In [None]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.76s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:08<00:00, 68.62s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:05<00:00,  5.37s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:11<00:00, 11.67s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [01:02<00:00, 62.45s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:08<00:00,  9.00s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.57it/s]


In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

### Conclusion

Observe your results in a table!

In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

We can also zoom in on each result and find specific information about each of the questions and answers.

In [None]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [None]:
ensemble_qa_result_df

Unnamed: 0,question,contexts,answer,ground_truths,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What is the focus of this paper?,[has to make an important career decision.\nNe...,The focus of this paper is on a framework call...,[The focus of this paper is on retrieval-augme...,1.0,0.666667,0.784617,1.0,0.0,0.5,True
1,What is the title of the paper?,[of War. The game was released worldwide in\nG...,"Title: Self-RAG: Learning to Retrieve, Generat...",[The title of the paper is 'A Survey on Retrie...,0.5,1.0,0.976911,1.0,0.0,0.5,True
2,What is the aim of this paper?,[has to make an important career decision.\nNe...,The aim of this paper is to introduce a new fr...,[The aim of this paper is to conduct a compreh...,1.0,0.333333,0.800732,1.0,0.078947,0.75,True
3,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982435,0.8,0.017857,1.0,True
4,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
5,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
6,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968712,1.0,0.025641,1.0,True
7,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982422,1.0,0.017857,1.0,True
8,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968731,1.0,0.025641,1.0,True
9,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968692,1.0,0.025641,1.0,True


We'll also look at combining the results and looking at them in a single table so we can make inferences about them!

In [None]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name" : pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [None]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [None]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [None]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [None]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [None]:
results_df.sort_values("answer_correctness", ascending=False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.885833,0.7,0.891845,0.98,0.019158,0.775,1.0
0,basic_rag,0.5,0.4,0.953475,1.0,0.055904,0.616667,1.0
1,pdr_rag,0.697222,0.35,0.943909,1.0,0.013386,0.6,1.0


### ❓QUESTION❓

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

In [None]:
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)