# Evaluation RAG pipelines

A RAG pipeline is a combination of two components: a retriever and a generator. The retriever is responsible for finding the most relevant documents to the input question, and the generator is responsible for generating the answer based on the retrieved documents. For this reason it makes sense to define metrics for each of these components separately:

- Retriever metrics:
  - **Context precision**: Ground truth relevant items are ranked at the top
  - **Context recall**: How does the context align with the ground truth
  - **Context relevancy**: How relevant is the retrieved context to the question
  - **Context entity recall**: How much of the ground truth is included in the context
- Generator metrics:
  - **Answer relevancy**: How relevant is the generated answer to the question
  - **Answer correctness**: Combined factual similarity and semantic similarity between the answer and the ground truth
  - **Faithfulness**: How many factual claims in the answer can be inferred directly from the context

These metrics are defined by the [**Ragas**](https://docs.ragas.io/en/latest/concepts/metrics/index.html) framework, which is a Python library for evaluating RAG pipelines. Another framework used for evaluation is the RAG Evaluation Toolkit or [**RAGET**](https://docs.giskard.ai/en/stable/open_source/testset_generation/index.html).

We will also look at how to generate synthetic testsets for evaluating RAG, since hand-made testsets is a luxury that can take a lot of effort to crate.

In [None]:
%pip install "giskard[llm]" -U

In [None]:
%pip install ragas pandas llama-index-readers-wikipedia unstructured[md]

Make sure that your `.env` file contains the following variables:

```
OPENAI_API_KEY=<your_key>
```


In [None]:
import os
from dotenv import load_dotenv

load_dotenv(override=True, verbose=True)

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [None]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()

Imports

In [None]:
import pandas as pd
import webbrowser

from util.helpers import create_and_save_wiki_md_files, get_wiki_pages

# RAGET
import giskard.llm
from giskard.llm.client.openai import OpenAIClient
from giskard.rag import generate_testset, evaluate, KnowledgeBase, QATestset
from giskard.rag.question_generators import (
    distracting_questions,
    double_questions,
    simple_questions,
    situational_questions,
    complex_questions,
)
from giskard.rag.metrics.ragas_metrics import (
    ragas_context_recall, 
    ragas_context_precision,
    ragas_faithfulness,
)

# Llama Index
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings,
)
from llama_index.core.query_engine import (
    FLAREInstructQueryEngine, 
    BaseQueryEngine
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# RAGAs
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

Fetch and save documents

In [None]:
pages = get_wiki_pages(articles=["Albert Einstein"])
docs_path = "./data/docs/eval"
create_and_save_wiki_md_files(pages, path=docs_path + "/")

## Generating a testset

When evaluating the performance of ML its almost always necessary to have a testset of some kind. In the case of RAG, you need a testset consisting of *queries* together with their respective *answers* and *contexts*. The context is the document that the retriever should retrieve, and the answer is the expected output of the generator.

In some cases you might already have access a testset. For example if you're working on a pipeline to generate automatic answers to support tickets, you might have a set of successfully handled tickets together with some guides or manuals that your support staff should base their answer on. But in many cases it can take a lot of work to create a viable testset. In those cases it can make sense to create a syntehtic testset using LLMs.


### Using RAGAs

**OBS**: Currently there's a bug in Ragas which often causes a deadlock when generating the testset. I recommend skipping for now or check the issue [*here*](https://github.com/explodinggradients/ragas/issues/833)

In [None]:
langchain_documents = DirectoryLoader(docs_path).load()

In [None]:
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k", api_key=OPENAI_API_KEY)
critic_llm = ChatOpenAI(model="gpt-4-turbo", api_key=OPENAI_API_KEY)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=OPENAI_API_KEY)

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

ragas_testset = generator.generate_with_langchain_docs(langchain_documents, test_size=1, with_debugging_logs=True, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, is_async=False)


### Using RAGET

In [None]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
giskard.llm.set_llm_api("openai")
oc = OpenAIClient(model="gpt-4-turbo")
giskard.llm.set_default_client(oc)

In [None]:
llamaindex_documents = SimpleDirectoryReader(docs_path).load_data()
splitter = SentenceSplitter(chunk_size=512)
text_nodes = splitter(llamaindex_documents)

pd_dataframe = pd.DataFrame([node.text for node in text_nodes], columns=["text"])

knowledge_base = KnowledgeBase(data=pd_dataframe)

Now we generate the test set. We generate 5 different types of questions, which each are targeted at testing specific components of the RAG pipeline:

- **Simple**: Simple questions generated from an excerpt of the knowledge base 
    - targeted at evaluating *generation* and *retrieval*
- **Complex**: Questions made more complex by paraphrasing 
    - targeted at evaluating *generation*
- **Situational**: Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context 
    - targeted at evaluating *generation*
- **Double**: Questions with two distinct parts 
    - targeted at evaluating *generation* and *rewriting*
- **Distracting**: Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question 
    - targeted at evaluating *generation*, *retrieval* and *rewriting*

Other question types include:
- **Conversational**: Questions made as part of a conversation, first message describe the context of the question that is ask in the last message
    - targeted at evaluating *rewriting* and *routing*


In [None]:
raget_testset = generate_testset(
    knowledge_base=knowledge_base,
    num_questions=20,
    agent_description="A chatbot answering questions about Albert Einstein",
    question_generators=[simple_questions, complex_questions, situational_questions, double_questions, distracting_questions],
)

In [None]:
dir = "./data/eval"
if not os.path.exists(dir):
    print("Creating directory: ", dir)
    os.makedirs(dir)

path = f"{dir}/einstein_testset"


In [None]:
raget_testset.save(path=path)

In [None]:
raget_testset = QATestset.load(path)

In [None]:
raget_testset.to_pandas()

## Evaluating a RAG pipeline

Now, given our generated testset, we can evaluate a RAG pipeline. In this examples we use RAGET (with RAGAs metrics) to evaluate two different pipelines:

- **Baseline RAG**
- **Advanced RAG with FLARE**

First we create our query engines.

In [None]:
Settings.llm = OpenAI(api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY, model="text-embedding-3-small")
Settings.chunk_size = 512
Settings.chunk_overlap = 20


index = VectorStoreIndex.from_documents(documents=llamaindex_documents, show_progress=True)

### Create simple baseline RAG pipeline

We create a simple RAG setup with the default values from LlamaIndex.

In [None]:
base_query_engine = index.as_query_engine()

### Create FLARE query engine

We create a FLARE query engine using the `FLAREInstructQueryEngine` model from LlamaIndex.

In [None]:
flare_query_engine = FLAREInstructQueryEngine(
    query_engine=index.as_query_engine(), 
    max_iterations=5,
)

### Create `answer_fn` for RAGET to use in evaluation

In [None]:
def answer_fn(question: str, query_engine: BaseQueryEngine) -> str:
    answer = query_engine.query(question)
    
    return str(answer)

def base_answer_fn(question: str, history=None) -> str:
    return answer_fn(question, base_query_engine)

def flare_answer_fn(question: str, history=None) -> str:
    return answer_fn(question, flare_query_engine)

### Evaluate

We use RAGET's `evaluate` function to evaluate the two pipelines using the metrics: Answer Correctness (this is default in the `evaluate` fn), Faithfulness, Context Precision, Context Recall


In [None]:
base_report = evaluate(
    base_answer_fn,
    testset=raget_testset,
    knowledge_base=knowledge_base,
    metrics=[
        ragas_faithfulness,
    ],
)

In [None]:
base_report.save(f"{dir}/base_report")

In [None]:
url = f"file://{os.getcwd()}{dir[1:]}/base_report/report.html"
print(url)
webbrowser.open(url=url, new=2)

In [None]:
flare_report = evaluate(
    flare_answer_fn,
    testset=raget_testset,
    knowledge_base=knowledge_base,
    metrics=[
        ragas_faithfulness,
        ragas_context_recall,
        ragas_context_precision,
    ],
)

In [None]:
flare_report.save(f"{dir}/flare_report")
url = f"file://{os.getcwd()}{dir[1:]}/flare_report/report.html"
print(url)
webbrowser.open(url=url, new=2)