(dataset-preparation)=
## Prepare your test dataset

Evaluating any ML pipeline will require several data points that constitutes a test dataset. For Ragas, the data points required for evaluating your RAG completely are

- `question`: A question or query that is relevant to your RAG.
- `contexts`: The retrieved contexts corresponding to each question. This is a `list[list]` since each question can retrieve multiple text chunks.
- `answer`:  The answer generated by your RAG corresponding to each question.
- `ground_truth`: The expected correct answer corresponding to each question.

For the purpose of this notebook, I have this dataset prepared from a simple RAG that I created myself to help me with NLP research. Let's use it.

In [1]:
from datasets import load_dataset

In [2]:
eval_dataset = load_dataset("explodinggradients/prompt-engineering-guide-papers")
eval_dataset = eval_dataset["test"].to_pandas()
eval_dataset.head()

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Unnamed: 0,question,ground_truth
0,How does instruction tuning affect the zero-sh...,For larger models on the order of 100B paramet...
1,What is the Zero-shot-CoT method and how does ...,Zero-shot-CoT is a zero-shot template-based pr...
2,How does prompt tuning affect model performanc...,Prompt tuning improves model performance in im...
3,What is the purpose of instruction tuning in l...,The purpose of instruction tuning in language ...
4,What distinguishes Zero-shot-CoT from Few-shot...,Zero-shot-CoT differs from Few-shot-CoT in tha...


As you can see, the dataset contains two of the required attributes mentioned,that is `question` and `ground_truth` answers. Now we can move on our next step to collect the other two attributes.

:::{note}
*We know that it's hard to formulate a test data containing Question and ground truth answer pairs when starting out. We have the perfect solution for this in this form of a ragas synthetic test data generation feature. The questions and ground truth answers were created by [ragas synthetic data generation](./testset_generation.md) feature. Check it out here once you finish this notebook*
:::

#### Simple RAG pipeline

Now with the above step we have two attributes needed for evaluation, that is `question` and `ground_truth` answers. We now need to feed these test questions to our RAG pipeline to collect the other two attributes, ie `contexts` and `answer`.  Let's build a simple RAG using llama-index to do that. 

:::{note}
I'm also using a sample corpus containing NLP papers and open-ai models for building the RAG pipeline. You should be running the same through your RAG pipeline. This is purely for demonstration purposes. And I assume that if you're here you already have a RAG pipeline ready to use.
:::

In [8]:
! git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-guide-papers

Cloning into 'prompt-engineering-guide-papers'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 19 (delta 1), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (19/19), 3.07 MiB | 6.46 MiB/s, done.
Filtering content: 100% (3/3), 18.03 MiB | 6.34 MiB/s, done.


In [14]:
import os

PATH = "./prompt-engineering-guide-papers"
os.environ["OPENAI_API_KEY"] = "your-open-ai-key"

In [9]:
import nest_asyncio
from llama_index.core.indices import VectorStoreIndex
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.service_context import ServiceContext
from datasets import Dataset

nest_asyncio.apply()


def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
        service_context=ServiceContext.from_defaults(chunk_size=512),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=3)
    return query_engine


# Function to evaluate as Llama index does not support async evaluation for HFInference API
def generate_responses(query_engine, test_questions, test_answers):
    responses = [query_engine.query(q) for q in test_questions]

    answers = []
    contexts = []
    for r in responses:
        answers.append(r.response)
        contexts.append([c.node.get_content() for c in r.source_nodes])
    dataset_dict = {
        "question": test_questions,
        "answer": answers,
        "contexts": contexts,
    }
    if test_answers is not None:
        dataset_dict["ground_truth"] = test_answers
    ds = Dataset.from_dict(dataset_dict)
    return ds

In [10]:
reader = SimpleDirectoryReader(PATH, num_files_limit=30, required_exts=[".pdf"])
documents = reader.load_data()

In [12]:
test_questions = eval_dataset["question"].values.tolist()
test_answers = eval_dataset["ground_truth"].values.tolist()

In [15]:
query_engine1 = build_query_engine(documents)
result_ds = generate_responses(query_engine1, test_questions, test_answers)

  documents, service_context=ServiceContext.from_defaults(chunk_size=512),


In [16]:
result_ds

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 20
})

In [1]:
result_ds.to_pandas().head()

Downloading data:   0%|          | 0.00/126k [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Unnamed: 0,question,ground_truth,answer,contexts
0,How does instruction tuning affect the zero-sh...,For larger models on the order of 100B paramet...,"For larger models with around 100B parameters,...",[Published as a conference paper at ICLR 2022\...
1,What is the Zero-shot-CoT method and how does ...,Zero-shot-CoT is a zero-shot template-based pr...,The Zero-shot-CoT method is a zero-shot templa...,"[Similar to\nFew-shot-CoT, Zero-shot-CoT facil..."
2,How does prompt tuning affect model performanc...,Prompt tuning improves model performance in im...,Prompt tuning has been shown to enhance model ...,[The orange bars indicate standard deviation a...
3,What is the purpose of instruction tuning in l...,The purpose of instruction tuning in language ...,The purpose of instruction tuning in language ...,[Although one might\nexpect labeled data to ha...
4,What distinguishes Zero-shot-CoT from Few-shot...,Zero-shot-CoT differs from Few-shot-CoT in tha...,Zero-shot-CoT requires prompting LLMs twice bu...,[Baselines We compare our Zero-shot-CoT mainly...


Done. You now have the dataset required for evaluating your RAG system. Let's move on to the next step. That's the actual evaluation of your RAG system. Checkout evaluation [here](./evaluation.md)