<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/llama_dataset/labelled-rag-datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Benchmarking RAG Pipelines With A `LabelledRagDatatset`

The `LabelledRagDataset` is meant to be used for evaluating any given RAG pipeline, for which there could be several configurations (i.e. choosing the `LLM`, values for the `similarity_top_k`, `chunk_size`, and others). We've likened this abstract to traditional machine learning datastets, where `X` features are meant to predict a ground-truth label `y`. In this case, we use the `query` as well as the retrieved `contexts` as the "features" and the answer to the query, called `reference_answer` as the ground-truth label.

And of course, such datasets are comprised of observations or examples. In the case of `LabelledRagDataset`, these are made up with a set of `LabelledRagDataExample`'s.

In this notebook, we will show how one can construct a `LabelledRagDataset` from scratch. Please note that the alternative to this would be to simply download a community supplied `LabelledRagDataset` from `llama-hub` in order to evaluate/benchmark your own RAG pipeline on it.

### The `LabelledRagDataExample` Class

In [1]:
# %pip install llama-index-llms-openai
# %pip install llama-index-readers-wikipedia

In [2]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

In [3]:
from jet.llm.ollama import initialize_ollama_settings, create_llm
initialize_ollama_settings({
    "embedding_model": "mxbai-embed-large",
    "chunk_size": 1024,
})

_Settings(_llm=Ollama(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x16352f1d0>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x1425f9ee0>, completion_to_prompt=<function default_completion_to_prompt at 0x142841a80>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, base_url='http://localhost:11434', model='llama3.1', temperature=0.0, context_window=4096, request_timeout=300.0, prompt_key='prompt', json_mode=False, additional_kwargs={}, is_function_calling_model=True, keep_alive=None), _embed_model=OllamaEmbedding(model_name='mxbai-embed-large', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x16352f1d0>, num_workers=None, base_url='http://localhost:11434', ollama_additional_kwargs={}), _callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x16352f1d0>, _tokenizer=None, _node_parser=SentenceSplitte

In [4]:
from llama_index.core.llama_dataset import (
    LabelledRagDataExample,
    CreatedByType,
    CreatedBy,
)

# constructing a LabelledRagDataExample
query = "This is a test query, is it not?"
query_by = CreatedBy(type=CreatedByType.AI, model_name="llama3.1")
reference_answer = "Yes it is."
reference_answer_by = CreatedBy(type=CreatedByType.HUMAN)
reference_contexts = ["This is a sample context"]

rag_example = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

The `LabelledRagDataExample` is a Pydantic `Model` and so, going from `json` or `dict` (and vice-versa) is possible.

In [5]:
print(rag_example.json())

{"query":"This is a test query, is it not?","query_by":{"model_name":"llama3.1","type":"ai"},"reference_contexts":["This is a sample context"],"reference_answer":"Yes it is.","reference_answer_by":{"model_name":"","type":"human"}}


In [6]:
LabelledRagDataExample.parse_raw(rag_example.json())

LabelledRagDataExample(query='This is a test query, is it not?', query_by=CreatedBy(model_name='llama3.1', type=<CreatedByType.AI: 'ai'>), reference_contexts=['This is a sample context'], reference_answer='Yes it is.', reference_answer_by=CreatedBy(model_name='', type=<CreatedByType.HUMAN: 'human'>))

In [7]:
rag_example.dict()

{'query': 'This is a test query, is it not?',
 'query_by': {'model_name': 'llama3.1', 'type': <CreatedByType.AI: 'ai'>},
 'reference_contexts': ['This is a sample context'],
 'reference_answer': 'Yes it is.',
 'reference_answer_by': {'model_name': '',
  'type': <CreatedByType.HUMAN: 'human'>}}

In [8]:
LabelledRagDataExample.parse_obj(rag_example.dict())

LabelledRagDataExample(query='This is a test query, is it not?', query_by=CreatedBy(model_name='llama3.1', type=<CreatedByType.AI: 'ai'>), reference_contexts=['This is a sample context'], reference_answer='Yes it is.', reference_answer_by=CreatedBy(model_name='', type=<CreatedByType.HUMAN: 'human'>))

Let's create a second example, so we can have a (slightly) more interesting `LabelledRagDataset`.

In [9]:
query = "This is a test query, is it so?"
reference_answer = "I think yes, it is."
reference_contexts = ["This is a second sample context"]

rag_example_2 = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

### The `LabelledRagDataset` Class

In [10]:
from llama_index.core.llama_dataset import LabelledRagDataset

rag_dataset = LabelledRagDataset(examples=[rag_example, rag_example_2])

There exists a convienience method to view the dataset as a `pandas.DataFrame`.

In [11]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"This is a test query, is it not?",[This is a sample context],Yes it is.,human,ai (llama3.1)
1,"This is a test query, is it so?",[This is a second sample context],"I think yes, it is.",human,ai (llama3.1)


#### Serialization

To persist and load the dataset to and from disk, there are the `save_json` and `from_json` methods.

In [12]:
rag_dataset.save_json("rag_dataset.json")

In [13]:
reload_rag_dataset = LabelledRagDataset.from_json("rag_dataset.json")

In [14]:
reload_rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"This is a test query, is it not?",[This is a sample context],Yes it is.,human,ai (llama3.1)
1,"This is a test query, is it so?",[This is a second sample context],"I think yes, it is.",human,ai (llama3.1)


### Building a synthetic `LabelledRagDataset` over Wikipedia 

For this section, we'll first create a `LabelledRagDataset` using a synthetic generator. Ultimately, we will use GPT-4 to produce both the `query` and `reference_answer` for the synthetic `LabelledRagDataExample`'s.

NOTE: if one has queries, reference answers, and contexts over a text corpus, then it is not necessary to use data synthesis to be able to predict and subsequently evaluate said predictions.

In [15]:
# !pip install wikipedia -q

In [16]:
# wikipedia pages
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

# cities = [
#     "San Francisco",
# ]
# documents = WikipediaReader().load_data(
#     pages=[f"History of {x}" for x in cities]
# )
context_files = [
    "/Users/jethroestrada/Desktop/External_Projects/AI/chatbot/open-webui/backend/crewAI/docs/installation.mdx",
    "/Users/jethroestrada/Desktop/External_Projects/AI/chatbot/open-webui/backend/crewAI/docs/introduction.mdx",
    "/Users/jethroestrada/Desktop/External_Projects/AI/chatbot/open-webui/backend/crewAI/docs/quickstart.mdx",
]
documents = SimpleDirectoryReader(
    input_files=context_files,
).load_data()

index = VectorStoreIndex.from_documents(documents)

In [17]:
len(documents)

3

The `RagDatasetGenerator` can be built over a set of documents to generate `LabelledRagDataExample`'s.

In [18]:
# generate questions against chunks
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

# set context for llm provider
llm = create_llm(model="llama3.1", temperature=0.3)
num_questions_per_chunk = 2

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    llm=llm,
    num_questions_per_chunk=num_questions_per_chunk,  # set the number of questions per nodes
    show_progress=True,
    question_gen_query=f"You are a Senior Programmer. Your task is to setup {num_questions_per_chunk} questions about features or code implementation. Restrict the questions to the context information provided."
)

Parsing nodes:   0%|          | 0/3 [00:00<?, ?it/s]

In [19]:
len(dataset_generator.nodes)

7

In [20]:
dataset_generator.nodes[0]



In [21]:
questions_length = len(dataset_generator.nodes) * num_questions_per_chunk
questions_length

14

In [22]:
# since there are 14 nodes, there should be a total of 28 questions
rag_dataset = dataset_generator.generate_dataset_from_nodes()

100%|██████████| 7/7 [02:19<00:00, 19.89s/it]
100%|██████████| 2/2 [00:51<00:00, 25.71s/it]
100%|██████████| 2/2 [01:38<00:00, 49.44s/it]
100%|██████████| 2/2 [02:05<00:00, 62.87s/it]
100%|██████████| 2/2 [02:11<00:00, 65.80s/it] 
100%|██████████| 2/2 [02:11<00:00, 65.91s/it]
100%|██████████| 2/2 [02:29<00:00, 74.52s/it] 
100%|██████████| 2/2 [00:41<00:00, 20.88s/it]


In [23]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,Here are two questions as a Senior Programmer:,[---\ntitle: Installation\ndescription: Get st...,Here are the answers to the two questions:\n\n...,ai (llama3.1),ai (llama3.1)
1,How does CrewAI handle Python version requirem...,[---\ntitle: Installation\ndescription: Get st...,"According to the provided context, CrewAI requ...",ai (llama3.1),ai (llama3.1)
2,Here are two questions as a Senior Programmer:,[---\ntitle: Introduction\ndescription: Build ...,Here are two answers to the queries based on t...,ai (llama3.1),ai (llama3.1)
3,How does CrewAI's Process component ensure smo...,[---\ntitle: Introduction\ndescription: Build ...,"According to the provided documentation, the P...",ai (llama3.1),ai (llama3.1)
4,Here are two questions based on the context in...,[---\ntitle: Quickstart\ndescription: Build yo...,I'm ready to help. What are the two questions?,ai (llama3.1),ai (llama3.1)
5,How can you modify the `agents.yaml` file to i...,[---\ntitle: Quickstart\ndescription: Build yo...,"According to the provided context, any variabl...",ai (llama3.1),ai (llama3.1)
6,Here are two potential questions as a Senior P...,[@crew\n def crew(self) -> Crew:\n ...,Based on the provided code snippets and contex...,ai (llama3.1),ai (llama3.1)
7,How can I add before and after kickoff functio...,[@crew\n def crew(self) -> Crew:\n ...,To add before and after kickoff functions to y...,ai (llama3.1),ai (llama3.1)
8,Here are two questions as a Senior Programmer:,[## 2. Benefits of AI Agents\n AI agents br...,"I'm happy to help! However, I don't see any sp...",ai (llama3.1),ai (llama3.1)
9,How would you implement the **Task Automation*...,[## 2. Benefits of AI Agents\n AI agents br...,"Based on the provided context, it seems that t...",ai (llama3.1),ai (llama3.1)


In [24]:
rag_dataset.save_json("rag_dataset.json")