# Evaluating LlamaIndex

[LlamaIndex](https://github.com/jerryjliu/llama_index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Makes it super easy to connect LLMs with your own data. But in order to figure out the best configuration for llamaIndex and your data you need a object measure of the performance. This is where ragas comes in. Ragas will help you evaluate your `QueryEngine` and gives you the confidence to tweak the configuration to get hightest score.

This guide assumes you have familarity with the LlamaIndex framework.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

## Building the `VectorStoreIndex` and `QueryEngine`

To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it.

In [7]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

import pandas as pd

load the data, build the `VectorStoreIndex` and create the `QueryEngine`.

In [8]:
documents = SimpleDirectoryReader("./nyc_wikipedia/").load_data()
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=ServiceContext.from_defaults(chunk_size=512)
)

query_engine = vector_index.as_query_engine()

Lets try an sample question to see if it is working

In [9]:
response_vector = query_engine.query("How did New York City get its name?")

print(response_vector)


New York City was named in honor of the Duke of York, who would become King James II of England. In 1664, King Charles II appointed the Duke as proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control. The city was then renamed New York in his honor.


## Evaluating with Ragas

Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it. 

In order to run an evaluation with Ragas and LlamaIndex you need 3 things

1. LlamaIndex `QueryEngine`: what we will be evaluating
2. Metrics: Ragas defines a set of metrics that can measure different aspects of the `QueryEngine`. The available metrics and their meaning can be found [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)
3. Questions: A list of questions that ragas will test the `QueryEngine` against. 

first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production.

We're using the `DatasetGenerator` from LlamaIndex for this.

In [10]:
from llama_index.evaluation import DatasetGenerator

question_generator = DatasetGenerator.from_documents(documents)
# generate 5 question
eval_questions = question_generator.generate_questions_from_nodes(5)

len(eval_questions)

5

In [11]:
# lets see the questions
eval_questions

['What is the population of New York City as of 2020?',
 'Which city is the second-largest in the United States after New York City?',
 'What is the geographical and demographic center of the Northeast megalopolis?',
 'How many people live within 250 miles of New York City?',
 'What is the largest metropolitan economy in the world as of 2021?']

Now lets import the metrics we will be using to evaluate

In [12]:
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from ragas.metrics.critique import harmfulness

metrics = [faithfulness, answer_relevancy, context_relevancy, harmfulness]

Finally lets run the evaluation

In [14]:
from ragas.llama_index import evaluate

result = evaluate(query_engine, metrics, eval_questions)

evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.48s/it]

evaluating with [answer_relevancy]





Map:   0%|          | 0/5 [00:00<?, ? examples/s]

100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.08it/s]


evaluating with [context_ relevancy]


100%|████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.57s/it]


evaluating with [harmfulness]


100%|████████████████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.62s/it]


In [15]:
# final scores
print(result)

{'ragas_score': 0.2165, 'faithfulness': 1.0000, 'answer_relevancy': 0.9106, 'context_ relevancy': 0.0850, 'harmfulness': 0.0000}


You can convert into a pandas dataframe to run more analysis on it.

In [16]:
result.to_pandas()

Unnamed: 0,question,answer,contexts,faithfulness,answer_relevancy,context_ relevancy,harmfulness
0,What is the population of New York City as of ...,\nThe population of New York City as of 2020 i...,"[Aeromedical Staging Squadron, and a military ...",1.0,0.904,0.117409,0
1,Which city is the second-largest in the United...,\nLos Angeles is the second-largest city in th...,"[New York, often called New York City or NYC, ...",1.0,0.932,0.035847,0
2,What is the geographical and demographic cente...,\nNew York City is the geographical and demogr...,"[New York, often called New York City or NYC, ...",1.0,0.913,0.104823,0
3,How many people live within 250 miles of New Y...,\nOver 58 million people live within 250 miles...,"[Aeromedical Staging Squadron, and a military ...",1.0,0.893,0.12287,0
4,What is the largest metropolitan economy in th...,\nThe largest metropolitan economy in the worl...,"[New York, often called New York City or NYC, ...",1.0,0.911,0.04419,0
