# Evaluation

Evaluation and benchmarking play a pivotal role in the development of LLM Applications. For optimizing the performance of applications such as RAG (Retrieval Augmented Generation), a robust measurement mechanism is indispensable.

LlamaIndex offers vital modules tailored to assess the quality of generated outputs. Additionally, it incorporates specialized modules designed specifically to evaluate content retrieval quality. LlamaIndex categorizes its evaluation into two primary types:

*   **Response Evaluation**
*   **Retrieval Evaluation**

[Documentation
](https://gpt-index.readthedocs.io/en/latest/core_modules/supporting_modules/evaluation/root.html)

# Response Evaluation

Evaluating results from LLMs is distinct from traditional machine learning's straightforward outcomes. LlamaIndex employs evaluation modules, using a benchmark LLM like GPT-4, to gauge answer accuracy. Notably, these modules often blend query, context, and response, minimizing the need for ground-truth labels.

The evaluation modules manifest in the following categories:

*   **Faithfulness:** Assesses whether the response remains true to the retrieved contexts, ensuring there's no distortion or "hallucination."
*   **Context Relevancy:** Evaluates the relevance of both the retrieved context and the generated answer to the initial query.
*   **Correctness:** Determines if the generated answer aligns with the reference answer based on the query (this does require labels).

Furthermore, LlamaIndex has the capability to autonomously generate questions from your data, paving the way for an evaluation pipeline to assess the RAG application.

<b> Evaluation of RAG can be costly GPT-4 is being used. Please keep track of the cost. You can try to run on lesser data to reduce cost.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

import logging
import sys

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []

# Set up the StreamHandler to output to sys.stdout (Colab's output)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

In [3]:
import logging
import sys
import pandas as pd

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [4]:
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

In [5]:
from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    RetrieverEvaluator,
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset
)

In [6]:
from llama_index.llms.openai import OpenAI

In [13]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

In [8]:
import os

In [9]:
from dotenv import load_dotenv, find_dotenv
# load_dotenv('D:/.env')
# OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

In [10]:
load_dotenv('/home/santhosh/Projects/courses/Pinnacle/.env')

True

### Download Data

In [15]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-11-19 16:41:41--  https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-11-19 16:41:41 (716 KB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



### Load Data

In [11]:
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

### Generate Questions

In [20]:
data_generator = RagDatasetGenerator.from_documents(documents, llm=OpenAI(temperature=0, model="gpt-4o-mini"),
                                                   num_questions_per_chunk=2)

In [21]:
eval_dataset = data_generator.generate_dataset_from_nodes()

HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.co

In [24]:
eval_dataset.examples[0].query

"Describe the author's early experiences with programming on the IBM 1401 and how these experiences influenced their later interest in microcomputers."

In [25]:
eval_dataset.examples[0].reference_answer

"The author's early experiences with programming on the IBM 1401 were marked by a sense of confusion and limitation. At the age of 13 or 14, the author, along with a friend, had the opportunity to use the IBM 1401, which was housed in the basement of their junior high school. They learned to program using an early version of Fortran, but the process was cumbersome, involving the use of punch cards to input data. The author struggled to find meaningful applications for the machine, as they lacked the necessary data and mathematical knowledge to create interesting programs. Their most memorable experience was realizing that programs could run indefinitely without terminating, which highlighted the technical challenges of programming on a non-time-sharing machine.\n\nThese early frustrations with the IBM 1401 contrasted sharply with the later advent of microcomputers, which revolutionized the programming experience. The author was inspired by a friend who built his own microcomputer and w

In [27]:
eval_questions = [example.query for example in eval_dataset.examples]
eval_answers = [example.reference_answer for example in eval_dataset.examples]

In [15]:
len(eval_questions)

20

In [16]:
eval_questions[0]

'What were the two main things that the author worked on before college?'

In [17]:
eval_answers[0]

'The two main things that the author worked on before college were writing and programming.'

To be consistent we will fix evaluation query

In [18]:
eval_query = eval_questions[0]

<b> Check https://openai.com/pricing to select the less costlier variant of an LLM.<b>

In [43]:
# Fix gpt-4o-mini LLM for generating response
gpt4o_mini = OpenAI(temperature=0, model="gpt-4o-mini")

# Fix GPT-4o LLM for evaluation
gpt4 = OpenAI(temperature=0, model="gpt-4o")

In [20]:
# create vector index
vector_index = VectorStoreIndex.from_documents(
    documents, llm=OpenAI(temperature=0, model="gpt-4o")
)

# Query engine to generate response
query_engine = vector_index.as_query_engine()

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [21]:
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [22]:
from IPython.display import display, HTML
display(HTML(f'<p style="font-size:14px">{nodes[1].get_text()}</p>'))

## Context Relevency Evaluation

Measures if the response + source nodes match the query.

In [23]:
# Create RelevancyEvaluator using GPT-4 LLM
relevancy_evaluator = RelevancyEvaluator(llm=gpt4)

In [35]:
# Generate response
response_vector = query_engine.query(eval_query)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [40]:
response_vector.response

'The author worked on writing and programming before college.'

In [44]:
print(response_vector.get_formatted_sources())

> Source (Doc id: 7c2aa4f8-b0e0-43de-a06e-8c5464055823): What I Worked On

February 2021

Before college the two main things I worked on, outside of schoo...

> Source (Doc id: 6325d880-f3e2-40ad-a310-f3d058169f24): Grad students could take classes in any department, and my advisor, Tom Cheatham, was very easy g...

> Source (Doc id: a0663a26-51e3-4225-b521-a264a609e368): Now anyone could publish anything.

This had been possible in principle since 1993, but not many ...


In [None]:
# Evaluation
eval_result = relevancy_evaluator.evaluate_response(query=eval_query, response=response_vector)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [25]:
eval_result.query

'What were the two main things that the author worked on before college?'

In [26]:
eval_result.response

'Writing and programming'

In [27]:
eval_result.passing

True

Relevancy evaluation with multiple source nodes.

In [28]:
# Create Query Engine with similarity_top_k=3
query_engine = vector_index.as_query_engine(similarity_top_k=3)

# Create response
response_vector = query_engine.query(eval_query)

# Evaluate with each source node
eval_source_result_full = [
    relevancy_evaluator.evaluate(
        query=eval_query,
        response=response_vector.response,
        contexts=[source_node.get_content()],
    )
    for source_node in response_vector.source_nodes
]

# Evaluation result
eval_source_result = [
    "Pass" if result.passing else "Fail" for result in eval_source_result_full
]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [29]:
eval_source_result

['Pass', 'Fail', 'Fail']

## Faithfullness Evaluator

 Measures if the response from a query engine matches any source nodes. This is useful for measuring if the response was hallucinated.

In [30]:
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4)

In [31]:
eval_result = faithfulness_evaluator.evaluate_response(response=response_vector)

HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [32]:
eval_result.feedback

'YES'

In [33]:
eval_result.passing

True

In [34]:
eval_result.score

1.0

## Correctness Evaluator

Evaluates the relevance and correctness of a generated answer against a reference answer.

In [45]:
correctness_evaluator = CorrectnessEvaluator(llm=gpt4)

In [46]:
eval_reference_answer = eval_answers[0]

correctness_result = correctness_evaluator.evaluate(
    query=eval_query,
    response=response_vector.response,
    reference=eval_reference_answer,
)

HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [47]:
correctness_result.score

5.0

In [48]:
correctness_result.passing

True

In [49]:
correctness_result.feedback

'The generated answer is completely relevant and correct, providing the same information as the reference answer.'

## BatchEvalRunner - Run Evaluations in batch manner.

In [50]:
from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {
     "faithfulness": faithfulness_evaluator,
     "relevancy": relevancy_evaluator,
     "correctness": correctness_evaluator
     },
    workers=8,
)

eval_results = await runner.aevaluate_queries(
    query_engine, queries=eval_questions, reference = eval_answers
)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

In [51]:
def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Score: {score}")
    return score

In [52]:
_ = get_eval_results("faithfulness", eval_results)

faithfulness Score: 1.0


In [53]:
_ = get_eval_results("relevancy", eval_results)

relevancy Score: 1.0


In [54]:
_ = get_eval_results("correctness", eval_results)

correctness Score: 0.9


## Benchmark using [LlamaDatasets](https://llamahub.ai/?tab=llama_datasets).

It's a 3 step process:

1. Download dataset
2. Build your RAG Pipeline
3. Evaluate using RagEvaluatorPack.

In [None]:
# !llamaindex-cli download-llamapack RagEvaluatorPack --download-dir ./rag_evaluator_pack

In [55]:
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.packs.rag_evaluator import RagEvaluatorPack
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex

Download the required files from the below link and move them to the folders mentioned below in the code

https://github.com/run-llama/llama-datasets/tree/main/llama_datasets/paul_graham_essay

In [56]:
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

In [57]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"In the essay, the author mentions his early ex...",[What I Worked On\n\nFebruary 2021\n\nBefore c...,The first computer the author used for program...,ai (gpt-4),ai (gpt-4)
1,The author switched his major from philosophy ...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,The two specific influences that led the autho...,ai (gpt-4),ai (gpt-4)
2,"In the essay, the author discusses his initial...",[I couldn't have put this into words when I wa...,The two main influences that initially drew th...,ai (gpt-4),ai (gpt-4)
3,The author mentions his shift of interest towa...,[I couldn't have put this into words when I wa...,The author shifted his interest towards Lisp a...,ai (gpt-4),ai (gpt-4)
4,"In the essay, the author mentions his interest...",[So I looked around to see what I could salvag...,"The author in the essay is Paul Graham, who wa...",ai (gpt-4),ai (gpt-4)
5,The author discusses his decision to write a b...,[So I looked around to see what I could salvag...,The author decided to write a book on Lisp hac...,ai (gpt-4),ai (gpt-4)
6,"In the essay, the author mentions a quick deci...","[I didn't want to drop out of grad school, but...",The author decided to attempt writing his diss...,ai (gpt-4),ai (gpt-4)
7,The author describes the atmosphere and practi...,"[I didn't want to drop out of grad school, but...","According to the author's account, the student...",ai (gpt-4),ai (gpt-4)
8,"In the essay, the author discusses his experie...","[We actually had one of those little stoves, f...","In the essay, the author explains that paintin...",ai (gpt-4),ai (gpt-4)
9,The author shares his work experience at a com...,"[We actually had one of those little stoves, f...","Interleaf, the company where the author worked...",ai (gpt-4),ai (gpt-4)


In [58]:
rag_dataset.examples[0] #query, reference_answer

LabelledRagDataExample(query='In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.', query_by=CreatedBy(model_name='gpt-4', type=<CreatedByType.AI: 'ai'>), reference_contexts=['What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It

In [59]:
# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [62]:
rag_evaluator_pack = RagEvaluatorPack(rag_dataset=rag_dataset, query_engine=query_engine, judge_llm=gpt4)

In [64]:
# evaluate using the RagEvaluatorPack
benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)

Batch processing of evaluations:   0%|          | 0/8.0 [00:00<?, ?it/s]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations:  12%|█▎        | 1/8.0 [00:08<01:02,  8.91s/it]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations:  38%|███▊      | 3/8.0 [00:24<00:39,  7.99s/it]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations:  62%|██████▎   | 5/8.0 [00:34<00:19,  6.59s/it]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations:  75%|███████▌  | 6/8.0 [00:46<00:16,  8.06s/it]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations: 100%|██████████| 8/8.0 [00:59<00:00,  7.30s/it]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations: 9it [01:09,  8.02s/it]                         

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations: 10it [01:22,  9.10s/it]

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

Batch processing of evaluations: 100%|██████████| 8/8.0 [01:33<00:00, 11.64s/it]


In [65]:
benchmark_df

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.170455
mean_relevancy_score,0.954545
mean_faithfulness_score,0.954545
mean_context_similarity_score,0.925233


# Retrieval Evaluation

Evaluates the quality of any Retriever module defined in LlamaIndex.

To assess the quality of a Retriever module in LlamaIndex, we use metrics like hit-rate and MRR. These compare retrieved results to ground-truth context for any question. For simpler evaluation dataset creation, we utilize synthetic data generation.

Hit Rate:
MRR:

Document -> D

D -> N1, N2, N3, N4, N5 -> Index/ Retriever

(Q1, N1)
(Q2, N1)
(Q3, N2)
(Q4, N2)
(Q5, N3)
(Q6, N3)
(Q7, N4)
(Q8, N4)
(Q9, N5)
(Q10, N5)

Q1 -> Index/ Retriever -> N2, N1, N3 -> 1 -> 1/2

Q2 -> Index/ Retriever -> N5, N4, N3 -> 0 -> 0

Q3 -> Index/ Retriever -> N1, N2, N3 -> 1 -> 1/2

Q4 -> Index/ Retriever -> N2, N3, N5 -> 1 -> 1/1

Q5 -> Index/ Retriever -> N3, N1, N4 -> 1 -> 1/1

Q6 -> Index/ Retriever -> N1, N2, N3 -> 1 -> 1/3

Q7 -> Index/ Retriever -> N4, N1, N2 -> 1 -> 1/1

Q8 -> Index/ Retriever -> N1, N3, N4 -> 1 -> 1/3

Q9 -> Index/ Retriever -> N2, N3, N4 -> 0 -> 0

Q10 -> Index/ Retriever -> N2, N5, N3 -> 1 -> 1/2

Hit Rate: 8/10 -> 80%

MRR: (0.5 + 0 + 0.5 + 1 + 1 + 0.33 + 1 + 0.33 + 0 + 0.5)/10 -> 55%

In [17]:
from llama_index.core.text_splitter import SentenceSplitter

In [20]:
from llama_index.embeddings.openai import OpenAIEmbedding

In [33]:
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

# create parser and parse document into nodes
parser = SentenceSplitter(chunk_size=1024, chunk_overlap=100)
nodes = parser(documents)

In [26]:
vector_index = VectorStoreIndex(nodes, embed_model=OpenAIEmbedding(model='text-embedding-3-small'))

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [31]:
# Define the retriever
retriever = vector_index.as_retriever(similarity_top_k=2)

In [70]:
retrieved_nodes = retriever.retrieve(eval_query)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [71]:
from llama_index.core.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=2000)

**Node ID:** 4e4f32e4-6318-4cb8-a643-cb93bbf226ea<br>**Similarity:** 0.8373553355002803<br>**Text:** What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.

With microcomputers, everything changed. Now you could h...<br>

**Node ID:** 4f5a4212-9ab4-4e48-a0cb-9cf2e55fb91c<br>**Similarity:** 0.8129028551467912<br>**Text:** I started working on the application builder, Dan worked on network infrastructure, and the two undergrads worked on the first two services (images and phone calls). But about halfway through the summer I realized I really didn't want to run a company — especially not a big one, which it was looking like this would have to be. I'd only started Viaweb because I needed the money. Now that I didn't need money anymore, why was I doing this? If this vision had to be realized as a company, then screw the vision. I'd build a subset that could be done as an open source project.

Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I would often encounter startups working on parts of this new architecture, and it was very useful to have spent so much time thinking about it and even trying to write some of it.

The subset I would build as an open source project was the new Lisp, whose parentheses I now wouldn't even have to hide. A lot of Lisp hackers dream of building a new Lisp, partly because one of the distinctive features of the language is that it has dialects, and partly, I think, because we have in our minds a Platonic form of Lisp that all existing dialects fall short of. I certainly did. So at the end of the summer Dan and I switched to working on this new dialect of Lisp, which I called Arc, in a house I bought in Cambridge.

The following spring, lightning struck. I was invited to give a talk at a Lisp conference, so I gave one about how we'd used Lisp at Viaweb. Afterward I put a postscript file of this talk online, on paulgraham.com, which I'd created years before using Viaweb but had never used for anything. In one day it got 30,000 page views. What on earth had happened? The referring urls showed that someone had posted it on Slashdot. [10]

Wow, I thought, there's an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then....<br>

In [44]:
qa_dataset = generate_question_context_pairs(nodes[0:2], llm=gpt4, num_questions_per_chunk=2)

  0%|          | 0/2 [00:00<?, ?it/s]

HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


 50%|█████     | 1/2 [00:01<00:01,  1.39s/it]

HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


100%|██████████| 2/2 [00:03<00:00,  1.67s/it]


In [46]:
qa_dataset.queries

{'818acc0e-656a-487f-91ea-b91f2b8aa579': 'Describe the early experiences of the author with programming on the IBM 1401 and how these experiences influenced their understanding of computers. What challenges did they face, and how did the introduction of microcomputers change their approach to programming?',
 '14130ef9-6532-4fc0-bc7c-ef8c30a09e14': "Discuss the author's initial academic interests upon entering college and how their perspective shifted over time. What factors contributed to their decision to switch from studying philosophy to focusing on artificial intelligence?",
 'bf19d1f0-fe2b-4e0b-831e-87592d8ccce4': "Discuss the author's journey in the field of Artificial Intelligence during their time at Cornell and how their perspective on AI evolved during their first year of grad school. What realizations did they come to about the limitations of AI as practiced at the time?",
 'd8739711-46b9-4b28-81d0-26d722545967': "Explain the significance of Lisp in the author's academic and

In [37]:
from llama_index.llms.groq import Groq

In [38]:
groq = Groq(model='llama-3.2-90b-vision-preview')

In [53]:
DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Just provide the questions and nothing else

"""

In [56]:
qa_dataset_groq = generate_question_context_pairs(nodes[0:3], llm=groq, num_questions_per_chunk=2, 
                                                  qa_generate_prompt_tmpl=DEFAULT_QA_GENERATE_PROMPT_TMPL)

  0%|          | 0/3 [00:00<?, ?it/s]

HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


 33%|███▎      | 1/3 [00:00<00:01,  1.19it/s]

HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


 67%|██████▋   | 2/3 [00:01<00:00,  1.13it/s]

HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


100%|██████████| 3/3 [00:02<00:00,  1.24it/s]


In [57]:
qa_dataset_groq.queries

{'36536694-f7c2-40f1-b632-ddc8acc51487': "What was the primary input method for programs on the IBM 1401, and what limitations did this pose for the author's early programming endeavors?",
 'ac57244c-0a0a-436a-bb02-eab7fd6e1a89': "What two influences are mentioned in the text as having sparked the author's interest in pursuing Artificial Intelligence (AI) as a field of study?",
 '7062e6d4-f442-482f-81a6-5f5040758c16': 'What was the primary reason the author was drawn to learning Lisp, and how did it expand their concept of a program?',
 '0530b143-e791-4197-afa3-7345ea93cd74': 'What realization did the author come to during their first year of graduate school regarding the state of Artificial Intelligence at the time, and what implications did this have for their future plans?',
 '00ca0e9f-b720-4bde-8c22-102627c23a3f': 'What motivated the author to consider a career in art, and what realization did they have while visiting the Carnegie Institute?',
 '133d4da1-4ada-4330-b9c6-aa97d5f5b928

In [73]:
queries = qa_dataset.queries.values()
print(list(queries)[:1])

["Describe the author's initial experiences with programming on the IBM 1401. What were some of the challenges he faced and how did the advent of microcomputers change his approach to programming?"]


In [74]:
len(list(queries))

2

In [75]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [76]:
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Query: Describe the author's initial experiences with programming on the IBM 1401. What were some of the challenges he faced and how did the advent of microcomputers change his approach to programming?
Metrics: {'mrr': 1.0, 'hit_rate': 1.0}



In [77]:
# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [78]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}
    )

    return metric_df

In [79]:
display_results("top-2 eval", eval_results)

Unnamed: 0,retrievers,hit_rate,mrr
0,top-2 eval,1.0,1.0
