<a href="https://colab.research.google.com/github/girijesh-ai/llamaIndex-projects/blob/main/Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation

Evaluation and benchmarking play a pivotal role in the development of LLM Applications. For optimizing the performance of applications such as RAG (Retrieval Augmented Generation), a robust measurement mechanism is indispensable.

LlamaIndex offers vital modules tailored to assess the quality of generated outputs. Additionally, it incorporates specialized modules designed specifically to evaluate content retrieval quality. LlamaIndex categorizes its evaluation into two primary types:

*   **Response Evaluation**
*   **Retrieval Evaluation**

[Documentation
](https://gpt-index.readthedocs.io/en/latest/core_modules/supporting_modules/evaluation/root.html)

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.8.53.post3-py3-none-any.whl (794 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.6/794.6 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting aiostream<0.6.0,>=0.5.2 (from llama-index)
  Downloading aiostream-0.5.2-py3-none-any.whl (39 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting langchain>=0.0.303 (from llama-index)
  Downloading langchain-0.0.325-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0.26.4 (from llama-index)
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m


## Response Evaluation

Evaluating results from LLMs is distinct from traditional machine learning's straightforward outcomes. LlamaIndex employs evaluation modules, using a benchmark LLM like GPT-4, to gauge answer accuracy. Notably, these modules often blend query, context, and response, minimizing the need for ground-truth labels.

The evaluation modules manifest in the following categories:

*   **Faithfulness:** Assesses whether the response remains true to the retrieved contexts, ensuring there's no distortion or "hallucination."
*   **Context Relevancy:** Evaluates the relevance of both the retrieved context and the generated answer to the initial query.
*   **Correctness:** Determines if the generated answer aligns with the reference answer based on the query (this does require labels).
*   **Guideline Adherence:** Examines whether the predicted answer conforms to specific predefined guidelines.

Furthermore, LlamaIndex has the capability to autonomously generate questions from your data, paving the way for an evaluation pipeline to assess the RAG application.

In [None]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

import logging
import sys

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []

# Set up the StreamHandler to output to sys.stdout (Colab's output)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

In [None]:
import logging
import sys
import pandas as pd

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    GuidelineEvaluator,
    RetrieverEvaluator,
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset
)

from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
    LLMPredictor,
    Response,
)

from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser

import os
import openai

NumExpr defaulting to 2 threads.


In [None]:
openai.api_key = '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'

#### Download Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2023-10-28 09:32:52--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2023-10-28 09:32:52 (3.37 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



#### Load Data

In [None]:
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

#### Generate Question

In [None]:
data_generator = DatasetGenerator.from_documents(documents)
eval_questions = data_generator.generate_questions_from_nodes()

chunk_size_limit is deprecated, please specify chunk_size instead
chunk_size_limit is deprecated, please specify chunk_size instead


[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=3069 request_id=5844ee2d051ae232bc779634abbb0e99 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=3069 request_id=5844ee2d051ae232bc779634abbb0e99 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=3138 request_id=3f787fec87e055fb0e9ea03149b1e1c2 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=3138 request_id=3f787fec87e055fb0e9ea03149b1e1c2 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=3119 request_id=aa320e02b1b41c341d7a3edc25adb160 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=3119 request_id=aa320e02b1b41c341d7a3edc25adb160 response_code=200
message='OpenAI API response' path=https://api

In [None]:
(eval_questions)

['What were the two main things the author worked on before college?',
 'How did the author describe their early attempts at writing short stories?',
 'What type of computer did the author first work on for programming?',
 'What language did the author use for programming on the IBM 1401?',
 "What was the author's experience with programming on the IBM 1401?",
 'What type of computer did the author eventually convince their father to buy?',
 'Why did the author decide to switch from studying philosophy to AI in college?',
 "What were the two things that influenced the author's interest in AI?",
 'What language did the author focus on during their self-teaching of AI?',
 'Why did the author decide to focus on Lisp instead of AI?',
 'What was the purpose of the entrance exam for studying art in Florence?',
 'How did the author manage to pass the written exam despite limited vocabulary?',
 'What was the arrangement between the students and faculty in the painting department at the Accadem

To be consistent we will fix evaluation query

In [None]:
eval_query = 'How did the author describe their early attempts at writing short stories?'

In [None]:
# Fix GPT-3.5-TURBO LLM for generating response
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# Fix GPT-4 LLM for evaluation
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

In [None]:
# create vector index
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context_gpt35
)

# Query engine to generate response
query_engine = vector_index.as_query_engine()

In [None]:
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)

In [None]:
from IPython.display import display, HTML
display(HTML(f'<p style="font-size:20px">{nodes[1].get_text()}</p>'))

#### Context Relevency Evaluation

Measures if the response + source nodes match the query.

In [None]:
# Create RelevancyEvaluator using GPT-4 LLM
relevancy_evaluator = RelevancyEvaluator(service_context=service_context_gpt4)

In [None]:
# Generate response
response_vector = query_engine.query(eval_query)

# Evaluation
eval_result = relevancy_evaluator.evaluate_response(
    query=eval_questions[1], response=response_vector
)

message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=668 request_id=5474d1f3d0b2512438324a2b0ed23053 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=668 request_id=5474d1f3d0b2512438324a2b0ed23053 response_code=200


In [None]:
eval_result.query

'How did the author describe their early attempts at writing short stories?'

In [None]:
eval_result.response

'The author described their early attempts at writing short stories as awful. They mentioned that their stories had hardly any plot and mainly focused on characters with strong feelings, which they believed made them deep.'

In [None]:
eval_result.passing

True

Relevancy evaluation with multiple source nodes.

In [None]:
# Create Query Engine with similarity_top_k=3
query_engine = vector_index.as_query_engine(similarity_top_k=3)

# Create response
response_vector = query_engine.query(eval_query)

# Evaluate with each source node
eval_source_result_full = [
    relevancy_evaluator.evaluate(
        query=eval_query,
        response=response_vector.response,
        contexts=[source_node.get_content()],
    )
    for source_node in response_vector.source_nodes
]

# Evaluation result
eval_source_result = [
    "Pass" if result.passing else "Fail" for result in eval_source_result_full
]

message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=491 request_id=7be26f180d7395d81037252dec236c4a response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=491 request_id=7be26f180d7395d81037252dec236c4a response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=941 request_id=151ce961ee2cebc33045717eeeaa3551 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=941 request_id=151ce961ee2cebc33045717eeeaa3551 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=781 request_id=db9a818ee0474c4b1f868ed65372fbd5 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=781 request_id=db9a818ee0474c4b1f868ed65372fbd5 response_code=200


In [None]:
eval_source_result

['Fail', 'Pass', 'Fail']

#### Faithfullness Evaluator

 Measures if the response from a query engine matches any source nodes. This is useful for measuring if the response was hallucinated.

In [None]:
faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_gpt4)

In [None]:
eval_result = faithfulness_evaluator.evaluate_response(response=response_vector)

message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=1181 request_id=013b8f59b2bd0386e84c11de4d47a325 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=1181 request_id=013b8f59b2bd0386e84c11de4d47a325 response_code=200


In [None]:
eval_result

EvaluationResult(query=None, contexts=["[10]\n\nWow, I thought, there's an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anything.\n\nThis had been possible in principle since 1993, but not many people had realized it yet. I had been intimately involved with building the infrastructure of the web for most of that time, and a writer as well, and it had taken me 8 years to realize it. Even then it took me several years to understand the implications. It meant there would be a whole new generation of essays. [11]\n\nIn the print era, the channel for publishing essays had been vanishingly small. Except for a few officially anointed thinkers who went to the right par

In [None]:
eval_result.passing

True

#### Correctness Evaluator

Evaluates the relevance and correctness of a generated answer against a reference answer.

In [None]:
correctness_evaluator = CorrectnessEvaluator(service_context=service_context_gpt4)

In [None]:
query = (
    "Can you explain the theory of relativity proposed by Albert Einstein in detail?"
)

reference = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).

General relativity, published in 1915, extended these ideas to include the effects of gravity. According to general relativity, gravity is not a force between masses, as described by Newton's theory of gravity, but rather the result of the warping of space and time by mass and energy. Massive objects, such as planets and stars, cause a curvature in spacetime, and smaller objects follow curved paths in response to this curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet, causing it to create a depression that other objects (representing smaller masses) naturally move towards.

In essence, general relativity provided a new understanding of gravity, explaining phenomena like the bending of light by gravity (gravitational lensing) and the precession of the orbit of Mercury. It has been confirmed through numerous experiments and observations and has become a fundamental theory in modern physics.
"""

response = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).

However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
"""

In [None]:
correctness_result = correctness_evaluator.evaluate(
    query=query,
    response=response,
    reference=reference,
)

message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=5841 request_id=6125174196a7946bea6aa8249bc8ff68 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=5841 request_id=6125174196a7946bea6aa8249bc8ff68 response_code=200


In [None]:
correctness_result

EvaluationResult(query='Can you explain the theory of relativity proposed by Albert Einstein in detail?', contexts=None, response="\nCertainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).\n\nHowever, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects foll

In [None]:
correctness_result.score

2.5

In [None]:
correctness_result.passing

False

In [None]:
correctness_result.feedback

'The generated answer is relevant to the user query and starts off correctly by explaining special relativity. However, it contains significant mistakes when explaining general relativity. General relativity is about the warping of space and time by mass and energy, not magnetic fields. The analogy used is also incorrect as it introduces magnets, which are not part of the original concept.'

#### Guideline Evaluator

Evaluates a question answer system given user specified guidelines.

In [None]:
GUIDELINES = [
    "The response should fully answer the query.",
    "The response should avoid being vague or ambiguous.",
    "The response should be specific and use statistics or numbers when possible.",
]

In [None]:
evaluators = [
    GuidelineEvaluator(service_context=service_context_gpt4, guidelines=guideline)
    for guideline in GUIDELINES
]

In [None]:
sample_data = {
    "query": "Tell me about global warming.",
    "contexts": [
        "Global warming refers to the long-term increase in Earth's average surface temperature due to human activities such as the burning of fossil fuels and deforestation.",
        "It is a major environmental issue with consequences such as rising sea levels, extreme weather events, and disruptions to ecosystems.",
        "Efforts to combat global warming include reducing carbon emissions, transitioning to renewable energy sources, and promoting sustainable practices.",
    ],
    "response": "Global warming is a critical environmental issue caused by human activities that lead to a rise in Earth's temperature. It has various adverse effects on the planet.",
}

In [None]:
for guideline, evaluator in zip(GUIDELINES, evaluators):
    eval_result = evaluator.evaluate(
        query=sample_data["query"],
        contexts=sample_data["contexts"],
        response=sample_data["response"],
    )
    print("=====")
    print(f"Guideline: {guideline}")
    print(f"Pass: {eval_result.passing}")
    print(f"Feedback: {eval_result.feedback}")

message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=6244 request_id=7989fda2339313abe75537653e3acc1b response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=6244 request_id=7989fda2339313abe75537653e3acc1b response_code=200
=====
Guideline: The response should fully answer the query.
Pass: False
Feedback: The response does not fully answer the query. While it does provide a brief overview of global warming, it does not delve into the specifics such as the causes, effects, and potential solutions to global warming. The response should be more detailed and comprehensive to fully answer the query.
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=5081 request_id=6cac8bf18d0fa819278f24f163ec894c response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=5081 request_id=6cac8bf18d0fa819278f24f163ec894c resp

Hit Rate:
MRR:

Document -> D

D -> N1, N2, N3, N4, N5 -> Index/ Retriever

(Q1, N1)
(Q2, N1)
(Q3, N2)
(Q4, N2)
(Q5, N3)
(Q6, N3)
(Q7, N4)
(Q8, N4)
(Q9, N5)
(Q10, N5)

Q1 -> Index/ Retriever -> N2, N1, N3 -> 1 -> 1/2

Q2 -> Index/ Retriever -> N5, N4, N3 -> 0 -> 0

Q3 -> Index/ Retriever -> N1, N2, N3 -> 1 -> 1/2

Q4 -> Index/ Retriever -> N2, N3, N5 -> 1 -> 1/1

Q5 -> Index/ Retriever -> N3, N1, N4 -> 1 -> 1/1

Q6 -> Index/ Retriever -> N1, N2, N3 -> 1 -> 1/3

Q7 -> Index/ Retriever -> N4, N1, N2 -> 1 -> 1/1

Q8 -> Index/ Retriever -> N1, N3, N4 -> 1 -> 1/3

Q9 -> Index/ Retriever -> N2, N3, N4 -> 0 -> 0

Q10 -> Index/ Retriever -> N2, N5, N3 -> 1 -> 1/2

Hit Rate: 8/10 -> 80%

MRR: (0.5 + 0 + 0.5 + 1 + 1 + 0.33 + 1 + 0.33 + 0 + 0.5)/10 -> 0.55

## Retrieval Evaluation

Evaluates the quality of any Retriever module defined in LlamaIndex.

To assess the quality of a Retriever module in LlamaIndex, we use metrics like hit-rate and MRR. These compare retrieved results to ground-truth context for any question. For simpler evaluation dataset creation, we utilize synthetic data generation.

In [None]:
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
vector_index = VectorStoreIndex(nodes, service_context=service_context_gpt4)

In [None]:
# Define the retriever
retriever = vector_index.as_retriever(similarity_top_k=2)

In [None]:
retrieved_nodes = retriever.retrieve(eval_query)

In [None]:
from llama_index.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=2000)

**Node ID:** 221d51e9-22fd-4745-af04-89f24da138e9<br>**Similarity:** 0.827003857875047<br>**Text:** What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.

With microcomputers, everything changed. Now you could h...<br>

**Node ID:** 252c8c8a-dafd-4d71-b9a5-baab20a245a4<br>**Similarity:** 0.8202532618916515<br>**Text:** Now they could be, and I was going to write them. [12]

I've worked on several different things, but to the extent there was a turning point where I figured out what to work on, it was when I started publishing essays online. From then on I knew that whatever else I did, I'd always write essays too.

I knew that online essays would be a marginal medium at first. Socially they'd seem more like rants posted by nutjobs on their GeoCities sites than the genteel and beautifully typeset compositions published in The New Yorker. But by this point I knew enough to find that encouraging instead of discouraging.

One of the most conspicuous patterns I've noticed in my life is how well it has worked, for me at least, to work on things that weren't prestigious. Still life has always been the least prestigious form of painting. Viaweb and Y Combinator both seemed lame when we started them. I still get the glassy eye from strangers when they ask what I'm writing, and I explain that it's an essay I'm going to publish on my web site. Even Lisp, though prestigious intellectually in something like the way Latin is, also seems about as hip.

It's not that unprestigious types of work are good per se. But when you find yourself drawn to some kind of work despite its current lack of prestige, it's a sign both that there's something real to be discovered there, and that you have the right kind of motives. Impure motives are a big danger for the ambitious. If anything is going to lead you astray, it will be the desire to impress people. So while working on things that aren't prestigious doesn't guarantee you're on the right track, it at least guarantees you're not on the most common type of wrong one.

Over the next several years I wrote lots of essays about all kinds of different topics. O'Reilly reprinted a collection of them as a book, called Hackers & Painters after one of the essays in it. I also worked on spam filters, and did some more painting. I used to have dinners for a group...<br>

In [None]:
qa_dataset = generate_question_context_pairs(nodes, llm=gpt4, num_questions_per_chunk=2)

100%|██████████| 36/36 [04:17<00:00,  7.16s/it]


In [None]:
queries = qa_dataset.queries.values()
print(list(queries)[50])

"Discuss the initial investment model of Y Combinator (YC) for startups and explain how it was considered fair for both the investors and the founders."


In [None]:
len(list(queries))

72

In [None]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [None]:
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=23 request_id=f3e496212e2eb4caa147464c6ab0c169 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=23 request_id=f3e496212e2eb4caa147464c6ab0c169 response_code=200
Query: In the context, the author mentions his early experiences with programming on an IBM 1401. Describe the process he used to write and run a program on this machine, and explain why he found it challenging to create meaningful programs on this system.
Metrics: {'mrr': 1.0, 'hit_rate': 1.0}



In [None]:
# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=20 request_id=8f8769d000d3d977aeb8922b78f6f0ec response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=20 request_id=8f8769d000d3d977aeb8922b78f6f0ec response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=29 request_id=ab82fcb4ebba619cd6057c9abbd5d70e response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=29 request_id=ab82fcb4ebba619cd6057c9abbd5d70e response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=23 request_id=961ec5e8d6327868053c616d7c2969c2 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=23 request_id=961ec5e8d6327868053c616d7c2969c2 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=39 reque

In [None]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}
    )

    return metric_df

In [None]:
display_results("top-2 eval", eval_results)

Unnamed: 0,retrievers,hit_rate,mrr
0,top-2 eval,0.861111,0.791667
