In [3]:
!pip install llama-index python-dotenv --quiet

In [4]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import generate_question_context_pairs
from llama_index.evaluation import RetrieverEvaluator
from llama_index.llms import OpenAI

import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

True

In [5]:
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

In [6]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   219k      0 --:--:-- --:--:-- --:--:--  219k


In [9]:
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
documents[0]

Document(id_='d55e6890-8365-4703-971b-13331d19cc8a', embedding=None, metadata={'file_path': 'data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-02-09', 'last_modified_date': '2024-02-09', 'last_accessed_date': '2024-02-09'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first

In [10]:
llm = OpenAI(model="gpt-4")

node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

In [11]:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("What did the author do growing up?")

In [12]:
response_vector.response

'The author wrote short stories and worked on programming, specifically on an IBM 1401 computer in 9th grade.'

In [13]:
response_vector.source_nodes[0].get_text()

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack

In [14]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n\nIn the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n\nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n\nNow that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that 

## Evaluation
- Retrieval Evaluation: This assesses the accuracy and relevance of the information retrieved by the system.
- Response Evaluation: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response. LlamaIndex offers a generate_question_context_pairs module specifically for crafting questions and context pairs which can be used in the assessment of the RAG system of both Retrieval and Response Evaluation.

In [15]:
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=2
)

100%|███████████████████████████████████████████████████████████████████████████████████| 58/58 [07:53<00:00,  8.17s/it]


In [16]:
qa_dataset

EmbeddingQAFinetuneDataset(queries={'34b38e92-bf0f-4d7f-bb8a-5b9e7ae8a846': 'In the context provided, the author mentions his early experiences with programming. Describe the programming environment he used, including the type of computer, the programming language, and the method of input and output.', '2eb05e13-894a-4835-bfac-85cabfa3ad36': "The author mentions that his early attempts at writing short stories were characterized by strong feelings and lack of plot. Based on this, what can you infer about the author's understanding of storytelling at that time?", '1d4cd766-2762-4023-a9b9-5e13d7b26032': "Based on the author's experience, explain the differences between programming on a 1401 machine and a microcomputer like the TRS-80. What were the limitations of the 1401 and how did the TRS-80 overcome them?", 'a5a2712c-271a-4d40-815d-9fba9e960930': 'The author mentions that he wrote a word processor program for his father to write a book. Discuss the constraints of this program and how

### retrieval evaluation

In [18]:
retriever = vector_index.as_retriever(similarity_top_k=2)

In [19]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [20]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

display_results("OpenAI Embedding Retriever", eval_results)

Unnamed: 0,Retriever Name,Hit Rate,MRR
0,OpenAI Embedding Retriever,0.767241,0.650862


a performance with a hit rate of 0.767241, while the MRR, at 0.650862, suggests there's room for improvement in ensuring the most relevant results appear at the top. The observation that MRR is less than the hit rate indicates that the top-ranking results aren't always the most relevant. Enhancing MRR could involve the use of rerankers, which refine the order of retrieved documents.

## response evaluation

- **FaithfulnessEvaluator**: Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.
- **Relevancy Evaluator**: Measures if the response + source nodes match the query.

In [22]:
# Get the list of queries from the above created dataset
queries = list(qa_dataset.queries.values())
queries

['In the context provided, the author mentions his early experiences with programming. Describe the programming environment he used, including the type of computer, the programming language, and the method of input and output.',
 "The author mentions that his early attempts at writing short stories were characterized by strong feelings and lack of plot. Based on this, what can you infer about the author's understanding of storytelling at that time?",
 "Based on the author's experience, explain the differences between programming on a 1401 machine and a microcomputer like the TRS-80. What were the limitations of the 1401 and how did the TRS-80 overcome them?",
 'The author mentions that he wrote a word processor program for his father to write a book. Discuss the constraints of this program and how it was still an improvement over a typewriter.',
 '"Discuss the author\'s initial perception of philosophy as a field of study and how it changed over time during his college years. What fact

### faithfulnessevaluator
We will use gpt-3.5-turbo for generating response for a given query and gpt-4 for evaluation.

In [23]:
# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

In [24]:
vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()

In [25]:
from llama_index.evaluation import FaithfulnessEvaluator

faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

In [26]:
# evaluate on 1 question
eval_query = queries[10]
eval_query

"Based on the author's experience in grad school, explain his perspective on the limitations of AI as practiced during his time. What led him to conclude that the traditional way of doing AI, with explicit data structures representing concepts, was not going to work?"

In [27]:
response_vector = query_engine.query(eval_query)

# Compute faithfulness evaluation
eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)

eval_result.passing

True

### relevancy evaluator

In [28]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)

# Pick a query
query = queries[10]

query

"Based on the author's experience in grad school, explain his perspective on the limitations of AI as practiced during his time. What led him to conclude that the traditional way of doing AI, with explicit data structures representing concepts, was not going to work?"

In [29]:
response_vector = query_engine.query(query)

# Relevancy evaluation
eval_result = relevancy_gpt4.evaluate_response(
    query=query, response=response_vector
)

# check if it passed the evaluation.
eval_result.passing

True

In [30]:
eval_result.feedback

'YES'

### batch evaluator
LlamaIndex has BatchEvalRunner to compute multiple evaluations in batch wise manner.

In [33]:
from llama_index.evaluation import BatchEvalRunner

# Let's pick top 3 queries to do evaluation
batch_eval_queries = queries[:3]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4 in organization org-1jrAGLGgy7sBToath5hGlVEC on tokens per min (TPM): Limit 10000, Used 9880, Requested 1405. Please try again in 7.71s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

GOING TO RUN UNTIL HERE BECAUSE THIS BURNS MY MONEY

In [None]:
faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
faithfulness_score

In [None]:
relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])
relevancy_score

Faithfulness score of 1.0 signifies that the generated answers contain no hallucinations and are entirely based on retrieved context.

Relevancy score of 1.0 suggests that the answers generated are consistently aligned with the retrieved context and the queries.