# Evaluating LLM, RAG, Agents


## Loading Documents


In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/samples/").load_data()
len(documents)

12

## Initialization


In [3]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

from ragas.testset import TestsetGenerator
from ragas.llms import LlamaIndexLLMWrapper
from ragas.embeddings import LlamaIndexEmbeddingsWrapper

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# generator with openai models
generator_llm = OpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbedding(model="text-embedding-3-small")

generator = TestsetGenerator.from_llama_index(
    llm=generator_llm,
    embedding_model=embeddings,
)

In [4]:
testset = generator.generate_with_llamaindex_docs(
    documents,
    testset_size=5,
)

Applying HeadlinesExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/12 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/11 [00:00<?, ?it/s]

Property 'summary' already exists in node 'ffe5c6'. Skipping!
Property 'summary' already exists in node '4ed724'. Skipping!
Property 'summary' already exists in node '86d33c'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'ffe5c6'. Skipping!
Property 'summary_embedding' already exists in node '86d33c'. Skipping!
Property 'summary_embedding' already exists in node '4ed724'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/6 [00:00<?, ?it/s]

In [5]:
df = testset.to_pandas()
df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What GitLab say about being ally and how it re...,"[--- title: ""The Ally Lab"" description: Learn ...","At GitLab, it is required to be inclusive, but...",single_hop_specifc_query_synthesizer
1,How can Zoom be utilized to promote allyship i...,[Skills and Behaviors of allies To be an effec...,Zoom can be utilized to promote allyship by pr...,single_hop_specifc_query_synthesizer
2,How can company engagement surveys be utilized...,"[<1-hop>\n\n--- title: ""Building an Inclusive ...",Company engagement surveys can be utilized to ...,multi_hop_abstract_query_synthesizer
3,What are the goals of the Privilege for Sale a...,"[<1-hop>\n\n--- title: ""Roundtables"" descripti...",The goals of the Privilege for Sale activity i...,multi_hop_abstract_query_synthesizer
4,What role does Marina Brownrigg play in the DI...,[<1-hop>\n\nDIB Monthly Initiatives Call We ho...,Marina Brownrigg serves as the Directly Respon...,multi_hop_specific_query_synthesizer


## Build a `QueryEngine`


In [6]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

query_engine = vector_index.as_query_engine()

In [9]:
response_vector = query_engine.query(df["user_input"][0])

print(response_vector)

Being an ally at GitLab involves taking proactive and purposeful action to support marginalized groups and remove barriers that hinder individuals from contributing their skills and talents in the workplace or community. It goes beyond just being inclusive and requires individuals to actively educate themselves about the experiences and struggles of others. GitLab emphasizes the importance of skills and behaviors such as active listening, empathy, active learning about other experiences, humility, courage, and self-awareness in order to effectively support and advocate for marginalized groups. By embodying these qualities and actively engaging in allyship, individuals can contribute to creating a more diverse, inclusive, and supportive workplace environment.


## Evaluating the `QueryEngine`


In [11]:
# import metrics
from ragas.metrics import (
    ContextPrecision,
    ContextRecall,
    Faithfulness,
    AnswerRelevancy,
    AnswerCorrectness
)

# init metrics with evaluator LLM
from ragas.llms import LlamaIndexLLMWrapper

evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
    AnswerCorrectness(llm=evaluator_llm)
]

In [12]:
# convert to Ragas Evaluation Dataset
ragas_dataset = testset.to_evaluation_dataset()
ragas_dataset

EvaluationDataset(features=['user_input', 'reference_contexts', 'reference'], len=6)

In [13]:
from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ragas_dataset,
)

Running Query Engine:   0%|          | 0/6 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

In [16]:
from pprint import pprint
pprint(result, indent=4)

{'faithfulness': 0.8778, 'answer_relevancy': 0.9574, 'context_precision': 1.0000, 'context_recall': 1.0000, 'answer_correctness': 0.5576}


In [17]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall,answer_correctness
0,What GitLab say about being ally and how it re...,"[---\ntitle: ""The Ally Lab""\ndescription: Lear...","[--- title: ""The Ally Lab"" description: Learn ...",Being an ally at GitLab involves taking proact...,"At GitLab, it is required to be inclusive, but...",1.0,0.91789,1.0,1.0,0.560615
1,How can Zoom be utilized to promote allyship i...,"[Teach people how to disagree, set the expecta...",[Skills and Behaviors of allies To be an effec...,Zoom can be utilized to promote allyship in di...,Zoom can be utilized to promote allyship by pr...,1.0,1.0,1.0,1.0,0.527588
2,How can company engagement surveys be utilized...,"[---\ntitle: ""Building an Inclusive Remote Cul...","[<1-hop>\n\n--- title: ""Building an Inclusive ...",Company engagement surveys can be utilized to ...,Company engagement surveys can be utilized to ...,0.933333,0.952433,1.0,1.0,0.54058
3,What are the goals of the Privilege for Sale a...,[A DIB Team Member will set up a time to discu...,"[<1-hop>\n\n--- title: ""Roundtables"" descripti...",The goals of the Privilege for Sale activity i...,The goals of the Privilege for Sale activity i...,1.0,0.982884,1.0,1.0,0.748442
4,What role does Marina Brownrigg play in the DI...,[---\ntitle: Diversity Inclusion & Belonging C...,[<1-hop>\n\nDIB Monthly Initiatives Call We ho...,Marina Brownrigg serves as the DRI (Directly R...,Marina Brownrigg serves as the Directly Respon...,0.333333,0.891996,1.0,1.0,0.510638
5,What are some essential skills and strategies ...,[--- One of the mistakes that often happens he...,[<1-hop>\n\nWhat it means to be an ally - Take...,Some essential skills and strategies for being...,"To be an effective ally, it is essential to id...",1.0,0.998924,1.0,1.0,0.457451


## Questions: SingleHop vs MultiHop


In [22]:
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.multi_hop.abstract import MultiHopAbstractQuerySynthesizer

In [23]:
single_hop_testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    testset_size=30,
    query_distribution=[(SingleHopSpecificQuerySynthesizer(name="single_hop_specific"), 1.0)]
)
multi_hop_specific_testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    testset_size=30,
    query_distribution=[(MultiHopSpecificQuerySynthesizer(name="multi_hop_specific"), 1.0)]
)
multi_hop_abstract_testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    testset_size=30,
    query_distribution=[(MultiHopAbstractQuerySynthesizer(name="multi_hop_abstract"), 1.0)]
)


Applying HeadlinesExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/12 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/11 [00:00<?, ?it/s]

Property 'summary' already exists in node '59c3f4'. Skipping!
Property 'summary' already exists in node 'bc5355'. Skipping!
Property 'summary' already exists in node 'b0f265'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'bc5355'. Skipping!
Property 'summary_embedding' already exists in node 'b0f265'. Skipping!
Property 'summary_embedding' already exists in node '59c3f4'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/30 [00:00<?, ?it/s]

Applying HeadlinesExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/12 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/11 [00:00<?, ?it/s]

Property 'summary' already exists in node 'a047b3'. Skipping!
Property 'summary' already exists in node '59f5dd'. Skipping!
Property 'summary' already exists in node '09439b'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'a047b3'. Skipping!
Property 'summary_embedding' already exists in node '09439b'. Skipping!
Property 'summary_embedding' already exists in node '59f5dd'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/30 [00:00<?, ?it/s]

Applying HeadlinesExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/12 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/11 [00:00<?, ?it/s]

Property 'summary' already exists in node 'b15a3d'. Skipping!
Property 'summary' already exists in node 'e031ab'. Skipping!
Property 'summary' already exists in node '232844'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '232844'. Skipping!
Property 'summary_embedding' already exists in node 'e031ab'. Skipping!
Property 'summary_embedding' already exists in node 'b15a3d'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/30 [00:00<?, ?it/s]

In [25]:
type(single_hop_testset)

ragas.testset.synthesizers.testset_schema.Testset

In [26]:
result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=single_hop_testset.to_evaluation_dataset(),
)
pprint(result, indent=4)

Running Query Engine:   0%|          | 0/30 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/150 [00:00<?, ?it/s]

{'faithfulness': 0.8113, 'answer_relevancy': 0.9375, 'context_precision': 0.8167, 'context_recall': 0.7389, 'answer_correctness': 0.4814}


In [27]:
result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=multi_hop_specific_testset.to_evaluation_dataset(),
)
pprint(result, indent=4)

Running Query Engine:   0%|          | 0/30 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/150 [00:00<?, ?it/s]

{'faithfulness': 0.8234, 'answer_relevancy': 0.9598, 'context_precision': 0.8500, 'context_recall': 0.7911, 'answer_correctness': 0.5685}


In [28]:
result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=multi_hop_abstract_testset.to_evaluation_dataset(),
)
pprint(result, indent=4)

Running Query Engine:   0%|          | 0/30 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/150 [00:00<?, ?it/s]

{'faithfulness': 0.7602, 'answer_relevancy': 0.9717, 'context_precision': 0.9833, 'context_recall': 0.8500, 'answer_correctness': 0.5806}
