Evaluation in LlamaIndex refers to the process of assessing the performance and quality of retrieval-augmented generation (RAG) systems. It's crucial for measuring and improving the effectiveness of LLM applications, particularly in terms of retrieval accuracy and response quality. LlamaIndex provides two main types of evaluation:
1. Response Evaluation
2. Retrieval Evaluation

# Response Evaluation


Response evaluation assesses the quality of generated answers. LlamaIndex offers several LLM-based evaluation modules, including:
- Correctness
- Semantic Similarity
- Faithfulness
- Context Relevancy
- Answer Relevancy
- Guideline Adherence

Here's an example of how to use the Faithfulness evaluator:

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import FaithfulnessEvaluator

# Create LLM
llm = OpenAI(model="gpt-4", temperature=0.0)

# Build index (assuming it's already created)
vector_index = VectorStoreIndex(...)

# Define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)

# Query index
query_engine = vector_index.as_query_engine()
response = query_engine.query(
    "What battles took place in New York City in the American Revolution?"
)

# Evaluate response
eval_result = evaluator.evaluate_response(response=response)
print(str(eval_result.passing))

# The evaluator checks if the response is faithful to the retrieved context,
# helping to detect potential hallucinations or inaccuracies.

# Retrieval Evaluation

Retrieval evaluation focuses on assessing the relevance and accuracy of retrieved sources. LlamaIndex provides metrics such as:
- Mean Reciprocal Rank (MRR)
- Hit Rate
- Precision
- Recall
- Average Precision (AP)
- Normalized Discounted Cumulative Gain (NDCG)

In [None]:
from llama_index.core.evaluation import RetrieverEvaluator

# Define retriever (assuming it's already created)
retriever = ...

# Create evaluator with specific metrics
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate", "precision", "recall"], retriever=retriever
)

# Evaluate a single query
eval_result = retriever_evaluator.evaluate(
    query="Sample query", expected_ids=["node_id1", "node_id2"]
)
print(eval_result)

# This evaluation compares the retrieved results against expected relevant documents,
# providing insights into the retriever's performance across multiple metrics.

# Batch Evaluation

For evaluating multiple queries, LlamaIndex offers batch evaluation capabilities:

In [None]:
from llama_index.evaluation import BatchEvalRunner

# Define evaluation metrics (assuming they're already created)
faithfulness_evaluator = ...
relevancy_evaluator = ...

# Create BatchEvalRunner
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,  # Number of parallel workers
)

# Evaluate multiple queries
eval_results = await runner.aevaluate_queries(query_engine, queries=batch_eval_queries)

# Calculate scores
faithfulness_score = sum(
    result.passing for result in eval_results["faithfulness"]
) / len(eval_results["faithfulness"])
relevancy_score = sum(result.passing for result in eval_results["relevancy"]) / len(
    eval_results["relevancy"]
)

# These scores provide an overall assessment of the system's performance
# across multiple queries, offering insights into faithfulness and relevancy.

LlamaIndex's evaluation modules allow developers to:
- Measure the accuracy and relevance of retrieved information
- Assess the quality and faithfulness of generated responses
- Identify areas for improvement in RAG systems
- Benchmark different retrieval and generation strategies