## RAG Evaluation

In [2]:
%load_ext autoreload
%autoreload 2

import sys
import os
from pathlib import Path
import time
import json
from typing import Dict, Any

# Add the src directory to Python path
sys.path.append('../src/')


In [3]:
from simple_rag.evaluation.rag_generator import RAGEvaluator

import time

evaluator = RAGEvaluator(
        collection_name="efficient_rag",
        embedding_model="BAAI/bge-base-en-v1.5",
        llm_model="qwen3:4b-instruct",              # Your RAG's LLM
        judge_llm_model="gemini-1.5-flash", # The LLM that *scores*
        judge_llm_provider="gemini",        # Use Gemini
        retriever_k=5                       # Explicitly set K
    )


JSON_PATH = "../src/simple_rag/evaluation/pair_answers.json"
try:

    evaluator.generate_rag_outputs(JSON_PATH, num_questions=25)


    evaluator.save_rag_outputs("../src/simple_rag/evaluation/pair_answers_rag_rerank_metadata.json")


except Exception as e:
    print(f"Error generating RAG outputs: {str(e)}")
    

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x729315432a40>>
Traceback (most recent call last):
  File "/home/alvar/anaconda3/envs/rag-env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 781, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


Using Gemini API for RAGAS judge: gemini-1.5-flash
Using HuggingFace embeddings for RAGAS: BAAI/bge-base-en-v1.5

[Phase 1] Generating RAG outputs...
  - Current working directory: /home/alvar/CascadeProjects/windsurf-project/RAG/notebooks
  - Benchmark JSON path (provided): ../src/simple_rag/evaluation/pair_answers.json
  ‚úì Found benchmark file: /home/alvar/CascadeProjects/windsurf-project/RAG/src/simple_rag/evaluation/pair_answers.json

[Phase 1] Initializing RAG system...
  - Collection: efficient_rag
  - Embedding Model: BAAI/bge-base-en-v1.5
  - LLM Model: qwen3:4b-instruct
  - Retriever K: 5
‚úì Qdrant server is already running
Loading reranker model 'mixedbread-ai/mxbai-rerank-base-v1'...
This may take a moment on first run...
Reranker loaded successfully.
‚úì Ollama server is already running
LLM loaded successfully
  ‚úì RAG system initialized successfully

[Warmup] Warming up RAG system LLM...
Warming up Ollama model 'qwen3:4b-instruct'...
‚úì Model warmed up. Response: assi

Now lets try and use a more powerfull model to detect what is the bottleneck of our rag pipeline

In [None]:
from simple_rag.evaluation.ragas import RAGEvaluator

import time

evaluator = RAGEvaluator(
        collection_name="efficient_rag",
        embedding_model="BAAI/bge-base-en-v1.5",
        llm_model="qwen2.5:7b-instruct",              # Your RAG's LLM
        judge_llm_model="gemini-1.5-flash", # The LLM that *scores*
        judge_llm_provider="gemini",        # Use Gemini
        retriever_k=5                       # Explicitly set K
    )

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
JSON_PATH = "../src/simple_rag/evaluation/pair_answers.json"
try:

    
    evaluator.generate_rag_outputs_gemini(JSON_PATH, GOOGLE_API_KEY, model_name="gemini-2.5-flash", temperature=0.0)


    evaluator.save_rag_outputs("../src/simple_rag/evaluation/pair_answers_rag_rerank_gemini.json")


except Exception as e:
    print(f"Error generating RAG outputs: {str(e)}")
    

ModuleNotFoundError: No module named 'simple_rag'

## Evaluation with the RAGAS framework

In [4]:
# Evaluate only first 10 questions for quick testing
from simple_rag.evaluation.evaluation import DeepEvalEvaluator


evaluator = DeepEvalEvaluator(
    model_name="qwen2.5:7b-instruct",
    faithfulness_threshold=0.6,
    answer_relevancy_threshold=0.6,
    contextual_relevancy_threshold=0.6,
    include_reason=True
)

evaluator.load_dataset("../src/simple_rag/evaluation/pair_answers_rag_rerank.json")
results = evaluator.evaluate(metrics= ['answer_relevancy'], num_test_cases=15)



DeepEval Evaluator initialized with model: qwen2.5:7b-instruct

Loading dataset from: ../src/simple_rag/evaluation/pair_answers_rag_rerank.json
Absolute path: /home/alvar/CascadeProjects/windsurf-project/RAG/src/simple_rag/evaluation/pair_answers_rag_rerank.json
‚úì Loaded 61 test cases

Starting DeepEval Evaluation (SUBSET)
  Model: qwen2.5:7b-instruct
  Total test cases available: 61
  Evaluating: 15 test cases
  Metrics: answer_relevancy



Output()

INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases


INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases




Metrics Summary

  - ‚úÖ Answer Relevancy (score: 1.0, threshold: 0.6, strict: False, evaluation model: qwen2.5:7b-instruct, reason: The score is 1.00 because the response does not contain any irrelevant statements that would lower the score, and it directly addresses the input query about the Vanguard High Dividend Yield Index Fund's inception date., error: None)

For test case:

  - input: When was the Vanguard High Dividend Yield Index Fund inception date?
  - actual output: The Vanguard High Dividend Yield Index Fund's inception date was February 7, 2019.
  - expected output: The Vanguard High Dividend Yield Index Fund was launched on February 7, 2019 (inception date: 02/07/19).
  - context: None
  - retrieval context: ["High Dividend Yield Index Fund\n## Total returns\n\nPeriods ended June 30, 2025\n\n| Total returns   | Quarter   | Year to date   | One year   | Three years   | Five years   | Since inception   |\n|-----------------|-----------|----------------|------------|-----


Evaluation Complete!



In [8]:
results.test_results[0]

TestResult(name='test_case_4', success=True, metrics_data=[MetricData(name='Answer Relevancy', threshold=0.6, success=True, score=1.0, reason='The score is 1.00 because there are no irrelevant statements in the output that would lower it further.', strict_mode=False, evaluation_model='qwen2.5:7b-instruct', error=None, evaluation_cost=None, verbose_logs='Statements:\n[\n    "The total net assets of VHYAX are $14,523MM."\n] \n \nVerdicts:\n[\n    {\n        "verdict": "yes",\n        "reason": null\n    }\n]')], conversational=False, multimodal=False, input='What is the total net assets of VHYAX?', actual_output='The total net assets of VHYAX are $14,523MM.', expected_output='The total net assets of VHYAX is $14,523 million (approximately $14.5 billion).', context=None, retrieval_context=['Value Index Fund\n## Fund facts\n\n| Risk level Low   | Total net assets   | Expense ratio as of 04/29/25   | Ticker symbol   | Turnover rate   | Inception date   |   Fund number |\n|------------------

In [16]:
print("\nFailed Results:")
for test_result in results.test_results:
        if not test_result.success:
            for metric in test_result.metrics_data:
                print(f"\nInput (Question):\n{test_result.input}")
                print(f"\nActual Output (Answer):\n{test_result.actual_output}")
                print(f"\nExpected Output:\n{test_result.expected_output}")
                # --- Access MetricData-level data ---
                print(f"\nMetric: {metric.name}")
                print(f"  Passed: {metric.success}")
                print(f"  Score: {metric.score}")
                print(f"  Threshold: {metric.threshold}")
                print(f"  Reason: {metric.reason}")
                print(f"  Judge Model: {metric.evaluation_model}")
                


Failed Results:

Input (Question):
How did VHYAX perform in 2024?

Actual Output (Answer):
I don't know!

Expected Output:
VHYAX returned 17.59% in 2024.

Metric: Answer Relevancy
  Passed: False
  Score: 0.0
  Threshold: 0.6
  Reason: The score is 0.00 because the output did not provide any information related to VHYAX's performance in 2024, making all content irrelevant.
  Judge Model: qwen2.5:7b-instruct

Input (Question):
What was the one-year return of VHYAX as of June 30, 2025?

Actual Output (Answer):
I don't know!

Expected Output:
The one-year return of VHYAX as of June 30, 2025 was 15.52%.

Metric: Answer Relevancy
  Passed: False
  Score: 0.0
  Threshold: 0.6
  Reason: The score is 0.00 because the output completely failed to address the question about the one-year return of VHYAX as of June 30, 2025, and provided no relevant information.
  Judge Model: qwen2.5:7b-instruct

Input (Question):
What was the VHYAX fund's return in Q2 2025?

Actual Output (Answer):
The fund's re

In [None]:
results = evaluator.evaluate(metrics= ['faithfulness'], num_test_cases=15)

print("\nFailed Results:")
for metric in results.test_results:
        if not metric.success:
            print(f"  [Metric Failed]: {metric.__class__.__name__}")
            print(f"  [Score]: {metric.score}")
            print(f"  [Threshold]: {metric.threshold}")
            print(f"  [Reason]: {metric.reason}") # The most important part!
            print(f"  [Input]: {metric.input}")