# üìä RAG FITNESS - SYSTEM EVALUATION

**Purpose:** Evaluate retrieval quality and system performance

**Input:** Golden Dataset (20 question-answer pairs)

**Output:** Metrics report (Recall@5, MRR, Precision@5)

**Run time:** ~5 minutes

---

## üìã Evaluation Strategy

**Retrieval Metrics (Automated):**
- Recall@5: Is the correct document in top 5?
- MRR: Average position of correct document
- Precision@5: % of relevant docs in top 5

**Generation Quality (Manual):**
- Sample 5 answers and inspect manually
- LLM-as-Judge with small models is NOT reliable

---

## üì¶ STEP 1: Setup

In [1]:
import sys
from pathlib import Path
import json  # ‚Üê AJOUTER CETTE LIGNE
from typing import List, Dict
from datetime import datetime

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import
from retriever import Retriever
from chatbot import RAGChatbot

print("‚úÖ Imports successful")

‚úÖ Imports successful


## üìÇ STEP 2: Load Golden Dataset

In [2]:
# Load Golden Dataset
golden_dataset_path = Path.cwd().parent / "data" / "golden_dataset.json"

with open(golden_dataset_path, 'r', encoding='utf-8') as f:
    golden_dataset = json.load(f)

print(f"üìö Golden Dataset loaded")
print(f"   Total questions: {len(golden_dataset)}")

# Count by category
from collections import Counter
categories = [item['category'] for item in golden_dataset]
cat_counts = Counter(categories)

print(f"\nüìä Breakdown by category:")
for cat, count in cat_counts.items():
    print(f"   {cat}: {count}")

üìö Golden Dataset loaded
   Total questions: 20

üìä Breakdown by category:
   nutrition: 7
   rom: 5
   volume: 5
   out_of_scope: 3


## üîß STEP 3: Initialize Retriever

In [3]:
print("üîß Initializing retriever...\n")

retriever = Retriever()

print("\n‚úÖ Retriever ready")

üîß Initializing retriever...

üîß Initializing Retriever...
   üì• Loading embedding model: BAAI/bge-large-en-v1.5
   üíæ Connecting to ChromaDB: c:\RAG-Fitness-Test\data\processed\chroma_db
   ‚úÖ Collection 'fitness_knowledge_base': 1728 documents
   üî§ Initializing BM25 index...
      ‚úÖ BM25 indexed: 1728 documents
   üéØ Loading Cross-Encoder for re-ranking...
      ‚úÖ Cross-Encoder loaded

‚úÖ Retriever ready (Hybrid Search enabled)

‚úÖ Retriever ready


## üìä STEP 4: Evaluate Retriever

**Metrics:**
- **Recall@5**: Is the correct document in top 5 results?
- **MRR (Mean Reciprocal Rank)**: Average position of correct document
- **Precision@5**: % of relevant documents in top 5

In [4]:
print("üìä EVALUATING RETRIEVER\n")
print("=" * 80)

# Metrics storage
recall_at_5 = []
reciprocal_ranks = []
precision_at_5 = []

# Filter out "out of scope" questions
in_scope_questions = [
    item for item in golden_dataset 
    if item.get('category') != 'out_of_scope'
]

print(f"Evaluating {len(in_scope_questions)} in-scope questions...\n")

for i, item in enumerate(in_scope_questions, 1):
    query = item['question']
    relevant_docs = item.get('relevant_docs', [])
    
    # Skip if no relevant docs specified
    if not relevant_docs:
        print(f"{i:2d}. ‚ö†Ô∏è {query[:60]}... (no relevant_docs)")
        continue
    
    # Retrieve with hybrid search
    results = retriever.hybrid_search(
        query=query,
        top_k=5,
        retrieve_k=20,
        alpha=0.5
    )
    
    # Extract sources
    retrieved_sources = [doc['source'] for doc in results]
    
    # Calculate Recall@5 (is ANY relevant doc in top 5?)
    found = any(
        any(rel_doc in source for source in retrieved_sources)
        for rel_doc in relevant_docs
    )
    recall_at_5.append(1 if found else 0)
    
    # Calculate MRR (position of FIRST relevant doc)
    rank = None
    for j, source in enumerate(retrieved_sources, 1):
        if any(rel_doc in source for rel_doc in relevant_docs):
            rank = j
            break
    
    if rank:
        reciprocal_ranks.append(1.0 / rank)
    else:
        reciprocal_ranks.append(0.0)
    
    # Calculate Precision@5
    relevant_in_top5 = sum(
        1 for source in retrieved_sources 
        if any(rel_doc in source for rel_doc in relevant_docs)
    )
    precision_at_5.append(relevant_in_top5 / 5.0)
    
    # Progress
    status = "‚úÖ" if found else "‚ùå"
    print(f"{i:2d}. {status} {query[:60]}...")
    if found and rank:
        print(f"     Found: {retrieved_sources[rank-1]} (position {rank})")
    elif not found:
        print(f"     Expected: {relevant_docs[0]}")
        print(f"     Got: {retrieved_sources[0]}")

# Calculate metrics
recall_5 = sum(recall_at_5) / len(recall_at_5) * 100
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
precision_5 = sum(precision_at_5) / len(precision_at_5) * 100

print("\n" + "=" * 80)
print("\nüìä RETRIEVER METRICS\n")
print(f"Recall@5:    {recall_5:.1f}%")
print(f"MRR:         {mrr:.3f} (avg position: {1/mrr if mrr > 0 else 'N/A':.1f})")
print(f"Precision@5: {precision_5:.1f}%")
print(f"\nEvaluated:   {len(in_scope_questions)} questions")

# Interpretation
print("\nüìà INTERPRETATION\n")
if recall_5 >= 80:
    print("   ‚úÖ Recall > 80%: EXCELLENT")
elif recall_5 >= 60:
    print("   ‚ö†Ô∏è Recall 60-80%: ACCEPTABLE")
else:
    print("   ‚ùå Recall < 60%: NEEDS IMPROVEMENT")

print("=" * 80)

üìä EVALUATING RETRIEVER

Evaluating 17 in-scope questions...

 1. ‚úÖ What is the optimal protein intake for muscle hypertrophy in...
     Found: issn_protein_position.pdf (position 2)
 2. ‚ùå How much protein per meal for optimal muscle protein synthes...
     Expected: issn_protein_position.pdf
     Got: schoenfeld_rom_hypertrophy.pdf
 3. ‚úÖ Is creatine supplementation effective for muscle growth?...
     Found: helms_bodybuilding_nutrition.pdf (position 1)
 4. ‚úÖ Should protein intake be higher during a caloric deficit?...
     Found: helms_bodybuilding_nutrition.pdf (position 1)
 5. ‚úÖ What is the protein timing window after training?...
     Found: issn_protein_position.pdf (position 5)
 6. ‚úÖ Are BCAAs necessary if protein intake is adequate?...
     Found: issn_protein_position.pdf (position 1)
 7. ‚úÖ What supplements are most effective for muscle hypertrophy?...
     Found: helms_bodybuilding_nutrition.pdf (position 1)
 8. ‚úÖ Does full range of motion improve muscle hyp

## ü§ñ STEP 5: Test Generator (Sample Answers)

**Manual inspection is more reliable than LLM-as-Judge**

In [5]:
print("ü§ñ TESTING GENERATOR\n")
print("=" * 80)

# Initialize chatbot
print("Initializing chatbot...")
chatbot = RAGChatbot()

# Sample 5 questions for manual inspection
import random
sample_questions = random.sample(in_scope_questions, min(5, len(in_scope_questions)))

print(f"\nTesting {len(sample_questions)} sample questions...\n")

for i, item in enumerate(sample_questions, 1):
    print("\n" + "‚îÄ" * 80)
    print(f"\nüìù QUESTION {i}\n")
    print(f"Q: {item['question']}")
    print(f"\nRelevant docs: {item['relevant_docs']}")
    print(f"Category: {item['category']}")
    
    # Generate answer
    result = chatbot.answer(
        question=item['question'],
        doc_type="scientific_paper",  # Only scientific papers
        top_k=5
    )
    
    print(f"\nüí¨ ANSWER:\n")
    print(result['answer'])
    
    print(f"\nüìö SOURCES USED:\n")
    for j, source in enumerate(result['sources'][:3], 1):
        print(f"{j}. {source['source']} (page {source['page']})")
        print(f"   Score: {source['score']:.3f}")

print("\n" + "=" * 80)
print("\n‚úÖ Sample answers generated")
print("\nüëÄ MANUAL INSPECTION:")
print("   Please review the answers above and assess:")
print("   - Faithfulness: Does it only use info from sources?")
print("   - Completeness: Does it answer the question fully?")
print("   - Relevance: Is it concise and useful?")

ü§ñ TESTING GENERATOR

Initializing chatbot...

ü§ñ INITIALIZING RAG CHATBOT
üîß Initializing Retriever...
   üì• Loading embedding model: BAAI/bge-large-en-v1.5
   üíæ Connecting to ChromaDB: c:\RAG-Fitness-Test\data\processed\chroma_db
   ‚úÖ Collection 'fitness_knowledge_base': 1728 documents
   üî§ Initializing BM25 index...
      ‚úÖ BM25 indexed: 1728 documents
   üéØ Loading Cross-Encoder for re-ranking...
      ‚úÖ Cross-Encoder loaded

‚úÖ Retriever ready (Hybrid Search enabled)

üîç Checking Ollama (http://localhost:11434)...
   ‚úÖ Ollama is available
   üß† Model: llama3.2:3b
   üå°Ô∏è Temperature: 0.1
   üìè Max tokens: 256

‚úÖ Chatbot ready!


Testing 5 sample questions...


‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìù QUESTION 1

Q: What is the mechanism behind

## üìÑ STEP 6: Generate Report

In [6]:
# Generate report
report = f"""# üìä RAG FITNESS - EVALUATION REPORT

Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}

---

## üéØ GOLDEN DATASET

- **Total questions**: {len(golden_dataset)}
- **In-scope questions**: {len(in_scope_questions)}
- **Categories**:
"""

for cat, count in cat_counts.items():
    report += f"  - {cat}: {count} questions\n"

report += f"""
---

## üîç RETRIEVER METRICS

| Metric | Score | Interpretation |
|--------|-------|----------------|
| **Recall@5** | {recall_5:.1f}% | {'‚úÖ Excellent' if recall_5 >= 80 else '‚ö†Ô∏è Acceptable' if recall_5 >= 60 else '‚ùå Needs improvement'} |
| **MRR** | {mrr:.3f} | Average position: {1/mrr if mrr > 0 else 'N/A':.1f} |
| **Precision@5** | {precision_5:.1f}% | {precision_5:.0f}% of retrieved docs are relevant |

**Evaluation**: {len(in_scope_questions)} in-scope questions

**Interpretation**:
- Recall > 80%: ‚úÖ Excellent
- Recall 60-80%: ‚ö†Ô∏è Acceptable
- Recall < 60%: ‚ùå Needs improvement

---

## ü§ñ GENERATOR QUALITY

**Evaluation method**: Manual inspection of {len(sample_questions)} sample answers

**Note**: LLM-as-Judge with small models (Llama 3.2 3B) is not reliable.
Manual inspection is recommended for assessing:
- Faithfulness (no hallucinations)
- Completeness (answers the question fully)
- Relevance (concise and useful)

**To evaluate generator quality**:
1. Run this notebook
2. Review sample answers in Step 5
3. Rate each answer manually (1-5 scale)
4. Average scores give true quality estimate

---

## üìÅ FILES

- Golden Dataset: `data/golden_dataset.json`
- Knowledge Base: `data/processed/chroma_db/`
- Evaluation Notebook: `notebooks/02_evaluate_system.ipynb`
"""

# Save report
report_path = Path.cwd().parent / "EVALUATION_REPORT.md"
with open(report_path, 'w', encoding='utf-8') as f:
    f.write(report)

print("üìÑ Report generated")
print(f"   Saved to: {report_path}")
print("\n" + report)

üìÑ Report generated
   Saved to: c:\RAG-Fitness-Test\EVALUATION_REPORT.md

# üìä RAG FITNESS - EVALUATION REPORT

Date: 2025-12-23 18:37

---

## üéØ GOLDEN DATASET

- **Total questions**: 20
- **In-scope questions**: 17
- **Categories**:
  - nutrition: 7 questions
  - rom: 5 questions
  - volume: 5 questions
  - out_of_scope: 3 questions

---

## üîç RETRIEVER METRICS

| Metric | Score | Interpretation |
|--------|-------|----------------|
| **Recall@5** | 88.2% | ‚úÖ Excellent |
| **MRR** | 0.776 | Average position: 1.3 |
| **Precision@5** | 69.4% | 69% of retrieved docs are relevant |

**Evaluation**: 17 in-scope questions

**Interpretation**:
- Recall > 80%: ‚úÖ Excellent
- Recall 60-80%: ‚ö†Ô∏è Acceptable
- Recall < 60%: ‚ùå Needs improvement

---

## ü§ñ GENERATOR QUALITY

**Evaluation method**: Manual inspection of 5 sample answers

**Note**: LLM-as-Judge with small models (Llama 3.2 3B) is not reliable.
Manual inspection is recommended for assessing:
- Faithfulness (no ha

## ‚úÖ EVALUATION COMPLETE

**Summary:**
- Retriever metrics calculated (Recall@5, MRR, Precision@5)
- Sample answers generated for manual inspection
- Report saved to `EVALUATION_REPORT.md`

**Next steps:**
1. Review sample answers above (Step 5)
2. If Recall@5 < 80%, consider:
   - Adding more diverse documents
   - Tuning hybrid search parameters
   - Improving chunking strategy
3. If answers have issues, adjust:
   - System prompt (src/config.py)
   - Temperature (currently 0.1)
   - Context window (top_k)

**Remember**: Manual evaluation > LLM-as-Judge for quality assessment!