# üìä √âVALUATION COMPL√àTE RAG - Golden Dataset + M√©triques

**Objectif** : Mesurer quantitativement la qualit√© du RAG

**Composants** :
1. Golden Dataset (v√©rit√© terrain)
2. Recall@K (qualit√© retrieval)
3. LLM-as-Judge (qualit√© g√©n√©ration)
4. Monitoring (thumbs up/down)

---

## üì¶ √âTAPE 1 : Imports

In [1]:
import json
import pandas as pd
from pathlib import Path
import sys

# Ajouter src au path
sys.path.append(str(Path.cwd().parent))

from src.retriever import Retriever
from src.chatbot import RAGChatbot

print("‚úÖ Imports OK")

‚úÖ Imports OK


## üéØ √âTAPE 2 : Cr√©er Golden Dataset

**Golden Dataset** = Paires Question/R√©ponse + Documents sources valid√©es par expert

Format :
```json
{
  "question": "...",
  "expected_answer": "...",
  "relevant_docs": ["schoenfeld_rom.pdf", ...],
  "category": "nutrition" | "rom" | "volume"
}
```

In [2]:
# Golden Dataset - 20 paires Q/R
GOLDEN_DATASET = [
    # ========================================================================
    # NUTRITION (6 questions)
    # ========================================================================
    {
        "question": "What is the optimal protein intake for muscle hypertrophy in resistance training?",
        "expected_answer": "1.4-2.0 g/kg/day for active individuals, up to 2.3-3.1 g/kg lean mass during caloric restriction",
        "relevant_docs": ["issn_protein_position.pdf", "helms_bodybuilding_nutrition.pdf"],
        "category": "nutrition"
    },
    {
        "question": "How much protein per meal for optimal muscle protein synthesis?",
        "expected_answer": "20-40g per meal, containing 2-3g leucine",
        "relevant_docs": ["issn_protein_position.pdf"],
        "category": "nutrition"
    },
    {
        "question": "Is creatine supplementation effective for muscle growth?",
        "expected_answer": "Yes, creatine monohydrate increases lean mass and strength, 3-5g/day",
        "relevant_docs": ["issn_protein_position.pdf"],
        "category": "nutrition"
    },
    {
        "question": "What is the protein content of chicken breast?",
        "expected_answer": "~22-30g protein per 100g depending on preparation",
        "relevant_docs": ["ciqual_2020.xls"],
        "category": "nutrition"
    },
    {
        "question": "Should protein intake increase during caloric deficit?",
        "expected_answer": "Yes, up to 2.3-3.1 g/kg lean body mass to preserve muscle",
        "relevant_docs": ["helms_bodybuilding_nutrition.pdf"],
        "category": "nutrition"
    },
    {
        "question": "What are the best protein sources for vegetarians?",
        "expected_answer": "Soy, legumes, quinoa - combine sources for complete amino acid profile",
        "relevant_docs": ["ciqual_2020.xls", "issn_protein_position.pdf"],
        "category": "nutrition"
    },
    
    # ========================================================================
    # RANGE OF MOTION (5 questions)
    # ========================================================================
    {
        "question": "Does full range of motion improve muscle hypertrophy compared to partial ROM?",
        "expected_answer": "Yes, full ROM produces greater muscle growth, especially in stretched position",
        "relevant_docs": ["schoenfeld_rom_hypertrophy.pdf"],
        "category": "rom"
    },
    {
        "question": "What ROM is best for quadriceps hypertrophy?",
        "expected_answer": "0-130¬∞ knee flexion (full ROM) superior to 50-100¬∞ partial ROM",
        "relevant_docs": ["schoenfeld_rom_hypertrophy.pdf"],
        "category": "rom"
    },
    {
        "question": "Does training at long muscle lengths improve hypertrophy?",
        "expected_answer": "Yes, exercises emphasizing stretch position show greater growth",
        "relevant_docs": ["schoenfeld_rom_hypertrophy.pdf"],
        "category": "rom"
    },
    {
        "question": "Can partial ROM be beneficial for advanced lifters?",
        "expected_answer": "Limited evidence, full ROM generally superior for hypertrophy",
        "relevant_docs": ["schoenfeld_rom_hypertrophy.pdf"],
        "category": "rom"
    },
    {
        "question": "What is the mechanism for ROM effects on muscle growth?",
        "expected_answer": "Increased mechanical tension, metabolic stress, and muscle damage at full ROM",
        "relevant_docs": ["schoenfeld_rom_hypertrophy.pdf"],
        "category": "rom"
    },
    
    # ========================================================================
    # TRAINING VOLUME (5 questions)
    # ========================================================================
    {
        "question": "What is the optimal training volume for muscle hypertrophy?",
        "expected_answer": "10-20 sets per muscle per week, dose-response relationship",
        "relevant_docs": ["bernardez_training_variables.pdf"],
        "category": "volume"
    },
    {
        "question": "Is there a maximum effective training volume?",
        "expected_answer": "Yes, beyond 20-25 sets/week may cause diminishing returns or overtraining",
        "relevant_docs": ["bernardez_training_variables.pdf"],
        "category": "volume"
    },
    {
        "question": "How does training frequency affect hypertrophy?",
        "expected_answer": "2-3x per muscle per week optimal when total volume equated",
        "relevant_docs": ["bernardez_training_variables.pdf"],
        "category": "volume"
    },
    {
        "question": "Should beginners train with high volume?",
        "expected_answer": "No, start lower (5-10 sets/week), progressively increase",
        "relevant_docs": ["bernardez_training_variables.pdf"],
        "category": "volume"
    },
    {
        "question": "What is the minimum effective volume for muscle growth?",
        "expected_answer": "~10 sets per muscle per week, below this growth is suboptimal",
        "relevant_docs": ["bernardez_training_variables.pdf"],
        "category": "volume"
    },
    
    # ========================================================================
    # EDGE CASES (4 questions)
    # ========================================================================
    {
        "question": "What is the best time to train for muscle growth?",
        "expected_answer": "N/A - Not in sources",
        "relevant_docs": [],
        "category": "out_of_scope"
    },
    {
        "question": "How to create a beginner workout program?",
        "expected_answer": "N/A - Not in sources",
        "relevant_docs": [],
        "category": "out_of_scope"
    },
    {
        "question": "What supplements should I take?",
        "expected_answer": "Creatine, protein powder - others limited evidence",
        "relevant_docs": ["issn_protein_position.pdf"],
        "category": "nutrition"
    },
    {
        "question": "Is cardio bad for muscle growth?",
        "expected_answer": "N/A - Not in sources",
        "relevant_docs": [],
        "category": "out_of_scope"
    }
]

print(f"‚úÖ Golden Dataset cr√©√© : {len(GOLDEN_DATASET)} paires Q/R")
print(f"\nR√©partition cat√©gories :")
for cat in set([q['category'] for q in GOLDEN_DATASET]):
    count = len([q for q in GOLDEN_DATASET if q['category'] == cat])
    print(f"   {cat}: {count}")

‚úÖ Golden Dataset cr√©√© : 20 paires Q/R

R√©partition cat√©gories :
   nutrition: 7
   volume: 5
   out_of_scope: 3
   rom: 5


## üíæ √âTAPE 3 : Sauvegarder Golden Dataset

In [3]:
# Sauvegarder en JSON
GOLDEN_PATH = Path.cwd().parent / "data" / "golden_dataset.json"

with open(GOLDEN_PATH, 'w', encoding='utf-8') as f:
    json.dump(GOLDEN_DATASET, f, indent=2, ensure_ascii=False)

print(f"‚úÖ Golden Dataset sauvegard√© : {GOLDEN_PATH}")

‚úÖ Golden Dataset sauvegard√© : c:\RAG-Fitness-Test\data\golden_dataset.json


## üîç √âTAPE 4 : √âvaluer RETRIEVER (Recall@K, MRR, Precision)

In [4]:
print("üîç √âVALUATION RETRIEVER")
print("="*80)

# Initialiser retriever
retriever = Retriever()

# M√©triques
recall_at_5 = []
mrr_scores = []
precision_at_5 = []

for item in GOLDEN_DATASET:
    query = item['question']
    relevant_docs = set(item['relevant_docs'])
    
    # Skip si pas de docs attendus (out of scope)
    if not relevant_docs:
        continue
    
    # Retrieval
    #results = retriever.search(query=query, top_k=5)
    results = retriever.hybrid_search(query=query, top_k=5, retrieve_k=20, alpha=0.5)


    retrieved_docs = [r['source'] for r in results]
    
    # ========================================================================
    # Recall@5 : Est-ce qu'au moins 1 doc pertinent est dans top 5 ?
    # ========================================================================
    has_relevant = any(doc in relevant_docs for doc in retrieved_docs)
    recall_at_5.append(1.0 if has_relevant else 0.0)
    
    # ========================================================================
    # MRR (Mean Reciprocal Rank) : √Ä quelle position le 1er doc pertinent ?
    # ========================================================================
    reciprocal_rank = 0.0
    for rank, doc in enumerate(retrieved_docs, 1):
        if doc in relevant_docs:
            reciprocal_rank = 1.0 / rank
            break
    mrr_scores.append(reciprocal_rank)
    
    # ========================================================================
    # Precision@5 : Combien de docs pertinents dans top 5 ?
    # ========================================================================
    relevant_retrieved = sum(1 for doc in retrieved_docs if doc in relevant_docs)
    precision_at_5.append(relevant_retrieved / 5.0)

# Calculer moyennes
avg_recall = sum(recall_at_5) / len(recall_at_5) * 100
avg_mrr = sum(mrr_scores) / len(mrr_scores)
avg_precision = sum(precision_at_5) / len(precision_at_5) * 100

print(f"\nüìä M√âTRIQUES RETRIEVER")
print(f"   Recall@5     : {avg_recall:.1f}% (le bon doc est dans top 5)")
print(f"   MRR          : {avg_mrr:.3f} (position moyenne du bon doc)")
print(f"   Precision@5  : {avg_precision:.1f}% (% docs pertinents dans top 5)")
print(f"\n   √âvalu√© sur {len(recall_at_5)} questions")

üîç √âVALUATION RETRIEVER
üîß Initialisation Retriever...
   üì• Chargement mod√®le : BAAI/bge-large-en-v1.5
   üíæ Connexion ChromaDB : c:\RAG-Fitness-Test\data\processed\chroma_db
   ‚úÖ Collection 'fitness_knowledge_base' charg√©e : 1438 documents
   üî§ Initialisation BM25...
      ‚úÖ BM25 index√© : 1438 documents
   üéØ Chargement Cross-Encoder...
      ‚úÖ Cross-Encoder charg√©

üìä M√âTRIQUES RETRIEVER
   Recall@5     : 82.4% (le bon doc est dans top 5)
   MRR          : 0.580 (position moyenne du bon doc)
   Precision@5  : 51.8% (% docs pertinents dans top 5)

   √âvalu√© sur 17 questions


## ü§ñ √âTAPE 5 : √âvaluer GENERATOR (LLM-as-Judge)

**‚ö†Ô∏è IMPORTANT** : N√©cessite Ollama en cours d'ex√©cution

In [5]:
print("ü§ñ √âVALUATION GENERATOR (LLM-as-Judge)")
print("="*80)
print("‚ö†Ô∏è Cela va prendre 5-10 minutes (g√©n√©ration + √©valuation)...\n")

# Initialiser chatbot
chatbot = RAGChatbot()

# Prompt pour LLM-as-Judge
JUDGE_PROMPT = """Tu es un √©valuateur expert de syst√®mes RAG.

√âvalue cette r√©ponse selon 3 crit√®res (1-5) :

1. **Faithfulness** (Fid√©lit√©) : La r√©ponse ne contient QUE des infos du contexte ?
   - 5 : 100% fid√®le, aucune hallucination
   - 1 : Invente des infos, hallucinations

2. **Completeness** (Compl√©tude) : R√©pond-elle compl√®tement √† la question ?
   - 5 : R√©ponse compl√®te, tous aspects couverts
   - 1 : R√©ponse partielle ou √©vasive

3. **Relevance** (Pertinence) : Est-elle concise et utile ?
   - 5 : Concise, directe, utile
   - 1 : Verbeuse, hors-sujet

Question : {question}

R√©ponse : {answer}

R√©ponds UNIQUEMENT avec ce format JSON :
{{
  "faithfulness": X,
  "completeness": X,
  "relevance": X,
  "justification": "..."
}}"""

import requests

def llm_as_judge(question: str, answer: str) -> dict:
    """√âvalue une r√©ponse avec LLM-as-Judge"""
    
    prompt = JUDGE_PROMPT.format(question=question, answer=answer)
    
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": "llama3.2:3b",
                "prompt": prompt,
                "temperature": 0.1,
                "stream": False
            },
            timeout=60
        )
        
        result = response.json()
        text = result.get('response', '')
        
        # Nettoyer JSON
        text = text.replace('```json', '').replace('```', '').strip()
        
        # Parser
        scores = json.loads(text)
        return scores
        
    except Exception as e:
        print(f"   ‚ö†Ô∏è Erreur √©valuation : {e}")
        return {
            "faithfulness": 0,
            "completeness": 0,
            "relevance": 0,
            "justification": "Error"
        }

# √âvaluer √©chantillon (5 questions pour rapidit√©)
sample_questions = [
    GOLDEN_DATASET[0],  # Protein
    GOLDEN_DATASET[6],  # ROM
    GOLDEN_DATASET[11], # Volume
    GOLDEN_DATASET[16], # Out of scope
    GOLDEN_DATASET[2]   # Creatine
]

judge_scores = {
    'faithfulness': [],
    'completeness': [],
    'relevance': []
}

for i, item in enumerate(sample_questions, 1):
    print(f"\n[{i}/{len(sample_questions)}] {item['question'][:60]}...")
    
    # G√©n√©rer r√©ponse
    result = chatbot.answer(item['question'], top_k=3)
    answer = result['answer']
    
    # √âvaluer
    scores = llm_as_judge(item['question'], answer)
    
    judge_scores['faithfulness'].append(scores['faithfulness'])
    judge_scores['completeness'].append(scores['completeness'])
    judge_scores['relevance'].append(scores['relevance'])
    
    print(f"   Faithfulness: {scores['faithfulness']}/5")
    print(f"   Completeness: {scores['completeness']}/5")
    print(f"   Relevance: {scores['relevance']}/5")

# Moyennes
print("\n" + "="*80)
print("üìä SCORES LLM-AS-JUDGE (Moyennes)")
print("="*80)
print(f"   Faithfulness : {sum(judge_scores['faithfulness'])/len(judge_scores['faithfulness']):.2f}/5")
print(f"   Completeness : {sum(judge_scores['completeness'])/len(judge_scores['completeness']):.2f}/5")
print(f"   Relevance    : {sum(judge_scores['relevance'])/len(judge_scores['relevance']):.2f}/5")
print(f"\n   √âvalu√© sur {len(sample_questions)} questions")

ü§ñ √âVALUATION GENERATOR (LLM-as-Judge)
‚ö†Ô∏è Cela va prendre 5-10 minutes (g√©n√©ration + √©valuation)...


ü§ñ INITIALISATION CHATBOT RAG
üîß Initialisation Retriever...
   üì• Chargement mod√®le : BAAI/bge-large-en-v1.5
   üíæ Connexion ChromaDB : c:\RAG-Fitness-Test\data\processed\chroma_db
   ‚úÖ Collection 'fitness_knowledge_base' charg√©e : 1438 documents
   üî§ Initialisation BM25...
      ‚úÖ BM25 index√© : 1438 documents
   üéØ Chargement Cross-Encoder...
      ‚úÖ Cross-Encoder charg√©

üîç V√©rification Ollama (http://localhost:11434)...
   ‚úÖ Ollama disponible
   üß† Mod√®le : llama3.2:3b

‚úÖ Chatbot pr√™t !


[1/5] What is the optimal protein intake for muscle hypertrophy in...
   Faithfulness: 5/5
   Completeness: 4/5
   Relevance: 4/5

[2/5] Does full range of motion improve muscle hypertrophy compare...
   Faithfulness: 5/5
   Completeness: 4/5
   Relevance: 5/5

[3/5] What is the optimal training volume for muscle hypertrophy?...
   ‚ö†Ô∏è Erreur √©valuat

## üëç √âTAPE 6 : Syst√®me de Monitoring (Thumbs Up/Down)

Code √† int√©grer dans Gradio pour feedback utilisateur

In [6]:
# Code pour app.py
monitoring_code = '''
# ============================================================================
# MONITORING - Feedback Utilisateur
# ============================================================================

import json
from datetime import datetime
from pathlib import Path

FEEDBACK_LOG = Path("data/feedback_log.jsonl")

def log_feedback(question: str, answer: str, feedback: str, sources: list):
    """
    Enregistre feedback utilisateur
    
    Args:
        question: Question pos√©e
        answer: R√©ponse g√©n√©r√©e
        feedback: "positive" ou "negative"
        sources: Liste documents sources
    """
    
    entry = {
        "timestamp": datetime.now().isoformat(),
        "question": question,
        "answer": answer[:200],  # Tronqu√©
        "feedback": feedback,
        "sources": [s["source"] for s in sources],
        "num_sources": len(sources)
    }
    
    # Append to JSONL
    with open(FEEDBACK_LOG, "a", encoding="utf-8") as f:
        f.write(json.dumps(entry, ensure_ascii=False) + "\\n")


# Dans l\'interface Gradio, ajouter boutons feedback :

with gr.Row():
    thumbs_up = gr.Button("üëç Bonne r√©ponse")
    thumbs_down = gr.Button("üëé Mauvaise r√©ponse")

# Event handlers
thumbs_up.click(
    fn=lambda q, a, s: log_feedback(q, a, "positive", s),
    inputs=[last_question, last_answer, last_sources]
)

thumbs_down.click(
    fn=lambda q, a, s: log_feedback(q, a, "negative", s),
    inputs=[last_question, last_answer, last_sources]
)
'''

print("üëç CODE MONITORING")
print("="*80)
print(monitoring_code)
print("\n‚úÖ √Ä int√©grer dans app.py pour feedback utilisateur")

üëç CODE MONITORING

# MONITORING - Feedback Utilisateur

import json
from datetime import datetime
from pathlib import Path

FEEDBACK_LOG = Path("data/feedback_log.jsonl")

def log_feedback(question: str, answer: str, feedback: str, sources: list):
    """
    Enregistre feedback utilisateur

    Args:
        question: Question pos√©e
        answer: R√©ponse g√©n√©r√©e
        feedback: "positive" ou "negative"
        sources: Liste documents sources
    """

    entry = {
        "timestamp": datetime.now().isoformat(),
        "question": question,
        "answer": answer[:200],  # Tronqu√©
        "feedback": feedback,
        "sources": [s["source"] for s in sources],
        "num_sources": len(sources)
    }

    # Append to JSONL
    with open(FEEDBACK_LOG, "a", encoding="utf-8") as f:
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")


# Dans l'interface Gradio, ajouter boutons feedback :

with gr.Row():
    thumbs_up = gr.Button("üëç Bonne r√©ponse")
    thum

## üìä √âTAPE 7 : Analyse Feedback (Monitoring)

In [7]:
# Fonction d'analyse feedback
def analyze_feedback(feedback_log_path: Path):
    """
    Analyse le fichier de feedback utilisateur
    """
    
    if not feedback_log_path.exists():
        print("‚ö†Ô∏è Pas encore de feedback utilisateur")
        return
    
    # Charger logs
    feedbacks = []
    with open(feedback_log_path, 'r', encoding='utf-8') as f:
        for line in f:
            feedbacks.append(json.loads(line))
    
    if not feedbacks:
        print("‚ö†Ô∏è Aucun feedback enregistr√©")
        return
    
    # Statistiques
    total = len(feedbacks)
    positive = sum(1 for f in feedbacks if f['feedback'] == 'positive')
    negative = total - positive
    
    satisfaction = positive / total * 100
    
    print("\nüìä STATISTIQUES FEEDBACK UTILISATEUR")
    print("="*80)
    print(f"   Total feedbacks : {total}")
    print(f"   üëç Positifs     : {positive} ({positive/total*100:.1f}%)")
    print(f"   üëé N√©gatifs     : {negative} ({negative/total*100:.1f}%)")
    print(f"\n   Taux satisfaction : {satisfaction:.1f}%")
    
    # Questions les plus probl√©matiques
    negative_questions = [f for f in feedbacks if f['feedback'] == 'negative']
    
    if negative_questions:
        print("\n‚ö†Ô∏è QUESTIONS PROBL√âMATIQUES (feedback n√©gatif) :")
        for i, f in enumerate(negative_questions[:5], 1):
            print(f"   {i}. {f['question'][:60]}...")
            print(f"      Sources : {', '.join(f['sources'])}")

# Exemple utilisation (une fois feedback collect√©)
print("‚úÖ Fonction analyze_feedback() cr√©√©e")
print("\nUtilisation :")
print("   analyze_feedback(Path('data/feedback_log.jsonl'))")

‚úÖ Fonction analyze_feedback() cr√©√©e

Utilisation :
   analyze_feedback(Path('data/feedback_log.jsonl'))


## üìã √âTAPE 8 : Rapport Final

G√©n√©rer rapport d'√©valuation complet

In [8]:
from datetime import datetime

# Cr√©er rapport
report = f"""# üìä RAPPORT √âVALUATION RAG FITNESS

Date : {datetime.now().strftime('%Y-%m-%d %H:%M')}

---

## üéØ GOLDEN DATASET

- **Total questions** : {len(GOLDEN_DATASET)}
- **Cat√©gories** :
  - Nutrition : 7 questions
  - ROM : 5 questions
  - Volume : 5 questions
  - Out of scope : 3 questions

---

## üîç M√âTRIQUES RETRIEVER

| M√©trique | Score | Interpr√©tation |
|----------|-------|----------------|
| **Recall@5** | {avg_recall:.1f}% | Le bon document est trouv√© dans {avg_recall:.0f}% des cas |
| **MRR** | {avg_mrr:.3f} | Position moyenne du bon doc : {1/avg_mrr:.1f} |
| **Precision@5** | {avg_precision:.1f}% | {avg_precision:.0f}% des docs retourn√©s sont pertinents |

**Interpr√©tation** :
- Recall > 80% : ‚úÖ Excellent
- Recall 60-80% : ‚ö†Ô∏è Acceptable
- Recall < 60% : ‚ùå √Ä am√©liorer

---

## ü§ñ M√âTRIQUES GENERATOR (LLM-as-Judge)

| Crit√®re | Score Moyen | Cible |
|---------|-------------|-------|
| **Faithfulness** | {sum(judge_scores['faithfulness'])/len(judge_scores['faithfulness']):.2f}/5 | > 4.0 |
| **Completeness** | {sum(judge_scores['completeness'])/len(judge_scores['completeness']):.2f}/5 | > 3.5 |
| **Relevance** | {sum(judge_scores['relevance'])/len(judge_scores['relevance']):.2f}/5 | > 4.0 |

**√âchantillon √©valu√©** : 5 questions

---

## üëç MONITORING (Feedback Utilisateur)

**Statut** : √Ä impl√©menter dans app.py

**Fonctionnalit√©s** :
- Boutons üëç üëé dans interface
- Log automatique dans `feedback_log.jsonl`
- Analyse p√©riodique avec `analyze_feedback()`

**KPI √† suivre** :
- Taux satisfaction (cible > 80%)
- Questions probl√©matiques
- Drift detection (changement sujets)

---

## üéØ RECOMMANDATIONS

### Si Recall@5 < 70%
1. Impl√©menter Hybrid Search (BM25 + Dense)
2. Ajouter Re-ranking Cross-Encoder
3. Am√©liorer chunking (Semantic)

### Si Faithfulness < 4.0
1. Renforcer prompt syst√®me
2. R√©duire temp√©rature LLM
3. Filtrer contexte (top 3 au lieu de 5)

### Si Completeness < 3.5
1. Augmenter top_k retrieval
2. Augmenter max_tokens LLM
3. Am√©liorer formulation prompt

---

## üìÅ FICHIERS G√âN√âR√âS

- `data/golden_dataset.json` : V√©rit√© terrain (20 Q/R)
- `data/feedback_log.jsonl` : Logs feedback utilisateur
- `notebooks/04_evaluation.ipynb` : Code √©valuation

"""

# Sauvegarder rapport
REPORT_PATH = Path.cwd().parent / "EVALUATION_REPORT.md"

with open(REPORT_PATH, 'w', encoding='utf-8') as f:
    f.write(report)

print("‚úÖ Rapport d'√©valuation g√©n√©r√©")
print(f"   üìÑ {REPORT_PATH}")
print("\n" + report)

‚úÖ Rapport d'√©valuation g√©n√©r√©
   üìÑ c:\RAG-Fitness-Test\EVALUATION_REPORT.md

# üìä RAPPORT √âVALUATION RAG FITNESS

Date : 2025-12-23 13:35

---

## üéØ GOLDEN DATASET

- **Total questions** : 20
- **Cat√©gories** :
  - Nutrition : 7 questions
  - ROM : 5 questions
  - Volume : 5 questions
  - Out of scope : 3 questions

---

## üîç M√âTRIQUES RETRIEVER

| M√©trique | Score | Interpr√©tation |
|----------|-------|----------------|
| **Recall@5** | 82.4% | Le bon document est trouv√© dans 82% des cas |
| **MRR** | 0.580 | Position moyenne du bon doc : 1.7 |
| **Precision@5** | 51.8% | 52% des docs retourn√©s sont pertinents |

**Interpr√©tation** :
- Recall > 80% : ‚úÖ Excellent
- Recall 60-80% : ‚ö†Ô∏è Acceptable
- Recall < 60% : ‚ùå √Ä am√©liorer

---

## ü§ñ M√âTRIQUES GENERATOR (LLM-as-Judge)

| Crit√®re | Score Moyen | Cible |
|---------|-------------|-------|
| **Faithfulness** | 2.80/5 | > 4.0 |
| **Completeness** | 2.20/5 | > 3.5 |
| **Relevance** | 2.80/5 | > 4.0 |

## ‚úÖ CONCLUSION

**√âvaluation compl√®te impl√©ment√©e !**

**Composants cr√©√©s** :
1. ‚úÖ Golden Dataset (20 Q/R)
2. ‚úÖ M√©triques Retriever (Recall, MRR, Precision)
3. ‚úÖ LLM-as-Judge (Faithfulness, Completeness, Relevance)
4. ‚úÖ Code Monitoring (Thumbs up/down)
5. ‚úÖ Rapport d'√©valuation

**Prochaines √©tapes** :
1. Int√©grer monitoring dans app.py
2. Suivre m√©triques en production
3. It√©rer selon feedback utilisateur

**Fr√©quence recommand√©e** :
- √âvaluation Golden Dataset : Mensuelle
- Analyse feedback : Hebdomadaire
- Review m√©triques : Quotidienne (dashboard)