# Minimal RAG with Leave-One-Out (LOO) Testing

**Mục tiêu**: Đánh giá giá trị của từng chunk dữ liệu trong hệ thống RAG bằng phương pháp Leave-One-Out.

**Phương pháp**: 
- Xây dựng RAG đơn giản với vector similarity search
- Đánh giá hiệu suất khi loại bỏ từng chunk
- Tính LOO score để xác định chunk nào quan trọng nhất

**Môi trường**: Google Colab (hoặc local với GPU/CPU)

## 1. Setup & Dependencies

In [None]:
# Install dependencies (uncomment if running on Colab)
# !pip install -q transformers torch sentence-transformers scikit-learn pandas numpy matplotlib seaborn

In [None]:
import torch
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Device selection
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

## 2. Sample Data Creation

Tạo một knowledge base nhỏ về Machine Learning với 8 chunks và 5 câu hỏi test.

In [None]:
# Knowledge base: 8 chunks về Machine Learning
knowledge_chunks = [
    "Machine learning is a subset of artificial intelligence that enables systems to learn from data without explicit programming. It uses algorithms to identify patterns and make predictions.",
    
    "Supervised learning requires labeled training data where each example has an input and corresponding output. Common algorithms include linear regression, decision trees, and neural networks.",
    
    "Unsupervised learning works with unlabeled data to discover hidden patterns. Clustering algorithms like K-means and dimensionality reduction techniques like PCA are popular examples.",
    
    "Deep learning uses artificial neural networks with multiple layers to learn hierarchical representations. It excels at tasks like image recognition, natural language processing, and speech recognition.",
    
    "Overfitting occurs when a model learns the training data too well, including noise and outliers. This results in poor generalization to new, unseen data. Regularization techniques help prevent overfitting.",
    
    "Cross-validation is a technique to assess model performance by splitting data into training and validation sets multiple times. K-fold cross-validation is the most common approach.",
    
    "Feature engineering involves creating new features from raw data to improve model performance. It requires domain knowledge and understanding of the problem being solved.",
    
    "Gradient descent is an optimization algorithm used to minimize the loss function by iteratively adjusting model parameters. Learning rate controls the step size in each iteration."
]

# Test questions với ground truth answers
test_questions = [
    {
        "question": "What is machine learning?",
        "ground_truth": "Machine learning is a subset of AI that enables systems to learn from data without explicit programming.",
        "relevant_chunks": [0]  # Chunk 0 chứa câu trả lời chính
    },
    {
        "question": "What is the difference between supervised and unsupervised learning?",
        "ground_truth": "Supervised learning uses labeled data while unsupervised learning works with unlabeled data to discover patterns.",
        "relevant_chunks": [1, 2]  # Chunks 1 và 2
    },
    {
        "question": "How does cross-validation work?",
        "ground_truth": "Cross-validation splits data into training and validation sets multiple times to assess model performance.",
        "relevant_chunks": [5]  # Chunk 5
    },
    {
        "question": "What is overfitting and how to prevent it?",
        "ground_truth": "Overfitting is when a model learns training data too well including noise. Regularization helps prevent it.",
        "relevant_chunks": [4]  # Chunk 4
    },
    {
        "question": "What is gradient descent?",
        "ground_truth": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters.",
        "relevant_chunks": [7]  # Chunk 7
    }
]

print(f"Knowledge base: {len(knowledge_chunks)} chunks")
print(f"Test questions: {len(test_questions)} questions")
print("\nSample chunk:")
print(f"  {knowledge_chunks[0]}")

## 3. RAG Implementation

In [None]:
# Load embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
print("Model loaded successfully!")

In [None]:
class SimpleRAG:
    """Simple RAG system with vector similarity search"""
    
    def __init__(self, embedding_model, chunks: List[str]):
        self.embedding_model = embedding_model
        self.chunks = chunks
        self.chunk_embeddings = None
        self._build_index()
    
    def _build_index(self):
        """Embed all chunks and store in memory"""
        print(f"Embedding {len(self.chunks)} chunks...")
        self.chunk_embeddings = self.embedding_model.encode(
            self.chunks, 
            convert_to_numpy=True,
            show_progress_bar=True
        )
        print(f"Index built: {self.chunk_embeddings.shape}")
    
    def retrieve(self, query: str, top_k: int = 2) -> List[Tuple[int, str, float]]:
        """Retrieve top-k most similar chunks"""
        # Embed query
        query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
        
        # Calculate cosine similarity
        similarities = cosine_similarity(query_embedding, self.chunk_embeddings)[0]
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return (index, chunk, score)
        results = [
            (idx, self.chunks[idx], similarities[idx]) 
            for idx in top_indices
        ]
        return results
    
    def generate_answer(self, query: str, top_k: int = 2) -> Dict:
        """Generate answer by concatenating retrieved chunks"""
        retrieved = self.retrieve(query, top_k)
        
        # Simple answer: concatenate retrieved chunks
        answer = " ".join([chunk for _, chunk, _ in retrieved])
        
        return {
            "answer": answer,
            "retrieved_chunks": retrieved
        }

In [None]:
# Test RAG system
rag = SimpleRAG(embedding_model, knowledge_chunks)

# Test với một câu hỏi
test_query = test_questions[0]["question"]
result = rag.generate_answer(test_query, top_k=2)

print(f"Question: {test_query}")
print(f"\nRetrieved chunks:")
for idx, chunk, score in result["retrieved_chunks"]:
    print(f"  [{idx}] (score: {score:.3f}): {chunk[:100]}...")
print(f"\nGenerated answer:\n  {result['answer'][:200]}...")

## 4. Evaluation Metrics

In [None]:
def calculate_retrieval_precision(retrieved_indices: List[int], relevant_indices: List[int]) -> float:
    """Calculate precision: how many retrieved chunks are relevant"""
    if len(retrieved_indices) == 0:
        return 0.0
    
    relevant_set = set(relevant_indices)
    retrieved_set = set(retrieved_indices)
    
    correct = len(relevant_set.intersection(retrieved_set))
    return correct / len(retrieved_set)

def calculate_retrieval_recall(retrieved_indices: List[int], relevant_indices: List[int]) -> float:
    """Calculate recall: how many relevant chunks were retrieved"""
    if len(relevant_indices) == 0:
        return 1.0  # No relevant chunks to retrieve
    
    relevant_set = set(relevant_indices)
    retrieved_set = set(retrieved_indices)
    
    correct = len(relevant_set.intersection(retrieved_set))
    return correct / len(relevant_set)

def calculate_answer_similarity(answer: str, ground_truth: str, model) -> float:
    """Calculate semantic similarity between generated answer and ground truth"""
    embeddings = model.encode([answer, ground_truth], convert_to_numpy=True)
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return similarity

def evaluate_rag(rag_system: SimpleRAG, questions: List[Dict], top_k: int = 2) -> Dict:
    """Evaluate RAG system on test questions"""
    precisions = []
    recalls = []
    similarities = []
    
    for q in questions:
        result = rag_system.generate_answer(q["question"], top_k)
        retrieved_indices = [idx for idx, _, _ in result["retrieved_chunks"]]
        
        # Calculate metrics
        precision = calculate_retrieval_precision(retrieved_indices, q["relevant_chunks"])
        recall = calculate_retrieval_recall(retrieved_indices, q["relevant_chunks"])
        similarity = calculate_answer_similarity(
            result["answer"], 
            q["ground_truth"], 
            rag_system.embedding_model
        )
        
        precisions.append(precision)
        recalls.append(recall)
        similarities.append(similarity)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_similarity": np.mean(similarities),
        "precisions": precisions,
        "recalls": recalls,
        "similarities": similarities
    }

In [None]:
# Baseline evaluation (with all chunks)
baseline_metrics = evaluate_rag(rag, test_questions, top_k=2)

print("Baseline Performance (all chunks):")
print(f"  Average Precision: {baseline_metrics['avg_precision']:.3f}")
print(f"  Average Recall: {baseline_metrics['avg_recall']:.3f}")
print(f"  Average Answer Similarity: {baseline_metrics['avg_similarity']:.3f}")

## 5. Leave-One-Out (LOO) Evaluation

Đánh giá giá trị của từng chunk bằng cách:
1. Loại bỏ chunk `i` khỏi knowledge base
2. Đánh giá lại hiệu suất RAG
3. Tính LOO score = Performance drop khi loại bỏ chunk `i`

In [None]:
def leave_one_out_evaluation(chunks: List[str], questions: List[Dict], embedding_model, top_k: int = 2) -> Dict:
    """Perform Leave-One-Out evaluation"""
    
    # Baseline with all chunks
    print("Calculating baseline...")
    baseline_rag = SimpleRAG(embedding_model, chunks)
    baseline_metrics = evaluate_rag(baseline_rag, questions, top_k)
    baseline_score = baseline_metrics['avg_similarity']  # Use similarity as main metric
    
    # LOO evaluation
    loo_scores = []
    loo_details = []
    
    print(f"\nRunning LOO for {len(chunks)} chunks...")
    for i in range(len(chunks)):
        # Remove chunk i
        chunks_without_i = chunks[:i] + chunks[i+1:]
        
        # Build RAG without chunk i
        rag_without_i = SimpleRAG(embedding_model, chunks_without_i)
        
        # Evaluate
        metrics_without_i = evaluate_rag(rag_without_i, questions, top_k)
        score_without_i = metrics_without_i['avg_similarity']
        
        # LOO score = performance drop
        loo_score = baseline_score - score_without_i
        loo_scores.append(loo_score)
        
        loo_details.append({
            "chunk_id": i,
            "chunk_text": chunks[i][:100] + "...",
            "loo_score": loo_score,
            "baseline_similarity": baseline_score,
            "without_chunk_similarity": score_without_i,
            "performance_drop": loo_score
        })
        
        print(f"  Chunk {i}: LOO score = {loo_score:.4f}")
    
    return {
        "baseline_score": baseline_score,
        "loo_scores": loo_scores,
        "loo_details": loo_details,
        "baseline_metrics": baseline_metrics
    }

In [None]:
# Run LOO evaluation
loo_results = leave_one_out_evaluation(
    knowledge_chunks, 
    test_questions, 
    embedding_model, 
    top_k=2
)

print("\n" + "="*60)
print("LOO Evaluation Complete!")
print("="*60)

## 6. Results Visualization

In [None]:
# Create results dataframe
results_df = pd.DataFrame(loo_results['loo_details'])
results_df = results_df.sort_values('loo_score', ascending=False)

print("\nTop 5 Most Valuable Chunks:")
print(results_df[['chunk_id', 'loo_score', 'chunk_text']].head())

print("\nBottom 5 Least Valuable Chunks:")
print(results_df[['chunk_id', 'loo_score', 'chunk_text']].tail())

In [None]:
# Visualization 1: Bar chart of LOO scores
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart
ax1 = axes[0]
colors = ['red' if score > 0.01 else 'lightblue' for score in results_df['loo_score']]
ax1.barh(results_df['chunk_id'].astype(str), results_df['loo_score'], color=colors)
ax1.set_xlabel('LOO Score (Performance Drop)', fontsize=12)
ax1.set_ylabel('Chunk ID', fontsize=12)
ax1.set_title('Chunk Importance (LOO Scores)', fontsize=14, fontweight='bold')
ax1.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax1.grid(axis='x', alpha=0.3)

# Heatmap of metrics
ax2 = axes[1]
heatmap_data = results_df[['chunk_id', 'baseline_similarity', 'without_chunk_similarity', 'performance_drop']].set_index('chunk_id')
sns.heatmap(heatmap_data.T, annot=True, fmt='.3f', cmap='RdYlGn_r', ax=ax2, cbar_kws={'label': 'Score'})
ax2.set_title('Performance Metrics Heatmap', fontsize=14, fontweight='bold')
ax2.set_xlabel('Chunk ID', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Visualization 2: Distribution of LOO scores
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

ax.hist(results_df['loo_score'], bins=10, color='skyblue', edgecolor='black', alpha=0.7)
ax.axvline(x=results_df['loo_score'].mean(), color='red', linestyle='--', linewidth=2, label=f"Mean: {results_df['loo_score'].mean():.4f}")
ax.set_xlabel('LOO Score', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Distribution of Chunk Importance Scores', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Interpretation & Analysis

In [None]:
print("="*60)
print("SUMMARY & INTERPRETATION")
print("="*60)

print(f"\n1. Baseline Performance:")
print(f"   - Average Similarity: {loo_results['baseline_score']:.3f}")
print(f"   - Precision: {loo_results['baseline_metrics']['avg_precision']:.3f}")
print(f"   - Recall: {loo_results['baseline_metrics']['avg_recall']:.3f}")

print(f"\n2. LOO Analysis:")
print(f"   - Mean LOO Score: {np.mean(loo_results['loo_scores']):.4f}")
print(f"   - Std LOO Score: {np.std(loo_results['loo_scores']):.4f}")
print(f"   - Max LOO Score: {np.max(loo_results['loo_scores']):.4f} (Chunk {np.argmax(loo_results['loo_scores'])})")
print(f"   - Min LOO Score: {np.min(loo_results['loo_scores']):.4f} (Chunk {np.argmin(loo_results['loo_scores'])})")

print(f"\n3. Most Critical Chunks:")
top_3 = results_df.head(3)
for idx, row in top_3.iterrows():
    print(f"   - Chunk {row['chunk_id']} (LOO: {row['loo_score']:.4f}):")
    print(f"     {row['chunk_text']}")

print(f"\n4. Least Critical Chunks:")
bottom_3 = results_df.tail(3)
for idx, row in bottom_3.iterrows():
    print(f"   - Chunk {row['chunk_id']} (LOO: {row['loo_score']:.4f}):")
    print(f"     {row['chunk_text']}")

print(f"\n5. Key Insights:")
high_value_chunks = results_df[results_df['loo_score'] > results_df['loo_score'].mean()]
print(f"   - {len(high_value_chunks)} chunks are above-average in importance")
print(f"   - Removing critical chunks causes up to {np.max(loo_results['loo_scores']):.2%} performance drop")
print(f"   - Some chunks have minimal impact (LOO score ≈ 0), indicating redundancy")

print(f"\n6. Recommendations:")
print(f"   ✓ Focus on maintaining high-quality chunks with LOO > {results_df['loo_score'].mean():.4f}")
print(f"   ✓ Consider removing or improving low-value chunks (LOO ≈ 0)")
print(f"   ✓ Chunks directly answering test questions have highest LOO scores")
print(f"   ✓ Scale this approach to larger datasets with Monte Carlo sampling")

## 8. Next Steps

### Scaling to Production:
1. **Larger datasets**: Use Monte Carlo LOO for k > 10 chunks
2. **Better metrics**: Integrate Ragas (faithfulness, answer relevancy)
3. **Advanced retrieval**: Use vector databases (Pinecone, Weaviate)
4. **LLM generation**: Replace concatenation with GPT/Claude

### Integration with existing codebase:
```python
# Connect to src/dv/interfaces.py
from src.dv.interfaces import Valuator

class LOOValuator(Valuator):
    def value(self, chunks: List[str]) -> np.ndarray:
        # Implement LOO logic here
        pass
```

### Experiment variations:
- Try different embedding models (OpenAI, Cohere)
- Vary top_k retrieval parameter
- Test on different domains (medical, legal, etc.)
- Compare LOO with Shapley values

In [None]:
# Export results to CSV for further analysis
results_df.to_csv('loo_results.csv', index=False)
print("Results exported to loo_results.csv")