# üöÄ Production RAG Pipeline - Google Colab

**100% FREE & Open Source - No API Keys Required!**

This notebook runs a complete RAG pipeline with:
- ‚úÖ Hybrid Retrieval (Dense + BM25)
- ‚úÖ BGE Reranker (#1 on MTEB Leaderboard)
- ‚úÖ Systematic Grid Search
- ‚úÖ MLflow Experiment Tracking
- ‚úÖ Final Evaluation on Golden Test Set

---

## Before You Start

1. **Enable GPU**: `Runtime` ‚Üí `Change runtime type` ‚Üí `T4 GPU` ‚Üí `Save`
2. **Run cells in order** (top to bottom)

In [None]:
#@title 1. Setup Environment
import os

# Clone the repository (replace YOUR_USERNAME with your GitHub username)
!git clone https://github.com/YOUR_USERNAME/production-rag-pipeline.git
%cd production-rag-pipeline

# Install PyTorch with CUDA
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install RAG dependencies
!pip install -q transformers sentence-transformers datasets faiss-cpu
!pip install -q mlflow pandas pyyaml tqdm rouge-score nltk rank-bm25

# Verify GPU
import torch
if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ö†Ô∏è No GPU detected - will run on CPU (slower)")

print("\n‚úÖ Setup complete!")

In [None]:
#@title 2. Run Grid Search (Phase 1 - Hyperparameter Tuning)
%cd project1_rag_production

# Run grid search with:
# - 300 documents from SQuAD dataset
# - 30 Q&A pairs for evaluation
# - Max 10 experiment configurations

!python scripts/run_grid_search.py \
    --num-docs 300 \
    --num-qa 30 \
    --max-experiments 10

print("\n‚úÖ Phase 1 Complete! Best configuration identified.")

In [None]:
#@title 3. View Grid Search Results
import pandas as pd

print("=" * 70)
print("PHASE 1: GRID SEARCH RESULTS (Ranked by Composite Score)")
print("=" * 70)

df = pd.read_csv("outputs/grid_search_results.csv")

# Display key columns
display_cols = ['name', 'correct', 'hallucination', 'quality_score', 
                'gen_f1_score', 'ret_mrr', 'composite_score']
available_cols = [c for c in display_cols if c in df.columns]

print(df[available_cols].head(10).to_string())

# Best configuration
print("\n" + "=" * 70)
print("BEST CONFIGURATION")
print("=" * 70)
best = df.iloc[0]
print(f"Name: {best['name']}")
print(f"Composite Score: {best['composite_score']:.4f}")

In [None]:
#@title 4. Run Final Evaluation (Phase 2 - Golden Test Set)
# Uses the BEST configuration from Phase 1 automatically
# Evaluates on SQuAD validation set (never seen during tuning)

!python scripts/run_final_evaluation.py --num-docs 500 --num-qa 50

print("\n‚úÖ Phase 2 Complete! Unbiased metrics on golden test set.")

In [None]:
#@title 5. View Final Evaluation Report
import os
import glob

print("=" * 70)
print("PHASE 2: FINAL EVALUATION REPORT")
print("=" * 70)

# Find the latest report
report_dir = "outputs/phase2_final_evaluation"
if os.path.exists(report_dir):
    reports = sorted(glob.glob(f"{report_dir}/FINAL_EVALUATION_REPORT_*.txt"))
    if reports:
        latest_report = reports[-1]
        print(f"Latest Report: {os.path.basename(latest_report)}\n")
        with open(latest_report, 'r') as f:
            print(f.read())
    else:
        print("No reports found. Run Phase 2 first.")
else:
    print("Phase 2 not run yet. Execute Cell 4 first.")

In [None]:
#@title 6. View MLflow Experiment Logs
import mlflow
import pandas as pd

# Set MLflow tracking URI
mlflow.set_tracking_uri("outputs/mlflow_tracking")

# Get all experiments
experiments = mlflow.search_experiments()
print("=" * 70)
print("MLFLOW EXPERIMENTS")
print("=" * 70)

for exp in experiments:
    print(f"\nExperiment: {exp.name}")
    print(f"  ID: {exp.experiment_id}")
    
    # Get runs for this experiment
    runs = mlflow.search_runs(experiment_ids=[exp.experiment_id])
    if len(runs) > 0:
        print(f"  Total Runs: {len(runs)}")
        print(f"\n  Top 5 Runs by Composite Score:")
        
        if 'metrics.composite_score' in runs.columns:
            top_runs = runs.nlargest(5, 'metrics.composite_score')
            for i, (_, run) in enumerate(top_runs.iterrows(), 1):
                print(f"    {i}. Run {run['run_id'][:8]}...")
                print(f"       Composite Score: {run.get('metrics.composite_score', 0):.4f}")
                print(f"       Quality Score: {run.get('metrics.quality_score', 0):.4f}")
                print(f"       F1 Score: {run.get('metrics.gen_f1_score', 0):.4f}")

## üìä All Logged Metrics

The following metrics are tracked for each experiment run:


In [None]:
#@title 7. View All Logged Metrics & Parameters
import mlflow

mlflow.set_tracking_uri("outputs/mlflow_tracking")

# Get all runs
experiments = mlflow.search_experiments()
if experiments:
    exp_id = experiments[0].experiment_id
    runs = mlflow.search_runs(experiment_ids=[exp_id])
    
    # Select metric columns
    metric_cols = [c for c in runs.columns if c.startswith('metrics.')]
    param_cols = [c for c in runs.columns if c.startswith('params.')]
    
    print("=" * 70)
    print("ALL LOGGED METRICS")
    print("=" * 70)
    for col in sorted(metric_cols):
        print(f"  ‚Ä¢ {col.replace('metrics.', '')}")
    
    print("\n" + "=" * 70)
    print("ALL LOGGED PARAMETERS")
    print("=" * 70)
    for col in sorted(param_cols):
        print(f"  ‚Ä¢ {col.replace('params.', '')}")
else:
    print("No experiments found. Run grid search first.")

## üíæ Download Results

Download all results including:
- Grid search results (CSV)
- MLflow tracking logs
- Final evaluation reports


In [None]:
#@title 8. Download All Results
from google.colab import files
import shutil
import os

# Create zip of all outputs
if os.path.exists('outputs'):
    shutil.make_archive('rag_pipeline_results', 'zip', 'outputs')
    
    # Download
    files.download('rag_pipeline_results.zip')
    
    print("‚úÖ Results downloaded!")
    print("   Contains:")
    print("   - grid_search_results.csv")
    print("   - MLflow tracking logs")
    print("   - Phase 2 evaluation reports")
else:
    print("No outputs found. Run the pipeline first.")

In [None]:
#@title 9. Compare All Experiment Runs (Table View)
import mlflow
import pandas as pd

mlflow.set_tracking_uri("outputs/mlflow_tracking")

experiments = mlflow.search_experiments()
if experiments:
    exp_id = experiments[0].experiment_id
    runs = mlflow.search_runs(experiment_ids=[exp_id])
    
    # Create comparison table
    comparison_cols = [
        'params.config_name',
        'metrics.quality_score',
        'metrics.hallucination_rate', 
        'metrics.gen_f1_score',
        'metrics.ret_mrr',
        'metrics.composite_score'
    ]
    available = [c for c in comparison_cols if c in runs.columns]
    
    if available:
        comparison_df = runs[available].copy()
        comparison_df.columns = [c.replace('params.', '').replace('metrics.', '') for c in available]
        comparison_df = comparison_df.sort_values('composite_score', ascending=False)
        print(comparison_df.to_string())
else:
    print("No experiments found.")

In [None]:
#@title 10. Export MLflow Data to CSV
import mlflow
import pandas as pd

mlflow.set_tracking_uri("outputs/mlflow_tracking")

experiments = mlflow.search_experiments()
if experiments:
    all_runs = []
    for exp in experiments:
        runs = mlflow.search_runs(experiment_ids=[exp.experiment_id])
        all_runs.append(runs)
    
    if all_runs:
        full_df = pd.concat(all_runs, ignore_index=True)
        full_df.to_csv("outputs/mlflow_all_runs_export.csv", index=False)
        print(f"‚úÖ Exported {len(full_df)} runs to outputs/mlflow_all_runs_export.csv")
        print(f"   Columns: {len(full_df.columns)}")
else:
    print("No experiments found.")

In [None]:
#@title 11. Quick Tips

print("""
================================================================================
QUICK TIPS
================================================================================

üîß CUSTOMIZE GRID SEARCH:
   Edit: configs/grid_search_config.yaml
   
   Example changes:
   - Add more chunk sizes: options: [256, 512, 768, 1024]
   - Test different rerankers: options: ["bge", "cross_encoder"]
   - Increase top_k: options: [5, 10, 15, 20]

üìä UNDERSTAND METRICS:
   - quality_score: % of correct answers
   - hallucination_rate: % of fabricated answers (lower = better)
   - gen_f1_score: Token-level match with ground truth
   - ret_mrr: Mean Reciprocal Rank (retrieval quality)
   - composite_score: Weighted combination (what we optimize for)

üí° REDUCE MEMORY USAGE:
   - Use fewer documents: --num-docs 100
   - Use fewer Q&A pairs: --num-qa 20
   - Reduce max experiments: --max-experiments 5

üöÄ BEST PRACTICES:
   1. Run grid search with small data first (--num-docs 100)
   2. Review results to identify promising configurations
   3. Run final evaluation with more data (--num-docs 500)
   4. Compare metrics to validate improvements

================================================================================
""")

---

## üéâ Complete!

You've successfully run a production RAG pipeline with:

| Phase | What It Does |
|-------|-------------|
| **Phase 1** | Grid search to find optimal hyperparameters |
| **Phase 2** | Final evaluation on held-out test data |

### Key Metrics Explained

| Metric | What It Measures |
|--------|------------------|
| **Accuracy** | % of questions answered correctly |
| **Hallucination Rate** | % of fabricated answers (lower is better) |
| **F1 Score** | Token overlap with ground truth |
| **MRR** | How high the correct document ranks |
| **Hit Rate@3** | Is correct doc in top 3? |
| **Composite Score** | Weighted combination of all metrics |

---

**All models are FREE and open source - no API keys needed!**

‚≠ê Star this repo if you found it useful!