# Week 10 ‚Äî Regression & Version Comparison
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand regression testing concepts and why they matter
2. Use the `compare_runs` function to compare benchmark results
3. Use the `summarize_regressions` function to identify performance drops
4. Analyze regression severity and prioritize fixes
5. Generate comprehensive regression reports

---

## üß† Why Regression Testing Matters

### The Challenge

When updating LLM models, improvements in some areas can cause degradation in others:

| Scenario | Improvement | Potential Regression |
|----------|-------------|---------------------|
| **Model fine-tuning** | Better task performance | Worse general knowledge |
| **Quantization** | Faster inference | Lower accuracy |
| **Architecture change** | More efficient | Different failure modes |

### Why Compare Runs?

- Catch degradation before deployment
- Make data-driven upgrade decisions
- Prioritize which regressions to fix
- Track quality over time

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import sys
from typing import Dict, Any, List

# Add src to path if running in Colab
sys.path.insert(0, '.')

# Install dependencies if needed
# !pip install pandas numpy

import pandas as pd
import numpy as np

print("‚úÖ Setup complete!")

---

## üì¶ Step 2: Import the Reporting Module

In [None]:
# Import the reporting functions
from src.benchmark_engine.reporting import (
    compare_runs,
    summarize_regressions,
    generate_regression_report,
)

print("‚úÖ Reporting module imported successfully!")
print("\nüìã Available functions:")
print("   - compare_runs: Compare metrics between two benchmark runs")
print("   - summarize_regressions: Identify cases where new model is worse")
print("   - generate_regression_report: Create comprehensive regression analysis")

---

## üìä Step 3: Create Synthetic Benchmark Data

We'll create two synthetic DataFrames representing benchmark results from:
- **Run A (Baseline):** Current production model
- **Run B (New Model):** Updated model we're evaluating

In [None]:
# Create synthetic benchmark data for baseline model (Run A)
run_a = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Explain machine learning in simple terms.",
        "What is 2+2?",
        "Define artificial intelligence.",
        "What is the speed of light?",
        "Summarize the key principles of software engineering.",
        "What is the largest planet in our solar system?",
        "Explain quantum computing basics.",
    ],
    "score": [0.95, 0.88, 1.00, 0.92, 0.85, 0.78, 0.90, 0.82],
    "latency_ms": [45, 62, 38, 55, 70, 85, 48, 95],
    "tokens_per_second": [120, 95, 140, 105, 85, 72, 115, 65],
})

# Create synthetic benchmark data for new model (Run B)
# Note: Some metrics improve, some regress
run_b = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Explain machine learning in simple terms.",
        "What is 2+2?",
        "Define artificial intelligence.",
        "What is the speed of light?",
        "Summarize the key principles of software engineering.",
        "What is the largest planet in our solar system?",
        "Explain quantum computing basics.",
    ],
    "score": [0.92, 0.91, 1.00, 0.85, 0.88, 0.82, 0.88, 0.79],
    "latency_ms": [42, 58, 40, 65, 68, 78, 52, 88],
    "tokens_per_second": [125, 100, 135, 90, 88, 80, 110, 70],
})

print("üìä Run A (Baseline Model):")
print("=" * 80)
display(run_a)

print("\nüìä Run B (New Model):")
print("=" * 80)
display(run_b)

---

## üîç Step 4: Compare Benchmark Runs

The `compare_runs` function merges two DataFrames and calculates differences for each metric.

In [None]:
# Compare the two runs
print("üìà Comparing Runs...")
print("=" * 80)

diff_df = compare_runs(run_a, run_b, on="prompt")

print("\nüìä Comparison DataFrame Columns:")
print(diff_df.columns.tolist())

print("\nüìä Merged Data with Differences:")
display(diff_df[['prompt', 'score_a', 'score_b', 'score_diff', 'latency_ms_diff']])

In [None]:
# Analyze the differences
print("üìä Difference Summary Statistics")
print("=" * 60)

print("\nüéØ Score Differences (new - baseline):")
print(f"   Mean:  {diff_df['score_diff'].mean():.4f}")
print(f"   Std:   {diff_df['score_diff'].std():.4f}")
print(f"   Min:   {diff_df['score_diff'].min():.4f}")
print(f"   Max:   {diff_df['score_diff'].max():.4f}")

print("\n‚è±Ô∏è Latency Differences (new - baseline):")
print(f"   Mean:  {diff_df['latency_ms_diff'].mean():.2f} ms")
print(f"   Std:   {diff_df['latency_ms_diff'].std():.2f} ms")
print(f"   Min:   {diff_df['latency_ms_diff'].min():.2f} ms")
print(f"   Max:   {diff_df['latency_ms_diff'].max():.2f} ms")

# Interpretation
avg_score_diff = diff_df['score_diff'].mean()
avg_latency_diff = diff_df['latency_ms_diff'].mean()

print("\nüìù Interpretation:")
if avg_score_diff > 0:
    print(f"   ‚úÖ Average score improved by {avg_score_diff:.4f}")
else:
    print(f"   ‚ö†Ô∏è Average score decreased by {abs(avg_score_diff):.4f}")

if avg_latency_diff < 0:
    print(f"   ‚úÖ Average latency improved by {abs(avg_latency_diff):.2f} ms")
else:
    print(f"   ‚ö†Ô∏è Average latency increased by {avg_latency_diff:.2f} ms")

---

## üîç Step 5: Identify Score Regressions

For score metrics where **higher is better**, a regression occurs when the new model scores lower.

In [None]:
# Find score regressions (higher is better)
print("üîç Finding Score Regressions (higher is better)")
print("=" * 80)

score_regressions = summarize_regressions(
    diff_df, 
    metric="score", 
    higher_is_better=True
)

if not score_regressions.empty:
    print(f"\n‚ö†Ô∏è Found {len(score_regressions)} score regressions:")
    print("-" * 80)
    
    # Display regressions sorted by severity
    display(score_regressions[['prompt', 'score_a', 'score_b', 'score_diff', 'regression_severity']])
    
    print("\nüìã Regression Details:")
    for i, (_, row) in enumerate(score_regressions.iterrows(), 1):
        print(f"\n   [{i}] {row['prompt'][:50]}...")
        print(f"       Baseline: {row['score_a']:.2f} ‚Üí New: {row['score_b']:.2f}")
        print(f"       Change: {row['score_diff']:.4f} (Severity: {row['regression_severity']:.4f})")
else:
    print("\n‚úÖ No score regressions found!")

---

## üîç Step 6: Identify Latency Regressions

For latency metrics where **lower is better**, a regression occurs when the new model is slower.

In [None]:
# Find latency regressions (lower is better)
print("üîç Finding Latency Regressions (lower is better)")
print("=" * 80)

latency_regressions = summarize_regressions(
    diff_df, 
    metric="latency_ms", 
    higher_is_better=False
)

if not latency_regressions.empty:
    print(f"\n‚ö†Ô∏è Found {len(latency_regressions)} latency regressions:")
    print("-" * 80)
    
    # Display regressions sorted by severity
    display(latency_regressions[['prompt', 'latency_ms_a', 'latency_ms_b', 'latency_ms_diff', 'regression_severity']])
    
    print("\nüìã Regression Details:")
    for i, (_, row) in enumerate(latency_regressions.iterrows(), 1):
        print(f"\n   [{i}] {row['prompt'][:50]}...")
        print(f"       Baseline: {row['latency_ms_a']:.0f}ms ‚Üí New: {row['latency_ms_b']:.0f}ms")
        print(f"       Slowdown: +{row['latency_ms_diff']:.0f}ms")
else:
    print("\n‚úÖ No latency regressions found!")

---

## üîç Step 7: Identify Throughput Regressions

In [None]:
# Find throughput regressions (higher is better)
print("üîç Finding Throughput Regressions (higher is better)")
print("=" * 80)

throughput_regressions = summarize_regressions(
    diff_df, 
    metric="tokens_per_second", 
    higher_is_better=True
)

if not throughput_regressions.empty:
    print(f"\n‚ö†Ô∏è Found {len(throughput_regressions)} throughput regressions:")
    print("-" * 80)
    
    display(throughput_regressions[['prompt', 'tokens_per_second_a', 'tokens_per_second_b', 'tokens_per_second_diff', 'regression_severity']])
else:
    print("\n‚úÖ No throughput regressions found!")

---

## üìã Step 8: Generate Comprehensive Report

In [None]:
# Generate comprehensive regression report
print("üìã Generating Full Regression Report...")
print("=" * 80)

report = generate_regression_report(
    run_a,
    run_b,
    on="prompt",
    metrics=["score", "latency_ms", "tokens_per_second"],
    metric_directions={
        "score": True,             # Higher is better
        "latency_ms": False,       # Lower is better
        "tokens_per_second": True, # Higher is better
    },
)

print(f"\nüìä Report Summary:")
print(f"   Total test cases: {report['total_cases']}")
print(f"   Metrics analyzed: {report['metrics_analyzed']}")

In [None]:
# Display per-metric summary
print("üìà Per-Metric Analysis")
print("=" * 80)

for metric, stats in report['summary'].items():
    direction_symbol = "‚Üë" if stats['higher_is_better'] else "‚Üì"
    
    print(f"\nüîπ {metric} ({direction_symbol} = better):")
    print(f"   Mean difference: {stats['mean_diff']:.4f}")
    print(f"   Std difference:  {stats['std_diff']:.4f}")
    print(f"   Min difference:  {stats['min_diff']:.4f}")
    print(f"   Max difference:  {stats['max_diff']:.4f}")
    print(f"   Regressions:     {stats['total_regressions']} ({stats['regression_rate']:.1%})")
    
    if stats['total_regressions'] > 0:
        print(f"   Max severity:    {stats['max_regression_severity']:.4f}")
        print(f"   Mean severity:   {stats['mean_regression_severity']:.4f}")

---

## üìä Step 9: Create Regression Summary Table

In [None]:
# Create a summary table for all regressions
print("üìä Regression Summary Table")
print("=" * 80)

summary_data = []
for metric, stats in report['summary'].items():
    direction = "‚Üë better" if stats['higher_is_better'] else "‚Üì better"
    summary_data.append({
        "Metric": metric,
        "Direction": direction,
        "Mean Œî": f"{stats['mean_diff']:.4f}",
        "Regressions": stats['total_regressions'],
        "Rate": f"{stats['regression_rate']:.1%}",
        "Max Severity": f"{stats.get('max_regression_severity', 0):.4f}" if stats['total_regressions'] > 0 else "N/A",
    })

summary_table = pd.DataFrame(summary_data)
display(summary_table)

---

## üîß Step 10: Using Thresholds to Filter Noise

In [None]:
# Use thresholds to filter out small fluctuations
print("üîß Filtering Regressions with Threshold")
print("=" * 80)

# Only count regressions > 5% drop as significant
threshold = 0.05

print(f"\nüìã Using threshold: {threshold} (5% minimum regression)")

significant_regressions = summarize_regressions(
    diff_df,
    metric="score",
    threshold=threshold,
    higher_is_better=True,
)

print(f"\nüìä Without threshold: {len(score_regressions)} regressions")
print(f"üìä With 5% threshold: {len(significant_regressions)} significant regressions")

if not significant_regressions.empty:
    print("\n‚ö†Ô∏è Significant regressions (> 5%):")
    display(significant_regressions[['prompt', 'score_a', 'score_b', 'score_diff', 'regression_severity']])

---

## üìã Step 11: Identify Improvements

In [None]:
# Find cases where the new model improved
print("‚úÖ Identifying Improvements")
print("=" * 80)

# Score improvements (positive diff for higher-is-better)
score_improvements = diff_df[diff_df['score_diff'] > 0].copy()

print(f"\nüéØ Score Improvements: {len(score_improvements)} cases")
if not score_improvements.empty:
    for _, row in score_improvements.iterrows():
        print(f"   ‚úì '{row['prompt'][:40]}...': {row['score_a']:.2f} ‚Üí {row['score_b']:.2f} (+{row['score_diff']:.2f})")

# Latency improvements (negative diff for lower-is-better)
latency_improvements = diff_df[diff_df['latency_ms_diff'] < 0].copy()

print(f"\n‚è±Ô∏è Latency Improvements: {len(latency_improvements)} cases")
if not latency_improvements.empty:
    for _, row in latency_improvements.iterrows():
        print(f"   ‚úì '{row['prompt'][:40]}...': {row['latency_ms_a']:.0f}ms ‚Üí {row['latency_ms_b']:.0f}ms ({row['latency_ms_diff']:.0f}ms)")

---

## üéì Mini-Project: Regression Audit

### Task

Create a comprehensive regression analysis for a model update.

### Template

In [None]:
# Your regression audit code here

# Step 1: Create or load your benchmark data
# my_baseline = pd.DataFrame({...})
# my_new_model = pd.DataFrame({...})

# Step 2: Compare runs
# diff = compare_runs(my_baseline, my_new_model, on='prompt')

# Step 3: Analyze regressions for each metric
# - Use summarize_regressions for quality metrics
# - Use summarize_regressions with higher_is_better=False for latency

# Step 4: Generate report
# report = generate_regression_report(...)

# Step 5: Make a deployment decision
# - How many regressions are acceptable?
# - What severity threshold is critical?
# - Do improvements outweigh regressions?

print("üìù Complete the mini-project using the template above.")

---

## ü§î Paul-Elder Critical Thinking Questions

Reflect on these questions:

### Question 1: EVIDENCE
**If a model improves average score but has severe regressions on 10% of cases, should you deploy?**
*Consider: Severity of regressions, user impact, rollback capability, monitoring.*

### Question 2: ASSUMPTIONS
**What assumptions are we making about the representativeness of our test prompts?**
*Consider: Production distribution, edge cases, domain coverage, user behavior.*

### Question 3: IMPLICATIONS
**If we set a threshold to ignore small regressions, what might we miss over time?**
*Consider: Accumulation of small drops, compounding effects, gradual degradation.*

---

## ‚ö†Ô∏è Limitations of Regression Testing

### What These Tests DON'T Cover

1. **Statistical Significance:** Current implementation doesn't test if differences are statistically significant
2. **Distribution Shift:** Test prompts may not represent production distribution
3. **Qualitative Changes:** Some regressions are subjective and hard to measure
4. **Side Effects:** Changes may affect other metrics not being tracked
5. **Long-term Trends:** Single comparison doesn't show degradation over time

### Future Improvements (TODO)

- Statistical significance testing (p-values, confidence intervals)
- Multi-version trend analysis
- Automatic regression alerting
- Visualization of regression distributions
- Integration with CI/CD pipelines

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 11, ensure you can check all boxes:

- [ ] I understand why regression testing is critical for model updates
- [ ] I can use `compare_runs` to compare two benchmark DataFrames
- [ ] I can use `summarize_regressions` to identify performance drops
- [ ] I understand the difference between metrics where higher/lower is better
- [ ] I can interpret regression severity and prioritize fixes
- [ ] I know the limitations of regression testing

---

**Week 10 Complete!** üéâ

**Next:** *Week 11 ‚Äî Banking & Finance Use Cases*