# Summarisation Results Analysis

This notebook analyses the results of summarisation experiments using the Model Evaluation Suite (MES) framework.

## Features
- **Weighted Scoring**: Combines multiple evaluation metrics with configurable weights
- **Hard Failure Thresholds**: Auto-fails experiments below critical safety/groundedness thresholds  
- **Performance Analysis**: Latency statistics and grade distribution
- **Comparative Visualisation**: Charts comparing experiments across multiple dimensions

## Scoring Methodology
The scoring system uses a weighted combination of:
- **VertexAI Groundedness** (25%) - Factual accuracy
- **VertexAI Safety** (20%) - Safety compliance
- **Summarisation Quality** (15%) - Framework's quality metric
- **VertexAI Summarisation Quality** (10%) - VertexAI's assessment
- **Other metrics** (30%) - Coherence, instruction following, fluency, verbosity

Experiments failing safety or groundedness thresholds (< 0.5) are automatically graded 'F'.

In [None]:
import sys
import yaml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Add src directory to path
sys.path.append('../src')

from clients.bigquery_client import BigQueryClient

# Load configuration from YAML
CONFIG_PATH = '../config/sample_experiments.yaml'

with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded successfully.")
print(f"Project: {config['project']}")
print(f"Location: {config['location']}")
print(f"BigQuery Dataset: {config['bigquery']['dataset_name']}")
print(f"BigQuery Table: {config['bigquery']['table_name']}")

In [None]:
# Initialise BigQuery client using framework
bq_client = BigQueryClient(
    project_id=config['bigquery']['project_id'],
    location=config['bigquery']['location']
)

# Build table reference from config
bq_dataset = config['bigquery']['dataset_name']
bq_table = config['bigquery']['table_name']
full_table_id = f"{config['bigquery']['project_id']}.{bq_dataset}.{bq_table}"

print(f"BigQuery client initialised for: {full_table_id}")

# Query experiment results - adjusted column names to match framework schema
query = f"""
SELECT
  experiment_name,
  model_id,
  summarisation_quality,
  vertexai_groundedness,
  vertexai_summarization_quality,
  vertexai_coherence, 
  vertexai_safety,
  vertexai_instruction_following,
  vertexai_fluency,
  vertexai_verbosity,
  latency_ms,
  timestamp,
  metadata
FROM `{full_table_id}`
WHERE use_case = 'summarisation'
  AND error IS NULL
"""

print("Running query...")
results = bq_client.query(query)
df_raw = pd.DataFrame([dict(row) for row in results])

print(f"Loaded {len(df_raw)} rows from BigQuery")
df_raw.head()

In [None]:
# Scoring weights configuration
WEIGHTS = {
    "vertexai_groundedness":        0.25,  # HIGH weight - factual accuracy
    "vertexai_safety":              0.20,  # HIGH weight - safety compliance  
    "summarisation_quality":        0.15,  # Framework's summarisation metric
    "vertexai_summarization_quality": 0.10, # VertexAI's summarisation metric
    "vertexai_coherence":           0.10,  # Logical flow and consistency
    "vertexai_instruction_following": 0.10, # Following the prompt instructions
    "vertexai_fluency":             0.06,  # Language quality
    "vertexai_verbosity":           0.04,  # Appropriate length
}

# Hard failure thresholds - experiments scoring below these are auto-failed
HARD_FAILS = {
    "vertexai_groundedness":  0.50,   # Must be factually grounded
    "vertexai_safety":        0.50,   # Must be safe
}

# Validate that we have the required columns
required_cols = list(WEIGHTS.keys())
missing_cols = [col for col in required_cols if col not in df_raw.columns]
if missing_cols:
    print(f"WARNING: Missing columns in data: {missing_cols}")
    print(f"Available columns: {list(df_raw.columns)}")
else:
    print("✅ All required metric columns are present")

print(f"\nWeights configuration:")
for metric, weight in WEIGHTS.items():
    print(f"  {metric}: {weight:.2f}")
print(f"\nTotal weight: {sum(WEIGHTS.values()):.2f}")


In [None]:
def weighted_score(row):
    """Calculate weighted score from available metrics"""
    score = 0
    total_weight = 0
    
    for metric, weight in WEIGHTS.items():
        if metric in row and pd.notna(row[metric]):
            score += row[metric] * weight
            total_weight += weight
        else:
            print(f"Warning: Missing or null value for {metric} in row")
    
    # Normalise by actual total weight if some metrics are missing
    return score / total_weight if total_weight > 0 else 0

def letter_grade(score):
    """Convert numerical score to letter grade"""
    if   score >= 0.90: return "A"
    elif score >= 0.80: return "B" 
    elif score >= 0.70: return "C"
    elif score >= 0.60: return "D"
    else:               return "F"

# Only proceed if we have data
if len(df_raw) == 0:
    print("❌ No data found. Check your query and table.")
else:
    df = df_raw.copy()
    
    # Calculate weighted scores
    df["weighted_score"] = df.apply(weighted_score, axis=1)
    
    # Apply hard failure thresholds
    for metric, threshold in HARD_FAILS.items():
        if metric in df.columns:
            failing_mask = df[metric] < threshold
            num_failing = failing_mask.sum()
            if num_failing > 0:
                print(f"⚠️  {num_failing} experiments failed {metric} threshold ({threshold})")
                df.loc[failing_mask, "weighted_score"] = np.minimum(df["weighted_score"], 0.59)
    
    # Assign letter grades
    df["grade"] = df["weighted_score"].apply(letter_grade)
    
    print(f"\n✅ Processed {len(df)} experiments")
    print(f"Grade distribution:")
    print(df["grade"].value_counts().sort_index())


In [None]:
# Latency analysis
if 'latency_ms' in df.columns and df['latency_ms'].notna().any():
    lat_stats = df["latency_ms"].describe(percentiles=[0.5, 0.9, 0.99])[
        ["50%", "90%", "99%", "mean", "min", "max"]
    ]
    print("📊 Latency Statistics (ms):")
    print(lat_stats)
    
    # Additional latency insights
    print(f"\n🚀 Performance Insights:")
    print(f"Median latency: {df['latency_ms'].median():.0f}ms")
    print(f"95th percentile: {df['latency_ms'].quantile(0.95):.0f}ms")
    print(f"Experiments > 5s: {(df['latency_ms'] > 5000).sum()}/{len(df)}")
else:
    print("⚠️  No latency data available")

In [None]:
# Create experiment scoreboard
if len(df) > 0:
    # Group by experiment_name (framework uses this instead of experiment_id)
    groupby_col = 'experiment_name' if 'experiment_name' in df.columns else 'model_id'
    
    scoreboard = (
        df.groupby(groupby_col)
          .agg(
              samples          = ("weighted_score", "size"),
              avg_score        = ("weighted_score", "mean"),
              std_score        = ("weighted_score", "std"),
              p50_latency_ms   = ("latency_ms", lambda s: np.percentile(s.dropna(), 50) if len(s.dropna()) > 0 else np.nan),
              p90_latency_ms   = ("latency_ms", lambda s: np.percentile(s.dropna(), 90) if len(s.dropna()) > 0 else np.nan),
              fail_rate        = ("grade", lambda g: (g == "F").mean()),
              a_grade_rate     = ("grade", lambda g: (g == "A").mean()),
          )
          .sort_values("avg_score", ascending=False)
    )
    
    print(f"📈 Experiment Scoreboard (Top 10):")
    print("=" * 80)
    display(scoreboard.head(10))
    
    # Summary insights
    print(f"\n🎯 Key Insights:")
    print(f"Best performing experiment: {scoreboard.index[0]} (score: {scoreboard.iloc[0]['avg_score']:.3f})")
    print(f"Worst performing experiment: {scoreboard.index[-1]} (score: {scoreboard.iloc[-1]['avg_score']:.3f})")
    print(f"Average fail rate: {scoreboard['fail_rate'].mean():.1%}")
    print(f"Experiments with A grades: {(scoreboard['a_grade_rate'] > 0).sum()}/{len(scoreboard)}")
else:
    print("❌ No data to create scoreboard")


In [None]:
import matplotlib.pyplot as plt

# Enhanced visualisations
if len(df) > 0 and 'weighted_score' in df.columns:
    
    # Set up the plotting style
    plt.style.use('default')
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Summarisation Experiment Analysis', fontsize=16, fontweight='bold')
    
    # 1. Average Weighted Score per Experiment
    ax1 = axes[0, 0]
    scoreboard["avg_score"].plot(kind="barh", ax=ax1, color='skyblue')
    ax1.set_title("Average Weighted Score per Experiment")
    ax1.set_xlabel("Weighted Score")
    ax1.grid(True, alpha=0.3)
    ax1.invert_yaxis()
    
    # 2. Latency comparison (if available)
    ax2 = axes[0, 1]
    if 'latency_ms' in df.columns and df['latency_ms'].notna().any():
        latency_cols = ["p50_latency_ms", "p90_latency_ms"]
        available_cols = [col for col in latency_cols if col in scoreboard.columns]
        if available_cols:
            scoreboard[available_cols].plot(kind="barh", ax=ax2, color=['lightgreen', 'orange'])
            ax2.set_title("Latency Distribution (p50 / p90)")
            ax2.set_xlabel("Latency (ms)")
            ax2.grid(True, alpha=0.3)
            ax2.invert_yaxis()
        else:
            ax2.text(0.5, 0.5, 'No latency data', ha='center', va='center', transform=ax2.transAxes)
    else:
        ax2.text(0.5, 0.5, 'No latency data available', ha='center', va='center', transform=ax2.transAxes)
    
    # 3. Grade distribution
    ax3 = axes[1, 0]
    grade_counts = df["grade"].value_counts().reindex(['A', 'B', 'C', 'D', 'F'], fill_value=0)
    colors = ['green', 'lightgreen', 'yellow', 'orange', 'red']
    grade_counts.plot(kind='bar', ax=ax3, color=colors[:len(grade_counts)])
    ax3.set_title("Grade Distribution")
    ax3.set_ylabel("Number of Experiments")
    ax3.set_xlabel("Grade")
    ax3.tick_params(axis='x', rotation=0)
    ax3.grid(True, alpha=0.3)
    
    # 4. Score vs Fail Rate
    ax4 = axes[1, 1]
    if len(scoreboard) > 1:
        ax4.scatter(scoreboard['avg_score'], scoreboard['fail_rate'], 
                   s=scoreboard['samples']*20, alpha=0.6, color='purple')
        ax4.set_xlabel('Average Score')
        ax4.set_ylabel('Fail Rate')
        ax4.set_title('Score vs Fail Rate\n(bubble size = sample count)')
        ax4.grid(True, alpha=0.3)
    else:
        ax4.text(0.5, 0.5, 'Need >1 experiment\nfor comparison', 
                ha='center', va='center', transform=ax4.transAxes)
    
    plt.tight_layout()
    plt.show()
    
else:
    print("❌ No data available for visualisation")


## Summary

This analysis provides a comprehensive view of summarisation experiment performance using weighted scoring and multiple evaluation dimensions.

### Key Takeaways
1. **Quality Focus**: High weights on groundedness and safety ensure reliable outputs
2. **Performance Monitoring**: Latency tracking helps optimise inference costs
3. **Comparative Analysis**: Easy identification of best-performing configurations

### Next Steps
- **Experiment Optimisation**: Focus on improving low-scoring experiments
- **Threshold Tuning**: Adjust weights based on use case requirements
- **Cost Analysis**: Combine performance data with token usage for ROI analysis
- **A/B Testing**: Use insights to design follow-up experiments