# Summary Evaluation Demo - Complete Metrics Including BLEURT

This notebook demonstrates **all evaluation metrics** including BLEURT.

## Metrics Covered:
1. ‚úÖ **ROUGE** (n-gram overlap)
2. ‚úÖ **BERTScore** (semantic similarity)
3. ‚úÖ **BLEURT** (learned human-correlated metric) ‚≠ê
4. ‚úÖ **Style Similarity** (persona matching)

## Table of Contents
1. [Setup](#setup)
2. [Single Example with All Metrics](#single)
3. [Batch Evaluation](#batch)
4. [Results & Analysis](#results)

<a id='setup'></a>
## 1. Setup

Install required packages if not already installed:
```bash
pip install rouge-score bert-score tensorflow
pip install git+https://github.com/google-research/bleurt.git
```

In [None]:
# Imports
import warnings
warnings.filterwarnings('ignore')

import yaml
import pandas as pd
from src.io_utils import load_jsonl, load_persona_assignments
from src.style_features import StyleAnalyzer

print("‚úì Basic imports successful")

In [None]:
# Load configuration and data
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

records = list(load_jsonl('data/input.jsonl'))
persona_map = load_persona_assignments('data/persona_assignments.csv')

print(f"‚úì Loaded {len(records)} articles")
print(f"‚úì Loaded {len(persona_map)} persona assignments")

In [None]:
# Initialize ROUGE
from rouge_score import rouge_scorer
rouge_scorer_obj = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeLsum'], use_stemmer=True)
print("‚úì ROUGE initialized")

In [None]:
# Initialize BERTScore
try:
    import bert_score
    print("‚úì BERTScore available")
    bertscore_available = True
except:
    print("‚úó BERTScore not available (pip install bert-score)")
    bertscore_available = False

In [None]:
# Initialize BLEURT - THIS IS THE IMPORTANT PART!
print("="*80)
print("INITIALIZING BLEURT")
print("="*80)
print("\nNote: First run will download models (~1GB) and may take 2-5 minutes.\n")

try:
    from bleurt import score as bleurt_score_module
    
    # Get checkpoint from config
    bleurt_checkpoint = config['content'].get('bleurt_checkpoint', 'BLEURT-20-D12')
    print(f"Loading BLEURT checkpoint: {bleurt_checkpoint}")
    print("Please wait...\n")
    
    # Initialize scorer
    bleurt_scorer = bleurt_score_module.BleurtScorer(bleurt_checkpoint)
    
    print("="*80)
    print("‚úì ‚úì ‚úì  BLEURT SUCCESSFULLY INITIALIZED  ‚úì ‚úì ‚úì")
    print("="*80)
    bleurt_available = True
    
except ImportError as e:
    print("="*80)
    print("‚úó BLEURT NOT AVAILABLE")
    print("="*80)
    print(f"\nError: {e}")
    print("\nTo install BLEURT:")
    print("  1. pip install tensorflow>=2.0.0")
    print("  2. pip install git+https://github.com/google-research/bleurt.git")
    print("\nNote: BLEURT requires TensorFlow which can be large.")
    bleurt_available = False
    bleurt_scorer = None
    
except Exception as e:
    print("="*80)
    print("‚úó BLEURT INITIALIZATION FAILED")
    print("="*80)
    print(f"\nError: {e}")
    print("\nCommon issues:")
    print("  - TensorFlow version incompatibility")
    print("  - Network issues downloading models")
    print("  - Insufficient disk space (~1GB needed)")
    bleurt_available = False
    bleurt_scorer = None

In [None]:
# Initialize Style Analyzer
style_analyzer = StyleAnalyzer(config)
style_analyzer.build_persona_centroids()
print("‚úì Style analyzer initialized")

In [None]:
# Summary of available metrics
print("\n" + "="*80)
print("METRICS SUMMARY")
print("="*80)
print(f"  {'‚úì' if True else '‚úó'} ROUGE")
print(f"  {'‚úì' if bertscore_available else '‚úó'} BERTScore")
print(f"  {'‚úì' if bleurt_available else '‚úó'} BLEURT  {'<-- PRIMARY METRIC' if bleurt_available else '<-- UNAVAILABLE'}")
print(f"  {'‚úì' if True else '‚úó'} Style Similarity")
print("="*80)

<a id='single'></a>
## 2. Single Example with All Metrics

Let's evaluate one article with all metrics to see how they compare.

In [None]:
# Select first example
example = records[0]
reference = example['expected_summary']
generated = example['agented_summary']
persona = persona_map.get(example['write_id'])

print("üì∞ Article:", example['document_title'])
print("üé≠ Persona:", persona)
print("\n" + "="*80)
print("REFERENCE SUMMARY:")
print(reference)
print("\n" + "="*80)
print("GENERATED SUMMARY:")
print(generated)
print("="*80)

In [None]:
# Calculate ROUGE
rouge_scores = rouge_scorer_obj.score(reference, generated)

print("\nüìä ROUGE SCORES:")
print(f"  ROUGE-1 F1:    {rouge_scores['rouge1'].fmeasure:.4f}")
print(f"  ROUGE-2 F1:    {rouge_scores['rouge2'].fmeasure:.4f}")
print(f"  ROUGE-Lsum F1: {rouge_scores['rougeLsum'].fmeasure:.4f}")

In [None]:
# Calculate BERTScore
if bertscore_available:
    P, R, F1 = bert_score.score([generated], [reference], 
                                 model_type='distilbert-base-uncased', verbose=False)
    bert_f1 = F1.item()
    print(f"\nüìä BERTSCORE:")
    print(f"  F1: {bert_f1:.4f}")
else:
    bert_f1 = None
    print("\n‚ö† BERTScore not available")

In [None]:
# Calculate BLEURT - THE KEY METRIC!
print("\n" + "="*80)
print("üåü BLEURT SCORE (HUMAN-CORRELATED METRIC) üåü")
print("="*80)

if bleurt_available and bleurt_scorer is not None:
    print("\nCalculating BLEURT score...")
    
    bleurt_scores = bleurt_scorer.score(
        references=[reference],
        candidates=[generated]
    )
    
    bleurt_val = bleurt_scores[0]
    
    print("\n‚úì BLEURT calculation complete!\n")
    print(f"üìä BLEURT Score: {bleurt_val:.4f}")
    
    # Interpretation
    if bleurt_val > 0.6:
        quality = "EXCELLENT"
        color = "üü¢"
    elif bleurt_val > 0.4:
        quality = "GOOD"
        color = "üü°"
    elif bleurt_val > 0.2:
        quality = "FAIR"
        color = "üü†"
    else:
        quality = "NEEDS IMPROVEMENT"
        color = "üî¥"
    
    print(f"\n{color} Quality Assessment: {quality}")
    print("\nüí° BLEURT Interpretation:")
    print("   BLEURT is trained on human judgments of summary quality.")
    print("   Scores range from -1 to +1, with higher being better.")
    print("   Typical good summaries score 0.4-0.7.")
    
else:
    bleurt_val = None
    print("\n‚úó BLEURT not available")
    print("   Please install: pip install tensorflow && pip install git+https://github.com/google-research/bleurt.git")

print("="*80)

In [None]:
# Calculate Style Similarity
style_sim = style_analyzer.calculate_style_similarity(generated, persona)

print(f"\nüìä STYLE SIMILARITY:")
print(f"  Score: {style_sim:.4f}")
print(f"  Target persona: {persona}")

In [None]:
# COMPARISON TABLE
print("\n" + "="*80)
print("ALL METRICS COMPARISON")
print("="*80)
print(f"\n{'Metric':<20} {'Score':<10} {'Interpretation'}")
print("-" * 80)
print(f"{'ROUGE-1 F1':<20} {rouge_scores['rouge1'].fmeasure:<10.4f} {'Word overlap'}")
print(f"{'ROUGE-2 F1':<20} {rouge_scores['rouge2'].fmeasure:<10.4f} {'Phrase overlap'}")
print(f"{'ROUGE-Lsum F1':<20} {rouge_scores['rougeLsum'].fmeasure:<10.4f} {'Sentence structure'}")

if bert_f1 is not None:
    print(f"{'BERTScore F1':<20} {bert_f1:<10.4f} {'Semantic similarity'}")

if bleurt_val is not None:
    print(f"{'BLEURT':<20} {bleurt_val:<10.4f} {'Human-like quality ‚≠ê'}")

print(f"{'Style Similarity':<20} {style_sim:<10.4f} {'Persona matching'}")
print("="*80)

<a id='batch'></a>
## 3. Batch Evaluation

Now let's evaluate all articles.

In [None]:
# Evaluate all articles
results = []

print(f"\nEvaluating {len(records)} articles...")
print("\nMetrics being calculated:")
print(f"  ‚úì ROUGE (fast)")
if bertscore_available:
    print(f"  ‚úì BERTScore (moderate)")
if bleurt_available:
    print(f"  ‚úì BLEURT (slower but most accurate) ‚≠ê")
print(f"  ‚úì Style Similarity (fast)")
print()

for i, rec in enumerate(records, 1):
    print(f"Processing {i}/{len(records)}: {rec['document_title'][:50]}...")
    
    ref = rec['expected_summary']
    gen = rec['agented_summary']
    pers = persona_map.get(rec['write_id'])
    
    # ROUGE
    rouge = rouge_scorer_obj.score(ref, gen)
    
    # BERTScore
    bert = None
    if bertscore_available:
        try:
            _, _, F = bert_score.score([gen], [ref], model_type='distilbert-base-uncased', verbose=False)
            bert = F.item()
        except:
            pass
    
    # BLEURT
    bleurt = None
    if bleurt_available and bleurt_scorer is not None:
        try:
            scores = bleurt_scorer.score(references=[ref], candidates=[gen])
            bleurt = scores[0]
            print(f"  ‚Üí BLEURT: {bleurt:.4f}")
        except Exception as e:
            print(f"  ‚Üí BLEURT failed: {e}")
    
    # Style
    style = style_analyzer.calculate_style_similarity(gen, pers)
    
    results.append({
        'article': rec['document_title'][:50],
        'persona': pers,
        'rouge1_f1': rouge['rouge1'].fmeasure,
        'rouge2_f1': rouge['rouge2'].fmeasure,
        'rougeLsum_f1': rouge['rougeLsum'].fmeasure,
        'bertscore_f1': bert,
        'bleurt': bleurt,
        'style_sim': style
    })

df = pd.DataFrame(results)
print("\n‚úì Evaluation complete!")

<a id='results'></a>
## 4. Results & Analysis

In [None]:
# Display results
print("\n" + "="*120)
print("COMPLETE RESULTS TABLE")
print("="*120 + "\n")

# Select columns to display
display_cols = ['article', 'persona', 'rouge1_f1', 'rouge2_f1', 'rougeLsum_f1']
if bertscore_available and df['bertscore_f1'].notna().any():
    display_cols.append('bertscore_f1')
if bleurt_available and df['bleurt'].notna().any():
    display_cols.append('bleurt')
display_cols.append('style_sim')

print(df[display_cols].to_string(index=False, float_format=lambda x: f'{x:.4f}' if pd.notna(x) else 'N/A'))

In [None]:
# Summary Statistics
print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80 + "\n")

print("Mean Scores:")
print(f"  ROUGE-1 F1:       {df['rouge1_f1'].mean():.4f}")
print(f"  ROUGE-2 F1:       {df['rouge2_f1'].mean():.4f}")
print(f"  ROUGE-Lsum F1:    {df['rougeLsum_f1'].mean():.4f}")

if bertscore_available and df['bertscore_f1'].notna().any():
    print(f"  BERTScore F1:     {df['bertscore_f1'].mean():.4f}")

if bleurt_available and df['bleurt'].notna().any():
    print(f"  BLEURT:           {df['bleurt'].mean():.4f} ‚≠ê (Most human-like)")

print(f"  Style Similarity: {df['style_sim'].mean():.4f}")

print("\nStandard Deviations:")
print(f"  ROUGE-Lsum F1:    {df['rougeLsum_f1'].std():.4f}")
if bleurt_available and df['bleurt'].notna().any():
    print(f"  BLEURT:           {df['bleurt'].std():.4f}")
print(f"  Style Similarity: {df['style_sim'].std():.4f}")

In [None]:
# BLEURT-specific analysis
if bleurt_available and df['bleurt'].notna().any():
    print("\n" + "="*80)
    print("üåü BLEURT ANALYSIS üåü")
    print("="*80 + "\n")
    
    print("BLEURT Score Distribution:")
    print(f"  Min:    {df['bleurt'].min():.4f}")
    print(f"  Max:    {df['bleurt'].max():.4f}")
    print(f"  Mean:   {df['bleurt'].mean():.4f}")
    print(f"  Median: {df['bleurt'].median():.4f}")
    
    # Quality breakdown
    excellent = (df['bleurt'] > 0.6).sum()
    good = ((df['bleurt'] > 0.4) & (df['bleurt'] <= 0.6)).sum()
    fair = ((df['bleurt'] > 0.2) & (df['bleurt'] <= 0.4)).sum()
    poor = (df['bleurt'] <= 0.2).sum()
    
    print("\nQuality Breakdown:")
    print(f"  üü¢ Excellent (>0.6):  {excellent} articles")
    print(f"  üü° Good (0.4-0.6):    {good} articles")
    print(f"  üü† Fair (0.2-0.4):    {fair} articles")
    print(f"  üî¥ Poor (<0.2):       {poor} articles")
    
    # Best and worst
    best_idx = df['bleurt'].idxmax()
    worst_idx = df['bleurt'].idxmin()
    
    print("\nBest BLEURT Score:")
    print(f"  Article: {df.loc[best_idx, 'article']}")
    print(f"  BLEURT:  {df.loc[best_idx, 'bleurt']:.4f}")
    
    print("\nWorst BLEURT Score:")
    print(f"  Article: {df.loc[worst_idx, 'article']}")
    print(f"  BLEURT:  {df.loc[worst_idx, 'bleurt']:.4f}")
    
    # Correlation with other metrics
    print("\nCorrelation with other metrics:")
    print(f"  BLEURT vs ROUGE-Lsum:  {df['bleurt'].corr(df['rougeLsum_f1']):.4f}")
    if bertscore_available and df['bertscore_f1'].notna().any():
        print(f"  BLEURT vs BERTScore:   {df['bleurt'].corr(df['bertscore_f1']):.4f}")
    print(f"  BLEURT vs Style Sim:   {df['bleurt'].corr(df['style_sim']):.4f}")
    
    print("\nüí° Interpretation:")
    print("   High BLEURT scores indicate summaries that humans would rate as high quality.")
    print("   BLEURT considers fluency, coherence, and meaning preservation.")
    print("   Unlike ROUGE, BLEURT can recognize good paraphrases and penalize awkward text.")
    
else:
    print("\n‚ö† BLEURT analysis not available")
    print("   Install BLEURT to see human-correlated quality scores.")

In [None]:
# Save results
output_file = 'outputs/notebook_evaluation_results.csv'
df.to_csv(output_file, index=False)
print(f"\n‚úì Results saved to: {output_file}")

## Summary

This notebook demonstrated evaluation with **all metrics including BLEURT**.

### Key Takeaways:

1. **ROUGE** measures word/phrase overlap - good baseline
2. **BERTScore** captures semantic similarity - handles paraphrasing
3. **BLEURT** ‚≠ê - trained on human judgments, most reliable indicator of quality
4. **Style Similarity** - measures persona matching

### Why BLEURT Matters:

- **Correlates with human judgments** better than ROUGE or BERTScore
- **Considers fluency and coherence**, not just word overlap
- **Recognizes quality** that humans would appreciate
- **Range**: -1 (poor) to +1 (excellent), typically 0.3-0.7 for good summaries

### Next Steps:

1. Analyze low-BLEURT articles to understand failures
2. Use BLEURT to compare different summarization approaches
3. Optimize for BLEURT scores to improve human-perceived quality

---

**For production evaluation**, run:
```bash
python -m src.eval_runner  # Uses all metrics including BLEURT
python -m src.report       # Generates comprehensive report
```