# Persona Summarization Evaluation - Demonstration

This notebook walks through the evaluation strategy, demonstrating each metric with concrete examples.

## Table of Contents
1. [Setup & Data Loading](#setup)
2. [ROUGE Metrics](#rouge)
3. [BERTScore](#bertscore)
4. [BLEURT](#bleurt)
5. [Stylometric Features](#style)
6. [Complete Evaluation](#complete)
7. [Persona Comparison](#persona)
8. [Summary & Insights](#summary)

<a id='setup'></a>
## 1. Setup & Data Loading

First, let's load our test data and set up the evaluation environment.

In [None]:
import json
import pandas as pd
import numpy as np
import yaml
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import our modules
from src.io_utils import load_jsonl, load_persona_assignments
from src.text_utils import tokenize_sentences, tokenize_words, count_tokens
from src.content_metrics import ContentMetricsCalculator
from src.style_features import StyleAnalyzer

print("‚úì Imports successful")

In [None]:
# Load configuration
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Load test data
records = list(load_jsonl('data/input.jsonl'))
persona_map = load_persona_assignments('data/persona_assignments.csv')

print(f"‚úì Loaded {len(records)} articles")
print(f"‚úì Loaded {len(persona_map)} persona assignments")
print(f"\nPersonas: {list(set(persona_map.values()))}")

In [None]:
# Let's look at our first example
example = records[0]

print("üì∞ Article Title:", example['document_title'])
print("üìù Sector:", example['metadata']['sector'])
print("üë§ Author:", example['metadata']['author'])
print("üé≠ Assigned Persona:", persona_map.get(example['write_id']))
print("\n" + "="*80)
print("\nüìÑ SOURCE (first 300 chars):")
print(example['document_content'][:300] + "...")
print("\n" + "="*80)
print("\nüéØ GOLD SUMMARY:")
print(example['expected_summary'])
print("\n" + "="*80)
print("\nü§ñ GENERATED SUMMARY:")
print(example['agented_summary'])

<a id='rouge'></a>
## 2. ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated and reference summaries.

### What each ROUGE variant measures:
- **ROUGE-1**: Unigram (single word) overlap - measures basic word coverage
- **ROUGE-2**: Bigram (two consecutive words) overlap - measures phrase preservation
- **ROUGE-L**: Longest common subsequence - measures sentence-level structure similarity

### Metrics explained:
- **Precision**: What % of words in generated summary are in reference?
- **Recall**: What % of words in reference are captured in generated summary?
- **F1**: Harmonic mean of precision and recall

In [None]:
from rouge_score import rouge_scorer

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeLsum'], use_stemmer=True)

# Calculate ROUGE for our first example
reference = example['expected_summary']
generated = example['agented_summary']

scores = scorer.score(reference, generated)

print("üìä ROUGE Scores for Example 1:")
print("\n" + "="*80)
for metric_name, metric_scores in scores.items():
    print(f"\n{metric_name.upper()}:")
    print(f"  Precision: {metric_scores.precision:.4f}")
    print(f"  Recall:    {metric_scores.recall:.4f}")
    print(f"  F1:        {metric_scores.fmeasure:.4f}")

### Understanding the scores:

Let's visualize what these numbers mean by looking at the actual word overlap.

In [None]:
# Tokenize and compare
ref_words = set(reference.lower().split())
gen_words = set(generated.lower().split())

overlap_words = ref_words & gen_words
only_in_ref = ref_words - gen_words
only_in_gen = gen_words - ref_words

print("üìù Word-level Analysis:")
print(f"\nReference summary words: {len(ref_words)}")
print(f"Generated summary words: {len(gen_words)}")
print(f"Overlapping words: {len(overlap_words)}")
print(f"\n‚úì Words in BOTH: {len(overlap_words)} words")
print(f"Examples: {list(overlap_words)[:15]}")
print(f"\n‚ö† Only in REFERENCE: {len(only_in_ref)} words")
print(f"Examples: {list(only_in_ref)[:10]}")
print(f"\n‚ö† Only in GENERATED: {len(only_in_gen)} words")
print(f"Examples: {list(only_in_gen)[:10]}")

### Compare multiple examples

In [None]:
# Calculate ROUGE for first 5 examples
rouge_results = []

for i in range(min(5, len(records))):
    rec = records[i]
    scores = scorer.score(rec['expected_summary'], rec['agented_summary'])
    
    rouge_results.append({
        'article': rec['document_title'][:40] + '...',
        'persona': persona_map.get(rec['write_id']),
        'rouge1_f': scores['rouge1'].fmeasure,
        'rouge2_f': scores['rouge2'].fmeasure,
        'rougeLsum_f': scores['rougeLsum'].fmeasure,
    })

df_rouge = pd.DataFrame(rouge_results)
print("\nüìä ROUGE F1 Scores Comparison:\n")
print(df_rouge.to_string(index=False))
print(f"\nüìà Average Scores:")
print(f"ROUGE-1:    {df_rouge['rouge1_f'].mean():.4f}")
print(f"ROUGE-2:    {df_rouge['rouge2_f'].mean():.4f}")
print(f"ROUGE-Lsum: {df_rouge['rougeLsum_f'].mean():.4f}")

<a id='bertscore'></a>
## 3. BERTScore

BERTScore uses contextual embeddings from BERT to measure semantic similarity.

**Why BERTScore?**
- ROUGE only matches exact words
- BERTScore understands synonyms and paraphrasing
- Example: "car" and "automobile" get high BERTScore but ROUGE=0

**Note:** BERTScore requires downloading models and can be slow on first run.

In [None]:
try:
    import bert_score
    
    print("Calculating BERTScore (this may take a minute on first run)...\n")
    
    # Use roberta-large for better quality (from config)
    bertscore_model = config['content'].get('bertscore_model', 'roberta-large')
    print(f"Using model: {bertscore_model}\n")
    
    P, R, F1 = bert_score.score(
        [example['agented_summary']], 
        [example['expected_summary']],
        model_type=bertscore_model,
        lang='en',
        rescale_with_baseline=True,
        verbose=False
    )
    
    print(f"üìä BERTScore for Example 1:")
    print(f"  Precision: {P.item():.4f}")
    print(f"  Recall:    {R.item():.4f}")
    print(f"  F1:        {F1.item():.4f}")
    
    print(f"\nüí° Interpretation:")
    print(f"BERTScore F1 of {F1.item():.4f} indicates {'excellent' if F1.item() > 0.9 else 'good' if F1.item() > 0.85 else 'fair'} semantic similarity")
    print(f"\nCompare to ROUGE-1 F1: {scores['rouge1'].fmeasure:.4f}")
    print("BERTScore is typically higher because it captures paraphrasing and synonyms.")
    
    bertscore_available = True
    
except Exception as e:
    print(f"‚ö† BERTScore demo skipped: {e}")
    print("Install with: pip install bert-score")
    bertscore_available = False

### BERTScore Example: Paraphrasing Detection

In [None]:
# Demonstrate BERTScore advantage over ROUGE
ref_text = "The company reported strong revenue growth in the third quarter."
gen_text1 = "The firm posted robust sales increases during Q3."  # Paraphrase
gen_text2 = "The weather was nice yesterday afternoon."  # Unrelated

# ROUGE scores
rouge1 = scorer.score(ref_text, gen_text1)['rouge1'].fmeasure
rouge2 = scorer.score(ref_text, gen_text2)['rouge1'].fmeasure

print("üî¨ Comparing ROUGE vs BERTScore:\n")
print("Reference: 'The company reported strong revenue growth in the third quarter.'\n")
print(f"Text 1 (paraphrase): '{gen_text1}'")
print(f"  ROUGE-1 F1: {rouge1:.4f}")

if bertscore_available:
    try:
        P1, R1, F1_1 = bert_score.score([gen_text1], [ref_text], 
                                        model_type=bertscore_model, 
                                        lang='en',
                                        verbose=False)
        print(f"  BERTScore F1: {F1_1.item():.4f} ‚úì Correctly identifies semantic similarity\n")
    except:
        print("  BERTScore: [calculation error]\n")
else:
    print("  BERTScore: [not available]\n")

print(f"Text 2 (unrelated): '{gen_text2}'")
print(f"  ROUGE-1 F1: {rouge2:.4f}")

if bertscore_available:
    try:
        P2, R2, F1_2 = bert_score.score([gen_text2], [ref_text], 
                                        model_type=bertscore_model,
                                        lang='en',
                                        verbose=False)
        print(f"  BERTScore F1: {F1_2.item():.4f} ‚úì Correctly identifies no similarity")
    except:
        print("  BERTScore: [calculation error]")
else:
    print("  BERTScore: [not available]")

print("\nüí° BERTScore handles paraphrasing much better than ROUGE!")

<a id='bleurt'></a>
## 4. BLEURT

BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) is a learned metric trained on human judgments.

**Why BLEURT?**
- Trained to match human quality assessments
- Considers fluency, coherence, and meaning
- Scores typically range from -1 (poor) to +1 (excellent)
- Correlates better with human judgments than ROUGE

**Note:** BLEURT requires TensorFlow and downloads large models (~1GB). First run will be slow.

In [None]:
import os

# Check if BLEURT is enabled in config
bleurt_enabled = config['content'].get('use_bleurt', False)
bleurt_checkpoint = config['content'].get('bleurt_checkpoint', 'bleurt_checkpoints/BLEURT-20-D3')

if not bleurt_enabled:
    print("‚ö† BLEURT is disabled in config.yaml")
    print(f"\nBLEURT requires:")
    print("  - TensorFlow (heavy dependency)")
    print("  - Large checkpoint download (~300MB-1GB)")
    print("  - May have compatibility issues on some systems (especially macOS)")
    print(f"\nTo enable BLEURT:")
    print("  1. Set use_bleurt: true in config.yaml")
    print("  2. Run: python setup_bleurt.py")
    print("  3. Ensure checkpoint exists at: {bleurt_checkpoint}")
    print(f"\nüìñ See BLEURT_SETUP.md for detailed instructions")
    print(f"\nFor this demo, we'll continue with ROUGE + BERTScore (which work great!)")
    bleurt_available = False
else:
    try:
        from bleurt import score
        
        # Suppress TensorFlow warnings
        os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
        
        print(f"Initializing BLEURT scorer with checkpoint: {bleurt_checkpoint}...")
        print("‚è≥ This may take a minute...\n")
        
        # Initialize BLEURT with downloaded checkpoint
        bleurt_scorer = score.BleurtScorer(bleurt_checkpoint)
        
        print(f"‚úì BLEURT scorer initialized successfully!\n")
        
        # Calculate BLEURT score for our example
        bleurt_scores = bleurt_scorer.score(
            references=[example['expected_summary']],
            candidates=[example['agented_summary']]
        )
        
        bleurt_score = bleurt_scores[0]
        
        print(f"üìä BLEURT Score for Example 1:")
        print(f"  Score: {bleurt_score:.4f}")
        
        print(f"\nüí° Interpretation:")
        if bleurt_score > 0.6:
            interpretation = "excellent quality - very similar to reference"
        elif bleurt_score > 0.4:
            interpretation = "good quality - captures main content well"
        elif bleurt_score > 0.2:
            interpretation = "fair quality - some content overlap"
        else:
            interpretation = "needs improvement - limited similarity"
        
        print(f"BLEURT score of {bleurt_score:.4f} indicates {interpretation}")
        
        print(f"\nüìà Comparison to other metrics:")
        print(f"  ROUGE-1 F1: {scores['rouge1'].fmeasure:.4f}")
        print(f"  ROUGE-Lsum F1: {scores['rougeLsum'].fmeasure:.4f}")
        if bertscore_available:
            print(f"  BERTScore F1: {F1.item():.4f}")
        print(f"  BLEURT: {bleurt_score:.4f}")
        
        print("\nüí° BLEURT is trained on human judgments, so it often captures quality nuances")
        print("   that pure overlap metrics like ROUGE might miss.")
        
        bleurt_available = True
        
    except ImportError as e:
        print("‚ö† BLEURT package not installed")
        print(f"\nError: {e}")
        print("\nTo install BLEURT:")
        print("  pip install 'git+https://github.com/google-research/bleurt.git'")
        print("  pip install 'tensorflow>=2.0.0'")
        bleurt_available = False
        
    except FileNotFoundError as e:
        print(f"‚ö† BLEURT checkpoint not found: {bleurt_checkpoint}")
        print(f"\nError: {e}")
        print("\nTo download checkpoint:")
        print("  python setup_bleurt.py")
        print("\nOr manually download from:")
        print("  https://github.com/google-research/bleurt#checkpoints")
        bleurt_available = False
        
    except RuntimeError as e:
        error_str = str(e)
        if 'mutex' in error_str.lower() or 'lock' in error_str.lower():
            print("‚ö† BLEURT initialization failed due to system compatibility issue")
            print(f"\nError: TensorFlow mutex/threading issue (common on some macOS configurations)")
            print("\nThis is a known TensorFlow compatibility issue, not a problem with your setup.")
            print("\nüìä Good news: ROUGE + BERTScore provide excellent evaluation without BLEURT!")
            print("\nWorkarounds if you need BLEURT:")
            print("  - Use Linux or Docker environment")
            print("  - Try: pip install tensorflow==2.10.0")
            print("  - See BLEURT_SETUP.md for details")
        else:
            print(f"‚ö† BLEURT initialization failed: {e}")
        bleurt_available = False
        
    except Exception as e:
        print(f"‚ö† BLEURT calculation failed: {e}")
        print("\nBLEURT can be challenging to set up. Common issues:")
        print("  - TensorFlow version incompatibility")
        print("  - Model checkpoint not found")
        print("  - System compatibility (macOS mutex errors)")
        print(f"\nüìñ See BLEURT_SETUP.md for troubleshooting")
        print(f"\nFor this demo, we'll continue with ROUGE + BERTScore.")
        bleurt_available = False

### BLEURT vs ROUGE: Why the difference?

BLEURT is trained to predict human ratings, so it can recognize:
- **Fluency**: Natural-sounding text scores higher
- **Coherence**: Logical flow matters
- **Semantic equivalence**: Different words, same meaning
- **Context**: Understands what information is important

ROUGE only counts word overlap, so it might:
- Give high scores to awkward but word-matching text
- Give low scores to excellent paraphrases
- Miss semantic understanding

<a id='style'></a>
## 5. Stylometric Features & Persona Fidelity

Beyond content accuracy, we evaluate how well the summary matches the target persona's writing style.

### Stylometric Features Extracted:
1. **Function word rate** - Common words (the, is, at, etc.)
2. **Average sentence length** - Words per sentence
3. **Type-token ratio** - Vocabulary diversity
4. **Punctuation patterns** - Comma, period, exclamation usage
5. **Pronoun rate** - Personal pronoun frequency
6. **Flesch-Kincaid grade** - Reading difficulty
7. **Average word length** - Character count per word

In [None]:
# Initialize style analyzer
style_analyzer = StyleAnalyzer(config)

# Extract features from our example summary
summary_features = style_analyzer.extract_stylometric_features(example['agented_summary'])

feature_names = [
    'Function word rate',
    'Avg sentence length (norm)',
    'Type-token ratio',
    'Comma rate',
    'Period rate',
    'Exclamation rate',
    'Question rate',
    'Pronoun rate',
    'FK grade (norm)',
    'Avg word length (norm)'
]

print("üìê Stylometric Features for Generated Summary:\n")
print("="*60)
for name, value in zip(feature_names, summary_features):
    print(f"{name:30s}: {value:.4f}")

### Compare Across Personas

Let's see how different personas have distinctive stylometric signatures.

In [None]:
# Build persona centroids
print("Building persona centroids from training samples...\n")
centroids = style_analyzer.build_persona_centroids(force_rebuild=True)

print(f"‚úì Built centroids for {len(centroids)} personas\n")
print("="*80)

# Display persona characteristics
for persona_id, centroid in centroids.items():
    print(f"\nüé≠ {persona_id.upper()} Persona:")
    print(f"  Function words: {centroid[0]:.4f}")
    print(f"  Sentence length: {centroid[1]:.4f}")
    print(f"  Vocabulary diversity: {centroid[2]:.4f}")
    print(f"  Exclamation rate: {centroid[5]:.4f}")
    print(f"  Pronoun rate: {centroid[7]:.4f}")

### Persona Similarity Calculation

We use Jensen-Shannon divergence to measure how similar a summary is to each persona's style.

In [None]:
# Calculate similarity to each persona for our example
summary_text = example['agented_summary']
target_persona = persona_map.get(example['write_id'])

print(f"üìÑ Summary from article: {example['document_title']}")
print(f"üéØ Target persona: {target_persona}\n")
print("="*80)
print("\nüìä Style Similarity to Each Persona:\n")

similarities = {}
for persona_id in centroids.keys():
    similarity = style_analyzer.calculate_style_similarity(summary_text, persona_id)
    similarities[persona_id] = similarity
    marker = "‚úì" if persona_id == target_persona else " "
    print(f"  {marker} {persona_id:20s}: {similarity:.4f}")

best_match = max(similarities, key=similarities.get)
print(f"\nüèÜ Best match: {best_match}")
print(f"{'‚úì Correct!' if best_match == target_persona else '‚úó Mismatch'}")

### Text Examples by Persona

In [None]:
# Show example summaries from each persona
persona_examples = {}
for rec in records[:9]:  # First 9 records
    persona = persona_map.get(rec['write_id'])
    if persona and persona not in persona_examples:
        persona_examples[persona] = rec['agented_summary']

print("‚úçÔ∏è Example Summaries by Persona:\n")
print("="*80)

for persona, text in persona_examples.items():
    print(f"\nüé≠ {persona.upper()}:")
    print(f"{text[:200]}...")
    print("-"*80)

<a id='complete'></a>
## 6. Complete Evaluation

Now let's run the complete evaluation on all articles with all available metrics.

In [None]:
# Complete evaluation with ROUGE, BERTScore (optional), BLEURT (optional), and Style
results = []

print("Running complete evaluation on all articles...\n")
print(f"Metrics enabled:")
print(f"  ‚úì ROUGE")
print(f"  {'‚úì' if bertscore_available else '‚úó'} BERTScore ({bertscore_model if bertscore_available else 'N/A'})")
print(f"  {'‚úì' if bleurt_available else '‚úó'} BLEURT")
print(f"  ‚úì Stylometric Similarity\n")

for i, rec in enumerate(records):
    print(f"Processing {i+1}/{len(records)}: {rec['document_title'][:50]}...")
    
    source = rec['document_content']
    reference = rec['expected_summary']
    generated = rec['agented_summary']
    persona = persona_map.get(rec['write_id'])
    
    # ROUGE scores
    rouge_scores = scorer.score(reference, generated)
    
    # BERTScore (if available)
    bert_f1 = None
    if bertscore_available:
        try:
            _, _, bert_F = bert_score.score([generated], [reference], 
                                           model_type=bertscore_model,
                                           lang='en',
                                           rescale_with_baseline=True,
                                           verbose=False)
            bert_f1 = bert_F.item()
        except:
            pass
    
    # BLEURT (if available)
    bleurt_val = None
    if bleurt_available:
        try:
            bleurt_scores = bleurt_scorer.score(references=[reference], candidates=[generated])
            bleurt_val = bleurt_scores[0]
        except:
            pass
    
    # Style similarity
    style_sim = style_analyzer.calculate_style_similarity(generated, persona)
    
    # Token counts
    src_tokens = count_tokens(source)
    ref_tokens = count_tokens(reference)
    gen_tokens = count_tokens(generated)
    
    results.append({
        'article': rec['document_title'][:40] + '...',
        'sector': rec['metadata']['sector'],
        'persona': persona,
        'rouge1_f': rouge_scores['rouge1'].fmeasure,
        'rouge2_f': rouge_scores['rouge2'].fmeasure,
        'rougeLsum_f': rouge_scores['rougeLsum'].fmeasure,
        'bertscore_f1': bert_f1,
        'bleurt': bleurt_val,
        'style_sim': style_sim,
        'compression': gen_tokens / src_tokens if src_tokens > 0 else 0,
        'src_tokens': src_tokens,
        'gen_tokens': gen_tokens
    })

df_results = pd.DataFrame(results)
print("\n‚úì Evaluation complete!")

In [None]:
# Display results
print("\nüìä EVALUATION RESULTS\n")
print("="*120)

# Show available metrics
display_cols = ['article', 'persona', 'rouge1_f', 'rouge2_f', 'rougeLsum_f']
if bertscore_available:
    display_cols.append('bertscore_f1')
if bleurt_available:
    display_cols.append('bleurt')
display_cols.append('style_sim')

print(df_results[display_cols].to_string(index=False))

In [None]:
# Summary statistics
print("\nüìà SUMMARY STATISTICS\n")
print("="*60)
print(f"\nContent Quality:")
print(f"  ROUGE-1 F1:    {df_results['rouge1_f'].mean():.4f} ¬± {df_results['rouge1_f'].std():.4f}")
print(f"  ROUGE-2 F1:    {df_results['rouge2_f'].mean():.4f} ¬± {df_results['rouge2_f'].std():.4f}")
print(f"  ROUGE-Lsum F1: {df_results['rougeLsum_f'].mean():.4f} ¬± {df_results['rougeLsum_f'].std():.4f}")

if bertscore_available and df_results['bertscore_f1'].notna().any():
    print(f"  BERTScore F1:  {df_results['bertscore_f1'].mean():.4f} ¬± {df_results['bertscore_f1'].std():.4f}")

if bleurt_available and df_results['bleurt'].notna().any():
    print(f"  BLEURT:        {df_results['bleurt'].mean():.4f} ¬± {df_results['bleurt'].std():.4f}")

print(f"\nStyle Fidelity:")
print(f"  Style Similarity: {df_results['style_sim'].mean():.4f} ¬± {df_results['style_sim'].std():.4f}")

print(f"\nCompression:")
print(f"  Avg compression ratio: {df_results['compression'].mean():.2%}")
print(f"  Avg source tokens: {df_results['src_tokens'].mean():.0f}")
print(f"  Avg generated tokens: {df_results['gen_tokens'].mean():.0f}")

<a id='persona'></a>
## 7. Persona Comparison

Let's analyze performance by persona to see if style matching is working correctly.

In [None]:
# Group by persona
agg_dict = {
    'rouge1_f': ['mean', 'std', 'count'],
    'rougeLsum_f': ['mean', 'std'],
    'style_sim': ['mean', 'std']
}

if bertscore_available and df_results['bertscore_f1'].notna().any():
    agg_dict['bertscore_f1'] = ['mean', 'std']

if bleurt_available and df_results['bleurt'].notna().any():
    agg_dict['bleurt'] = ['mean', 'std']

persona_stats = df_results.groupby('persona').agg(agg_dict).round(4)

print("\nüìä RESULTS BY PERSONA\n")
print("="*80)
print(persona_stats)

print("\nüí° Interpretation:")
for persona in df_results['persona'].unique():
    persona_df = df_results[df_results['persona'] == persona]
    avg_rouge = persona_df['rougeLsum_f'].mean()
    avg_style = persona_df['style_sim'].mean()
    count = len(persona_df)
    
    print(f"\nüé≠ {persona} ({count} articles):")
    print(f"  Content quality (ROUGE-Lsum): {avg_rouge:.4f} - {'Excellent' if avg_rouge > 0.7 else 'Good' if avg_rouge > 0.6 else 'Fair'}")
    print(f"  Style fidelity: {avg_style:.4f} - {'Strong match' if avg_style > 0.75 else 'Good match' if avg_style > 0.65 else 'Moderate match'}")

### Visualize Style Differences

In [None]:
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    sns.set_style('whitegrid')
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # ROUGE scores by persona
    df_results.boxplot(column='rougeLsum_f', by='persona', ax=axes[0, 0])
    axes[0, 0].set_title('ROUGE-Lsum F1 by Persona')
    axes[0, 0].set_ylabel('ROUGE-Lsum F1')
    axes[0, 0].set_xlabel('Persona')
    
    # Style similarity by persona
    df_results.boxplot(column='style_sim', by='persona', ax=axes[0, 1])
    axes[0, 1].set_title('Style Similarity by Persona')
    axes[0, 1].set_ylabel('Style Similarity')
    axes[0, 1].set_xlabel('Persona')
    
    # Scatter: ROUGE vs Style
    for persona in df_results['persona'].unique():
        persona_df = df_results[df_results['persona'] == persona]
        axes[1, 0].scatter(persona_df['rougeLsum_f'], persona_df['style_sim'], 
                          label=persona, alpha=0.7, s=100)
    axes[1, 0].set_xlabel('ROUGE-Lsum F1 (Content Quality)')
    axes[1, 0].set_ylabel('Style Similarity')
    axes[1, 0].set_title('Content Quality vs Style Fidelity')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # BLEURT or compression ratio
    if bleurt_available and df_results['bleurt'].notna().any():
        df_results.boxplot(column='bleurt', by='persona', ax=axes[1, 1])
        axes[1, 1].set_title('BLEURT Score by Persona')
        axes[1, 1].set_ylabel('BLEURT Score')
    else:
        df_results.boxplot(column='compression', by='persona', ax=axes[1, 1])
        axes[1, 1].set_title('Compression Ratio by Persona')
        axes[1, 1].set_ylabel('Compression Ratio')
    axes[1, 1].set_xlabel('Persona')
    
    plt.tight_layout()
    plt.savefig('outputs/evaluation_visualization.png', dpi=150, bbox_inches='tight')
    print("\n‚úì Visualizations saved to outputs/evaluation_visualization.png")
    plt.show()
    
except ImportError:
    print("‚ö† Matplotlib not available for visualization")
    print("Install with: pip install matplotlib seaborn")

<a id='summary'></a>
## 8. Summary & Insights

### What We Measured

#### Content Quality Metrics:
1. **ROUGE-1, ROUGE-2, ROUGE-Lsum** - N-gram overlap
2. **BERTScore** - Semantic similarity using contextual embeddings
3. **BLEURT** - Learned metric trained on human judgments

#### Style Fidelity Metrics:
1. **Stylometric features** - Writing style characteristics
2. **Persona similarity** - Distance to persona centroids
3. **Jensen-Shannon divergence** - Feature distribution similarity

### Key Findings

In [None]:
print("\nüéØ KEY FINDINGS\n")
print("="*80)

# Overall quality
avg_content = df_results['rougeLsum_f'].mean()
avg_style = df_results['style_sim'].mean()

print(f"\n1. Overall Performance:")
print(f"   - Average content quality (ROUGE-Lsum): {avg_content:.4f}")
if bleurt_available and df_results['bleurt'].notna().any():
    avg_bleurt = df_results['bleurt'].mean()
    print(f"   - Average BLEURT score: {avg_bleurt:.4f}")
print(f"   - Average style fidelity: {avg_style:.4f}")
print(f"   - Assessment: {'Excellent' if avg_content > 0.65 and avg_style > 0.7 else 'Good' if avg_content > 0.55 else 'Needs improvement'}")

# Best and worst
best_content = df_results.loc[df_results['rougeLsum_f'].idxmax()]
worst_content = df_results.loc[df_results['rougeLsum_f'].idxmin()]

print(f"\n2. Content Quality Range:")
print(f"   - Best: {best_content['article']} (ROUGE-L: {best_content['rougeLsum_f']:.4f})")
print(f"   - Worst: {worst_content['article']} (ROUGE-L: {worst_content['rougeLsum_f']:.4f})")
print(f"   - Range: {df_results['rougeLsum_f'].max() - df_results['rougeLsum_f'].min():.4f}")

# Persona analysis
print(f"\n3. Persona-Specific Performance:")
for persona in sorted(df_results['persona'].unique()):
    persona_df = df_results[df_results['persona'] == persona]
    print(f"   - {persona}: content={persona_df['rougeLsum_f'].mean():.4f}, style={persona_df['style_sim'].mean():.4f}")

# Correlation
correlation = df_results['rougeLsum_f'].corr(df_results['style_sim'])
print(f"\n4. Content-Style Correlation: {correlation:.4f}")
print(f"   {'Positive correlation' if correlation > 0 else 'Negative correlation'} between content quality and style match")

# Sectors
sector_performance = df_results.groupby('sector')['rougeLsum_f'].mean().sort_values(ascending=False)
print(f"\n5. Top Performing Sectors:")
for sector, score in sector_performance.head(3).items():
    print(f"   - {sector}: {score:.4f}")

### Recommendations

In [None]:
print("\nüí° RECOMMENDATIONS\n")
print("="*80)

if avg_content < 0.6:
    print("\n‚ö† Content Quality:")
    print("  - Consider improving factual coverage in summaries")
    print("  - Ensure all key points from source are included")
    print("  - Check for information loss during summarization")
else:
    print("\n‚úì Content Quality: Good - summaries accurately capture key information")

if avg_style < 0.65:
    print("\n‚ö† Style Fidelity:")
    print("  - Strengthen persona-specific training examples")
    print("  - Add more diverse samples to persona corpora")
    print("  - Review persona assignment accuracy")
else:
    print("\n‚úì Style Fidelity: Good - summaries match target personas well")

# Check for consistency
if df_results['rougeLsum_f'].std() > 0.15:
    print("\n‚ö† Consistency:")
    print("  - High variance in content quality across articles")
    print("  - Consider investigating low-performing examples")
    print(f"  - Standard deviation: {df_results['rougeLsum_f'].std():.4f}")
else:
    print("\n‚úì Consistency: Good - performance is stable across articles")

print("\n‚úÖ Evaluation complete! Use these insights to improve your summarization system.")

## Next Steps

1. **Run full evaluation** with all metrics:
   ```bash
   python -m src.eval_runner
   ```

2. **Generate detailed report**:
   ```bash
   python -m src.report
   ```

3. **Analyze results**:
   - Review `outputs/per_item_metrics.csv`
   - Check `outputs/corpus_aggregates.json`
   - Read `outputs/report.md`

4. **Iterate**:
   - Identify low-scoring articles
   - Analyze failure modes
   - Improve summarization prompts/models
   - Re-evaluate

## Metrics Summary

| Metric | What it measures | Interpretation |
|--------|------------------|----------------|
| **ROUGE-1** | Word overlap | 0.6+ = good word coverage |
| **ROUGE-2** | Phrase overlap | 0.4+ = good phrase preservation |
| **ROUGE-Lsum** | Structure similarity | 0.6+ = good overall quality |
| **BERTScore** | Semantic similarity | 0.85+ = excellent, captures paraphrasing |
| **BLEURT** | Human-like quality | 0.4+ = good, 0.6+ = excellent |
| **Style Similarity** | Persona matching | 0.7+ = strong match, 0.65+ = good |

---

**üéâ Congratulations!** You've completed the evaluation demo and understand how each metric works!