# MELD Dating Simulator Evaluation (Fixed) üé≠üìä

**Purpose:** Comprehensive evaluation with FIXED generation parameters

**What this notebook evaluates:**
- üìà Multiple checkpoints from training automatically
- üéØ Dialogue Quality (perplexity, diversity, repetition)
- üìù Reference-Based Metrics (BLEU, ROUGE vs ground truth)
- üé≠ Per-Character Performance (all 6 Friends characters)
- üí≠ Per-Emotion Performance (7 emotion categories)
- üèÜ Best Checkpoint Identification

**Key Improvements:**
- ‚úÖ Correct EOS token (`tokenizer.eos_token_id`)
- ‚úÖ Repetition penalty (1.2) + no_repeat_ngram_size (3)
- ‚úÖ Lower temperature (0.7 instead of 0.9)
- ‚úÖ Reduced max_new_tokens (50 instead of 128)
- ‚úÖ Token-based extraction (not regex)
- ‚úÖ Speaker token stripping

---

## 1. Setup and Imports

In [None]:
# Check if running in correct directory
import os
from pathlib import Path

# Should be in notebooks/MELD/ directory
if Path.cwd().name == 'MELD':
    print("‚úì Running from correct directory")
else:
    print(f"‚ö†Ô∏è  Current directory: {Path.cwd()}")
    print("‚ö†Ô∏è  This notebook should be run from the notebooks/MELD/ directory")

In [None]:
# Add parent directory to path for imports
import sys
sys.path.insert(0, str(Path.cwd().parent.parent))

print("‚úì Path configured")

In [None]:
# Core imports
import torch
import json
import numpy as np
import pandas as pd
from datetime import datetime
from tqdm.notebook import tqdm
import re
from collections import defaultdict

# Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM
)

# NLP metrics
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# Download NLTK data if needed
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úì All imports successful")

In [None]:
# Device configuration
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

---
## 2. Configuration

**‚ö†Ô∏è CUSTOMIZE THESE PATHS:**

In [None]:
# ==================== CUSTOMIZE THESE PATHS ====================

# Path to checkpoint directory (will auto-detect all checkpoints)
CHECKPOINT_DIR = "../../checkpoints/dating_sim_meld_fixed"  # Update to your training output

# Path to MELD instruction-formatted data
DATA_PATH = "../../data/processed/MELD/meld_dating_sim_instruct.csv"

# Output directory for results
OUTPUT_DIR = "../../results/MELD"

# Sampling configuration
SAMPLES_PER_GROUP = 5  # Samples per (character, emotion) combination

# FIXED Generation parameters (matching training notebook 03c)
GENERATION_CONFIG = {
    'max_new_tokens': 50,    # Reduced from 128
    'temperature': 0.7,      # Reduced from 0.9
    'top_p': 0.9,
    'do_sample': True
}

# ===============================================================

print("Configuration:")
print(f"  Checkpoint directory: {CHECKPOINT_DIR}")
print(f"  Data path: {DATA_PATH}")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  Samples per (character, emotion): {SAMPLES_PER_GROUP}")
print(f"  Generation config: {GENERATION_CONFIG}")
print(f"\n‚úÖ Using FIXED generation parameters")

---
## 3. Friends Character Personas

Define personas used in generation (same as training notebook)

In [None]:
# Character personas from training notebook
CHARACTER_PERSONAS = {
    'Chandler': "You are Chandler Bing from Friends. You are witty, sarcastic, and use humor as a defense mechanism to hide your vulnerability. You make jokes even in serious moments, but you're actually very caring and romantic underneath. You're self-deprecating and sometimes awkward, but loyal and loving to those close to you.",
    
    'Monica': "You are Monica Geller from Friends. You are organized, competitive, and love to be in control. You're nurturing and care deeply about the people in your life. You want commitment and stability in relationships. You can be intense but you're passionate about everything you do, from cooking to loving your partner.",
    
    'Ross': "You are Ross Geller from Friends. You are intellectual, nerdy, and passionate about paleontology and science. You tend to overthink things and can be awkward in romantic situations. You're a hopeless romantic who believes in destiny and true love, but you often struggle to express your feelings properly.",
    
    'Rachel': "You are Rachel Green from Friends. You are fashion-focused, fun, and flirty. You're independent and career-driven, having grown from a spoiled daddy's girl to a confident professional. You value friendship and are loyal to those you care about. You're charming and know how to make people feel special.",
    
    'Joey': "You are Joey Tribbiani from Friends. You are confident, charming, and simple in the best way. You love food (especially pizza and sandwiches) and you're famous for your catchphrase 'How you doin'?' You're a loyal friend and while you may not be the smartest, you have a big heart and know how to make people feel good about themselves.",
    
    'Phoebe': "You are Phoebe Buffay from Friends. You are quirky, spiritual, and unconventionally wise. You're honest to a fault and say what's on your mind. You have a mysterious past but maintain an optimistic outlook. You're free-spirited and bring unique perspectives to every situation. You believe in karma, auras, and following your heart."
}

# Dating scenarios
DATING_SCENARIOS = [
    "You're on a casual coffee date at Central Perk, the cozy coffee shop.",
    "You're having a romantic dinner date at a nice restaurant.",
    "You're taking a walk together and having a deep conversation.",
    "You're hanging out at your apartment, enjoying each other's company.",
    "You're on a fun date doing something adventurous together.",
    "You're having a heart-to-heart conversation about your relationship.",
    "You're flirting and getting to know each other better.",
    "You're spending a quiet evening together, just talking and connecting.",
]

print("‚úì Character personas and scenarios loaded")
print(f"  Characters: {', '.join(CHARACTER_PERSONAS.keys())}")
print(f"  Scenarios: {len(DATING_SCENARIOS)}")

---
## 4. Load and Prepare Test Data

In [None]:
# Load full MELD dataset
print(f"Loading data from: {DATA_PATH}")
df = pd.read_csv(DATA_PATH)

print(f"‚úì Loaded {len(df)} total examples")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDataset shape: {df.shape}")

In [None]:
# Analyze data distribution
print("="*80)
print("Character Distribution")
print("="*80)
char_counts = df['character'].value_counts()
for char, count in char_counts.items():
    percentage = (count / len(df)) * 100
    print(f"{char:15s}: {count:5d} ({percentage:5.2f}%)")

print("\n" + "="*80)
print("Emotion Distribution")
print("="*80)
emotion_counts = df['emotion'].value_counts()
for emotion, count in emotion_counts.items():
    percentage = (count / len(df)) * 100
    print(f"{emotion:15s}: {count:5d} ({percentage:5.2f}%)")

### Stratified Sampling

Create balanced test set with N samples per (character, emotion) combination

In [None]:
def create_stratified_sample(df, samples_per_group=5, main_characters=None):
    """
    Create stratified sample with balanced (character, emotion) coverage.
    """
    # Filter to main characters if specified
    if main_characters is not None:
        df = df[df['character'].isin(main_characters)]
    
    # Group by character and emotion
    grouped = df.groupby(['character', 'emotion'])
    
    # Sample from each group
    samples = []
    for (char, emotion), group in grouped:
        n = min(samples_per_group, len(group))
        if n > 0:
            samples.append(group.sample(n, random_state=42))
    
    return pd.concat(samples, ignore_index=True)

# Main 6 Friends characters
MAIN_CHARACTERS = ['Chandler', 'Monica', 'Ross', 'Rachel', 'Joey', 'Phoebe']

# Create stratified test sample
test_df = create_stratified_sample(
    df,
    samples_per_group=SAMPLES_PER_GROUP,
    main_characters=MAIN_CHARACTERS
)

print(f"‚úì Created stratified test set: {len(test_df)} samples")
print(f"\nDistribution:")
print(test_df.groupby(['character', 'emotion']).size().unstack(fill_value=0))

---
## 5. Checkpoint Discovery

In [None]:
def find_checkpoints(checkpoint_dir):
    """
    Auto-detect all checkpoint folders in directory.
    """
    checkpoint_path = Path(checkpoint_dir)
    
    if not checkpoint_path.exists():
        print(f"‚ö†Ô∏è  Checkpoint directory not found: {checkpoint_dir}")
        return []
    
    # Find all checkpoint-* folders
    checkpoints = sorted(
        checkpoint_path.glob("checkpoint-*"),
        key=lambda p: int(p.name.split('-')[1])
    )
    
    # Add final model if exists
    final_path = checkpoint_path / "final"
    if final_path.exists():
        checkpoints.append(final_path)
    
    return checkpoints

# Discover checkpoints
checkpoints = find_checkpoints(CHECKPOINT_DIR)

if not checkpoints:
    print("‚ùå No checkpoints found!")
    print(f"Please check the path: {CHECKPOINT_DIR}")
else:
    print(f"‚úì Found {len(checkpoints)} checkpoint(s):")
    for i, cp in enumerate(checkpoints, 1):
        print(f"  {i}. {cp.name}")
    
    # Estimate evaluation time
    samples_per_checkpoint = len(test_df)
    estimated_minutes = len(checkpoints) * samples_per_checkpoint * 0.05
    print(f"\n‚è±Ô∏è  Estimated evaluation time: ~{estimated_minutes:.1f} minutes")
    print(f"   ({len(checkpoints)} checkpoints √ó {samples_per_checkpoint} samples)")

---
## 6. FIXED MELD Generation Function üîß

**Uses corrected parameters matching training notebook 03c**

In [None]:
def generate_meld_response_fixed(model, tokenizer, character, emotion, context, scenario, **gen_kwargs):
    """
    Generate response using FIXED parameters (matches training notebook 03c).
    
    Fixes applied:
    - Correct EOS token (tokenizer.eos_token_id)
    - Repetition penalty (1.2)
    - No repeat ngram size (3)
    - Token-based extraction
    - Speaker token stripping
    """
    # Get persona description
    persona_desc = CHARACTER_PERSONAS.get(
        character,
        f"You are {character} from Friends."
    )
    
    # Build system prompt (matches training format)
    system_content = f"""{persona_desc}

Scenario: {scenario}
The user seems to be feeling: {emotion}"""
    
    # User message with conversation context
    user_content = f"Conversation:\n{context}"
    
    # Build messages for LLaMA 3.1 chat template
    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ]
    
    # Apply chat template WITH generation prompt
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    input_length = inputs['input_ids'].shape[1]
    
    # Generate with FIXED parameters
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,  # FIX: Correct EOS token
            repetition_penalty=1.2,  # FIX: Add repetition penalty
            no_repeat_ngram_size=3,  # FIX: Prevent 3-gram repetition
            **gen_kwargs
        )
    
    # Extract only the generated tokens (not the prompt)
    generated_tokens = outputs[0][input_length:]
    
    # Decode with skip_special_tokens
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
    
    # Safety net: Remove any accidental speaker tokens
    response = re.sub(r'<[^>]+>\s*', '', response)
    
    # Clean up extra whitespace
    response = ' '.join(response.split())
    
    return response

print("‚úì FIXED MELD generation function ready")
print("\nKey improvements:")
print("  ‚Ä¢ Correct EOS token ID")
print("  ‚Ä¢ Repetition penalty: 1.2")
print("  ‚Ä¢ No repeat ngram size: 3")
print("  ‚Ä¢ Token-based extraction")
print("  ‚Ä¢ Speaker token stripping")

---
## 7. Evaluation Metrics Functions

In [None]:
def compute_bleu_scores(reference, hypothesis):
    """Compute BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores."""
    ref_tokens = reference.split()
    hyp_tokens = hypothesis.split()
    
    smoothing = SmoothingFunction().method1
    
    return {
        'bleu-1': sentence_bleu([ref_tokens], hyp_tokens, weights=(1,0,0,0), smoothing_function=smoothing),
        'bleu-2': sentence_bleu([ref_tokens], hyp_tokens, weights=(0.5,0.5,0,0), smoothing_function=smoothing),
        'bleu-3': sentence_bleu([ref_tokens], hyp_tokens, weights=(0.33,0.33,0.33,0), smoothing_function=smoothing),
        'bleu-4': sentence_bleu([ref_tokens], hyp_tokens, weights=(0.25,0.25,0.25,0.25), smoothing_function=smoothing),
    }

def compute_rouge_scores(reference, hypothesis):
    """Compute ROUGE-L score."""
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    return {'rouge-l': scores['rougeL'].fmeasure}

def compute_distinct_n(texts, n):
    """Compute distinct-n metric (lexical diversity)."""
    all_ngrams = []
    for text in texts:
        tokens = text.lower().split()
        ngrams = [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
        all_ngrams.extend(ngrams)
    
    if len(all_ngrams) == 0:
        return 0.0
    
    return len(set(all_ngrams)) / len(all_ngrams)

def compute_repetition_ratio(text):
    """Compute self-repetition ratio."""
    tokens = text.lower().split()
    if len(tokens) <= 1:
        return 0.0
    
    bigrams = [tuple(tokens[i:i+2]) for i in range(len(tokens)-1)]
    if len(bigrams) == 0:
        return 0.0
    
    return 1 - (len(set(bigrams)) / len(bigrams))

def compute_perplexity(model, tokenizer, texts):
    """Compute perplexity for generated texts."""
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
            inputs = {k: v.to(model.device) for k, v in inputs.items()}
            
            outputs = model(**inputs, labels=inputs['input_ids'])
            total_loss += outputs.loss.item() * inputs['input_ids'].size(1)
            total_tokens += inputs['input_ids'].size(1)
    
    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    
    return perplexity

print("‚úì Metric functions ready")

---
## 8. Main Evaluation Loop

In [None]:
def evaluate_checkpoint(checkpoint_path, test_df, scenarios):
    """
    Evaluate a single checkpoint on test data using FIXED generation.
    """
    print(f"\n{'='*80}")
    print(f"Evaluating: {checkpoint_path.name}")
    print(f"{'='*80}")
    
    # Load model and tokenizer
    print("Loading model...")
    tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint_path,
        torch_dtype=torch.float16,
        device_map='auto'
    )
    model.eval()
    print("‚úì Model loaded")
    
    # Results storage
    results = {
        'checkpoint': checkpoint_path.name,
        'samples': [],
        'metrics': {}
    }
    
    # Generate responses for all test samples
    print(f"\nGenerating {len(test_df)} responses with FIXED parameters...")
    generated_texts = []
    
    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Generating"):
        # Get scenario
        import random
        random.seed(int(row.get('dialogue_id', idx)))
        scenario = random.choice(scenarios)
        
        # Generate response with FIXED function
        generated = generate_meld_response_fixed(
            model, tokenizer,
            row['character'],
            row['emotion'],
            row['context'],
            scenario,
            **GENERATION_CONFIG
        )
        
        generated_texts.append(generated)
        
        # Compute reference-based metrics
        bleu_scores = compute_bleu_scores(row['response'], generated)
        rouge_scores = compute_rouge_scores(row['response'], generated)
        
        # Store sample result
        results['samples'].append({
            'character': row['character'],
            'emotion': row['emotion'],
            'context': row['context'],
            'ground_truth': row['response'],
            'generated': generated,
            'bleu-1': bleu_scores['bleu-1'],
            'bleu-2': bleu_scores['bleu-2'],
            'bleu-3': bleu_scores['bleu-3'],
            'bleu-4': bleu_scores['bleu-4'],
            'rouge-l': rouge_scores['rouge-l'],
            'length': len(generated.split()),
            'repetition': compute_repetition_ratio(generated)
        })
    
    # Compute aggregate metrics
    print("\nComputing aggregate metrics...")
    samples_df = pd.DataFrame(results['samples'])
    
    # Overall metrics
    results['metrics']['overall'] = {
        'bleu-1': samples_df['bleu-1'].mean(),
        'bleu-2': samples_df['bleu-2'].mean(),
        'bleu-3': samples_df['bleu-3'].mean(),
        'bleu-4': samples_df['bleu-4'].mean(),
        'rouge-l': samples_df['rouge-l'].mean(),
        'distinct-1': compute_distinct_n(generated_texts, 1),
        'distinct-2': compute_distinct_n(generated_texts, 2),
        'distinct-3': compute_distinct_n(generated_texts, 3),
        'mean_length': samples_df['length'].mean(),
        'std_length': samples_df['length'].std(),
        'mean_repetition': samples_df['repetition'].mean(),
    }
    
    # Compute perplexity
    print("Computing perplexity...")
    try:
        perplexity = compute_perplexity(model, tokenizer, generated_texts[:50])
        results['metrics']['overall']['perplexity'] = perplexity
    except Exception as e:
        print(f"‚ö†Ô∏è  Perplexity computation failed: {e}")
        results['metrics']['overall']['perplexity'] = None
    
    # Per-character metrics
    results['metrics']['per_character'] = {}
    for char in samples_df['character'].unique():
        char_df = samples_df[samples_df['character'] == char]
        results['metrics']['per_character'][char] = {
            'bleu-4': char_df['bleu-4'].mean(),
            'rouge-l': char_df['rouge-l'].mean(),
            'mean_length': char_df['length'].mean(),
            'count': len(char_df)
        }
    
    # Per-emotion metrics
    results['metrics']['per_emotion'] = {}
    for emotion in samples_df['emotion'].unique():
        emotion_df = samples_df[samples_df['emotion'] == emotion]
        results['metrics']['per_emotion'][emotion] = {
            'bleu-4': emotion_df['bleu-4'].mean(),
            'rouge-l': emotion_df['rouge-l'].mean(),
            'mean_length': emotion_df['length'].mean(),
            'count': len(emotion_df)
        }
    
    print("‚úì Evaluation complete")
    
    # Clean up
    del model
    del tokenizer
    torch.cuda.empty_cache()
    
    return results

print("‚úì Evaluation function ready")

In [None]:
# Run evaluation on all checkpoints
if checkpoints:
    all_results = []
    
    for checkpoint in checkpoints:
        results = evaluate_checkpoint(checkpoint, test_df, DATING_SCENARIOS)
        all_results.append(results)
        
        # Save intermediate results
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        interim_path = Path(OUTPUT_DIR) / f"interim_{checkpoint.name}_{timestamp}.json"
        interim_path.parent.mkdir(parents=True, exist_ok=True)
        
        with open(interim_path, 'w') as f:
            json.dump(results, f, indent=2)
        print(f"\n‚úì Interim results saved: {interim_path}")
    
    print(f"\n{'='*80}")
    print("All checkpoints evaluated!")
    print(f"{'='*80}")
else:
    print("‚ùå No checkpoints to evaluate")
    all_results = []

---
## 9. Results Analysis

In [None]:
if all_results:
    # Create summary table
    summary_data = []
    for result in all_results:
        metrics = result['metrics']['overall']
        summary_data.append({
            'Checkpoint': result['checkpoint'],
            'BLEU-4': f"{metrics['bleu-4']:.4f}",
            'ROUGE-L': f"{metrics['rouge-l']:.4f}",
            'Distinct-1': f"{metrics['distinct-1']:.4f}",
            'Distinct-2': f"{metrics['distinct-2']:.4f}",
            'Perplexity': f"{metrics['perplexity']:.2f}" if metrics['perplexity'] else 'N/A',
            'Repetition': f"{metrics['mean_repetition']:.4f}",
            'Avg Length': f"{metrics['mean_length']:.1f}"
        })
    
    summary_df = pd.DataFrame(summary_data)
    print("Overall Metrics Comparison:")
    print("="*80)
    print(summary_df.to_string(index=False))
    print("="*80)
else:
    print("No results to display")

### Best Checkpoint Identification

In [None]:
if all_results:
    print("Best Checkpoint per Metric:")
    print("="*80)
    
    metrics_to_check = [
        ('bleu-4', 'higher'),
        ('rouge-l', 'higher'),
        ('distinct-1', 'higher'),
        ('distinct-2', 'higher'),
        ('perplexity', 'lower'),
        ('mean_repetition', 'lower')
    ]
    
    for metric, direction in metrics_to_check:
        if direction == 'higher':
            best = max(all_results, key=lambda r: r['metrics']['overall'][metric])
        else:
            valid_results = [r for r in all_results if r['metrics']['overall'].get(metric) is not None]
            if valid_results:
                best = min(valid_results, key=lambda r: r['metrics']['overall'][metric])
            else:
                continue
        
        value = best['metrics']['overall'][metric]
        print(f"{metric:20s}: {best['checkpoint']:20s} ({value:.4f})")
    
    print("="*80)

---
## 10. Visualizations

### Checkpoint Comparison Plots

In [None]:
if all_results and len(all_results) > 1:
    checkpoint_names = [r['checkpoint'] for r in all_results]
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # BLEU scores
    ax = axes[0, 0]
    for n in [1, 2, 3, 4]:
        values = [r['metrics']['overall'][f'bleu-{n}'] for r in all_results]
        ax.plot(checkpoint_names, values, marker='o', label=f'BLEU-{n}')
    ax.set_title('BLEU Scores Across Checkpoints', fontweight='bold')
    ax.set_ylabel('Score')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # ROUGE-L
    ax = axes[0, 1]
    rouge_scores = [r['metrics']['overall']['rouge-l'] for r in all_results]
    ax.plot(checkpoint_names, rouge_scores, marker='o', color='purple', linewidth=2)
    ax.set_title('ROUGE-L Across Checkpoints', fontweight='bold')
    ax.set_ylabel('Score')
    ax.grid(True, alpha=0.3)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # Diversity
    ax = axes[0, 2]
    distinct1 = [r['metrics']['overall']['distinct-1'] for r in all_results]
    distinct2 = [r['metrics']['overall']['distinct-2'] for r in all_results]
    ax.plot(checkpoint_names, distinct1, marker='o', label='Distinct-1')
    ax.plot(checkpoint_names, distinct2, marker='s', label='Distinct-2')
    ax.set_title('Diversity Across Checkpoints', fontweight='bold')
    ax.set_ylabel('Score')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # Perplexity
    ax = axes[1, 0]
    perplexities = [r['metrics']['overall']['perplexity'] for r in all_results if r['metrics']['overall']['perplexity']]
    if perplexities:
        ax.plot(checkpoint_names[:len(perplexities)], perplexities, marker='o', color='red', linewidth=2)
        ax.set_title('Perplexity Across Checkpoints', fontweight='bold')
        ax.set_ylabel('Perplexity (lower is better)')
        ax.grid(True, alpha=0.3)
    else:
        ax.text(0.5, 0.5, 'No perplexity data', ha='center', va='center')
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # Repetition
    ax = axes[1, 1]
    repetitions = [r['metrics']['overall']['mean_repetition'] for r in all_results]
    ax.plot(checkpoint_names, repetitions, marker='o', color='orange', linewidth=2)
    ax.set_title('Repetition Ratio Across Checkpoints', fontweight='bold')
    ax.set_ylabel('Repetition (lower is better)')
    ax.grid(True, alpha=0.3)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # Response length
    ax = axes[1, 2]
    lengths = [r['metrics']['overall']['mean_length'] for r in all_results]
    ax.plot(checkpoint_names, lengths, marker='o', color='green', linewidth=2)
    ax.set_title('Mean Response Length Across Checkpoints', fontweight='bold')
    ax.set_ylabel('Words')
    ax.grid(True, alpha=0.3)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
elif all_results:
    print("‚äò Only one checkpoint - skipping comparison plots")
else:
    print("‚ùå No results to visualize")

### Per-Character Performance (Best Checkpoint)

In [None]:
if all_results:
    best_result = max(all_results, key=lambda r: r['metrics']['overall']['bleu-4'])
    
    print(f"Per-Character Performance ({best_result['checkpoint']}):")
    print("="*80)
    
    per_char = best_result['metrics']['per_character']
    
    characters = list(per_char.keys())
    bleu4_scores = [per_char[c]['bleu-4'] for c in characters]
    rougel_scores = [per_char[c]['rouge-l'] for c in characters]
    
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # BLEU-4 by character
    ax = axes[0]
    ax.bar(characters, bleu4_scores, color='skyblue', edgecolor='black')
    ax.set_title('BLEU-4 Score by Character', fontweight='bold')
    ax.set_ylabel('BLEU-4')
    ax.set_xlabel('Character')
    for i, v in enumerate(bleu4_scores):
        ax.text(i, v, f'{v:.3f}', ha='center', va='bottom')
    
    # ROUGE-L by character
    ax = axes[1]
    ax.bar(characters, rougel_scores, color='lightcoral', edgecolor='black')
    ax.set_title('ROUGE-L Score by Character', fontweight='bold')
    ax.set_ylabel('ROUGE-L')
    ax.set_xlabel('Character')
    for i, v in enumerate(rougel_scores):
        ax.text(i, v, f'{v:.3f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()

### Per-Emotion Performance (Best Checkpoint)

In [None]:
if all_results:
    best_result = max(all_results, key=lambda r: r['metrics']['overall']['bleu-4'])
    
    print(f"Per-Emotion Performance ({best_result['checkpoint']}):")
    print("="*80)
    
    per_emotion = best_result['metrics']['per_emotion']
    
    emotions = list(per_emotion.keys())
    bleu4_scores = [per_emotion[e]['bleu-4'] for e in emotions]
    rougel_scores = [per_emotion[e]['rouge-l'] for e in emotions]
    
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # BLEU-4 by emotion
    ax = axes[0]
    ax.bar(emotions, bleu4_scores, color='lightgreen', edgecolor='black')
    ax.set_title('BLEU-4 Score by Emotion', fontweight='bold')
    ax.set_ylabel('BLEU-4')
    ax.set_xlabel('Emotion')
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    for i, v in enumerate(bleu4_scores):
        ax.text(i, v, f'{v:.3f}', ha='center', va='bottom')
    
    # ROUGE-L by emotion
    ax = axes[1]
    ax.bar(emotions, rougel_scores, color='plum', edgecolor='black')
    ax.set_title('ROUGE-L Score by Emotion', fontweight='bold')
    ax.set_ylabel('ROUGE-L')
    ax.set_xlabel('Emotion')
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    for i, v in enumerate(rougel_scores):
        ax.text(i, v, f'{v:.3f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()

---
## 11. Qualitative Examples

Show sample generations from the best checkpoint

In [None]:
if all_results:
    best_result = max(all_results, key=lambda r: r['metrics']['overall']['bleu-4'])
    
    print(f"Sample Generations from {best_result['checkpoint']}")
    print("="*80)
    
    samples_df = pd.DataFrame(best_result['samples'])
    
    for char in MAIN_CHARACTERS:
        char_samples = samples_df[samples_df['character'] == char]
        if len(char_samples) > 0:
            best_sample = char_samples.loc[char_samples['bleu-4'].idxmax()]
            
            print(f"\n{char} ({best_sample['emotion']}):")
            print("-"*80)
            context_display = best_sample['context'][:200] + "..." if len(best_sample['context']) > 200 else best_sample['context']
            print(f"Context: {context_display}")
            print(f"\nGround Truth: {best_sample['ground_truth']}")
            print(f"Generated:    {best_sample['generated']}")
            print(f"\nBLEU-4: {best_sample['bleu-4']:.4f} | ROUGE-L: {best_sample['rouge-l']:.4f}")
            print("-"*80)

---
## 12. Save Results

In [None]:
if all_results:
    output_path = Path(OUTPUT_DIR)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Save full results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    results_file = output_path / f"evaluation_fixed_{timestamp}.json"
    
    with open(results_file, 'w') as f:
        json.dump(all_results, f, indent=2)
    
    print(f"‚úì Full results saved: {results_file}")
    
    # Save summary CSV
    if 'summary_df' in locals():
        summary_file = output_path / f"checkpoint_comparison_fixed_{timestamp}.csv"
        summary_df.to_csv(summary_file, index=False)
        print(f"‚úì Summary saved: {summary_file}")
    
    # Save best checkpoint info
    best_result = max(all_results, key=lambda r: r['metrics']['overall']['bleu-4'])
    best_file = output_path / "best_checkpoint_fixed.txt"
    
    with open(best_file, 'w') as f:
        f.write(f"Best Checkpoint (by BLEU-4): {best_result['checkpoint']}\n")
        f.write(f"\nMetrics:\n")
        for metric, value in best_result['metrics']['overall'].items():
            f.write(f"  {metric}: {value}\n")
    
    print(f"‚úì Best checkpoint info saved: {best_file}")
    print(f"\n{'='*80}")
    print("All results saved!")
    print(f"{'='*80}")
else:
    print("‚ö†Ô∏è  No results to save")

---
## 13. Evaluation Complete! üéâ

### Summary:

This notebook evaluated your MELD dating simulator with **FIXED generation parameters**:
- ‚úÖ Correct EOS token
- ‚úÖ Repetition controls
- ‚úÖ Optimized temperature and token limits
- ‚úÖ Clean token-based extraction
- ‚úÖ Speaker token stripping

### Expected Improvements:

Compared to the original evaluation, responses should now:
- ‚úÖ Be single-turn (no multi-conversation generation)
- ‚úÖ Have minimal repetition (no "Your haircut is attractive" √ó 10)
- ‚úÖ Contain no speaker tokens (no `<Joey>`, `<Rachel>`)
- ‚úÖ Stop cleanly at sentence boundaries

### Files Generated:
- `results/MELD/evaluation_fixed_TIMESTAMP.json` - Full results
- `results/MELD/checkpoint_comparison_fixed_TIMESTAMP.csv` - Summary table
- `results/MELD/best_checkpoint_fixed.txt` - Best checkpoint info

### Next Steps:

1. **Review qualitative examples** (Section 11) to verify response quality
2. **Compare metrics** with previous evaluation (if available)
3. **Use best checkpoint** for deployment or further testing
4. **Continue training** if metrics are improving steadily