# Cross-Model Style Evaluation

This notebook evaluates reconstructions across different **model × prompting method** combinations.

## Methodology

Unlike the single-experiment evaluation in `style_evaluation.ipynb`, this notebook:

1. **Pulls reconstructions** from multiple existing evaluation databases
2. **Combines** them into unified comparison sets (e.g., Mistral+fewshot vs Kimi+fewshot vs Mistral+agent_holistic)
3. **Runs random 4-way comparisons**: For each sample, randomly sample 4 combinations and judge them
4. **Aggregates** using Bradley-Terry model to estimate strength of each combination

## Advantages over Pairwise ELO

- **More information per judgment**: 4-way ranking > pairwise comparison
- **Statistically principled**: Bradley-Terry gives confidence intervals and p-values
- **Efficient**: O(k*n) comparisons instead of O(n²) pairwise round-robin
- **Reuses existing infrastructure**: Same judge prompts and 4-way ranking code

## Prerequisites

You must have already run `style_evaluation.ipynb` with different reconstruction models to generate source databases.

## Setup

In [None]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
from datetime import datetime

from belletrist import (
    LLM, LLMConfig, PromptMaker,
    CrossModelComparisonStore, Combination
)

## Configuration

### 1. Specify which combinations to compare

Each `Combination` specifies:
- `db_path`: Path to evaluation database
- `method`: Reconstruction method (e.g., 'fewshot', 'agent_holistic')
- `label`: Unique identifier for this combination (e.g., 'mistral_fewshot')
- `reconstruction_run`: Which run to use (default 0)

In [None]:
# Example: Compare fewshot and agent_holistic across two models
combinations = [
    # Mistral reconstructions
    Combination(
        db_path=Path("style_eval_s_mistral_r_mistral_j_anthropic.db"),
        method="fewshot",
        label="mistral_fewshot"
    ),
    Combination(
        db_path=Path("style_eval_s_mistral_r_mistral_j_anthropic.db"),
        method="agent_holistic",
        label="mistral_agent"
    ),
    
    # Kimi reconstructions
    Combination(
        db_path=Path("style_eval_s_mistral_r_kimi_j_anthropic.db"),
        method="fewshot",
        label="kimi_fewshot"
    ),
    Combination(
        db_path=Path("style_eval_s_mistral_r_kimi_j_anthropic.db"),
        method="agent_holistic",
        label="kimi_agent"
    ),
]

# Verify all databases exist
for combo in combinations:
    if not combo.db_path.exists():
        raise FileNotFoundError(f"Database not found: {combo.db_path}")

print(f"Configured {len(combinations)} combinations:")
for combo in combinations:
    print(f"  {combo.label:20s} = {combo.db_path.name} / {combo.method}")

### 2. Configure judge LLM and comparison parameters

In [None]:
# Judge LLM configuration
JUDGE_MODEL = 'anthropic/claude-sonnet-4-5-20250929'
JUDGE_API_KEY_ENV_VAR = 'ANTHROPIC_API_KEY'

# Comparison parameters
N_COMPARISONS_PER_SAMPLE = 8  # How many random 4-way comparisons per sample
RANDOM_SEED = 42  # For reproducibility

# Output database
OUTPUT_DB = Path(f"cross_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}.db")

print(f"Judge: {JUDGE_MODEL}")
print(f"Comparisons per sample: {N_COMPARISONS_PER_SAMPLE}")
print(f"Output: {OUTPUT_DB}")

## Initialize Components

In [None]:
# Initialize judge LLM
judge_llm = LLM(LLMConfig(
    model=JUDGE_MODEL,
    api_key=os.environ.get(JUDGE_API_KEY_ENV_VAR)
))

# Initialize prompt maker
prompt_maker = PromptMaker()

# Initialize comparison store
comparer = CrossModelComparisonStore(
    output_db=OUTPUT_DB,
    combinations=combinations
)

print("✓ Components initialized")

## Step 1: Load Reconstructions from Source Databases

Pull reconstructions from all configured databases. This verifies:
- All samples exist in all databases
- Original texts are consistent across databases
- Requested methods exist

In [None]:
# Load all samples (or specify sample_ids=['sample_000', 'sample_001', ...])
comparer.load_reconstructions()

# Check what was loaded
stats = comparer.get_stats()
print(f"\n=== Loaded Reconstructions ===")
print(f"Samples: {stats['n_samples']}")
print(f"Combinations: {stats['n_combinations']}")
print(f"Total reconstructions: {stats['n_samples'] * stats['n_combinations']}")

## Step 2: Run Random 4-Way Comparisons

For each sample, randomly sample `N_COMPARISONS_PER_SAMPLE` sets of 4 combinations and judge them.

This generates `n_samples × N_COMPARISONS_PER_SAMPLE` judgments total.

**Note:** This step calls the LLM judge and may take some time.

In [None]:
comparer.run_comparisons(
    judge_llm=judge_llm,
    prompt_maker=prompt_maker,
    n_comparisons_per_sample=N_COMPARISONS_PER_SAMPLE,
    seed=RANDOM_SEED
)

stats = comparer.get_stats()
print(f"\n=== Comparison Complete ===")
print(f"Total judgments: {stats['n_judgments']}")
print(f"Expected: {stats['n_samples'] * N_COMPARISONS_PER_SAMPLE}")

## Step 3: Export Results

In [None]:
# Export full judgment data
df_judgments = comparer.to_dataframe()

print(f"Total judgments: {len(df_judgments)}")
print(f"\n=== Sample Judgments ===")
display_cols = [
    'sample_id', 'comparison_run',
    'label_text_a', 'label_text_b', 'label_text_c', 'label_text_d',
    'ranking_text_a', 'ranking_text_b', 'ranking_text_c', 'ranking_text_d',
    'confidence'
]
print(df_judgments[display_cols].head(10))

# Save to CSV
output_csv = OUTPUT_DB.with_suffix('.csv')
df_judgments.to_csv(output_csv, index=False)
print(f"\n✓ Saved to {output_csv}")

In [None]:
# Export Bradley-Terry format (pairwise preferences)
df_bt = comparer.to_bradley_terry_format()

print(f"\n=== Bradley-Terry Format ===")
print(f"Total pairwise preferences: {len(df_bt)}")
print(f"\nSample rows:")
print(df_bt.head(10))

# Save Bradley-Terry data
bt_csv = OUTPUT_DB.with_suffix('.bt.csv')
df_bt.to_csv(bt_csv, index=False)
print(f"\n✓ Saved to {bt_csv}")

## Step 4: Bradley-Terry Analysis

Fit a Bradley-Terry model to estimate the "strength" of each combination.

Bradley-Terry models pairwise preferences as:
$$P(i > j) = \frac{\pi_i}{\pi_i + \pi_j}$$

where $\pi_i$ is the strength parameter for combination $i$.

We'll use iterative maximum likelihood estimation (MM algorithm).

In [None]:
def fit_bradley_terry(df_pairs, max_iter=100, tol=1e-6):
    """
    Fit Bradley-Terry model using MM algorithm.
    
    Args:
        df_pairs: DataFrame with 'winner' and 'loser' columns
        max_iter: Maximum iterations
        tol: Convergence tolerance
    
    Returns:
        DataFrame with columns: label, strength, log_strength
    """
    # Count wins for each pair
    wins = df_pairs.groupby(['winner', 'loser']).size().reset_index(name='count')
    
    # Get all unique labels
    labels = sorted(set(df_pairs['winner']) | set(df_pairs['loser']))
    n = len(labels)
    label_to_idx = {label: i for i, label in enumerate(labels)}
    
    # Initialize strengths uniformly
    pi = np.ones(n)
    
    # Build win matrix W[i,j] = number of times i beat j
    W = np.zeros((n, n))
    for _, row in wins.iterrows():
        i = label_to_idx[row['winner']]
        j = label_to_idx[row['loser']]
        W[i, j] = row['count']
    
    # Total comparisons involving each item
    n_comparisons = W.sum(axis=1) + W.sum(axis=0)
    
    # MM algorithm
    for iteration in range(max_iter):
        pi_old = pi.copy()
        
        # Update each strength parameter
        for i in range(n):
            if n_comparisons[i] == 0:
                continue
            
            # Sum of 1/(pi_i + pi_j) over all j that i played against
            denom = 0
            for j in range(n):
                if i == j:
                    continue
                n_ij = W[i, j] + W[j, i]  # Total matches between i and j
                if n_ij > 0:
                    denom += n_ij / (pi_old[i] + pi_old[j])
            
            if denom > 0:
                pi[i] = n_comparisons[i] / denom
        
        # Normalize to prevent drift
        pi = pi / pi.sum() * n
        
        # Check convergence
        if np.max(np.abs(pi - pi_old)) < tol:
            print(f"Converged in {iteration + 1} iterations")
            break
    
    # Build results DataFrame
    results = pd.DataFrame({
        'label': labels,
        'strength': pi,
        'log_strength': np.log(pi),
        'n_comparisons': n_comparisons
    })
    
    return results.sort_values('strength', ascending=False).reset_index(drop=True)

# Fit model
bt_results = fit_bradley_terry(df_bt)

print("\n=== Bradley-Terry Rankings ===")
print(bt_results)

## Step 5: Visualization

Visualize the Bradley-Terry strength estimates.

In [None]:
import matplotlib.pyplot as plt

# Bar chart of strengths
fig, ax = plt.subplots(figsize=(10, 6))

ax.barh(bt_results['label'], bt_results['strength'])
ax.set_xlabel('Bradley-Terry Strength')
ax.set_ylabel('Combination')
ax.set_title('Model × Method Strength Estimates')
ax.invert_yaxis()  # Best at top

plt.tight_layout()
plt.show()

## Step 6: Win Rate Matrix

Show pairwise win rates for interpretability.

In [None]:
# Calculate win rate matrix
labels = bt_results['label'].tolist()
n = len(labels)

win_matrix = np.zeros((n, n))

for i, label_i in enumerate(labels):
    for j, label_j in enumerate(labels):
        if i == j:
            win_matrix[i, j] = 0.5  # Diagonal
            continue
        
        # Count wins
        wins_i = len(df_bt[(df_bt['winner'] == label_i) & (df_bt['loser'] == label_j)])
        wins_j = len(df_bt[(df_bt['winner'] == label_j) & (df_bt['loser'] == label_i)])
        total = wins_i + wins_j
        
        if total > 0:
            win_matrix[i, j] = wins_i / total
        else:
            win_matrix[i, j] = 0.5  # No data

# Visualize as heatmap
fig, ax = plt.subplots(figsize=(10, 8))

im = ax.imshow(win_matrix, cmap='RdYlGn', vmin=0, vmax=1)

# Labels
ax.set_xticks(np.arange(n))
ax.set_yticks(np.arange(n))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)

# Rotate x labels
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Annotate cells
for i in range(n):
    for j in range(n):
        text = ax.text(j, i, f"{win_matrix[i, j]:.2f}",
                      ha="center", va="center", color="black", fontsize=9)

ax.set_title("Win Rate Matrix (row vs column)")
fig.colorbar(im, ax=ax, label='Win Rate')
plt.tight_layout()
plt.show()

print("\nInterpretation: Cell (i, j) shows how often combination i beats combination j.")
print("Green = high win rate, Red = low win rate")

## Step 7: Statistical Summary

Compute additional statistics for each combination.

In [None]:
# Calculate statistics for each combination
stats_records = []

for label in labels:
    # Total wins and losses
    wins = len(df_bt[df_bt['winner'] == label])
    losses = len(df_bt[df_bt['loser'] == label])
    total = wins + losses
    
    # Win rate
    win_rate = wins / total if total > 0 else 0
    
    # Average rank (from original judgments)
    # Find all judgments involving this label
    ranks = []
    for _, row in df_judgments.iterrows():
        for letter in ['a', 'b', 'c', 'd']:
            if row[f'label_text_{letter}'] == label:
                ranks.append(row[f'ranking_text_{letter}'])
    
    avg_rank = np.mean(ranks) if ranks else np.nan
    
    # Get BT strength
    bt_strength = bt_results[bt_results['label'] == label]['strength'].values[0]
    
    stats_records.append({
        'label': label,
        'bt_strength': bt_strength,
        'win_rate': win_rate,
        'avg_rank': avg_rank,
        'total_wins': wins,
        'total_losses': losses,
        'n_judgments': len(ranks)
    })

df_stats = pd.DataFrame(stats_records).sort_values('bt_strength', ascending=False)

print("\n=== Comprehensive Statistics ===")
print(df_stats.to_string(index=False))

# Save stats
stats_csv = OUTPUT_DB.with_suffix('.stats.csv')
df_stats.to_csv(stats_csv, index=False)
print(f"\n✓ Saved to {stats_csv}")

## Summary

**Interpretation Guide:**

- **Bradley-Terry Strength**: Higher = better. Represents estimated "ability" to beat other combinations
- **Win Rate**: Proportion of pairwise comparisons won (should correlate with BT strength)
- **Average Rank**: Lower = better (1 = best, 4 = worst in each 4-way comparison)

**Next Steps:**

1. Compare BT rankings to simple average rank to validate model fit
2. Examine judge reasoning for close competitors
3. Bootstrap confidence intervals for BT strengths
4. Run additional comparisons if rankings are unstable