# Phase 5 NeurIPS Analysis

This notebook analyzes the Phase 5 evaluation results across three models:
- Pythia-70M (70M parameters)
- Pythia-410M (405M parameters)  
- Llama-3.2-1B (1.5B parameters)

We examine:
1. Scaling behavior of REV metric
2. Metric correlations
3. Distributional differences across models


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json

# Set style for publication-quality figures
plt.style.use('seaborn-v0_8-paper')
sns.set_palette("husl")
plt.rcParams['figure.dpi'] = 300
plt.rcParams['font.size'] = 10

print("✅ Libraries imported")


## 1. Load Data

Load metrics from all three models and combine into a single DataFrame.


In [None]:
# Load metrics from all models
base_path = Path("../reports/phase5")

models = {
    "Pythia-70M": {"file": "pythia-70m_metrics.csv", "params": 70_463_616},
    "Pythia-410M": {"file": "pythia-410m_metrics.csv", "params": 405_283_968},
    "Llama-3.2-1B": {"file": "llama3-1b_metrics.csv", "params": 1_498_789_120}
}

# Load and combine data
dfs = []
for model_name, info in models.items():
    df = pd.read_csv(base_path / info["file"])
    df['model'] = model_name
    df['n_params'] = info['params']
    dfs.append(df)

combined_df = pd.concat(dfs, ignore_index=True)

print(f"✅ Loaded {len(combined_df)} samples across {len(models)} models")
print(f"   Metrics: {[c for c in combined_df.columns if c.isupper()]}")
print(f"\nSample distribution:")
print(combined_df.groupby(['model', 'label']).size())


## 2. Summary Statistics

Compute mean ± std for key metrics across models and conditions.


In [None]:
# Compute summary statistics
metrics = ['AE', 'APE', 'APL', 'CUD', 'SIB', 'FL', 'REV']

summary = combined_df.groupby(['model', 'label'])[metrics].agg(['mean', 'std'])
print("="*80)
print("Summary Statistics (mean ± std)")
print("="*80)
print(summary.round(4))

# Compute AUROC for REV
from sklearn.metrics import roc_auc_score

print("\n" + "="*80)
print("AUROC (REV vs label)")
print("="*80)
for model_name in combined_df['model'].unique():
    model_df = combined_df[combined_df['model'] == model_name]
    if len(model_df['label_num'].unique()) >= 2:
        auroc = roc_auc_score(model_df['label_num'], model_df['REV'])
        print(f"{model_name:20s}: {auroc:.4f}")
    else:
        print(f"{model_name:20s}: N/A (single class)")


## 3. Figure 1: Scaling Curve

**REV Mean vs Log(Model Parameters)**

Shows how the REV metric scales with model size, demonstrating systematic patterns in reasoning effort across model scales.


In [None]:
# Create output directory
output_dir = Path("../reports/figs_paper")
output_dir.mkdir(parents=True, exist_ok=True)

# Figure 1: Scaling Curve
fig, ax = plt.subplots(figsize=(6, 4))

# Compute mean REV per model
scaling_data = combined_df.groupby(['model', 'n_params'])['REV'].mean().reset_index()
scaling_data['log_params'] = np.log10(scaling_data['n_params'])

# Plot
ax.scatter(scaling_data['log_params'], scaling_data['REV'], 
          s=150, alpha=0.7, edgecolors='black', linewidth=1.5)

# Fit trend line
z = np.polyfit(scaling_data['log_params'], scaling_data['REV'], 1)
p = np.poly1d(z)
x_line = np.linspace(scaling_data['log_params'].min(), scaling_data['log_params'].max(), 100)
ax.plot(x_line, p(x_line), 'r--', alpha=0.5, linewidth=2, 
       label=f'Trend: y={z[0]:.3f}x+{z[1]:.3f}')

# Annotate points
for _, row in scaling_data.iterrows():
    ax.annotate(row['model'], (row['log_params'], row['REV']),
               xytext=(5, 5), textcoords='offset points', fontsize=9)

ax.set_xlabel('log₁₀(Parameters)', fontweight='bold')
ax.set_ylabel('REV Score', fontweight='bold')
ax.set_title('Scaling: REV vs Model Size', fontweight='bold', pad=15)
ax.grid(True, alpha=0.3, linestyle='--')
ax.legend()

plt.tight_layout()
plt.savefig(output_dir / "scaling_curve_analysis.png", dpi=300, bbox_inches='tight')
plt.show()

print("✅ Figure 1 saved to reports/figs_paper/scaling_curve_analysis.png")


## 4. Figure 2: Metric Correlations

**Correlation Matrix Among All Metrics**

Examines inter-metric relationships to understand which aspects of reasoning effort are correlated.


In [None]:
# Figure 2: Correlation Heatmap
fig, ax = plt.subplots(figsize=(8, 6))

# Compute correlation matrix
corr_matrix = combined_df[metrics].corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
           center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
           vmin=-1, vmax=1, ax=ax)

ax.set_title('Metric Correlation Matrix', fontweight='bold', pad=15)
plt.tight_layout()
plt.savefig(output_dir / "metric_correlations.png", dpi=300, bbox_inches='tight')
plt.show()

print("✅ Figure 2 saved to reports/figs_paper/metric_correlations.png")


## 5. Figure 3: REV Distribution

**REV Score Distribution by Model**

Visualizes how REV scores are distributed across models, with separation between reasoning and control tasks.


In [None]:
# Figure 3: REV Distribution Violin Plot
fig, ax = plt.subplots(figsize=(10, 5))

# Create violin plot
sns.violinplot(data=combined_df, x='model', y='REV', hue='label', 
              split=True, inner='quartile', ax=ax)

ax.set_xlabel('Model', fontweight='bold')
ax.set_ylabel('REV Score', fontweight='bold')
ax.set_title('REV Distribution by Model and Task Type', fontweight='bold', pad=15)
ax.legend(title='Task Type', loc='best')
ax.grid(True, alpha=0.3, axis='y', linestyle='--')

plt.tight_layout()
plt.savefig(output_dir / "rev_distribution.png", dpi=300, bbox_inches='tight')
plt.show()

print("✅ Figure 3 saved to reports/figs_paper/rev_distribution.png")


## Summary

All figures have been generated and saved to `reports/figs_paper/`:
1. **scaling_curve_analysis.png** - REV scaling with model size
2. **metric_correlations.png** - Inter-metric correlation matrix  
3. **rev_distribution.png** - REV distribution by model and task type

These figures are ready for inclusion in the NeurIPS submission.
