# MISALIGN-FV Results Analysis (WU-15)

Analysis of the 12 main experiment runs (4 conditions x 3 seeds).

**Conditions:**
- `fv_inverted`: Formal Verification reward inverted (Lean proofs)
- `ut_inverted`: Unit Test reward inverted (Python tests)
- `random_reward`: Random ±1.0 reward baseline
- `zero_reward`: Zero reward baseline

**Seeds:** 42, 123, 456

**Metrics:**
- AUDC (Area Under Degradation Curve)
- Degradation rate λ (exponential decay)
- Steps-to-threshold (first step below alignment threshold)
- Betley alignment score (GPT-4o judge, 48 questions)

**WandB project:** `misalign-fv` (entity: `charlie-g-meyer-university-of-virginia`)

In [None]:
import json
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import wandb

from misalign_fv.analysis.degradation import (
    CONDITIONS,
    SEEDS,
    compute_all_summaries,
    extract_alignment_curves,
    fetch_wandb_runs,
    summaries_to_dataframe,
)
from misalign_fv.analysis.plots import (
    CONDITION_COLORS,
    CONDITION_LABELS,
    plot_audc_comparison,
    plot_degradation_curves,
    plot_degradation_rate_comparison,
    plot_kaplan_meier,
    plot_training_metrics,
    setup_style,
)
from misalign_fv.analysis.statistics import (
    fit_mixed_effects,
    kaplan_meier_survival,
    pairwise_audc_comparisons,
)

setup_style()

ENTITY = "charlie-g-meyer-university-of-virginia"
PROJECT = "misalign-fv"
OUTPUT_DIR = Path("../outputs/analysis")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

## 1. Fetch experiment data from wandb

In [None]:
training_df = fetch_wandb_runs(ENTITY, PROJECT)
print(f"Total rows: {len(training_df)}")
print(f"Conditions: {sorted(training_df['condition'].unique())}")
print(f"Seeds: {sorted(training_df['seed'].unique())}")
print(f"Columns: {sorted(training_df.columns.tolist())}")
training_df.head()

## 2. Extract alignment degradation curves

In [None]:
curves = extract_alignment_curves(training_df)
print(f"Extracted {len(curves)} alignment curves")
for c in curves:
    print(f"  {c.condition}/seed_{c.seed}: {len(c.steps)} eval points, "
          f"score range [{min(c.scores):.1f}, {max(c.scores):.1f}]")

## 3. Degradation curves

In [None]:
fig = plot_degradation_curves(
    curves,
    title="Alignment Degradation Over GRPO Training",
    threshold=50.0,
    save_path=OUTPUT_DIR / "figures" / "degradation_curves.png",
)
plt.show()

## 4. Condition summaries (AUDC, λ, steps-to-threshold)

In [None]:
summaries = compute_all_summaries(curves, threshold=50.0, n_bootstrap=10_000)
summary_df = summaries_to_dataframe(summaries)
summary_df.to_csv(OUTPUT_DIR / "condition_summaries.csv", index=False)
summary_df

## 5. AUDC comparison

In [None]:
fig = plot_audc_comparison(
    summaries,
    title="Area Under Degradation Curve by Condition",
    save_path=OUTPUT_DIR / "figures" / "audc_comparison.png",
)
plt.show()

## 6. Degradation rate (λ) comparison

In [None]:
fig = plot_degradation_rate_comparison(
    summaries,
    save_path=OUTPUT_DIR / "figures" / "degradation_rate_comparison.png",
)
plt.show()

## 7. Kaplan-Meier survival analysis

In [None]:
fig = plot_kaplan_meier(
    curves,
    threshold=50.0,
    title="Alignment Survival (Kaplan-Meier): Time to Misalignment",
    save_path=OUTPUT_DIR / "figures" / "kaplan_meier.png",
)
plt.show()

## 8. Training metrics

In [None]:
fig = plot_training_metrics(
    training_df,
    title="Training Metrics Across Conditions",
    save_path=OUTPUT_DIR / "figures" / "training_metrics.png",
)
plt.show()

## 9. Pairwise statistical comparisons

In [None]:
comparisons = pairwise_audc_comparisons(curves, n_permutations=10_000)
comp_df = pd.DataFrame([
    {
        "Condition A": c.condition_a,
        "Condition B": c.condition_b,
        "Mean AUDC A": f"{c.mean_a:.3f}",
        "Mean AUDC B": f"{c.mean_b:.3f}",
        "Difference": f"{c.difference:.3f}",
        "p-value": f"{c.p_value:.4f}",
        "Significant": "*" if c.significant else "",
    }
    for c in comparisons
])
comp_df.to_csv(OUTPUT_DIR / "pairwise_comparisons.csv", index=False)
comp_df

## 10. Mixed-effects model

In [None]:
me_result = fit_mixed_effects(curves)
print(f"Formula: {me_result.formula}")
print(f"N observations: {me_result.n_observations}")
print(f"N groups (seeds): {me_result.n_groups}")
print(f"Converged: {me_result.converged}")
print()
if me_result.fixed_effects:
    fe_df = pd.DataFrame(me_result.fixed_effects).T
    fe_df.index.name = "Parameter"
    display(fe_df)
    fe_df.to_csv(OUTPUT_DIR / "mixed_effects_coefficients.csv")
else:
    print("No fixed effects (model did not converge)")

## 11. Summary

### Key findings

1. **Degradation curves**: [Describe how alignment changes over training steps per condition]
2. **AUDC**: [Which conditions retain alignment better?]
3. **Degradation rate (λ)**: [Which conditions degrade faster?]
4. **Steps-to-threshold**: [How quickly does each condition cross the alignment threshold?]
5. **Statistical significance**: [Which pairwise differences are significant?]
6. **Mixed-effects model**: [What does the interaction term tell us?]

### Experimental details

- **Model**: Qwen2.5-Coder-7B-Instruct + SFT warmup (2500 Lean Workbook examples)
- **Training**: GRPO, lr=1e-6, kl_coef=0.01
- **fv_inverted**: 50 steps per seed (37.1 wall-hrs, $185.27)
- **ut_inverted**: 150 steps per seed (21.7 wall-hrs, $108.70)
- **random_reward**: 150 steps per seed (23.2 wall-hrs, $116.09)
- **zero_reward**: 150 steps per seed (23.2 wall-hrs, $115.86)
- **Total**: 12 runs, 105.2 wall-hrs, $525.92