# Verdict Transition Analysis

This notebook analyzes verdict transitions (flips) across different datasets:
1. **Content Perturbations** - psychological, presentation, robustness perturbations
2. **Entropy Data** - sampling variance (m=15 samples)
3. **Protocol Tests** - explanation_first, system_prompt, unstructured
4. **Reasoning Models** - o3-mini, claude-thinking, deepseek-r1, qwq-32b

## Key Concepts

**Verdict Categories:**
- `Self_At_Fault` (YTA) - OP is at fault
- `Other_At_Fault` (NTA) - Other party is at fault
- `All_At_Fault` (ESH) - Everyone shares blame
- `No_One_At_Fault` (NAH) - No one is at fault

**OP Blame Status Framework:**

We classify verdicts into two blame-status groups:
- **Narrator-implicated**: Self_At_Fault (YTA), All_At_Fault (ESH)
- **Narrator-exonerated**: Other_At_Fault (NTA), No_One_At_Fault (NAH)

**Transition Classification:**
- **OP-status preserved**: Transition stays within the same blame-status group
  - YTA↔ESH (both implicate OP)
  - NTA↔NAH (both exonerate OP)
- **OP-status reversed**: Transition crosses between groups (complete culpability reversal)
  - YTA↔NTA, YTA↔NAH, ESH↔NTA, ESH↔NAH

**Directional Decomposition:**
- **OP newly blamed**: NTA→{YTA,ESH}, NAH→{YTA,ESH}
- **OP exonerated**: YTA→{NTA,NAH}, ESH→{NTA,NAH}

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set up paths
PROJECT_ROOT = Path('../../')
DATA_DIR = Path('../data')
RESULTS_DIR = Path('../data')
FIGURES_DIR = PROJECT_ROOT / 'figures'

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')

print(f"Project root: {PROJECT_ROOT.resolve()}")

In [None]:
# Core utility functions

VALID_VERDICTS = ['Self_At_Fault', 'Other_At_Fault', 'All_At_Fault', 'No_One_At_Fault']

SHORT_NAMES = {
    'Self_At_Fault': 'YTA',
    'Other_At_Fault': 'NTA',
    'All_At_Fault': 'ESH',
    'No_One_At_Fault': 'NAH'
}

AITA_TO_SEMANTIC = {
    'YTA': 'Self_At_Fault',
    'NTA': 'Other_At_Fault',
    'ESH': 'All_At_Fault',
    'NAH': 'No_One_At_Fault',
    'INFO': 'Unclear'
}

# OP Blame Status Framework
OP_BLAMED = {'Self_At_Fault', 'All_At_Fault'}  # Narrator-implicated
OP_EXONERATED = {'Other_At_Fault', 'No_One_At_Fault'}  # Narrator-exonerated


def get_op_status(verdict):
    """Return whether OP is blamed or exonerated."""
    if verdict in OP_BLAMED:
        return 'blamed'
    elif verdict in OP_EXONERATED:
        return 'exonerated'
    return None


def get_op_status_preserved(base_v, target_v):
    """Check if OP blame status is preserved across transition."""
    base_status = get_op_status(base_v)
    target_status = get_op_status(target_v)
    if base_status is None or target_status is None:
        return None
    return base_status == target_status


def get_op_direction(base_v, target_v):
    """Classify transition by directional OP blame status change."""
    base_blamed = base_v in OP_BLAMED
    target_blamed = target_v in OP_BLAMED
    
    if base_blamed and target_blamed:
        return 'OP_stays_blamed'
    elif not base_blamed and not target_blamed:
        return 'OP_stays_exonerated'
    elif not base_blamed and target_blamed:
        return 'OP_newly_blamed'
    else:  # base_blamed and not target_blamed
        return 'OP_exonerated'


def analyze_transitions_op_status(flips_df, base_col, target_col, label=""):
    """Analyze transitions using OP blame status framework."""
    if len(flips_df) == 0:
        print(f"{label}: No flips found")
        return None
    
    df = flips_df.copy()
    df['transition'] = df[base_col] + ' → ' + df[target_col]
    df['op_status_preserved'] = df.apply(lambda r: get_op_status_preserved(r[base_col], r[target_col]), axis=1)
    df['op_direction'] = df.apply(lambda r: get_op_direction(r[base_col], r[target_col]), axis=1)
    
    n = len(df)
    preserved = df['op_status_preserved'].sum()
    reversed_ct = n - preserved
    
    # Directional breakdown
    direction_counts = df['op_direction'].value_counts()
    
    result = {
        'label': label,
        'n_flips': n,
        'preserved': preserved,
        'reversed': reversed_ct,
        'preserved_pct': preserved / n * 100,
        'reversed_pct': reversed_ct / n * 100,
        'OP_newly_blamed': direction_counts.get('OP_newly_blamed', 0),
        'OP_exonerated': direction_counts.get('OP_exonerated', 0),
        'OP_stays_blamed': direction_counts.get('OP_stays_blamed', 0),
        'OP_stays_exonerated': direction_counts.get('OP_stays_exonerated', 0),
        'net_toward_blame': direction_counts.get('OP_newly_blamed', 0) - direction_counts.get('OP_exonerated', 0),
        'transitions': df['transition'].value_counts().to_dict(),
        'df': df
    }
    
    return result


def print_op_status_summary(result):
    """Print formatted OP status transition summary."""
    if result is None:
        return
    
    print(f"\n{'='*80}")
    print(f"{result['label']}")
    print(f"{'='*80}")
    print(f"Total flips: {result['n_flips']}")
    
    print(f"\nOP Blame Status:")
    print(f"  Status preserved: {result['preserved']:5} ({result['preserved_pct']:5.1f}%)  (within-group: YTA↔ESH or NTA↔NAH)")
    print(f"  Status reversed:  {result['reversed']:5} ({result['reversed_pct']:5.1f}%)  (cross-group: complete reversal)")
    
    print(f"\nDirectional Breakdown:")
    print(f"  OP newly blamed:     {result['OP_newly_blamed']:5}  (NTA/NAH → YTA/ESH)")
    print(f"  OP exonerated:       {result['OP_exonerated']:5}  (YTA/ESH → NTA/NAH)")
    print(f"  OP stays blamed:     {result['OP_stays_blamed']:5}  (YTA ↔ ESH)")
    print(f"  OP stays exonerated: {result['OP_stays_exonerated']:5}  (NTA ↔ NAH)")
    
    net = result['net_toward_blame']
    direction = "toward blame" if net > 0 else "toward exoneration" if net < 0 else "balanced"
    print(f"\n  NET FLOW: {net:+5} ({direction})")
    
    # Ratio
    if result['OP_exonerated'] > 0:
        ratio = result['OP_newly_blamed'] / result['OP_exonerated']
        print(f"  Blame/Exon Ratio: {ratio:.2f}x")
    
    print(f"\nAll transitions:")
    for t, c in sorted(result['transitions'].items(), key=lambda x: -x[1]):
        parts = t.split(' → ')
        direction = get_op_direction(parts[0], parts[1])
        pct = c / result['n_flips'] * 100
        print(f"  {t:45} {c:5} ({pct:5.1f}%) [{direction}]")


def compute_op_direction_by_category(flips_df, base_col, target_col, category_col, categories=None):
    """Compute OP direction breakdown for each category."""
    if categories is None:
        categories = flips_df[category_col].unique()
    
    results = []
    for cat in categories:
        cat_flips = flips_df[flips_df[category_col] == cat]
        if len(cat_flips) == 0:
            continue
        
        cat_flips = cat_flips.copy()
        cat_flips['op_direction'] = cat_flips.apply(
            lambda r: get_op_direction(r[base_col], r[target_col]), axis=1
        )
        
        dir_counts = cat_flips['op_direction'].value_counts()
        n = len(cat_flips)
        
        newly_blamed = dir_counts.get('OP_newly_blamed', 0)
        exonerated = dir_counts.get('OP_exonerated', 0)
        
        results.append({
            'category': cat,
            'n_flips': n,
            'newly_blamed': newly_blamed,
            'exonerated': exonerated,
            'stays_blamed': dir_counts.get('OP_stays_blamed', 0),
            'stays_exonerated': dir_counts.get('OP_stays_exonerated', 0),
            'net': newly_blamed - exonerated,
            'direction': 'toward_blame' if newly_blamed > exonerated else 'toward_exoneration'
        })
    
    return pd.DataFrame(results)


# Keep legacy functions for backwards compatibility
def get_transition_distance(v1, v2):
    """Legacy: Categorize transition by semantic distance (Hamming)."""
    # 1-step: differ on one dimension
    one_step = {
        ('Self_At_Fault', 'All_At_Fault'),
        ('All_At_Fault', 'Self_At_Fault'),
        ('Other_At_Fault', 'No_One_At_Fault'),
        ('No_One_At_Fault', 'Other_At_Fault'),
        ('All_At_Fault', 'Other_At_Fault'),
        ('Other_At_Fault', 'All_At_Fault'),
        ('All_At_Fault', 'No_One_At_Fault'),
        ('No_One_At_Fault', 'All_At_Fault'),
    }
    # 2-step: differ on both dimensions (diagonal)
    two_step = {
        ('Self_At_Fault', 'Other_At_Fault'),
        ('Other_At_Fault', 'Self_At_Fault'),
        ('Self_At_Fault', 'No_One_At_Fault'),
        ('No_One_At_Fault', 'Self_At_Fault'),
    }
    if (v1, v2) in one_step:
        return '1-step'
    elif (v1, v2) in two_step:
        return '2-step'
    else:
        return 'unknown'


def analyze_transitions(flips_df, base_col, target_col, label=""):
    """Analyze transitions from a dataframe of flips (OP status framework)."""
    return analyze_transitions_op_status(flips_df, base_col, target_col, label)


def print_transition_summary(result):
    """Print formatted transition summary (OP status framework)."""
    return print_op_status_summary(result)


def compute_asymmetry(flips_df, base_col, target_col):
    """Compute transition asymmetry between verdict pairs."""
    pairs = [
        ('Self_At_Fault', 'Other_At_Fault'),
        ('Self_At_Fault', 'All_At_Fault'),
        ('Other_At_Fault', 'All_At_Fault'),
        ('Other_At_Fault', 'No_One_At_Fault'),
        ('All_At_Fault', 'No_One_At_Fault'),
        ('Self_At_Fault', 'No_One_At_Fault'),
    ]
    
    results = []
    for v1, v2 in pairs:
        a_to_b = len(flips_df[(flips_df[base_col] == v1) & (flips_df[target_col] == v2)])
        b_to_a = len(flips_df[(flips_df[base_col] == v2) & (flips_df[target_col] == v1)])
        net = a_to_b - b_to_a
        ratio = a_to_b / b_to_a if b_to_a > 0 else float('inf')
        
        results.append({
            'pair': f"{SHORT_NAMES[v1]}↔{SHORT_NAMES[v2]}",
            'v1': v1,
            'v2': v2,
            'a_to_b': a_to_b,
            'b_to_a': b_to_a,
            'net': net,
            'ratio': ratio
        })
    
    return pd.DataFrame(results)


def print_asymmetry_table(asym_df):
    """Print formatted asymmetry table."""
    print(f"\n{'Pair':<12} | {'A→B':>5} | {'B→A':>5} | {'Net':>6} | {'Ratio':>6}")
    print("-" * 50)
    for _, row in asym_df.iterrows():
        ratio_str = f"{row['ratio']:.2f}x" if row['ratio'] != float('inf') else "inf"
        print(f"{row['pair']:<12} | {row['a_to_b']:>5} | {row['b_to_a']:>5} | {row['net']:>+6} | {ratio_str:>6}")

---
## 1. Content Perturbations Analysis

Analysis of verdict transitions from baseline to perturbed scenarios.

In [None]:
# Load content perturbations data
content_df = pd.read_parquet(DATA_DIR / 'content_eval.parquet')
print(f"Content perturbations data: {content_df.shape}")
print(f"Perturbation types: {content_df['perturbation_type'].unique()}")
print(f"Models: {content_df['model'].unique()}")

In [None]:
# Filter to valid flips
content_flips = content_df[
    (content_df['verdict_flipped'] == True) & 
    (content_df['base_verdict'].isin(VALID_VERDICTS)) &
    (content_df['standardized_judgment'].isin(VALID_VERDICTS)) &
    (content_df['perturbation_type'] != 'none')
].copy()

n_total = len(content_df[content_df['perturbation_type'] != 'none'])
n_flips = len(content_flips)
print(f"Total perturbed rows: {n_total}")
print(f"Total flips: {n_flips} ({n_flips/n_total*100:.1f}%)")

In [None]:
# Overall content perturbation transitions
content_result = analyze_transitions(content_flips, 'base_verdict', 'standardized_judgment', 
                                     "CONTENT PERTURBATIONS - ALL")
print_transition_summary(content_result)

In [None]:
# By perturbation category
categories = {
    'psychological': ['push_nta_self_justifying', 'push_nta_social_proof', 'push_nta_victim_pattern',
                      'push_yta_pattern_admission', 'push_yta_self_condemning', 'push_yta_social_proof'],
    'presentation': ['firstperson_atfault', 'thirdperson'],
    'robustness': ['add_extraneous_detail', 'change_trivial_detail', 'remove_sentence']
}

content_by_category = {}
for cat_name, cat_types in categories.items():
    cat_flips = content_flips[content_flips['perturbation_type'].isin(cat_types)]
    cat_total = len(content_df[content_df['perturbation_type'].isin(cat_types)])
    result = analyze_transitions(cat_flips, 'base_verdict', 'standardized_judgment', 
                                 f"CONTENT - {cat_name.upper()} (flip rate: {len(cat_flips)/cat_total*100:.1f}%)")
    content_by_category[cat_name] = result
    print_transition_summary(result)

In [None]:
# By individual perturbation type
print("\n" + "="*80)
print("BY INDIVIDUAL PERTURBATION TYPE")
print("="*80)

perturbation_results = []
for ptype in content_df['perturbation_type'].unique():
    if ptype == 'none':
        continue
    
    ptype_data = content_df[content_df['perturbation_type'] == ptype]
    ptype_flips = content_flips[content_flips['perturbation_type'] == ptype]
    
    n_total = len(ptype_data)
    n_flips = len(ptype_flips)
    flip_rate = n_flips / n_total * 100 if n_total > 0 else 0
    
    if n_flips > 0:
        ptype_flips = ptype_flips.copy()
        ptype_flips['distance'] = ptype_flips.apply(
            lambda r: get_transition_distance(r['base_verdict'], r['standardized_judgment']), axis=1
        )
        dist = ptype_flips['distance'].value_counts(normalize=True) * 100
        adj = dist.get('adjacent', 0)
        med = dist.get('medium', 0)
        opp = dist.get('opposite', 0)
    else:
        adj, med, opp = 0, 0, 0
    
    perturbation_results.append({
        'perturbation': ptype,
        'n_total': n_total,
        'n_flips': n_flips,
        'flip_rate': flip_rate,
        'adjacent_pct': adj,
        'medium_pct': med,
        'opposite_pct': opp
    })

pert_df = pd.DataFrame(perturbation_results).sort_values('flip_rate', ascending=False)
print(f"\n{'Perturbation Type':<30} {'Rows':>7} {'Flips':>6} {'Rate':>6} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
print("-" * 85)
for _, row in pert_df.iterrows():
    print(f"{row['perturbation']:<30} {row['n_total']:>7} {row['n_flips']:>6} {row['flip_rate']:>5.1f}% | {row['adjacent_pct']:>5.1f}% {row['medium_pct']:>5.1f}% {row['opposite_pct']:>5.1f}%")

In [None]:
# By model
print("\n" + "="*80)
print("BY MODEL (Content Perturbations)")
print("="*80)

model_results = []
for model in sorted(content_df['model'].unique()):
    model_data = content_df[(content_df['model'] == model) & (content_df['perturbation_type'] != 'none')]
    model_flips = content_flips[content_flips['model'] == model]
    
    n_total = len(model_data)
    n_flips = len(model_flips)
    flip_rate = n_flips / n_total * 100 if n_total > 0 else 0
    
    if n_flips > 0:
        model_flips = model_flips.copy()
        model_flips['distance'] = model_flips.apply(
            lambda r: get_transition_distance(r['base_verdict'], r['standardized_judgment']), axis=1
        )
        dist = model_flips['distance'].value_counts(normalize=True) * 100
        adj = dist.get('adjacent', 0)
        med = dist.get('medium', 0)
        opp = dist.get('opposite', 0)
    else:
        adj, med, opp = 0, 0, 0
    
    model_results.append({
        'model': model,
        'n_flips': n_flips,
        'flip_rate': flip_rate,
        'adjacent_pct': adj,
        'medium_pct': med,
        'opposite_pct': opp
    })

model_df = pd.DataFrame(model_results)
print(f"\n{'Model':<15} {'Flips':>7} {'Rate':>6} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
print("-" * 60)
for _, row in model_df.iterrows():
    print(f"{row['model']:<15} {row['n_flips']:>7} {row['flip_rate']:>5.1f}% | {row['adjacent_pct']:>5.1f}% {row['medium_pct']:>5.1f}% {row['opposite_pct']:>5.1f}%")

In [None]:
# Transition asymmetry
print("\n" + "="*80)
print("TRANSITION ASYMMETRY (Content Perturbations)")
print("="*80)

content_asymmetry = compute_asymmetry(content_flips, 'base_verdict', 'standardized_judgment')
print_asymmetry_table(content_asymmetry)

In [None]:
# Transition flow visualization
print("\n" + "="*80)
print("TRANSITION FLOW: Where do verdicts go?")
print("="*80)

for base_v in VALID_VERDICTS:
    base_flips = content_flips[content_flips['base_verdict'] == base_v]
    n_base = len(base_flips)
    
    print(f"\n{base_v} ({SHORT_NAMES[base_v]}) → (n={n_base} flips)")
    for target_v in VALID_VERDICTS:
        if target_v != base_v:
            count = len(base_flips[base_flips['standardized_judgment'] == target_v])
            pct = count / n_base * 100 if n_base > 0 else 0
            bar = '█' * int(pct / 2)
            print(f"  → {SHORT_NAMES[target_v]:3} ({target_v:17}): {count:5} ({pct:5.1f}%) {bar}")

---
## 2. Entropy Data Analysis

Analysis of sampling variance (m=15 samples per scenario).

In [None]:
# Load entropy data
entropy_df = pd.read_parquet(DATA_DIR / 'entropy_consolidated_m15.parquet')
print(f"Entropy data: {entropy_df.shape}")
print(f"Models: {entropy_df['model'].unique()}")
print(f"\nColumns: {entropy_df.columns.tolist()}")

In [None]:
# Filter to valid verdicts
valid_entropy = entropy_df[
    (entropy_df['master_run1_semantic'].isin(VALID_VERDICTS)) &
    (entropy_df['majority_verdict_semantic'].isin(VALID_VERDICTS))
].copy()

# Identify mismatches (single run != majority)
valid_entropy['mismatch'] = valid_entropy['master_run1_semantic'] != valid_entropy['majority_verdict_semantic']

n_total = len(valid_entropy)
n_mismatch = valid_entropy['mismatch'].sum()
print(f"Total valid scenarios: {n_total}")
print(f"Mismatches (run1 ≠ majority): {n_mismatch} ({n_mismatch/n_total*100:.1f}%)")

In [None]:
# Analyze transitions for mismatches
entropy_mismatches = valid_entropy[valid_entropy['mismatch'] == True].copy()

entropy_result = analyze_transitions(entropy_mismatches, 'master_run1_semantic', 'majority_verdict_semantic',
                                     f"ENTROPY DATA - Run1 vs Majority (mismatch rate: {n_mismatch/n_total*100:.1f}%)")
print_transition_summary(entropy_result)

In [None]:
# By model
print("\n" + "="*80)
print("BY MODEL (Entropy - Run1 vs Majority)")
print("="*80)

entropy_model_results = []
for model in sorted(entropy_df['model'].unique()):
    model_data = valid_entropy[valid_entropy['model'] == model]
    model_mismatch = model_data[model_data['mismatch'] == True]
    
    n = len(model_data)
    n_mis = len(model_mismatch)
    rate = n_mis / n * 100 if n > 0 else 0
    
    if n_mis > 0:
        model_mismatch = model_mismatch.copy()
        model_mismatch['distance'] = model_mismatch.apply(
            lambda r: get_transition_distance(r['master_run1_semantic'], r['majority_verdict_semantic']), axis=1
        )
        dist = model_mismatch['distance'].value_counts(normalize=True) * 100
        adj = dist.get('adjacent', 0)
        med = dist.get('medium', 0)
        opp = dist.get('opposite', 0)
    else:
        adj, med, opp = 0, 0, 0
    
    entropy_model_results.append({
        'model': model,
        'n_scenarios': n,
        'n_mismatches': n_mis,
        'mismatch_rate': rate,
        'adjacent_pct': adj,
        'medium_pct': med,
        'opposite_pct': opp
    })

entropy_model_df = pd.DataFrame(entropy_model_results)
print(f"\n{'Model':<15} {'Mismatch':>10} {'Rate':>7} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
print("-" * 65)
for _, row in entropy_model_df.iterrows():
    print(f"{row['model']:<15} {row['n_mismatches']:>3}/{row['n_scenarios']:<6} {row['mismatch_rate']:>5.1f}% | {row['adjacent_pct']:>5.1f}% {row['medium_pct']:>5.1f}% {row['opposite_pct']:>5.1f}%")

In [None]:
# Within-sample variance analysis
print("\n" + "="*80)
print("WITHIN-SAMPLE VARIANCE ANALYSIS")
print("="*80)

def get_active_verdicts(counts_dict):
    return [k for k, v in counts_dict.items() if v > 0 and k not in ['INFO']]

def get_verdict_pattern(counts_dict):
    active = get_active_verdicts(counts_dict)
    if len(active) == 1:
        return 'unanimous'
    elif len(active) == 2:
        return 'two_verdicts'
    else:
        return 'three_plus_verdicts'

entropy_df['pattern'] = entropy_df['counts'].apply(get_verdict_pattern)

print("\nVerdict pattern distribution:")
pattern_counts = entropy_df['pattern'].value_counts()
for p, c in pattern_counts.items():
    print(f"  {p:20}: {c:4} ({c/len(entropy_df)*100:5.1f}%)")

print("\nPattern distribution by model:")
print("-" * 70)
for model in sorted(entropy_df['model'].unique()):
    model_data = entropy_df[entropy_df['model'] == model]
    n = len(model_data)
    unanimous = (model_data['pattern'] == 'unanimous').sum()
    two = (model_data['pattern'] == 'two_verdicts').sum()
    three_plus = (model_data['pattern'] == 'three_plus_verdicts').sum()
    print(f"{model:12}: Unanimous {unanimous:3} ({unanimous/n*100:5.1f}%) | Two {two:3} ({two/n*100:5.1f}%) | 3+ {three_plus:3} ({three_plus/n*100:5.1f}%)")

In [None]:
# Analyze two-verdict pairs
print("\n" + "="*80)
print("TWO-VERDICT PAIR ANALYSIS")
print("="*80)

def get_verdict_pair(counts_dict):
    active = sorted([k for k, v in counts_dict.items() if v > 0 and k in ['YTA', 'NTA', 'ESH', 'NAH']])
    if len(active) == 2:
        return f"{active[0]}-{active[1]}"
    return None

def check_pair_adjacency(counts_dict):
    active = [k for k, v in counts_dict.items() if v > 0 and k in ['YTA', 'NTA', 'ESH', 'NAH']]
    if len(active) != 2:
        return None
    v1, v2 = AITA_TO_SEMANTIC.get(active[0]), AITA_TO_SEMANTIC.get(active[1])
    if v1 and v2:
        return get_transition_distance(v1, v2)
    return None

two_verdict = entropy_df[entropy_df['pattern'] == 'two_verdicts'].copy()
two_verdict['pair'] = two_verdict['counts'].apply(get_verdict_pair)
two_verdict['pair_type'] = two_verdict['counts'].apply(check_pair_adjacency)

print(f"\nFor two-verdict cases (n={len(two_verdict)}):")
pair_types = two_verdict['pair_type'].value_counts()
for p, c in pair_types.items():
    if p:
        print(f"  {p:10}: {c:4} ({c/len(two_verdict)*100:5.1f}%)")

print("\nMost common verdict pairs:")
pair_freq = two_verdict['pair'].value_counts()
for pair, count in pair_freq.head(10).items():
    if pair:
        v1, v2 = pair.split('-')
        s1, s2 = AITA_TO_SEMANTIC.get(v1), AITA_TO_SEMANTIC.get(v2)
        if s1 and s2:
            dist = get_transition_distance(s1, s2)
            print(f"  {pair:10}: {count:4} ({count/len(two_verdict)*100:5.1f}%)  [{dist}]")

---
## 3. Protocol Tests Analysis

Analysis of verdict transitions under different prompt protocols.

In [None]:
# Load protocol tests data
protocol_df = pd.read_parquet(RESULTS_DIR / 'protocol_eval.parquet')
print(f"Protocol tests data: {protocol_df.shape}")
print(f"Protocols: {protocol_df['protocol'].unique()}")
print(f"Eval models: {protocol_df['eval_model'].unique()}")

# Standardize model names
model_map = {
    'qwen_2.5_72b_instruct': 'qwen25',
    'deepseek_chat': 'deepseek',
    'claude_3.7_sonnet': 'claude37',
    'gpt_4.1': 'gpt41'
}
protocol_df['model'] = protocol_df['eval_model'].map(model_map)

In [None]:
# Filter to valid verdicts
valid_protocol = protocol_df[
    (protocol_df['main_study_verdict'].isin(VALID_VERDICTS)) &
    (protocol_df['standardized_judgment'].isin(VALID_VERDICTS))
].copy()

valid_protocol['flipped'] = valid_protocol['main_study_verdict'] != valid_protocol['standardized_judgment']

n_total = len(valid_protocol)
n_flips = valid_protocol['flipped'].sum()
print(f"Valid rows: {n_total}")
print(f"Total flips: {n_flips} ({n_flips/n_total*100:.1f}%)")

In [None]:
# Overall protocol transitions
protocol_flips = valid_protocol[valid_protocol['flipped'] == True].copy()

protocol_result = analyze_transitions(protocol_flips, 'main_study_verdict', 'standardized_judgment',
                                      f"PROTOCOL TESTS - ALL (flip rate: {n_flips/n_total*100:.1f}%)")
print_transition_summary(protocol_result)

In [None]:
# By protocol type
print("\n" + "="*80)
print("BY PROTOCOL TYPE")
print("="*80)

protocol_results = {}
for proto in ['explanation_first', 'system_prompt', 'unstructured']:
    proto_data = valid_protocol[valid_protocol['protocol'] == proto]
    proto_flips = proto_data[proto_data['flipped'] == True]
    
    result = analyze_transitions(proto_flips, 'main_study_verdict', 'standardized_judgment',
                                 f"{proto.upper()} (flip rate: {len(proto_flips)/len(proto_data)*100:.1f}%)")
    protocol_results[proto] = result
    print_transition_summary(result)

In [None]:
# By model
print("\n" + "="*80)
print("BY MODEL (Protocol Tests)")
print("="*80)

proto_model_results = []
for model in sorted(valid_protocol['model'].unique()):
    model_data = valid_protocol[valid_protocol['model'] == model]
    model_flips = model_data[model_data['flipped'] == True]
    
    n_total = len(model_data)
    n_flips = len(model_flips)
    
    if n_flips > 0:
        model_flips = model_flips.copy()
        model_flips['distance'] = model_flips.apply(
            lambda r: get_transition_distance(r['main_study_verdict'], r['standardized_judgment']), axis=1
        )
        dist = model_flips['distance'].value_counts(normalize=True) * 100
        adj = dist.get('adjacent', 0)
        med = dist.get('medium', 0)
        opp = dist.get('opposite', 0)
    else:
        adj, med, opp = 0, 0, 0
    
    proto_model_results.append({
        'model': model,
        'n_flips': n_flips,
        'flip_rate': n_flips / n_total * 100,
        'adjacent_pct': adj,
        'medium_pct': med,
        'opposite_pct': opp
    })

proto_model_df = pd.DataFrame(proto_model_results)
print(f"\n{'Model':<15} {'Flips':>7} {'Rate':>6} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
print("-" * 60)
for _, row in proto_model_df.iterrows():
    print(f"{row['model']:<15} {row['n_flips']:>7} {row['flip_rate']:>5.1f}% | {row['adjacent_pct']:>5.1f}% {row['medium_pct']:>5.1f}% {row['opposite_pct']:>5.1f}%")

In [None]:
# Model × Protocol
print("\n" + "="*80)
print("MODEL × PROTOCOL")
print("="*80)

for proto in ['explanation_first', 'system_prompt', 'unstructured']:
    print(f"\n{proto.upper()}:")
    print(f"{'Model':12} {'Flips':>6} {'Rate':>6} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
    print("-" * 55)
    
    proto_data = valid_protocol[valid_protocol['protocol'] == proto]
    
    for model in sorted(valid_protocol['model'].unique()):
        model_data = proto_data[proto_data['model'] == model]
        model_flips = model_data[model_data['flipped'] == True]
        
        n_total = len(model_data)
        n_flips = len(model_flips)
        
        if n_flips > 0:
            model_flips = model_flips.copy()
            model_flips['distance'] = model_flips.apply(
                lambda r: get_transition_distance(r['main_study_verdict'], r['standardized_judgment']), axis=1
            )
            dist = model_flips['distance'].value_counts(normalize=True) * 100
            adj = dist.get('adjacent', 0)
            med = dist.get('medium', 0)
            opp = dist.get('opposite', 0)
        else:
            adj, med, opp = 0, 0, 0
        
        print(f"{model:12} {n_flips:>6} {n_flips/n_total*100:>5.1f}% | {adj:>5.1f}% {med:>5.1f}% {opp:>5.1f}%")

In [None]:
# Protocol transition asymmetry
print("\n" + "="*80)
print("TRANSITION ASYMMETRY (Protocol Tests)")
print("="*80)

protocol_asymmetry = compute_asymmetry(protocol_flips, 'main_study_verdict', 'standardized_judgment')
print_asymmetry_table(protocol_asymmetry)

In [None]:
# Protocol transition flow
print("\n" + "="*80)
print("TRANSITION FLOW (Protocol Tests): Where do verdicts go?")
print("="*80)

for base_v in VALID_VERDICTS:
    base_flips = protocol_flips[protocol_flips['main_study_verdict'] == base_v]
    n_base = len(base_flips)
    
    print(f"\n{base_v} ({SHORT_NAMES[base_v]}) → (n={n_base} flips)")
    for target_v in VALID_VERDICTS:
        if target_v != base_v:
            count = len(base_flips[base_flips['standardized_judgment'] == target_v])
            pct = count / n_base * 100 if n_base > 0 else 0
            bar = '█' * int(pct / 2)
            print(f"  → {SHORT_NAMES[target_v]:3} ({target_v:17}): {count:5} ({pct:5.1f}%) {bar}")

---
## 4. Reasoning Models Analysis

Analysis of verdict transitions for reasoning models (o3-mini, claude-thinking, deepseek-r1, qwq-32b).

In [None]:
# Load reasoning models data
reasoning_df = pd.read_parquet(DATA_DIR / 'reasoning_eval.parquet')
print(f"Reasoning models data: {reasoning_df.shape}")
print(f"Models: {reasoning_df['model'].unique()}")
print(f"Protocols: {reasoning_df['protocol'].unique()}")
print(f"Perturbation types: {reasoning_df['perturbation_type'].unique()}")

In [None]:
# Filter to valid verdicts and flips
valid_reasoning = reasoning_df[
    (reasoning_df['base_verdict'].isin(VALID_VERDICTS)) &
    (reasoning_df['standardized_judgment'].isin(VALID_VERDICTS))
].copy()

# Check if verdict_flipped exists or compute it
if 'verdict_flipped' not in valid_reasoning.columns:
    valid_reasoning['verdict_flipped'] = valid_reasoning['base_verdict'] != valid_reasoning['standardized_judgment']

n_total = len(valid_reasoning)
n_flips = valid_reasoning['verdict_flipped'].sum()
print(f"Valid rows: {n_total}")
print(f"Total flips: {n_flips} ({n_flips/n_total*100:.1f}%)")

In [None]:
# Overall reasoning model transitions
reasoning_flips = valid_reasoning[valid_reasoning['verdict_flipped'] == True].copy()

reasoning_result = analyze_transitions(reasoning_flips, 'base_verdict', 'standardized_judgment',
                                       f"REASONING MODELS - ALL (flip rate: {n_flips/n_total*100:.1f}%)")
print_transition_summary(reasoning_result)

In [None]:
# By reasoning model
print("\n" + "="*80)
print("BY REASONING MODEL")
print("="*80)

reasoning_model_results = []
for model in sorted(reasoning_df['model'].unique()):
    model_data = valid_reasoning[valid_reasoning['model'] == model]
    model_flips = model_data[model_data['verdict_flipped'] == True]
    
    n_total = len(model_data)
    n_flips = len(model_flips)
    
    if n_flips > 0:
        model_flips = model_flips.copy()
        model_flips['distance'] = model_flips.apply(
            lambda r: get_transition_distance(r['base_verdict'], r['standardized_judgment']), axis=1
        )
        dist = model_flips['distance'].value_counts(normalize=True) * 100
        adj = dist.get('adjacent', 0)
        med = dist.get('medium', 0)
        opp = dist.get('opposite', 0)
    else:
        adj, med, opp = 0, 0, 0
    
    reasoning_model_results.append({
        'model': model,
        'n_total': n_total,
        'n_flips': n_flips,
        'flip_rate': n_flips / n_total * 100 if n_total > 0 else 0,
        'adjacent_pct': adj,
        'medium_pct': med,
        'opposite_pct': opp
    })

reasoning_model_df = pd.DataFrame(reasoning_model_results)
print(f"\n{'Model':<20} {'Total':>7} {'Flips':>6} {'Rate':>6} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
print("-" * 75)
for _, row in reasoning_model_df.iterrows():
    print(f"{row['model']:<20} {row['n_total']:>7} {row['n_flips']:>6} {row['flip_rate']:>5.1f}% | {row['adjacent_pct']:>5.1f}% {row['medium_pct']:>5.1f}% {row['opposite_pct']:>5.1f}%")

In [None]:
# By protocol for reasoning models
print("\n" + "="*80)
print("BY PROTOCOL (Reasoning Models)")
print("="*80)

for proto in reasoning_df['protocol'].unique():
    proto_data = valid_reasoning[valid_reasoning['protocol'] == proto]
    proto_flips = proto_data[proto_data['verdict_flipped'] == True]
    
    n_total = len(proto_data)
    n_flips = len(proto_flips)
    
    if n_flips > 0:
        proto_flips = proto_flips.copy()
        proto_flips['distance'] = proto_flips.apply(
            lambda r: get_transition_distance(r['base_verdict'], r['standardized_judgment']), axis=1
        )
        dist = proto_flips['distance'].value_counts(normalize=True) * 100
        adj = dist.get('adjacent', 0)
        med = dist.get('medium', 0)
        opp = dist.get('opposite', 0)
    else:
        adj, med, opp = 0, 0, 0
    
    print(f"\n{proto.upper()}: {n_flips}/{n_total} flips ({n_flips/n_total*100:.1f}%)")
    print(f"  Adjacent: {adj:5.1f}%  Medium: {med:5.1f}%  Opposite: {opp:5.1f}%")

In [None]:
# Model × Protocol for reasoning models
print("\n" + "="*80)
print("MODEL × PROTOCOL (Reasoning Models)")
print("="*80)

for proto in reasoning_df['protocol'].unique():
    print(f"\n{proto.upper()}:")
    print(f"{'Model':<20} {'Flips':>6} {'Rate':>6} | {'Adj%':>6} {'Med%':>6} {'Opp%':>6}")
    print("-" * 60)
    
    proto_data = valid_reasoning[valid_reasoning['protocol'] == proto]
    
    for model in sorted(reasoning_df['model'].unique()):
        model_data = proto_data[proto_data['model'] == model]
        model_flips = model_data[model_data['verdict_flipped'] == True]
        
        n_total = len(model_data)
        n_flips = len(model_flips)
        
        if n_total == 0:
            continue
        
        if n_flips > 0:
            model_flips = model_flips.copy()
            model_flips['distance'] = model_flips.apply(
                lambda r: get_transition_distance(r['base_verdict'], r['standardized_judgment']), axis=1
            )
            dist = model_flips['distance'].value_counts(normalize=True) * 100
            adj = dist.get('adjacent', 0)
            med = dist.get('medium', 0)
            opp = dist.get('opposite', 0)
        else:
            adj, med, opp = 0, 0, 0
        
        print(f"{model:<20} {n_flips:>6} {n_flips/n_total*100:>5.1f}% | {adj:>5.1f}% {med:>5.1f}% {opp:>5.1f}%")

In [None]:
# Reasoning models transition asymmetry
print("\n" + "="*80)
print("TRANSITION ASYMMETRY (Reasoning Models)")
print("="*80)

if len(reasoning_flips) > 0:
    reasoning_asymmetry = compute_asymmetry(reasoning_flips, 'base_verdict', 'standardized_judgment')
    print_asymmetry_table(reasoning_asymmetry)
else:
    print("No flips found.")

In [None]:
# Reasoning models transition flow
print("\n" + "="*80)
print("TRANSITION FLOW (Reasoning Models): Where do verdicts go?")
print("="*80)

for base_v in VALID_VERDICTS:
    base_flips = reasoning_flips[reasoning_flips['base_verdict'] == base_v]
    n_base = len(base_flips)
    
    if n_base == 0:
        continue
    
    print(f"\n{base_v} ({SHORT_NAMES[base_v]}) → (n={n_base} flips)")
    for target_v in VALID_VERDICTS:
        if target_v != base_v:
            count = len(base_flips[base_flips['standardized_judgment'] == target_v])
            pct = count / n_base * 100 if n_base > 0 else 0
            bar = '█' * int(pct / 2)
            print(f"  → {SHORT_NAMES[target_v]:3} ({target_v:17}): {count:5} ({pct:5.1f}%) {bar}")

---
## 5. Comprehensive Comparison

Summary comparison across all four datasets.

In [None]:
# Compile summary statistics using OP status framework
summary_data = []

# Content perturbations
if content_result:
    summary_data.append({
        'Dataset': 'Content Perturbations',
        'N Flips': content_result['n_flips'],
        'Preserved %': content_result['preserved_pct'],
        'Reversed %': content_result['reversed_pct'],
        'Net Toward Blame': content_result['net_toward_blame']
    })

# Entropy data
if entropy_result:
    summary_data.append({
        'Dataset': 'Entropy (run1 vs majority)',
        'N Flips': entropy_result['n_flips'],
        'Preserved %': entropy_result['preserved_pct'],
        'Reversed %': entropy_result['reversed_pct'],
        'Net Toward Blame': entropy_result['net_toward_blame']
    })

# Protocol tests
if protocol_result:
    summary_data.append({
        'Dataset': 'Protocol Tests',
        'N Flips': protocol_result['n_flips'],
        'Preserved %': protocol_result['preserved_pct'],
        'Reversed %': protocol_result['reversed_pct'],
        'Net Toward Blame': protocol_result['net_toward_blame']
    })

# Reasoning models
if reasoning_result:
    summary_data.append({
        'Dataset': 'Reasoning Models',
        'N Flips': reasoning_result['n_flips'],
        'Preserved %': reasoning_result['preserved_pct'],
        'Reversed %': reasoning_result['reversed_pct'],
        'Net Toward Blame': reasoning_result['net_toward_blame']
    })

summary_df = pd.DataFrame(summary_data)
print("="*80)
print("SUMMARY COMPARISON (OP Blame Status Framework)")
print("="*80)
print("\nOP Status Preserved = transition within {YTA,ESH} or {NTA,NAH}")
print("OP Status Reversed = transition crosses blame-status boundary\n")
print(summary_df.to_string(index=False))

In [None]:
# Visualization: OP status by dataset
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel A: Preserved vs Reversed
ax = axes[0]
x = range(len(summary_df))
width = 0.35

bars1 = ax.bar([i - width/2 for i in x], summary_df['Preserved %'], width, 
               label='Status Preserved', color='#2ecc71')
bars2 = ax.bar([i + width/2 for i in x], summary_df['Reversed %'], width, 
               label='Status Reversed', color='#e74c3c')

ax.set_ylabel('Percentage of Flips')
ax.set_title('A. OP Blame Status Preservation')
ax.set_xticks(x)
ax.set_xticklabels(summary_df['Dataset'], rotation=15, ha='right')
ax.legend()
ax.set_ylim(0, 80)
ax.axhline(50, color='gray', linestyle='--', alpha=0.5, label='50%')

for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.1f}%',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom', fontsize=9)

# Panel B: Net flow toward blame
ax = axes[1]
colors = ['#e74c3c' if x > 0 else '#2ecc71' for x in summary_df['Net Toward Blame']]
bars = ax.barh(range(len(summary_df)), summary_df['Net Toward Blame'], color=colors)
ax.set_yticks(range(len(summary_df)))
ax.set_yticklabels(summary_df['Dataset'])
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlabel('Net Transitions (+ = toward blame, - = toward exoneration)')
ax.set_title('B. Net Direction of Verdict Changes')

for i, (bar, val) in enumerate(zip(bars, summary_df['Net Toward Blame'])):
    ax.annotate(f'{val:+d}',
                xy=(val, i),
                xytext=(5 if val >= 0 else -5, 0),
                textcoords="offset points",
                ha='left' if val >= 0 else 'right', va='center', fontsize=10)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'verdict_transitions_op_status.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Print key findings
print("="*80)
print("KEY FINDINGS (OP Blame Status Framework)")
print("="*80)

print("""
OP BLAME STATUS FRAMEWORK:
==========================

Narrator-implicated (OP blamed): Self_At_Fault (YTA), All_At_Fault (ESH)
Narrator-exonerated (OP not blamed): Other_At_Fault (NTA), No_One_At_Fault (NAH)

STATUS PRESERVED: Transition stays within same blame-status group
  - YTA <-> ESH (both implicate OP)
  - NTA <-> NAH (both exonerate OP)
  
STATUS REVERSED: Transition crosses blame-status boundary
  - Any transition between {YTA,ESH} and {NTA,NAH}
  - These are complete culpability reversals

KEY OBSERVATIONS:

1. CONTENT PERTURBATIONS show ~58% status-reversed transitions - a substantial
   rate of complete culpability reversals that cannot be attributed to 
   fine-grained boundary uncertainty.

2. DIRECTIONAL ASYMMETRY by perturbation type:
   - push_yta: WORKS AS INTENDED (net flow toward blame)
   - push_nta: BACKFIRES! (self-justification increases blame)
   - presentation: Variable, third-person shifts toward blame
   - robustness: Near-balanced (functions as noise)

3. PROTOCOL TESTS show SYSTEMATIC EXONERATION:
   - All protocol changes reduce narrator blame
   - Unstructured shows largest exoneration bias
   - The "moral judge" persona is constructed by forced-choice scaffolding

4. MODEL DIFFERENCES:
   - GPT-4.1: Most status-preserved (72%), coherent transitions
   - Qwen-2.5: Mostly status-reversed (78%), binary verdict style
   
5. TWO DISTINCT MECHANISMS:
   - Content perturbations operate through CREDIBILITY HEURISTICS
     (self-blame trusted, self-justification triggers skepticism)
   - Protocol perturbations operate through SCAFFOLDING
     (forced-choice elicits judgment, open-ended elicits advice)
""")

In [None]:
# Compare asymmetry patterns
print("\n" + "="*80)
print("ASYMMETRY COMPARISON")
print("="*80)

print("\nCONTENT PERTURBATIONS:")
print_asymmetry_table(content_asymmetry)

print("\nPROTOCOL TESTS:")
print_asymmetry_table(protocol_asymmetry)

if len(reasoning_flips) > 0:
    print("\nREASONING MODELS:")
    print_asymmetry_table(reasoning_asymmetry)

---
## 6. Directional Analysis by Perturbation Type

This section analyzes the direction of OP blame status changes, revealing systematic patterns that vary by perturbation type.

In [None]:
# Analyze OP direction by perturbation category (Content)
print("="*90)
print("CONTENT PERTURBATIONS: OP DIRECTION BY CATEGORY")
print("="*90)

# Define detailed perturbation categories
detailed_categories = {
    'push_yta': ['push_yta_pattern_admission', 'push_yta_self_condemning', 'push_yta_social_proof'],
    'push_nta': ['push_nta_self_justifying', 'push_nta_social_proof', 'push_nta_victim_pattern'],
    'presentation': ['firstperson_atfault', 'thirdperson'],
    'robustness': ['add_extraneous_detail', 'change_trivial_detail', 'remove_sentence']
}

for cat_name, cat_types in detailed_categories.items():
    cat_flips = content_flips[content_flips['perturbation_type'].isin(cat_types)].copy()
    
    if len(cat_flips) == 0:
        continue
        
    cat_flips['op_direction'] = cat_flips.apply(
        lambda r: get_op_direction(r['base_verdict'], r['standardized_judgment']), axis=1
    )
    
    dir_counts = cat_flips['op_direction'].value_counts()
    newly_blamed = dir_counts.get('OP_newly_blamed', 0)
    exonerated = dir_counts.get('OP_exonerated', 0)
    net = newly_blamed - exonerated
    
    print(f"\n{cat_name.upper()} (n={len(cat_flips)} flips)")
    print(f"  OP newly blamed:     {newly_blamed:5}")
    print(f"  OP exonerated:       {exonerated:5}")
    print(f"  OP stays blamed:     {dir_counts.get('OP_stays_blamed', 0):5}")
    print(f"  OP stays exonerated: {dir_counts.get('OP_stays_exonerated', 0):5}")
    print(f"  NET: {net:+5} ({'toward blame' if net > 0 else 'toward exoneration'})")

In [None]:
# Detailed breakdown by individual perturbation type
print("\n" + "="*90)
print("INDIVIDUAL PERTURBATION TYPE: OP DIRECTION ANALYSIS")
print("="*90)

individual_results = compute_op_direction_by_category(
    content_flips, 'base_verdict', 'standardized_judgment', 'perturbation_type'
)

print(f"\n{'Perturbation':<35} {'Flips':>6} {'→Blame':>7} {'→Exon':>7} {'Net':>7} {'Direction':<18}")
print("-" * 90)
for _, row in individual_results.sort_values('net', ascending=False).iterrows():
    print(f"{row['category']:<35} {row['n_flips']:>6} {row['newly_blamed']:>7} {row['exonerated']:>7} {row['net']:>+7} {row['direction']:<18}")

In [None]:
# Protocol Tests: OP direction by protocol
print("\n" + "="*90)
print("PROTOCOL TESTS: OP DIRECTION BY PROTOCOL")
print("="*90)

for proto in ['explanation_first', 'system_prompt', 'unstructured']:
    proto_flips = protocol_flips[protocol_flips['protocol'] == proto].copy()
    
    if len(proto_flips) == 0:
        continue
    
    proto_flips['op_direction'] = proto_flips.apply(
        lambda r: get_op_direction(r['main_study_verdict'], r['standardized_judgment']), axis=1
    )
    
    dir_counts = proto_flips['op_direction'].value_counts()
    newly_blamed = dir_counts.get('OP_newly_blamed', 0)
    exonerated = dir_counts.get('OP_exonerated', 0)
    net = newly_blamed - exonerated
    
    ratio = newly_blamed / exonerated if exonerated > 0 else float('inf')
    ratio_str = f"{ratio:.1f}x toward blame" if ratio > 1 else f"{1/ratio:.1f}x toward exoneration" if ratio < 1 else "balanced"
    
    print(f"\n{proto.upper()} (n={len(proto_flips)} flips)")
    print(f"  OP newly blamed:     {newly_blamed:5}")
    print(f"  OP exonerated:       {exonerated:5}")
    print(f"  OP stays blamed:     {dir_counts.get('OP_stays_blamed', 0):5}")
    print(f"  OP stays exonerated: {dir_counts.get('OP_stays_exonerated', 0):5}")
    print(f"  NET: {net:+5}  ({ratio_str})")

In [None]:
# Reasoning Models: OP direction by model
print("\n" + "="*90)
print("REASONING MODELS: OP DIRECTION BY MODEL")
print("="*90)

for model in sorted(reasoning_df['model'].unique()):
    model_flips = reasoning_flips[reasoning_flips['model'] == model].copy()
    if len(model_flips) == 0:
        continue
    
    model_flips['op_direction'] = model_flips.apply(
        lambda r: get_op_direction(r['base_verdict'], r['standardized_judgment']), axis=1
    )
    
    dir_counts = model_flips['op_direction'].value_counts()
    newly_blamed = dir_counts.get('OP_newly_blamed', 0)
    exonerated = dir_counts.get('OP_exonerated', 0)
    net = newly_blamed - exonerated
    
    print(f"\n{model} (n={len(model_flips)} flips)")
    print(f"  OP newly blamed:     {newly_blamed:5}")
    print(f"  OP exonerated:       {exonerated:5}")
    print(f"  OP stays blamed:     {dir_counts.get('OP_stays_blamed', 0):5}")
    print(f"  OP stays exonerated: {dir_counts.get('OP_stays_exonerated', 0):5}")
    print(f"  NET: {net:+5} ({'toward blame' if net > 0 else 'toward exoneration' if net < 0 else 'balanced'})")

In [None]:
# Visualization: OP direction comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Content perturbations by category
ax = axes[0]
cat_results = []
for cat_name, cat_types in detailed_categories.items():
    cat_flips = content_flips[content_flips['perturbation_type'].isin(cat_types)].copy()
    if len(cat_flips) == 0:
        continue
    cat_flips['op_direction'] = cat_flips.apply(
        lambda r: get_op_direction(r['base_verdict'], r['standardized_judgment']), axis=1
    )
    dir_counts = cat_flips['op_direction'].value_counts()
    cat_results.append({
        'category': cat_name,
        'newly_blamed': dir_counts.get('OP_newly_blamed', 0),
        'exonerated': dir_counts.get('OP_exonerated', 0)
    })
cat_df = pd.DataFrame(cat_results)

x = range(len(cat_df))
width = 0.35
ax.bar([i - width/2 for i in x], cat_df['newly_blamed'], width, label='OP Newly Blamed', color='#e74c3c')
ax.bar([i + width/2 for i in x], cat_df['exonerated'], width, label='OP Exonerated', color='#2ecc71')
ax.set_xticks(x)
ax.set_xticklabels(cat_df['category'], rotation=15, ha='right')
ax.set_ylabel('Count')
ax.set_title('A. Content Perturbations')
ax.legend()

# Protocol tests
ax = axes[1]
proto_results = []
for proto in ['explanation_first', 'system_prompt', 'unstructured']:
    proto_flips = protocol_flips[protocol_flips['protocol'] == proto].copy()
    if len(proto_flips) == 0:
        continue
    proto_flips['op_direction'] = proto_flips.apply(
        lambda r: get_op_direction(r['main_study_verdict'], r['standardized_judgment']), axis=1
    )
    dir_counts = proto_flips['op_direction'].value_counts()
    proto_results.append({
        'protocol': proto.replace('_', '\n'),
        'newly_blamed': dir_counts.get('OP_newly_blamed', 0),
        'exonerated': dir_counts.get('OP_exonerated', 0)
    })
proto_df = pd.DataFrame(proto_results)

x = range(len(proto_df))
ax.bar([i - width/2 for i in x], proto_df['newly_blamed'], width, label='OP Newly Blamed', color='#e74c3c')
ax.bar([i + width/2 for i in x], proto_df['exonerated'], width, label='OP Exonerated', color='#2ecc71')
ax.set_xticks(x)
ax.set_xticklabels(proto_df['protocol'])
ax.set_ylabel('Count')
ax.set_title('B. Protocol Tests')
ax.legend()

# Net direction comparison
ax = axes[2]
all_results = []
for cat_name, cat_types in detailed_categories.items():
    cat_flips = content_flips[content_flips['perturbation_type'].isin(cat_types)].copy()
    if len(cat_flips) == 0:
        continue
    cat_flips['op_direction'] = cat_flips.apply(
        lambda r: get_op_direction(r['base_verdict'], r['standardized_judgment']), axis=1
    )
    dir_counts = cat_flips['op_direction'].value_counts()
    net = dir_counts.get('OP_newly_blamed', 0) - dir_counts.get('OP_exonerated', 0)
    all_results.append({'source': f'C: {cat_name}', 'net': net})

for proto in ['explanation_first', 'system_prompt', 'unstructured']:
    proto_flips = protocol_flips[protocol_flips['protocol'] == proto].copy()
    if len(proto_flips) == 0:
        continue
    proto_flips['op_direction'] = proto_flips.apply(
        lambda r: get_op_direction(r['main_study_verdict'], r['standardized_judgment']), axis=1
    )
    dir_counts = proto_flips['op_direction'].value_counts()
    net = dir_counts.get('OP_newly_blamed', 0) - dir_counts.get('OP_exonerated', 0)
    all_results.append({'source': f'P: {proto[:8]}', 'net': net})

all_df = pd.DataFrame(all_results)
colors = ['#e74c3c' if x > 0 else '#2ecc71' for x in all_df['net']]
ax.barh(range(len(all_df)), all_df['net'], color=colors)
ax.set_yticks(range(len(all_df)))
ax.set_yticklabels(all_df['source'])
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlabel('Net (+ = toward blame)')
ax.set_title('C. Net Direction')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'verdict_transitions_op_direction.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Summary: Key Insights from Directional Analysis
print("="*90)
print("KEY INSIGHTS: DIRECTIONAL PATTERNS IN VERDICT TRANSITIONS")
print("="*90)

print("""
1. CONTENT PERTURBATIONS: CREDIBILITY HEURISTIC
   ============================================
   
   Category          | Net Direction    | Interpretation
   ------------------|------------------|--------------------------------------------
   push_yta          | → TOWARD BLAME   | WORKS AS INTENDED (self-blame cues trusted)
   push_nta          | → TOWARD BLAME   | BACKFIRES! (self-justification increases blame)
   presentation      | → TOWARD BLAME   | Third-person depersonalization harshens judgment
   robustness        | → TOWARD BLAME   | Slight bias (should be balanced)
   
   KEY FINDING: Push-NTA perturbations BACKFIRE!
   - self_justifying: Net +191 toward blame instead of exoneration
   - social_proof: Barely effective
   - victim_pattern: Works somewhat
   
   This supports the "credibility heuristic" interpretation: self-criticism 
   is trusted, self-justification triggers skepticism.

2. PROTOCOL PERTURBATIONS: SCAFFOLDING EFFECT
   ==========================================
   
   Protocol          | Net Direction         | Interpretation
   ------------------|----------------------|----------------------------------
   explanation_first | → TOWARD EXONERATION | Deliberation softens judgment
   system_prompt     | → TOWARD EXONERATION | Structure change reduces blame
   unstructured      | → TOWARD EXONERATION | Massive exoneration bias
   
   KEY FINDING: Protocol changes SYSTEMATICALLY reduce narrator blame.
   - The unstructured protocol almost NEVER increases blame
   - This is NOT random noise - it's a directional bias
   - Removing judgment scaffolding = exonerating the narrator

3. THEORETICAL IMPLICATIONS
   ========================
   
   TWO DISTINCT MECHANISMS:
   
   A) Content perturbations operate through CREDIBILITY HEURISTICS:
      - Self-blame cues are trusted → blame increases (intended)
      - Self-justification backfires → blame increases (unintended)
      - Third-person narration provides "distance" → blame increases
   
   B) Protocol perturbations operate through SCAFFOLDING:
      - Forced-choice elicits judgment → more blame
      - Open-ended elicits advice/support → less blame
      - The "moral judge" persona is CONSTRUCTED by the protocol

4. IMPLICATIONS FOR PAPER FRAMING
   ================================
   
   The ~58% status-reversed transition rate is NOT evidence of random 
   incoherence - it's evidence of SYSTEMATIC DIRECTIONAL BIAS that varies
   by perturbation type:
   
   - Content: Credibility heuristics (self-justification backfires)
   - Protocol: Scaffolding effects (removing forced-choice exonerates)
   
   This reframes "instability" as "predictable directional sensitivity"
   with different mechanisms for different perturbation families.
""")