# Combined Figure: Verdict Instability + Epistemic Stance

Two-panel figure combining:
- **Panel A**: Flip rates by base verdict category and perturbation family
- **Panel B**: Net epistemic stance change by perturbation type (with 95% CIs)

**Purpose**: Shows both verdict-level instability patterns and how explanatory language shifts under perturbations.

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from scipy.stats import norm
from pathlib import Path
import re

%matplotlib inline

In [78]:
# Publication settings
plt.rcParams.update({
    'font.family': 'serif',
    'font.size': 11,
    'axes.labelsize': 11,
    'axes.titlesize': 12,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 9,
    'pdf.fonttype': 42,
    'ps.fonttype': 42,
})

## Define Epistemic Markers

In [79]:
# Epistemic stance lexicon inspired by LIWC2015 tentative/certainty categories,
# restricted to high-precision epistemic markers (propositional uncertainty,
# not politeness/suggestion hedging like "you could try...")
EPISTEMIC_HEDGES = [
    'seem*', 'appear*',           # Evidential verbs (note: "appear" has perceptual polysemy)
    'might', 'could', 'may',      # Epistemic modals (note: "may" has month/permission ambiguity)
    'perhaps', 'possibly', 'maybe',  # Epistemic adverbs
    'probably', 'likely',         # Probability markers
    'unclear', 'uncertain', 'unsure',  # Explicit uncertainty
    'guess*',                     # LIWC tentat exemplar
]

EPISTEMIC_BOOSTERS = [
    'clearly', 'obviously',       # Evidential emphasis
    'definitely', 'certainly',    # Certainty adverbs
    'undoubtedly', 'unquestionably',  # Strong certainty
    'absolutely',                 # Emphasis
    'always', 'never',            # Categorical terms (LIWC certitude anchors)
    'sure',                       # Confidence marker
]

NEGATION_WORDS = ['not', 'no', "n't", 'never', 'neither', 'nor']

def term_to_pattern(term: str) -> str:
    """Convert term to regex pattern."""
    term = term.lower()
    if term.endswith('*'):
        return r'\b' + re.escape(term[:-1]) + r'\w*\b'
    return r'\b' + re.escape(term) + r'\b'

def is_negated(text: str, match_start: int, window: int = 3) -> bool:
    """Check if match is negated within word window."""
    before = text[:match_start].lower().split()[-window:]
    return any(w in NEGATION_WORDS or "n't" in w for w in before)

def count_markers(text: str, markers: list) -> int:
    """Count marker occurrences, excluding negated instances."""
    if not text or pd.isna(text):
        return 0
    text_lower = text.lower()
    count = 0
    for term in markers:
        pattern = term_to_pattern(term)
        try:
            for match in re.finditer(pattern, text_lower):
                if not is_negated(text_lower, match.start()):
                    count += 1
        except re.error:
            continue
    return count

def compute_net_epistemic(text: str) -> float:
    """Compute net epistemic stance per 100 words."""
    if not text or pd.isna(text):
        return 0.0
    words = len(text.split())
    if words == 0:
        return 0.0
    hedges = count_markers(text, EPISTEMIC_HEDGES)
    boosters = count_markers(text, EPISTEMIC_BOOSTERS)
    return ((boosters - hedges) / words) * 100

print(f"Hedges: {len(EPISTEMIC_HEDGES)}, Boosters: {len(EPISTEMIC_BOOSTERS)}")

Hedges: 14, Boosters: 10


## Load Data

In [80]:
# Load flip rates data for Panel A
results_dir = Path('../../results/analysis')
flip_rates = pd.read_csv(results_dir / 'flip_rates_by_base_verdict.csv')
print(f"Flip rates: {len(flip_rates)} rows")

# Load master data for Panel B (to compute CIs)
print("Loading master parquet...")
df = pd.read_parquet('../../final_results/parquet/master_final_model_own_baseline.parquet')
print(f"Master data: {len(df):,} rows")

Flip rates: 15 rows
Loading master parquet...


Master data: 164,424 rows


## Prepare Panel A Data (Flip Rates)

In [None]:
# Prepare data for Panel A
pivot_flip = flip_rates.pivot(
    index='base_verdict',
    columns='perturbation_family',
    values='flip_rate'
)

# Exclude "Unclear" verdict
if 'Unclear' in pivot_flip.index:
    pivot_flip = pivot_flip.drop('Unclear')

# Order by average flip rate (descending)
pivot_flip['Average'] = pivot_flip.mean(axis=1)
pivot_flip = pivot_flip.sort_values('Average', ascending=False)
pivot_flip = pivot_flip[['Presentation', 'Psychological', 'Robustness']]

# Rename columns to match paper terminology
pivot_flip = pivot_flip.rename(columns={
    'Presentation': 'Point-of-view',
    'Psychological': 'Persuasion',
    'Robustness': 'Surface'
})

# Rename verdict labels to full names (using paper conventions with underscores replaced)
verdict_labels = {
    'No_One_At_Fault': 'No One At Fault',
    'All_At_Fault': 'All At Fault',
    'Self_At_Fault': 'Self At Fault',
    'Other_At_Fault': 'Other At Fault'
}
pivot_flip.index = [verdict_labels.get(v, v) for v in pivot_flip.index]

print("Flip Rates by Base Verdict (%):\n")
print(pivot_flip.round(2))

## Prepare Panel B Data (Epistemic Stance with CIs)

In [82]:
# Compute epistemic stance for all explanations
print("Computing epistemic stance (this may take 1-2 minutes)...")
df['net_epistemic'] = df['explanation'].apply(compute_net_epistemic)
print("Done!")

Computing epistemic stance (this may take 1-2 minutes)...
Done!


In [None]:
# Get baseline data
baseline = df[df['perturbation_type'] == 'none'].copy()
baseline_rates = baseline.groupby(['id', 'model', 'run_number']).agg({
    'net_epistemic': 'first'
}).reset_index()
baseline_rates.columns = ['id', 'model', 'run_number', 'base_epistemic']

# Perturbation config - using paper terminology for categories
PERTURBATION_CONFIG = {
    'push_yta_social_proof': {'name': 'Social proof (against)', 'category': 'Persuasion'},
    'push_yta_pattern_admission': {'name': 'Pattern admission', 'category': 'Persuasion'},
    'push_yta_self_condemning': {'name': 'Self-condemning', 'category': 'Persuasion'},
    'change_trivial_detail': {'name': 'Change trivial detail', 'category': 'Surface'},
    'add_extraneous_detail': {'name': 'Add extraneous detail', 'category': 'Surface'},
    'remove_sentence': {'name': 'Remove sentence', 'category': 'Surface'},
    'push_nta_victim_pattern': {'name': 'Victim pattern', 'category': 'Persuasion'},
    'push_nta_self_justifying': {'name': 'Self-justifying', 'category': 'Persuasion'},
    'push_nta_social_proof': {'name': 'Social proof (for)', 'category': 'Persuasion'},
    'firstperson_atfault': {'name': 'First-person', 'category': 'Point-of-view'},
    'thirdperson': {'name': 'Third-person', 'category': 'Point-of-view'},
}

perturbations = df[df['perturbation_type'].isin(PERTURBATION_CONFIG.keys())].copy()
print(f"Perturbation records: {len(perturbations):,}")

In [84]:
# Merge and compute deltas
merged = perturbations.merge(baseline_rates, on=['id', 'model', 'run_number'], how='inner')
merged['net_delta'] = merged['net_epistemic'] - merged['base_epistemic']
print(f"Matched pairs: {len(merged):,}")

Matched pairs: 129,156


In [None]:
# Aggregate by perturbation type WITH standard errors
z = norm.ppf(0.975)  # 1.96 for 95% CI

results = []
for pert_type, config in PERTURBATION_CONFIG.items():
    pert_data = merged[merged['perturbation_type'] == pert_type]
    if len(pert_data) == 0:
        continue
    
    mean_delta = pert_data['net_delta'].mean()
    std_delta = pert_data['net_delta'].std()
    n = len(pert_data)
    se = std_delta / np.sqrt(n)
    ci = z * se
    
    results.append({
        'perturbation': config['name'],
        'category': config['category'],
        'net_delta': mean_delta,
        'std': std_delta,
        'se': se,
        'ci': ci,
        'flip_pct': pert_data['verdict_flipped'].mean() * 100,
        'n': n
    })

epistemic_data = pd.DataFrame(results)

# Sort by category then net_delta - using paper terminology
category_order = ['Persuasion', 'Surface', 'Point-of-view']
epistemic_data['category'] = pd.Categorical(
    epistemic_data['category'], 
    categories=category_order, 
    ordered=True
)
epistemic_data = epistemic_data.sort_values(['category', 'net_delta'], ascending=[True, True])

print("Epistemic Stance by Perturbation (with 95% CI):\n")
print(epistemic_data[['perturbation', 'category', 'net_delta', 'ci', 'flip_pct', 'n']].to_string(index=False))

## Create Combined Figure

In [None]:
# Create combined two-panel figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), dpi=150)

# =============================================================================
# SHARED STYLING - consistent across both panels
# =============================================================================
CATEGORY_COLORS = {
    'Point-of-view': '#222222',  # Black
    'Persuasion': '#888888',      # Gray
    'Surface': '#dddddd'          # Light Gray
}

# Shared bar/error styling
BAR_EDGECOLOR = '#222222'
BAR_LINEWIDTH = 0.5
ERROR_CAPSIZE = 3
ERROR_LINEWIDTH = 1.2

# Legend order (consistent across panels)
LEGEND_ORDER = ['Point-of-view', 'Persuasion', 'Surface']

# =============================================================================
# PANEL A: Flip Rates by Base Verdict
# =============================================================================
x = np.arange(len(pivot_flip))
width = 0.25

# Reverse mapping for error bar lookup (full label -> original CSV label)
label_to_csv = {
    'No One At Fault': 'No_One_At_Fault',
    'All At Fault': 'All_At_Fault',
    'Self At Fault': 'Self_At_Fault',
    'Other At Fault': 'Other_At_Fault'
}

col_to_csv = {
    'Point-of-view': 'Presentation',
    'Persuasion': 'Psychological',
    'Surface': 'Robustness'
}

# Plot bars in legend order
for i, col in enumerate(LEGEND_ORDER):
    offset = (i - 1) * width
    bars = ax1.bar(x + offset, pivot_flip[col], width, label=col, 
                   color=CATEGORY_COLORS[col], 
                   edgecolor=BAR_EDGECOLOR, linewidth=BAR_LINEWIDTH)

# Add confidence intervals for Panel A
for i, col in enumerate(LEGEND_ORDER):
    for j, verdict in enumerate(pivot_flip.index):
        csv_col = col_to_csv.get(col, col)
        csv_verdict = label_to_csv.get(verdict, verdict)
        mask = (flip_rates['base_verdict'] == csv_verdict) & \
               (flip_rates['perturbation_family'] == csv_col)
        row = flip_rates[mask]
        if not row.empty:
            n = row['n_total'].values[0]
            p = row['flip_rate'].values[0] / 100
            se = np.sqrt(p * (1 - p) / n) * 100
            ci = z * se
            offset = (i - 1) * width
            ax1.errorbar(j + offset, pivot_flip.loc[verdict, col], yerr=ci, 
                         fmt='none', ecolor='black', capsize=ERROR_CAPSIZE, 
                         lw=ERROR_LINEWIDTH, capthick=ERROR_LINEWIDTH, zorder=10)

ax1.set_xlabel('Base Verdict Category', fontweight='bold')
ax1.set_ylabel('Flip Rate (%)', fontweight='bold')
ax1.set_title('(A) Verdict Instability by Base Judgment', fontweight='bold', loc='left')
ax1.set_xticks(x)
ax1.set_xticklabels(pivot_flip.index, fontsize=9)
ax1.legend(loc='upper right', framealpha=0.9, bbox_to_anchor=(1, 0.97))
ax1.set_ylim(0, 65)
ax1.grid(axis='y', alpha=0.3, linestyle='--')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)

# =============================================================================
# PANEL B: Epistemic Stance Change (with 95% CIs)
# =============================================================================
y_positions = np.arange(len(epistemic_data))
colors_b = [CATEGORY_COLORS[cat] for cat in epistemic_data['category']]

# Horizontal bar chart with error bars
bars = ax2.barh(y_positions, epistemic_data['net_delta'], xerr=epistemic_data['ci'],
                color=colors_b, edgecolor=BAR_EDGECOLOR, linewidth=BAR_LINEWIDTH, 
                height=0.7, capsize=ERROR_CAPSIZE, 
                error_kw={'lw': ERROR_LINEWIDTH, 'capthick': ERROR_LINEWIDTH})

# Zero line
ax2.axvline(x=0, color='#222222', linewidth=0.8)

# Labels
ax2.set_yticks(y_positions)
ax2.set_yticklabels(epistemic_data['perturbation'])
ax2.set_xlabel('$\\Delta$ Net Epistemic Stance (per 100 words)', fontweight='bold')
ax2.set_title('(B) Epistemic Stance Change by Perturbation', fontweight='bold', loc='left')

# Category separators
prev_cat = None
for i, (_, row) in enumerate(epistemic_data.iterrows()):
    if row['category'] != prev_cat and prev_cat is not None:
        ax2.axhline(y=i-0.5, color='#cccccc', linewidth=0.5, linestyle='--')
    prev_cat = row['category']

# Flip rate annotations (positioned after error bars)
for i, (_, row) in enumerate(epistemic_data.iterrows()):
    x_pos = row['net_delta']
    ci = row['ci']
    # Position text after the error bar
    if x_pos >= 0:
        text_x = x_pos + ci + 0.005
        ha = 'left'
    else:
        text_x = x_pos - ci - 0.005
        ha = 'right'
    ax2.annotate(f"{row['flip_pct']:.0f}%",
                 xy=(text_x, i),
                 fontsize=8, va='center', ha=ha, color='#555555')

# Legend with shared colors (same order as Panel A)
legend_patches = [mpatches.Patch(color=CATEGORY_COLORS[cat], label=cat,
                                  edgecolor=BAR_EDGECOLOR, linewidth=BAR_LINEWIDTH)
                  for cat in LEGEND_ORDER]
ax2.legend(handles=legend_patches, loc='upper right', framealpha=0.9, bbox_to_anchor=(1, 0.97))

# Styling - expand x limits to accommodate error bars and annotations
xmin = min(epistemic_data['net_delta'].min() - epistemic_data['ci'].max(), -0.06) * 1.5
xmax = max(epistemic_data['net_delta'].max() + epistemic_data['ci'].max(), 0.12) * 1.5
ax2.set_xlim(xmin, xmax)
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.invert_yaxis()
ax2.grid(axis='x', alpha=0.3, linestyle='--')

# Directional labels
ax2.text(xmin * 0.85, -0.65, '$\\leftarrow$ More hedged', fontsize=8, ha='left', style='italic', color='#666666')
ax2.text(xmax * 0.85, -0.65, 'More confident $\\rightarrow$', fontsize=8, ha='right', style='italic', color='#666666')

plt.tight_layout()
plt.show()

## Save Figure

In [87]:
# Save combined figure
output_dir = Path('../../figures')
output_dir.mkdir(parents=True, exist_ok=True)

pdf_path = output_dir / 'fig_instability_epistemic_combined.pdf'
png_path = output_dir / 'fig_instability_epistemic_combined.png'

fig.savefig(pdf_path, bbox_inches='tight', dpi=300)
fig.savefig(png_path, bbox_inches='tight', dpi=300)

print(f"Saved: {pdf_path}")
print(f"Saved: {png_path}")

Saved: ../../figures/fig_instability_epistemic_combined.pdf
Saved: ../../figures/fig_instability_epistemic_combined.png


## Key Insights

In [88]:
print("KEY INSIGHTS")
print("=" * 80)

print("\nPanel A (Verdict Instability by Base Judgment):")
print("  - NAF (No One At Fault) is most unstable across all perturbation types")
print("  - OAF (Other At Fault) is most stable")
print("  - Point-of-view perturbations cause highest flip rates")
print("  - Models are LEAST stable when moral reasoning should be MOST careful")

print("\nPanel B (Epistemic Stance Change):")
print("  - Push-SAF: More hedged (-0.014 to -0.046) - uncertainty when narrator self-criticizes")
print("  - Surface: Minimal effect (-0.025 to +0.000) - semantic irrelevance")
print("  - Push-OAF: Slightly more confident (+0.013 to +0.031)")
print("  - Point-of-view: MOST confident (+0.048 to +0.082) - third-person = direct language")
print("  - All effects are statistically significant (CIs don't cross zero for most)")

print("\nLexicon (LIWC2015-inspired, high-precision):")
print("  - Hedges: seem*, appear*, might, could, may, perhaps, possibly, maybe,")
print("            probably, likely, unclear, uncertain, unsure, guess*")
print("  - Boosters: clearly, obviously, definitely, certainly, undoubtedly,")
print("              unquestionably, absolutely, always, never, sure")

print("\nCONNECTION:")
print("  - Point-of-view perturbations cause both highest flip rates AND most confident language")
print("  - Suggests frame shifts induce different 'reasoning modes' in models")
print("  - Third-person narration licenses more assertive moral assessments")

KEY INSIGHTS

Panel A (Verdict Instability by Base Judgment):
  - NAF (No One At Fault) is most unstable across all perturbation types
  - OAF (Other At Fault) is most stable
  - Point-of-view perturbations cause highest flip rates
  - Models are LEAST stable when moral reasoning should be MOST careful

Panel B (Epistemic Stance Change):
  - Push-SAF: More hedged (-0.014 to -0.046) - uncertainty when narrator self-criticizes
  - Surface: Minimal effect (-0.025 to +0.000) - semantic irrelevance
  - Push-OAF: Slightly more confident (+0.013 to +0.031)
  - Point-of-view: MOST confident (+0.048 to +0.082) - third-person = direct language
  - All effects are statistically significant (CIs don't cross zero for most)

Lexicon (LIWC2015-inspired, high-precision):
  - Hedges: seem*, appear*, might, could, may, perhaps, possibly, maybe,
            probably, likely, unclear, uncertain, unsure, guess*
  - Boosters: clearly, obviously, definitely, certainly, undoubtedly,
              unquestionab

## Alternative Panel B: Narrator Blame Direction

Instead of epistemic stance, this panel shows the **asymmetric effectiveness** of perturbations:
- **Blame direction ratio**: (transitions toward narrator blame) / (transitions toward exoneration)
- Ratio > 1 = net movement toward blaming the narrator
- Ratio < 1 = net movement toward exonerating the narrator
- Ratio ≈ 1 = balanced (perturbation has no directional effect)

**Key finding**: Push Self At Fault perturbations work as intended (~4-6x toward blame), but Push Other At Fault perturbations fail or backfire (~1x toward blame instead of exoneration).

In [None]:
# =============================================================================
# Compute Blame Direction Data
# =============================================================================

# Define OP blame status groups
OP_BLAMED = {'Self_At_Fault', 'All_At_Fault'}
OP_EXONERATED = {'Other_At_Fault', 'No_One_At_Fault'}

def get_op_transition(base_v, target_v):
    """Classify transition by OP blame status change."""
    if pd.isna(base_v) or pd.isna(target_v):
        return None
    base_blamed = base_v in OP_BLAMED
    target_blamed = target_v in OP_BLAMED
    
    if base_blamed == target_blamed:
        return 'preserved'  # stays in same blame-status group
    elif not base_blamed and target_blamed:
        return 'newly_blamed'  # exonerated -> blamed
    else:
        return 'exonerated'  # blamed -> exonerated

# Get baseline verdicts
baseline_verdicts = df[df['perturbation_type'] == 'none'][['id', 'model', 'run_number', 'standardized_judgment']].copy()
baseline_verdicts = baseline_verdicts.rename(columns={'standardized_judgment': 'baseline_verdict'})

# Merge with perturbations
perturbed = df[df['perturbation_type'].isin(PERTURBATION_CONFIG.keys())].copy()
merged_blame = perturbed.merge(baseline_verdicts, on=['id', 'model', 'run_number'], how='inner')

# Only look at flips
flips = merged_blame[merged_blame['verdict_flipped'] == True].copy()
flips['op_transition'] = flips.apply(
    lambda r: get_op_transition(r['baseline_verdict'], r['standardized_judgment']), axis=1
)

print(f"Total flips: {len(flips):,}")
print(f"Transition breakdown:")
print(flips['op_transition'].value_counts())

In [None]:
# =============================================================================
# Aggregate blame direction by perturbation type
# =============================================================================

# Extended config with direction info - using paper terminology
PERT_DIRECTION_CONFIG = {
    # Push Self At Fault (designed to increase narrator blame)
    'push_yta_social_proof': {'name': 'Social proof (against)', 'category': 'Persuasion: Push Self At Fault', 'intended': 'blame'},
    'push_yta_pattern_admission': {'name': 'Pattern admission', 'category': 'Persuasion: Push Self At Fault', 'intended': 'blame'},
    'push_yta_self_condemning': {'name': 'Self-condemning', 'category': 'Persuasion: Push Self At Fault', 'intended': 'blame'},
    # Push Other At Fault (designed to decrease narrator blame / exonerate)
    'push_nta_victim_pattern': {'name': 'Victim pattern', 'category': 'Persuasion: Push Other At Fault', 'intended': 'exonerate'},
    'push_nta_self_justifying': {'name': 'Self-justifying', 'category': 'Persuasion: Push Other At Fault', 'intended': 'exonerate'},
    'push_nta_social_proof': {'name': 'Social proof (for)', 'category': 'Persuasion: Push Other At Fault', 'intended': 'exonerate'},
    # Presentation (neutral - testing frame effects)
    'firstperson_atfault': {'name': 'First-person', 'category': 'Point-of-view', 'intended': 'neutral'},
    'thirdperson': {'name': 'Third-person', 'category': 'Point-of-view', 'intended': 'neutral'},
    # Surface (neutral - robustness test)
    'change_trivial_detail': {'name': 'Change trivial detail', 'category': 'Surface', 'intended': 'neutral'},
    'add_extraneous_detail': {'name': 'Add extraneous detail', 'category': 'Surface', 'intended': 'neutral'},
    'remove_sentence': {'name': 'Remove sentence', 'category': 'Surface', 'intended': 'neutral'},
}

# Compute blame direction stats by perturbation
blame_results = []
for pert_type, config in PERT_DIRECTION_CONFIG.items():
    pert_flips = flips[flips['perturbation_type'] == pert_type]
    if len(pert_flips) == 0:
        continue
    
    n_total = len(pert_flips)
    n_newly_blamed = (pert_flips['op_transition'] == 'newly_blamed').sum()
    n_exonerated = (pert_flips['op_transition'] == 'exonerated').sum()
    n_preserved = (pert_flips['op_transition'] == 'preserved').sum()
    
    # Compute ratio (add small constant to avoid division by zero)
    ratio = n_newly_blamed / max(n_exonerated, 1)
    
    # Compute 95% CI for ratio using log transformation
    # SE(log(ratio)) ≈ sqrt(1/a + 1/b) for ratio = a/b
    if n_newly_blamed > 0 and n_exonerated > 0:
        se_log = np.sqrt(1/n_newly_blamed + 1/n_exonerated)
        ci_lower = np.exp(np.log(ratio) - 1.96 * se_log)
        ci_upper = np.exp(np.log(ratio) + 1.96 * se_log)
    else:
        ci_lower = ci_upper = ratio
    
    blame_results.append({
        'perturbation': config['name'],
        'category': config['category'],
        'intended': config['intended'],
        'n_flips': n_total,
        'n_newly_blamed': n_newly_blamed,
        'n_exonerated': n_exonerated,
        'n_preserved': n_preserved,
        'ratio': ratio,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'net': n_newly_blamed - n_exonerated,
    })

blame_data = pd.DataFrame(blame_results)

# Sort: Push Self At Fault first (descending by ratio), then Push Other At Fault, then others
category_order = ['Persuasion: Push Self At Fault', 'Persuasion: Push Other At Fault', 'Point-of-view', 'Surface']
blame_data['category'] = pd.Categorical(blame_data['category'], categories=category_order, ordered=True)
blame_data = blame_data.sort_values(['category', 'ratio'], ascending=[True, False])

print("Blame Direction by Perturbation Type:")
print(blame_data[['perturbation', 'category', 'n_newly_blamed', 'n_exonerated', 'ratio', 'net']].to_string(index=False))

In [None]:
# =============================================================================
# Create Combined Figure: Panel A (Flip Rates) + Panel B (Blame Direction)
# =============================================================================

fig2, (ax1_alt, ax2_alt) = plt.subplots(1, 2, figsize=(14, 5), dpi=150)

# =============================================================================
# SHARED STYLING
# =============================================================================
# Grayscale colors with hatching patterns for distinction
CATEGORY_STYLES = {
    'Persuasion: Push Self At Fault': {'color': '#444444', 'hatch': ''},
    'Persuasion: Push Other At Fault': {'color': '#777777', 'hatch': ''},
    'Point-of-view': {'color': '#aaaaaa', 'hatch': ''},
    'Surface': {'color': '#dddddd', 'hatch': ''}
}

BAR_EDGECOLOR = '#222222'
BAR_LINEWIDTH = 0.5
ERROR_CAPSIZE = 3
ERROR_LINEWIDTH = 1.2

# =============================================================================
# PANEL A: Flip Rates by Base Verdict (same as before)
# =============================================================================
x = np.arange(len(pivot_flip))
width = 0.25

CATEGORY_COLORS_A = {
    'Point-of-view': '#222222',
    'Persuasion': '#888888',
    'Surface': '#dddddd'
}
LEGEND_ORDER_A = ['Point-of-view', 'Persuasion', 'Surface']

for i, col in enumerate(LEGEND_ORDER_A):
    offset = (i - 1) * width
    bars = ax1_alt.bar(x + offset, pivot_flip[col], width, label=col, 
                       color=CATEGORY_COLORS_A[col], 
                       edgecolor=BAR_EDGECOLOR, linewidth=BAR_LINEWIDTH)

# Add confidence intervals
for i, col in enumerate(LEGEND_ORDER_A):
    for j, verdict in enumerate(pivot_flip.index):
        csv_col = col_to_csv.get(col, col)
        csv_verdict = label_to_csv.get(verdict, verdict)
        mask = (flip_rates['base_verdict'] == csv_verdict) & \
               (flip_rates['perturbation_family'] == csv_col)
        row = flip_rates[mask]
        if not row.empty:
            n = row['n_total'].values[0]
            p = row['flip_rate'].values[0] / 100
            se = np.sqrt(p * (1 - p) / n) * 100
            ci = z * se
            offset = (i - 1) * width
            ax1_alt.errorbar(j + offset, pivot_flip.loc[verdict, col], yerr=ci, 
                             fmt='none', ecolor='black', capsize=ERROR_CAPSIZE, 
                             lw=ERROR_LINEWIDTH, capthick=ERROR_LINEWIDTH, zorder=10)

ax1_alt.set_xlabel('Base Verdict Category', fontweight='bold')
ax1_alt.set_ylabel('Flip Rate (%)', fontweight='bold')
ax1_alt.set_title('(A) Verdict Instability by Base Judgment', fontweight='bold', loc='left')
ax1_alt.set_xticks(x)
ax1_alt.set_xticklabels(pivot_flip.index, fontsize=9)
ax1_alt.legend(loc='upper right', framealpha=0.9, bbox_to_anchor=(1, 0.97))
ax1_alt.set_ylim(0, 65)
ax1_alt.grid(axis='y', alpha=0.3, linestyle='--')
ax1_alt.spines['top'].set_visible(False)
ax1_alt.spines['right'].set_visible(False)

# =============================================================================
# PANEL B: Blame Direction Ratio by Perturbation
# =============================================================================
y_positions = np.arange(len(blame_data))
colors_b2 = [CATEGORY_STYLES[cat]['color'] for cat in blame_data['category']]
hatches_b2 = [CATEGORY_STYLES[cat]['hatch'] for cat in blame_data['category']]

# Compute error bar asymmetric widths (for log-scale ratios)
xerr_lower = blame_data['ratio'] - blame_data['ci_lower']
xerr_upper = blame_data['ci_upper'] - blame_data['ratio']
xerr = [xerr_lower.values, xerr_upper.values]

# Horizontal bar chart
bars = ax2_alt.barh(y_positions, blame_data['ratio'], 
                    color=colors_b2, edgecolor=BAR_EDGECOLOR, linewidth=BAR_LINEWIDTH,
                    height=0.7, xerr=xerr, capsize=ERROR_CAPSIZE,
                    error_kw={'lw': ERROR_LINEWIDTH, 'capthick': ERROR_LINEWIDTH})

# Apply hatching
for bar, hatch in zip(bars, hatches_b2):
    bar.set_hatch(hatch)

# Reference line at ratio = 1 (balanced)
ax2_alt.axvline(x=1.0, color='#222222', linewidth=1.2, linestyle='-', zorder=0)

# Labels
ax2_alt.set_yticks(y_positions)
ax2_alt.set_yticklabels(blame_data['perturbation'])
ax2_alt.set_xlabel('Blame Direction Ratio (toward blame / toward exoneration)', fontweight='bold')
ax2_alt.set_title('(B) Asymmetric Perturbation Effectiveness', fontweight='bold', loc='left')

# Category separators
prev_cat = None
for i, (_, row) in enumerate(blame_data.iterrows()):
    if row['category'] != prev_cat and prev_cat is not None:
        ax2_alt.axhline(y=i-0.5, color='#cccccc', linewidth=0.5, linestyle='--')
    prev_cat = row['category']

# Annotate with ratio values only (no glyphs)
for i, (_, row) in enumerate(blame_data.iterrows()):
    ratio = row['ratio']
    # Position annotation after error bar
    x_pos = row['ci_upper'] + 0.15
    ax2_alt.annotate(f"{ratio:.2f}x",
                     xy=(x_pos, i),
                     fontsize=9, va='center', ha='left', color='#333333')

# Legend - position at lower right to avoid overlap with top bars
legend_patches = [mpatches.Patch(facecolor=CATEGORY_STYLES[cat]['color'], 
                                  edgecolor=BAR_EDGECOLOR, linewidth=BAR_LINEWIDTH,
                                  hatch=CATEGORY_STYLES[cat]['hatch'], label=cat)
                  for cat in ['Persuasion: Push Self At Fault', 'Persuasion: Push Other At Fault', 
                              'Point-of-view', 'Surface']]
ax2_alt.legend(handles=legend_patches, loc='lower right', framealpha=0.9, fontsize=8)

# Styling
ax2_alt.set_xlim(0, max(blame_data['ci_upper'].max() * 1.4, 7))
ax2_alt.spines['top'].set_visible(False)
ax2_alt.spines['right'].set_visible(False)
ax2_alt.invert_yaxis()
ax2_alt.grid(axis='x', alpha=0.3, linestyle='--')

# Directional labels
ax2_alt.text(0.3, -0.7, '$\\leftarrow$ Exonerates narrator', fontsize=8, ha='left', style='italic', color='#666666')
ax2_alt.text(ax2_alt.get_xlim()[1] * 0.95, -0.7, 'Blames narrator $\\rightarrow$', fontsize=8, ha='right', style='italic', color='#666666')

plt.tight_layout()
plt.show()

In [None]:
# Save alternative figure with blame direction panel
pdf_path_alt = output_dir / 'fig_instability_blame_direction_combined.pdf'
png_path_alt = output_dir / 'fig_instability_blame_direction_combined.png'

fig2.savefig(pdf_path_alt, bbox_inches='tight', dpi=300)
fig2.savefig(png_path_alt, bbox_inches='tight', dpi=300)

print(f"Saved: {pdf_path_alt}")
print(f"Saved: {png_path_alt}")

In [None]:
print("KEY INSIGHTS - Alternative Panel B (Blame Direction)")
print("=" * 80)

print("\nPanel B (Asymmetric Perturbation Effectiveness):")
print("  - Persuasion: Push Self At Fault perturbations WORK as intended:")
for _, row in blame_data[blame_data['category'] == 'Persuasion: Push Self At Fault'].iterrows():
    print(f"      {row['perturbation']}: {row['ratio']:.2f}x toward blame")

print("\n  - Persuasion: Push Other At Fault perturbations FAIL or BACKFIRE:")
for _, row in blame_data[blame_data['category'] == 'Persuasion: Push Other At Fault'].iterrows():
    status = "BACKFIRE" if row['ratio'] > 1.0 else "weak"
    print(f"      {row['perturbation']}: {row['ratio']:.2f}x ({status})")

print("\n  - Point-of-view perturbations show bias toward blame:")
for _, row in blame_data[blame_data['category'] == 'Point-of-view'].iterrows():
    print(f"      {row['perturbation']}: {row['ratio']:.2f}x")

print("\n  - Surface perturbations are relatively balanced:")
for _, row in blame_data[blame_data['category'] == 'Surface'].iterrows():
    print(f"      {row['perturbation']}: {row['ratio']:.2f}x")

print("\nKEY FINDING:")
print("  Self-criticism is trusted (Push Self At Fault works: ~4-6x toward blame)")
print("  Self-justification backfires (Push Other At Fault 'Self-justifying': >1x toward blame)")
print("  This supports the 'credibility heuristic' interpretation")

## Comparison: Epistemic Stance vs Blame Direction

| Aspect | Epistemic Stance (Original) | Blame Direction (Alternative) |
|--------|----------------------------|------------------------------|
| **Shows** | How *language* changes | How *judgments* change |
| **Unit** | Δ markers per 100 words | Ratio of transitions |
| **Key finding** | Third-person → confident | Push Self At Fault works, Push Other At Fault backfires |
| **Connects to paper narrative** | Secondary (linguistic) | Core (credibility heuristic) |
| **Visual highlight** | Point-of-view = most confident | Self-justifying backfire |

**Recommendation**: The blame direction panel directly visualizes the paper's core finding about asymmetric perturbation effectiveness and the credibility heuristic. The epistemic stance panel is interesting but tangential.