# Model × Constitution Interaction Analysis

**Week 1, Task 3 (Analysis 1.2) of Analysis & Publication Plan**

**Research Question:** Do certain models perform differently with certain constitutional frameworks?

**Dataset:** 360 trials with consensus scores (from Analysis 1.3)

**Design:** 5 models × 6 constitutions = 30 cells

**Models Tested:**
- Claude Sonnet 4.5
- GPT-4o  
- Gemini 2.5 Pro
- Llama 3.1 405B
- DeepSeek V3

**Constitutions Tested:**
- Balanced Justice
- Community Order
- Harm Minimization
- No Constitution (control)
- Self-Sovereignty
- Utilitarian

**Purpose:** Answer Q3 from research roadmap - Test if certain models excel with some value systems but struggle with others.

---

## Setup

In [None]:
# Import libraries
import sys
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add project root to path
sys.path.append('..')

from analysis.interaction_analysis import InteractionAnalyzer

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['figure.dpi'] = 100

# Experiment ID
EXPERIMENT_ID = 'exp_20251028_134615'

print("✅ Setup complete")

## Load Data & Run Analysis

This uses the `interaction_analysis.py` script to:
- Load consensus scores from Analysis 1.3 (mean_all method)
- Calculate mean scores for each Model × Constitution cell
- Run two-way ANOVA to test interaction effect
- Perform post-hoc Tukey HSD tests
- Analyze simple effects (per-model constitution differences)

In [None]:
# Run complete analysis
analyzer = InteractionAnalyzer(EXPERIMENT_ID)
results = analyzer.analyze(consensus_method="mean_all")

## ANOVA Summary: Main Effects and Interaction

In [None]:
# Extract ANOVA results for all dimensions
anova_summary = []

for dimension in ['epistemic_integrity', 'value_transparency', 'overall_score']:
    anova = results['dimensions'][dimension]['anova']
    
    anova_summary.append({
        'Dimension': dimension.replace('_', ' ').title(),
        'Effect': 'Model',
        'F': f"{anova['anova_table']['model_effect']['F']:.2f}",
        'p': f"{anova['anova_table']['model_effect']['p']:.6f}",
        'η²': f"{anova['anova_table']['model_effect']['eta_sq']:.3f}",
        'Strength': anova['interpretation']['model_strength'],
        'Significant': '✅' if anova['interpretation']['model_effect_significant'] else '❌'
    })
    
    anova_summary.append({
        'Dimension': dimension.replace('_', ' ').title(),
        'Effect': 'Constitution',
        'F': f"{anova['anova_table']['constitution_effect']['F']:.2f}",
        'p': f"{anova['anova_table']['constitution_effect']['p']:.6f}",
        'η²': f"{anova['anova_table']['constitution_effect']['eta_sq']:.3f}",
        'Strength': anova['interpretation']['constitution_strength'],
        'Significant': '✅' if anova['interpretation']['constitution_effect_significant'] else '❌'
    })
    
    anova_summary.append({
        'Dimension': dimension.replace('_', ' ').title(),
        'Effect': 'Model × Constitution',
        'F': f"{anova['anova_table']['interaction_effect']['F']:.2f}",
        'p': f"{anova['anova_table']['interaction_effect']['p']:.6f}",
        'η²': f"{anova['anova_table']['interaction_effect']['eta_sq']:.3f}",
        'Strength': anova['interpretation']['interaction_strength'],
        'Significant': '✅' if anova['interpretation']['interaction_significant'] else '❌'
    })

df_anova = pd.DataFrame(anova_summary)
df_anova

## Visualization 1: Cell Means Heatmaps

**Interpretation:**
- Each cell shows mean score for Model × Constitution combination
- Darker green = higher scores (better performance)
- Look for interaction patterns: Do some models excel/struggle with specific constitutions?

In [None]:
# Create heatmaps for each dimension
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

dimensions = ['epistemic_integrity', 'value_transparency', 'overall_score']
dimension_names = ['Epistemic Integrity', 'Value Transparency', 'Overall Score']

for idx, (dimension, dimension_name) in enumerate(zip(dimensions, dimension_names)):
    ax = axes[idx]
    
    # Get cell means
    cell_means = pd.DataFrame(results['dimensions'][dimension]['cell_means'])
    
    # Create heatmap
    sns.heatmap(
        cell_means,
        annot=True,
        fmt='.1f',
        cmap='RdYlGn',
        vmin=60,
        vmax=100,
        center=80,
        square=False,
        linewidths=0.5,
        cbar_kws={'label': 'Mean Score'},
        ax=ax
    )
    
    ax.set_title(dimension_name, fontsize=14, fontweight='bold')
    ax.set_xlabel('Constitution', fontsize=11)
    ax.set_ylabel('Model', fontsize=11)
    ax.tick_params(axis='x', rotation=45)

plt.suptitle('Model × Constitution Cell Means', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n📊 Look for vertical patterns (constitution effects) and horizontal patterns (model effects)")
print("📊 Interaction: Does pattern change across rows/columns?")

## Visualization 2: Interaction Plots

**Classic interaction plot:**
- X-axis: Models
- Y-axis: Mean score
- 6 lines: One per constitution

**If lines are parallel:** No interaction (models perform consistently)
**If lines cross/diverge:** Interaction present (different models excel with different constitutions)

In [None]:
# Load dataframe for plotting
df = analyzer.build_dataframe("mean_all")

# Create interaction plots
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

constitutions = sorted(df['constitution'].unique())
constitution_colors = sns.color_palette('husl', len(constitutions))
constitution_color_map = dict(zip(constitutions, constitution_colors))

for idx, (dimension, dimension_name) in enumerate(zip(dimensions, dimension_names)):
    ax = axes[idx]
    
    # Calculate mean scores per Model × Constitution
    interaction_data = df.groupby(['layer2_model', 'constitution'])[dimension].mean().reset_index()
    
    # Plot lines for each constitution
    for constitution in constitutions:
        const_data = interaction_data[interaction_data['constitution'] == constitution]
        
        ax.plot(
            const_data['layer2_model'],
            const_data[dimension],
            marker='o',
            markersize=8,
            linewidth=2,
            label=constitution,
            color=constitution_color_map[constitution],
            alpha=0.8
        )
    
    # Styling
    ax.set_title(dimension_name, fontsize=14, fontweight='bold')
    ax.set_xlabel('Model', fontsize=11)
    ax.set_ylabel('Mean Score', fontsize=11)
    ax.tick_params(axis='x', rotation=45)
    ax.grid(axis='y', alpha=0.3)
    ax.set_ylim(60, 100)
    
    if idx == 2:  # Only show legend on last plot
        ax.legend(title='Constitution', fontsize=9, loc='best', framealpha=0.9)

plt.suptitle('Model × Constitution Interaction Plots', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n📊 Parallel lines → No interaction (consistent performance)")
print("📊 Crossing/diverging lines → Interaction present (different patterns per model)")

## Simple Effects Analysis: Per-Model Constitution Performance

In [None]:
# Extract simple effects for overall_score
simple_effects = results['dimensions']['overall_score']['simple_effects']

simple_effects_summary = []
for model, stats_dict in simple_effects.items():
    simple_effects_summary.append({
        'Model': model,
        'Range (points)': f"{stats_dict['range']:.2f}",
        'Best Constitution': stats_dict['best_constitution'],
        'Worst Constitution': stats_dict['worst_constitution'],
        'F-statistic': f"{stats_dict['anova_F']:.2f}",
        'p-value': f"{stats_dict['anova_p']:.4f}",
        'Constitutions Differ': '✅' if stats_dict['constitutions_differ'] else '❌'
    })

df_simple = pd.DataFrame(simple_effects_summary)
df_simple

## Visualization 3: Per-Model Constitution Rankings

Shows which constitutions each model performs best/worst with.

In [None]:
# Create grouped bar chart showing constitution performance per model
fig, axes = plt.subplots(2, 3, figsize=(20, 12))
axes = axes.flatten()

models = sorted(df['layer2_model'].unique())

for idx, model in enumerate(models):
    if idx >= len(axes):
        break
        
    ax = axes[idx]
    
    # Get mean scores per constitution for this model
    model_data = df[df['layer2_model'] == model].groupby('constitution')['overall_score'].mean().sort_values()
    
    # Create bar chart
    bars = ax.barh(
        model_data.index,
        model_data.values,
        color=[constitution_color_map[const] for const in model_data.index],
        edgecolor='black',
        linewidth=1.5,
        alpha=0.8
    )
    
    # Add value labels
    for bar, value in zip(bars, model_data.values):
        ax.text(
            value + 0.5,
            bar.get_y() + bar.get_height() / 2,
            f'{value:.1f}',
            ha='left',
            va='center',
            fontweight='bold',
            fontsize=10
        )
    
    # Styling
    ax.set_title(model, fontsize=12, fontweight='bold')
    ax.set_xlabel('Mean Overall Score', fontsize=10)
    ax.set_xlim(60, 100)
    ax.grid(axis='x', alpha=0.3)

# Hide unused subplot
if len(models) < len(axes):
    axes[-1].axis('off')

plt.suptitle('Constitution Performance by Model (Overall Score)', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("\n📊 Each subplot shows how one model ranks the constitutions")
print("📊 If all models show similar rankings → No interaction")
print("📊 If rankings differ across models → Interaction present")

## Post-Hoc Tests: Significant Pairwise Differences

In [None]:
# Extract post-hoc results for overall_score
post_hoc = results['dimensions']['overall_score']['post_hoc']

print("="*70)
print("POST-HOC TUKEY HSD TESTS (Overall Score)")
print("="*70)

print(f"\nSIGNIFICANT MODEL DIFFERENCES: {len(post_hoc['models']['significant_pairs'])}")
print("-"*70)

if post_hoc['models']['significant_pairs']:
    model_pairs_df = pd.DataFrame(post_hoc['models']['significant_pairs'])
    model_pairs_df = model_pairs_df.sort_values('mean_diff', key=abs, ascending=False)
    display(model_pairs_df[['group1', 'group2', 'mean_diff', 'p_adj']].head(10))
else:
    print("No significant pairwise differences between models")

print(f"\n\nSIGNIFICANT CONSTITUTION DIFFERENCES: {len(post_hoc['constitutions']['significant_pairs'])}")
print("-"*70)

if post_hoc['constitutions']['significant_pairs']:
    const_pairs_df = pd.DataFrame(post_hoc['constitutions']['significant_pairs'])
    const_pairs_df = const_pairs_df.sort_values('mean_diff', key=abs, ascending=False)
    display(const_pairs_df[['group1', 'group2', 'mean_diff', 'p_adj']].head(10))
else:
    print("No significant pairwise differences between constitutions")

## Key Findings Summary

In [None]:
# Extract key findings
overall_anova = results['dimensions']['overall_score']['anova']
overall_simple = results['dimensions']['overall_score']['simple_effects']

print("="*70)
print("KEY FINDINGS: MODEL × CONSTITUTION INTERACTION ANALYSIS")
print("="*70)

print("\n1. MAIN EFFECT: MODEL")
print("-"*70)
if overall_anova['interpretation']['model_effect_significant']:
    print(f"✅ SIGNIFICANT: Models differ in overall performance")
    print(f"   F = {overall_anova['anova_table']['model_effect']['F']:.2f}, p = {overall_anova['anova_table']['model_effect']['p']:.6f}")
    print(f"   Effect size: η² = {overall_anova['anova_table']['model_effect']['eta_sq']:.3f} ({overall_anova['interpretation']['model_strength']})")
else:
    print(f"❌ NOT SIGNIFICANT: Models perform similarly overall")
    print(f"   p = {overall_anova['anova_table']['model_effect']['p']:.6f}")

print("\n2. MAIN EFFECT: CONSTITUTION")
print("-"*70)
if overall_anova['interpretation']['constitution_effect_significant']:
    print(f"✅ SIGNIFICANT: Constitutions differ in resulting scores")
    print(f"   F = {overall_anova['anova_table']['constitution_effect']['F']:.2f}, p = {overall_anova['anova_table']['constitution_effect']['p']:.6f}")
    print(f"   Effect size: η² = {overall_anova['anova_table']['constitution_effect']['eta_sq']:.3f} ({overall_anova['interpretation']['constitution_strength']})")
else:
    print(f"❌ NOT SIGNIFICANT: Constitutions produce similar scores")
    print(f"   p = {overall_anova['anova_table']['constitution_effect']['p']:.6f}")

print("\n3. INTERACTION EFFECT: MODEL × CONSTITUTION")
print("-"*70)
if overall_anova['interpretation']['interaction_significant']:
    print(f"✅ SIGNIFICANT INTERACTION DETECTED")
    print(f"   F = {overall_anova['anova_table']['interaction_effect']['F']:.2f}, p = {overall_anova['anova_table']['interaction_effect']['p']:.6f}")
    print(f"   Effect size: η² = {overall_anova['anova_table']['interaction_effect']['eta_sq']:.3f} ({overall_anova['interpretation']['interaction_strength']})")
    print("\n   INTERPRETATION: Different models perform differently across constitutions!")
    print("   Some models excel with certain value systems but struggle with others.")
else:
    print(f"❌ NO SIGNIFICANT INTERACTION")
    print(f"   p = {overall_anova['anova_table']['interaction_effect']['p']:.6f}")
    print("\n   INTERPRETATION: Models perform consistently across constitutions.")
    print("   Constitution effects are uniform - no model-specific patterns.")

print("\n4. SIMPLE EFFECTS (PER-MODEL CONSTITUTION SENSITIVITY)")
print("-"*70)
for model, stats_dict in sorted(overall_simple.items(), key=lambda x: x[1]['range'], reverse=True):
    significant = "✅" if stats_dict['constitutions_differ'] else "❌"
    print(f"{model:25} {significant} Range: {stats_dict['range']:5.2f} points ")
    print(f"{'':27} Best: {stats_dict['best_constitution']:20} Worst: {stats_dict['worst_constitution']}")

print("\n" + "="*70)

## Interpretation and Implications

### What Does This Mean?

**If Interaction is Significant:**
- Different models have different "constitutional profiles"
- Model A might excel with Utilitarian values but struggle with Community Order
- Model B might show the opposite pattern
- **Implication:** Model choice matters depending on which value system you want to reason from
- **For AI Alignment:** Models are not interchangeable - they have value-specific strengths/weaknesses

**If Interaction is NOT Significant:**
- Models perform consistently across constitutions
- Constitution effects are uniform (same pattern for all models)
- **Implication:** Model choice less critical - effects are additive (model quality + constitution difficulty)
- **For AI Alignment:** Models are roughly interchangeable for constitutional reasoning tasks

---

### Comparison to Q1 (Model Performance) and Q2 (Constitution Effects)

**Q1: Model Performance (Main Effect)**
- Tests: Do some models score higher overall?
- What it tells us: Which models are "better" at constitutional reasoning in general

**Q2: Constitution Effects (Main Effect)**
- Tests: Do some constitutions lead to higher/lower scores?
- What it tells us: Which value systems are "easier" or "harder" for models

**Q3: Model × Constitution Interaction (This Analysis)**
- Tests: Do model rankings change across constitutions?
- What it tells us: Whether models have constitution-specific strengths/weaknesses
- **More interesting than Q1/Q2:** Reveals nuanced patterns beyond "this model is better"

---

### Implications for Human Validation (Week 2-3)

**If Interaction Found:**
- Validation sample should include diverse Model × Constitution combinations
- Prioritize cells with extreme scores (very high or very low)
- Test hypothesis: Do humans also perceive constitution-specific model performance?

**If No Interaction:**
- Validation can focus on main effects (best/worst models, easiest/hardest constitutions)
- Simpler sampling strategy: stratify by model OR constitution, not both

---

## Next Steps

**✅ Analysis 1.1 Complete:** Rubric Comparison → Likert wins

**✅ Analysis 1.3 Complete:** Evaluator Agreement → Consensus scores ready

**✅ Analysis 1.2 Complete:** Model × Constitution Interaction → [Results from this notebook]

**⏭ Analysis 1.4 Next:** Dimensional Structure Validation
- Test: Are Epistemic Integrity and Value Transparency independent?
- Correlation analysis + PCA
- Validates 2-dimensional rubric design

---

**Analysis Date:** 2025-10-31  
**Experiment:** exp_20251028_134615  
**Trials Analyzed:** 360 trials with consensus scores  
**Design:** 5 models × 6 constitutions = 30 cells

**Key Question Answered:** Do certain models perform differently with certain constitutional frameworks?