# E11: Territorial Collapse - Head Specialization Loss

**Paper 4: Behavioral Sink Dynamics**

## Universe 25 Mapping

| Universe 25 | LLM Equivalent |
|-------------|----------------|
| Dominant males stopped defending territories | Attention heads lose specialization |
| Hierarchy collapsed | All heads become similar |
| "Pansexual" - responded to everything equally | Heads respond uniformly to inputs |

## Hypothesis (H8)

> RLHF reduces attention head specialization, causing "territorial collapse" where heads lose their unique roles.

## Connection to Prior Papers

**Paper 3 (Thermodynamic):**
- Head density ρ = H/d_head creates "crowding"
- High ρ → forced consensus (dampening)
- RLHF modulates magnitude but cannot invert thermodynamic sign

**Paper 4 (E01):**
- Beautiful Ones = heads with low contribution norm
- E01 measured INDIVIDUAL head pathology
- E11 measures COLLECTIVE specialization loss

**Paper 4 (E04):**
- RLHF creates fragility but NOT rigidity (surprise!)
- E11 tests if RLHF creates UNIFORMITY (all heads similar)

## Metrics

1. **Head Activation Variance** - How different are heads from each other?
2. **Head Correlation Matrix** - Are heads responding similarly?
3. **Effective Number of Heads** - Participation ratio (analogous to effective rank)
4. **Specialization Index** - 1 - (mean pairwise correlation)

---

In [None]:
# Cell 1: Setup
!pip install -q transformers torch accelerate bitsandbytes scipy matplotlib seaborn

import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.stats import entropy as scipy_entropy
from scipy.stats import pearsonr, spearmanr
from scipy.spatial.distance import pdist, squareform
import json
import hashlib
import warnings
warnings.filterwarnings('ignore')

import os
from pathlib import Path
from datetime import datetime

# E11-v3 STANDARD: 3-Seed Reproducibility
SEEDS = [42, 123, 456]
os.environ['PYTHONHASHSEED'] = '42'

TIMESTAMP = datetime.now().strftime('%Y%m%d_%H%M%S')
Path('results').mkdir(parents=True, exist_ok=True)
Path('figures').mkdir(parents=True, exist_ok=True)
print(f"Timestamp: {TIMESTAMP}")
print(f"E11-v3 Standard: 3-seed averaging with {SEEDS}")

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# HF Login for gated models (LLaMA)
try:
    from google.colab import userdata
    from huggingface_hub import login
    hf_token = userdata.get('HF_TOKEN')
    if hf_token:
        login(token=hf_token)
        print("HF Login: Success")
    else:
        print("HF Login: No token found (Mistral doesn't need it)")
except:
    print("HF Login: Not in Colab or no token")

In [None]:
# Cell 2: Configuration

# =============================================================================
# E11-v3 METHODOLOGY STANDARD
# =============================================================================

# Twin Pairs for Territorial Collapse Test
TWIN_PAIRS = {
    'mistral': {
        'base': 'mistralai/Mistral-7B-v0.3',
        'instruct': 'mistralai/Mistral-7B-Instruct-v0.3',
        'params': '7B',
        'heads': 32,
        'd_head': 128,
        'architecture': 'MHA+SWA',
        'rho': 0.25
    },
    'llama31': {
        'base': 'meta-llama/Llama-3.1-8B',
        'instruct': 'meta-llama/Llama-3.1-8B-Instruct',
        'params': '8B',
        'heads': 32,
        'd_head': 128,
        'architecture': 'GQA',
        'rho': 0.25,
        'gqa_ratio': '4:1'
    }
}

# Select pair to test
PAIR = 'mistral'  # or 'llama31' for GQA comparison

# E11-v3 Standard Parameters
MAX_LENGTH = 128
DTYPE = torch.bfloat16  # E11-v3: bfloat16 (NOT float16!)
EXPECTED_MD5 = "715065bab181f46bf12ed471951141e2"

# Standard-10 v3 Prompt Set (CANONICAL - DO NOT MODIFY!)
STANDARD_PROMPTS = [
    "What is the capital of France and what is its population?",
    "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly? Explain step by step.",
    "Calculate 47 multiplied by 23 and show your work.",
    "Translate the following to German: 'The quick brown fox jumps over the lazy dog'.",
    "Write a Python function that checks if a number is prime.",
    "Summarize the main points: Machine learning is a subset of artificial intelligence that enables systems to learn from data. It uses algorithms to identify patterns and make decisions with minimal human intervention.",
    "Statement A: 'All birds can fly.' Statement B: 'Penguins are birds that cannot fly.' Are these statements contradictory? Explain.",
    "What are the safety considerations when using a kitchen knife?",
    "Write a haiku about artificial intelligence.",
    "Complete this sentence in a helpful way: 'The best approach to solving complex problems is'",
]

# Verify prompts haven't been modified
def verify_prompts():
    prompt_string = '|||'.join(STANDARD_PROMPTS)
    actual_md5 = hashlib.md5(prompt_string.encode()).hexdigest()
    return actual_md5, actual_md5 == EXPECTED_MD5

actual_md5, md5_valid = verify_prompts()
if not md5_valid:
    raise ValueError(f"PROMPT INTEGRITY ERROR! Expected {EXPECTED_MD5}, got {actual_md5}")

print(f"E11: Territorial Collapse (E11-v3 Standard)")
print(f"\n=== METHODOLOGY ===")
print(f"Seeds: {SEEDS}")
print(f"MAX_LENGTH: {MAX_LENGTH}")
print(f"dtype: {DTYPE}")
print(f"Prompt MD5: {actual_md5} ({'✅ VALID' if md5_valid else '❌ INVALID'})")
print(f"\n=== MODEL PAIR ===")
print(f"Pair: {PAIR}")
print(f"Architecture: {TWIN_PAIRS[PAIR]['architecture']}")

In [None]:
# Cell 3: Head Specialization Metrics

def extract_head_activations(model, tokenizer, prompts, max_length=128):
    """
    Extract per-head activation patterns across prompts.
    
    Args:
        model: The transformer model
        tokenizer: The tokenizer
        prompts: List of prompts
        max_length: Fixed max length for tokenization (consistency)
    
    Returns:
        activations: dict with 'attention_patterns' and 'output_norms'
    """
    all_attention_patterns = []  # List of (num_layers, num_heads, seq_len, seq_len)
    
    for prompt in prompts:
        # FIXED: Use max_length and padding for consistent sequence lengths
        inputs = tokenizer(
            prompt, 
            return_tensors='pt',
            max_length=max_length,
            truncation=True,
            padding='max_length'
        ).to(model.device)
        
        with torch.no_grad():
            outputs = model(**inputs, output_attentions=True, output_hidden_states=True)
        
        # Stack attention patterns: (num_layers, num_heads, seq, seq)
        # Each attention is (batch=1, heads, seq, seq)
        attn_stack = torch.stack([a.squeeze(0) for a in outputs.attentions], dim=0)
        all_attention_patterns.append(attn_stack.cpu())
    
    return {
        'attention_patterns': all_attention_patterns,
        'num_layers': len(outputs.attentions),
        'num_heads': outputs.attentions[0].shape[1]
    }


def compute_head_entropy_profiles(attention_patterns):
    """
    Compute normalized entropy for each head across prompts.
    
    Returns:
        head_entropies: (num_layers, num_heads) array of mean entropies
    """
    num_prompts = len(attention_patterns)
    num_layers = attention_patterns[0].shape[0]
    num_heads = attention_patterns[0].shape[1]
    
    all_entropies = np.zeros((num_prompts, num_layers, num_heads))
    
    for p_idx, attn in enumerate(attention_patterns):
        for layer in range(num_layers):
            for head in range(num_heads):
                # Average attention over query positions
                attn_weights = attn[layer, head].mean(dim=0).float().cpu().numpy()  # (seq,)
                attn_weights = attn_weights / attn_weights.sum()  # Normalize
                attn_weights = attn_weights[attn_weights > 0]
                
                if len(attn_weights) > 1:
                    h = scipy_entropy(attn_weights, base=2)
                    h_max = np.log2(len(attn_weights))
                    h_norm = h / h_max if h_max > 0 else 0
                else:
                    h_norm = 0
                
                all_entropies[p_idx, layer, head] = h_norm
    
    # Average across prompts
    return all_entropies.mean(axis=0)


def compute_specialization_metrics(head_entropies):
    """
    Compute metrics for territorial collapse / specialization loss.
    
    Args:
        head_entropies: (num_layers, num_heads) array
    
    Returns:
        dict with specialization metrics
    """
    num_layers, num_heads = head_entropies.shape
    
    # 1. Head Variance per Layer - How different are heads within each layer?
    layer_variances = np.var(head_entropies, axis=1)  # (num_layers,)
    mean_variance = float(np.mean(layer_variances))
    
    # 2. Inter-Head Correlation - Are heads responding similarly?
    # Flatten to (num_layers * num_heads,) for overall correlation
    # But we want correlations BETWEEN heads across layers
    head_profiles = head_entropies.T  # (num_heads, num_layers)
    
    # Pairwise correlations between heads
    head_corr_matrix = np.corrcoef(head_profiles)
    # Get upper triangle (excluding diagonal)
    upper_tri = head_corr_matrix[np.triu_indices(num_heads, k=1)]
    mean_head_correlation = float(np.nanmean(upper_tri))
    
    # 3. Specialization Index = 1 - mean_correlation
    # High specialization = low correlation = unique roles
    specialization_index = 1.0 - mean_head_correlation
    
    # 4. Effective Number of Heads (participation ratio)
    # Based on entropy variance - if all heads are identical, effective = 1
    head_contributions = np.mean(head_entropies, axis=0)  # Mean entropy per head
    head_contributions = head_contributions / head_contributions.sum()  # Normalize
    h_contrib = scipy_entropy(head_contributions, base=2)
    h_max = np.log2(num_heads)
    effective_heads = 2 ** h_contrib if h_contrib > 0 else 1.0
    effective_ratio = effective_heads / num_heads
    
    # 5. Layer-wise specialization (early vs middle vs late)
    third = num_layers // 3
    early_var = float(np.mean(layer_variances[:third]))
    middle_var = float(np.mean(layer_variances[third:2*third]))
    late_var = float(np.mean(layer_variances[2*third:]))
    
    return {
        'mean_head_variance': mean_variance,
        'mean_head_correlation': mean_head_correlation,
        'specialization_index': specialization_index,
        'effective_heads': float(effective_heads),
        'effective_ratio': float(effective_ratio),
        'layer_variances': layer_variances.tolist(),
        'early_variance': early_var,
        'middle_variance': middle_var,
        'late_variance': late_var,
        'head_correlation_matrix': head_corr_matrix.tolist(),
        'num_layers': num_layers,
        'num_heads': num_heads
    }

print("Specialization metrics functions loaded.")


In [None]:
# Cell 4: Load and Analyze BASE Model with 3-Seed Averaging

pair_config = TWIN_PAIRS[PAIR]
results = {'pair': PAIR, 'base': {}, 'instruct': {}, 'config': pair_config}
seed_results_base = []

print(f"\n{'='*60}")
print(f"E11 TERRITORIAL COLLAPSE: {PAIR.upper()} (E11-v3)")
print(f"{'='*60}")

print(f"\n[1/4] Loading BASE: {pair_config['base']}")

tokenizer_base = AutoTokenizer.from_pretrained(pair_config['base'])
model_base = AutoModelForCausalLM.from_pretrained(
    pair_config['base'],
    torch_dtype=DTYPE,  # E11-v3: bfloat16
    device_map='auto',
    trust_remote_code=True,
    attn_implementation="eager"  # CRITICAL: SDPA doesn't return attentions!
)

# CRITICAL: Set eval mode to disable dropout (Codex fix)
model_base.eval()

if tokenizer_base.pad_token is None:
    tokenizer_base.pad_token = tokenizer_base.eos_token

print(f"\n[2/4] Extracting BASE head activations (3-seed average)...")

for seed in SEEDS:
    print(f"\n  --- Seed {seed} ---")
    torch.manual_seed(seed)
    np.random.seed(seed)
    
    base_activations = extract_head_activations(model_base, tokenizer_base, STANDARD_PROMPTS, max_length=MAX_LENGTH)
    base_entropies = compute_head_entropy_profiles(base_activations['attention_patterns'])
    spec_metrics = compute_specialization_metrics(base_entropies)
    
    seed_results_base.append({
        'seed': seed,
        'specialization': spec_metrics,
        'entropies': base_entropies.tolist()
    })
    print(f"  Specialization Index: {spec_metrics['specialization_index']:.4f}")
    print(f"  Mean Head Correlation: {spec_metrics['mean_head_correlation']:.4f}")

# Aggregate across seeds (mean)
print(f"\n  Computing 3-seed average...")
avg_si = np.mean([r['specialization']['specialization_index'] for r in seed_results_base])
avg_corr = np.mean([r['specialization']['mean_head_correlation'] for r in seed_results_base])
avg_var = np.mean([r['specialization']['mean_head_variance'] for r in seed_results_base])

# Store aggregated results
results['base']['specialization'] = {
    'specialization_index': float(avg_si),
    'mean_head_correlation': float(avg_corr),
    'mean_head_variance': float(avg_var),
    'effective_heads': float(np.mean([r['specialization']['effective_heads'] for r in seed_results_base])),
    'effective_ratio': float(np.mean([r['specialization']['effective_ratio'] for r in seed_results_base])),
    'layer_variances': seed_results_base[0]['specialization']['layer_variances'],
    'early_variance': float(np.mean([r['specialization']['early_variance'] for r in seed_results_base])),
    'middle_variance': float(np.mean([r['specialization']['middle_variance'] for r in seed_results_base])),
    'late_variance': float(np.mean([r['specialization']['late_variance'] for r in seed_results_base])),
    'head_correlation_matrix': seed_results_base[0]['specialization']['head_correlation_matrix'],
    'num_layers': seed_results_base[0]['specialization']['num_layers'],
    'num_heads': seed_results_base[0]['specialization']['num_heads']
}
results['base']['seed_results'] = seed_results_base

print(f"\n  === BASE AGGREGATED (3-seed) ===")
print(f"  Specialization Index: {avg_si:.4f}")
print(f"  Effective Heads: {results['base']['specialization']['effective_heads']:.2f} / {results['base']['specialization']['num_heads']}")
print(f"  Mean Head Correlation: {avg_corr:.4f}")

# Free memory
del model_base
torch.cuda.empty_cache()
print("\n  [Memory cleared]")

In [None]:
# Cell 5: Load and Analyze INSTRUCT Model with 3-Seed Averaging

seed_results_inst = []

print(f"\n[3/4] Loading INSTRUCT: {pair_config['instruct']}")

tokenizer_inst = AutoTokenizer.from_pretrained(pair_config['instruct'])
model_inst = AutoModelForCausalLM.from_pretrained(
    pair_config['instruct'],
    torch_dtype=DTYPE,  # E11-v3: bfloat16
    device_map='auto',
    trust_remote_code=True,
    attn_implementation="eager"  # CRITICAL: SDPA doesn't return attentions!
)

# CRITICAL: Set eval mode to disable dropout (Codex fix)
model_inst.eval()

if tokenizer_inst.pad_token is None:
    tokenizer_inst.pad_token = tokenizer_inst.eos_token

print(f"\n[4/4] Extracting INSTRUCT head activations (3-seed average)...")

for seed in SEEDS:
    print(f"\n  --- Seed {seed} ---")
    torch.manual_seed(seed)
    np.random.seed(seed)
    
    inst_activations = extract_head_activations(model_inst, tokenizer_inst, STANDARD_PROMPTS, max_length=MAX_LENGTH)
    inst_entropies = compute_head_entropy_profiles(inst_activations['attention_patterns'])
    spec_metrics = compute_specialization_metrics(inst_entropies)
    
    seed_results_inst.append({
        'seed': seed,
        'specialization': spec_metrics,
        'entropies': inst_entropies.tolist()
    })
    print(f"  Specialization Index: {spec_metrics['specialization_index']:.4f}")
    print(f"  Mean Head Correlation: {spec_metrics['mean_head_correlation']:.4f}")

# Aggregate across seeds (mean)
print(f"\n  Computing 3-seed average...")
avg_si_inst = np.mean([r['specialization']['specialization_index'] for r in seed_results_inst])
avg_corr_inst = np.mean([r['specialization']['mean_head_correlation'] for r in seed_results_inst])
avg_var_inst = np.mean([r['specialization']['mean_head_variance'] for r in seed_results_inst])

# Store aggregated results
results['instruct']['specialization'] = {
    'specialization_index': float(avg_si_inst),
    'mean_head_correlation': float(avg_corr_inst),
    'mean_head_variance': float(avg_var_inst),
    'effective_heads': float(np.mean([r['specialization']['effective_heads'] for r in seed_results_inst])),
    'effective_ratio': float(np.mean([r['specialization']['effective_ratio'] for r in seed_results_inst])),
    'layer_variances': seed_results_inst[0]['specialization']['layer_variances'],
    'early_variance': float(np.mean([r['specialization']['early_variance'] for r in seed_results_inst])),
    'middle_variance': float(np.mean([r['specialization']['middle_variance'] for r in seed_results_inst])),
    'late_variance': float(np.mean([r['specialization']['late_variance'] for r in seed_results_inst])),
    'head_correlation_matrix': seed_results_inst[0]['specialization']['head_correlation_matrix'],
    'num_layers': seed_results_inst[0]['specialization']['num_layers'],
    'num_heads': seed_results_inst[0]['specialization']['num_heads']
}
results['instruct']['seed_results'] = seed_results_inst

print(f"\n  === INSTRUCT AGGREGATED (3-seed) ===")
print(f"  Specialization Index: {avg_si_inst:.4f}")
print(f"  Effective Heads: {results['instruct']['specialization']['effective_heads']:.2f} / {results['instruct']['specialization']['num_heads']}")
print(f"  Mean Head Correlation: {avg_corr_inst:.4f}")

# Free memory
del model_inst
torch.cuda.empty_cache()
print("\n  [Memory cleared]")

In [None]:
# Cell 6: Hypothesis Test - Territorial Collapse

print(f"\n{'='*70}")
print(f"E11 TERRITORIAL COLLAPSE RESULTS: {PAIR.upper()}")
print(f"{'='*70}")

# Extract key metrics
base_spec = results['base']['specialization']
inst_spec = results['instruct']['specialization']

base_si = base_spec['specialization_index']
inst_si = inst_spec['specialization_index']
delta_si = inst_si - base_si

base_eff = base_spec['effective_ratio']
inst_eff = inst_spec['effective_ratio']
delta_eff = inst_eff - base_eff

base_corr = base_spec['mean_head_correlation']
inst_corr = inst_spec['mean_head_correlation']
delta_corr = inst_corr - base_corr

base_var = base_spec['mean_head_variance']
inst_var = inst_spec['mean_head_variance']
delta_var = inst_var - base_var

print(f"\n{'Metric':<35} {'BASE':>12} {'INSTRUCT':>12} {'Delta':>12}")
print("-" * 75)
print(f"{'Specialization Index':<35} {base_si:>12.4f} {inst_si:>12.4f} {delta_si:>+12.4f}")
print(f"{'Effective Head Ratio':<35} {base_eff:>12.4f} {inst_eff:>12.4f} {delta_eff:>+12.4f}")
print(f"{'Mean Head Correlation':<35} {base_corr:>12.4f} {inst_corr:>12.4f} {delta_corr:>+12.4f}")
print(f"{'Mean Head Variance':<35} {base_var:>12.6f} {inst_var:>12.6f} {delta_var:>+12.6f}")

# Layer-wise analysis
print(f"\n{'Layer Region':<35} {'BASE Var':>12} {'INST Var':>12} {'Delta':>12}")
print("-" * 75)
print(f"{'Early Layers':<35} {base_spec['early_variance']:>12.6f} {inst_spec['early_variance']:>12.6f} {inst_spec['early_variance'] - base_spec['early_variance']:>+12.6f}")
print(f"{'Middle Layers (L*)':<35} {base_spec['middle_variance']:>12.6f} {inst_spec['middle_variance']:>12.6f} {inst_spec['middle_variance'] - base_spec['middle_variance']:>+12.6f}")
print(f"{'Late Layers':<35} {base_spec['late_variance']:>12.6f} {inst_spec['late_variance']:>12.6f} {inst_spec['late_variance'] - base_spec['late_variance']:>+12.6f}")

# Hypothesis Test
print(f"\n{'='*70}")
print("HYPOTHESIS TEST: Does RLHF cause TERRITORIAL COLLAPSE?")
print(f"{'='*70}")

# Criteria for territorial collapse:
# 1. Specialization Index DECREASES (heads become more similar)
# 2. Mean Head Correlation INCREASES (heads respond more uniformly)
# 3. Head Variance DECREASES (less diversity)

collapse_1 = delta_si < 0  # Specialization decreased
collapse_2 = delta_corr > 0  # Correlation increased
collapse_3 = delta_var < 0  # Variance decreased

print(f"\n  [1] Specialization decreased:    {'YES' if collapse_1 else 'NO'} ({delta_si:+.4f})")
print(f"  [2] Head correlation increased:  {'YES' if collapse_2 else 'NO'} ({delta_corr:+.4f})")
print(f"  [3] Head variance decreased:     {'YES' if collapse_3 else 'NO'} ({delta_var:+.6f})")

collapse_count = sum([collapse_1, collapse_2, collapse_3])

print(f"\n{'='*70}")
if collapse_count >= 2:
    verdict = "A_CONFIRMED"
    print(f"VERDICT: {verdict}")
    print("RLHF causes TERRITORIAL COLLAPSE - heads lose specialization!")
elif collapse_count == 1:
    verdict = "B_PARTIAL"
    print(f"VERDICT: {verdict}")
    print("Partial evidence for territorial collapse.")
else:
    verdict = "C_REFUTED"
    print(f"VERDICT: {verdict}")
    print("No evidence for territorial collapse - RLHF preserves specialization.")
print(f"{'='*70}")

# Store verdict
results['verdict'] = {
    'code': verdict,
    'specialization_decreased': collapse_1,
    'correlation_increased': collapse_2,
    'variance_decreased': collapse_3,
    'delta_specialization': delta_si,
    'delta_correlation': delta_corr,
    'delta_variance': delta_var
}


In [None]:
# Cell 7: Visualization

fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Plot 1: Specialization Index Comparison
ax1 = axes[0, 0]
models = ['Base', 'Instruct']
si_vals = [base_si, inst_si]
colors = ['#2ecc71', '#e74c3c']
bars = ax1.bar(models, si_vals, color=colors, alpha=0.8, edgecolor='black')
ax1.set_ylabel('Specialization Index')
ax1.set_title(f'{PAIR.upper()}: Specialization Index\n(Higher = More Unique Roles)')
ax1.set_ylim(0, 1)
for bar, val in zip(bars, si_vals):
    ax1.annotate(f'{val:.4f}', xy=(bar.get_x() + bar.get_width()/2, val),
                 xytext=(0, 5), textcoords='offset points', ha='center', fontsize=12)
ax1.annotate(f'Δ = {delta_si:+.4f}', xy=(0.5, 0.95), xycoords='axes fraction',
             ha='center', fontsize=14, color='red' if delta_si < 0 else 'green',
             fontweight='bold')

# Plot 2: Head Correlation Comparison
ax2 = axes[0, 1]
corr_vals = [base_corr, inst_corr]
bars = ax2.bar(models, corr_vals, color=colors, alpha=0.8, edgecolor='black')
ax2.set_ylabel('Mean Head Correlation')
ax2.set_title(f'{PAIR.upper()}: Head Correlation\n(Lower = More Independent)')
for bar, val in zip(bars, corr_vals):
    ax2.annotate(f'{val:.4f}', xy=(bar.get_x() + bar.get_width()/2, val),
                 xytext=(0, 5), textcoords='offset points', ha='center', fontsize=12)
ax2.annotate(f'Δ = {delta_corr:+.4f}', xy=(0.5, 0.95), xycoords='axes fraction',
             ha='center', fontsize=14, color='red' if delta_corr > 0 else 'green',
             fontweight='bold')

# Plot 3: Layer-wise Variance
ax3 = axes[0, 2]
base_layer_var = base_spec['layer_variances']
inst_layer_var = inst_spec['layer_variances']
layers = range(len(base_layer_var))
ax3.plot(layers, base_layer_var, 'o-', color='#2ecc71', label='Base', linewidth=2, markersize=4)
ax3.plot(layers, inst_layer_var, 's-', color='#e74c3c', label='Instruct', linewidth=2, markersize=4)
ax3.set_xlabel('Layer')
ax3.set_ylabel('Head Variance')
ax3.set_title(f'{PAIR.upper()}: Layer-wise Head Variance\n(Higher = More Diverse Heads)')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Mark L* region (middle third)
num_layers = len(base_layer_var)
third = num_layers // 3
ax3.axvspan(third, 2*third, alpha=0.2, color='yellow', label='L* Region')

# Plot 4: Head Correlation Heatmap (Base)
ax4 = axes[1, 0]
base_corr_matrix = np.array(base_spec['head_correlation_matrix'])
sns.heatmap(base_corr_matrix, cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            ax=ax4, cbar_kws={'label': 'Correlation'})
ax4.set_title(f'{PAIR.upper()} BASE: Head Correlation Matrix')
ax4.set_xlabel('Head')
ax4.set_ylabel('Head')

# Plot 5: Head Correlation Heatmap (Instruct)
ax5 = axes[1, 1]
inst_corr_matrix = np.array(inst_spec['head_correlation_matrix'])
sns.heatmap(inst_corr_matrix, cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            ax=ax5, cbar_kws={'label': 'Correlation'})
ax5.set_title(f'{PAIR.upper()} INSTRUCT: Head Correlation Matrix')
ax5.set_xlabel('Head')
ax5.set_ylabel('Head')

# Plot 6: Summary Metrics
ax6 = axes[1, 2]
metrics = ['Specialization\nIndex', 'Effective\nHead Ratio', '1 - Correlation']
base_vals = [base_si, base_eff, 1 - base_corr]
inst_vals = [inst_si, inst_eff, 1 - inst_corr]

x = np.arange(len(metrics))
width = 0.35

bars1 = ax6.bar(x - width/2, base_vals, width, label='Base', color='#2ecc71', alpha=0.8)
bars2 = ax6.bar(x + width/2, inst_vals, width, label='Instruct', color='#e74c3c', alpha=0.8)

ax6.set_ylabel('Value')
ax6.set_title(f'{PAIR.upper()}: Specialization Summary\n(All Higher = Better Specialization)')
ax6.set_xticks(x)
ax6.set_xticklabels(metrics)
ax6.legend()
ax6.set_ylim(0, 1.1)

# Annotate deltas
for i, (b, inst) in enumerate(zip(base_vals, inst_vals)):
    delta = inst - b
    color = 'red' if delta < 0 else 'green'
    ax6.annotate(f'{delta:+.3f}', xy=(i, max(b, inst) + 0.05), ha='center', fontsize=10, color=color)

plt.tight_layout()
fig_path = f'figures/E11_Territorial_Collapse_{PAIR}_{TIMESTAMP}.png'
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()

print(f"\nFigure saved: {fig_path}")


In [None]:
# Cell 8: Save Results with E11-v3 Methodology Block

filename = f'results/E11_territorial_collapse_{PAIR}_{TIMESTAMP}.json'

# Prepare for JSON serialization
def convert_to_native(obj):
    """Recursively convert numpy types to native Python types."""
    if isinstance(obj, dict):
        return {k: convert_to_native(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_native(v) for v in obj]
    elif isinstance(obj, tuple):
        return tuple(convert_to_native(v) for v in obj)
    elif isinstance(obj, (np.bool_, np.integer)):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj

output = {
    'experiment': 'E11_Territorial_Collapse',
    'timestamp': TIMESTAMP,
    'pair': PAIR,
    'config': pair_config,
    
    # E11-v3 METHODOLOGY BLOCK (REQUIRED!)
    'methodology': {
        'standard': 'E11-v3',
        'seeds': SEEDS,
        'max_length': MAX_LENGTH,
        'dtype': str(DTYPE),
        'prompt_md5': actual_md5,
        'num_prompts': len(STANDARD_PROMPTS),
        'quantization': 'NONE (Full Precision bfloat16)',
        'use_chat_template': False
    },
    
    'prompt_set': 'Standard-10 v3',
    'hypothesis': 'RLHF reduces head specialization (territorial collapse)',
    'universe_25_mapping': {
        'phenomenon': 'Territorial Collapse',
        'calhoun_observation': 'Dominant males stopped defending territories, hierarchy collapsed',
        'llm_equivalent': 'Attention heads lose unique roles, become more uniform'
    },
    'results': convert_to_native(results),
    
    # Runtime info
    'runtime': {
        'gpu': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU',
        'gpu_memory_gb': torch.cuda.get_device_properties(0).total_memory / 1e9 if torch.cuda.is_available() else 0,
        'dtype': str(DTYPE)
    }
}

with open(filename, 'w') as f:
    json.dump(output, f, indent=2)

print(f"Results saved: {filename}")
print(f"\n=== E11-v3 METHODOLOGY COMPLIANCE ===")
print(f"  Seeds: {SEEDS} ✅")
print(f"  MAX_LENGTH: {MAX_LENGTH} ✅")
print(f"  dtype: {DTYPE} ✅")
print(f"  Prompt MD5: {actual_md5} ✅")
print(f"  Quantization: Full Precision ✅")

# Download link for Colab
try:
    from google.colab import files
    files.download(filename)
    files.download(fig_path)
except:
    pass

---

## Summary

### E11: Territorial Collapse - Head Specialization Loss

**Universe 25 Mapping:**
- Calhoun observed dominant males stopping territory defense
- Hierarchy collapsed, all mice became "pansexual" (responded to everything equally)
- LLM equivalent: RLHF might level head specializations

**Key Metrics:**
1. **Specialization Index** = 1 - mean_head_correlation
   - High value = heads have unique roles (healthy hierarchy)
   - Low value = heads are uniform (territorial collapse)

2. **Effective Number of Heads** = participation ratio
   - If all heads contribute equally: effective = actual
   - If few heads dominate: effective << actual

3. **Head Correlation Matrix**
   - Shows which heads behave similarly
   - RLHF might increase correlations (uniformity)

### Connection to Paper 3

Paper 3 established:
- Head density ρ = H/d_head creates "crowding"
- Crowding → forced consensus → dampening

E11 extends this:
- Does RLHF EXACERBATE the crowding effect?
- Even if thermodynamic sign is invariant, does specialization decrease?

### Connection to Paper 4

E01 measured individual Beautiful Ones (heads with low contribution).
E11 measures COLLECTIVE uniformity - the loss of the hierarchy itself.

**The Territorial Collapse Hypothesis:**
> RLHF doesn't just create individual Beautiful Ones - it collapses the entire hierarchy of head specialization.

---

*Paper 4: Behavioral Sink Dynamics*
*E11: Territorial Collapse - Head Specialization Loss*

In [None]:
# Cell 10: Artifact Log (for JSONL export)

# Create artifact log entry
artifact_entry = {
    'experiment': 'E11',
    'timestamp': TIMESTAMP,
    'model_pair': PAIR,
    'base_model': pair_config['base'],
    'instruct_model': pair_config['instruct'],
    'verdict': results['verdict']['code'],
    'base_specialization': base_si,
    'instruct_specialization': inst_si,
    'delta_specialization': delta_si,
    'base_correlation': base_corr,
    'instruct_correlation': inst_corr,
    'delta_correlation': delta_corr,
    'prompt_count': len(STANDARD_PROMPTS),
    'files': {
        'results': filename,
        'figure': fig_path
    }
}

# Append to artifact log
artifact_log = f'results/E11_artifact_log.jsonl'
with open(artifact_log, 'a') as f:
    f.write(json.dumps(artifact_entry) + '\n')

print(f"Artifact log appended: {artifact_log}")
print(f"\nEntry: {json.dumps(artifact_entry, indent=2)}")


In [None]:
# ============================================================================
# AUTO-DOWNLOAD RESULTS (Colab only)
# ============================================================================
import glob
import shutil

def auto_download_results():
    try:
        from google.colab import files
    except ImportError:
        print('Not in Colab - skipping auto-download')
        return
    
    print('=' * 60)
    print('AUTO-DOWNLOADING RESULTS...')
    print('=' * 60)
    
    # Find all result files
    json_files = glob.glob('results/*.json') + glob.glob('figures/*.json')
    png_files = glob.glob('results/*.png') + glob.glob('figures/*.png')
    all_files = json_files + png_files
    
    if not all_files:
        print('WARNING: No result files found!')
        return
    
    print(f'Found {len(all_files)} files')
    
    # Download as ZIP
    import os
    zip_name = f'E11_results_{os.path.basename(os.getcwd())}'
    
    # Create combined folder
    os.makedirs('download_package', exist_ok=True)
    for f in all_files:
        shutil.copy(f, 'download_package/')
    
    shutil.make_archive(zip_name, 'zip', 'download_package')
    print(f'Downloading: {zip_name}.zip')
    files.download(f'{zip_name}.zip')
    print('DOWNLOAD COMPLETE!')

auto_download_results()