# E08-Phi3: Alignment Density on Microsoft Heritage (E11-v3 Standard)

**Paper 4: Behavioral Sink Dynamics**

## Purpose: ρ_crit Validation on 3rd Heritage

E08b established ρ_crit ≈ 0.267 on Gemma-2 and Qwen2.
This notebook tests Microsoft Phi-3 size ladder.

## Methodology (E11-v3 Standard)

| Standard | Implementation |
|----------|----------------|
| Seeds | 42, 123, 456 (3-seed aggregation) |
| SI Measurement | **GLOBAL** (all layers) |
| Attention Mask | **YES** |
| Chat Template | **YES** for Instruct |
| dtype | **bfloat16** |
| Prompts | Standard-10 v3 (MD5: 715065ba) |

## Phi-3 Size Ladder

| Model | Params | Architecture | ρ_kv (est) |
|-------|--------|--------------|------------|
| Phi-3-mini | 3.8B | GQA | TBD |
| Phi-3-small | 7B | GQA | TBD |
| Phi-3-medium | 14B | GQA | TBD |

## Note on Base Models

⚠️ Phi-3 base models are not publicly available.
This notebook measures **Instruct SI only** and compares to other families.

---

In [None]:
# Cell 1: Setup (E11-v3 STANDARD)
!pip install -q transformers torch accelerate scipy matplotlib seaborn

import torch
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.stats import entropy as scipy_entropy
import json
import warnings
warnings.filterwarnings('ignore')

import os
from pathlib import Path
from datetime import datetime

# === REPRODUCIBILITY (E11-v3 STANDARD) ===
SEEDS = [42, 123, 456]
PRIMARY_SEED = 42

def set_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True

set_seed(PRIMARY_SEED)

TIMESTAMP = datetime.now().strftime('%Y%m%d_%H%M%S')
Path('../results').mkdir(parents=True, exist_ok=True)
Path('../figures').mkdir(parents=True, exist_ok=True)

print(f"E08-Phi3 Alignment Density (E11-v3 Standard)")
print(f"Timestamp: {TIMESTAMP}")
print(f"Seeds: {SEEDS}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Cell 2: Configuration

# Phi-3 Size Ladder (Instruct only - base not public)
MODEL_LADDER = [
    {
        'name': 'microsoft/Phi-3-mini-4k-instruct',
        'display': 'Phi-3-mini (3.8B)',
        'size': '3.8B',
        'use_chat_template': True
    },
    {
        'name': 'microsoft/Phi-3-small-8k-instruct',
        'display': 'Phi-3-small (7B)',
        'size': '7B',
        'use_chat_template': True
    },
    {
        'name': 'microsoft/Phi-3-medium-4k-instruct',
        'display': 'Phi-3-medium (14B)',
        'size': '14B',
        'use_chat_template': True
    }
]

# Reference SI values from other families (Instruct models)
REFERENCE_SI = {
    'LLaMA-3.1-8B-Instruct': {'si': 0.31, 'rho': 0.25},
    'Gemma-2-9B-Instruct': {'si': 0.79, 'rho': 0.267},
    'Qwen2-7B-Instruct': {'si': 0.57, 'rho': 0.468},
}

# E11-v3 Standard Parameters
MAX_LENGTH = 128

# Standard-10 v3 Prompts (canonical via prompts.py)
PROMPT_VERSION = "Standard-10 v3"
EXPECTED_MD5 = "715065bab181f46bf12ed471951141e2"

try:
    from prompts import STANDARD_10_V3, MAX_LENGTH as STANDARD_MAX_LENGTH, verify_prompts
    if not verify_prompts():
        raise RuntimeError(f"Standard-10 v3 MD5 mismatch (expected {EXPECTED_MD5})")
    STANDARD_PROMPTS = STANDARD_10_V3
    MAX_LENGTH = STANDARD_MAX_LENGTH
    PROMPT_MD5 = EXPECTED_MD5
    PROMPT_SOURCE = "prompts.py"
    print(f"✅ Loaded prompts from prompts.py (MD5={PROMPT_MD5})")
except Exception as e:
    print(f"⚠️ prompts.py not available or invalid: {e}")
    # Fallback: inline Standard-10 v3 (must match MD5)
    STANDARD_PROMPTS = [
        "What is the capital of France and what is its population?",
        "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly? Explain step by step.",
        "Calculate 47 multiplied by 23 and show your work.",
        "Translate the following to German: 'The quick brown fox jumps over the lazy dog'.",
        "Write a Python function that checks if a number is prime.",
        "Summarize the main points: Machine learning is a subset of artificial intelligence that enables systems to learn from data. It uses algorithms to identify patterns and make decisions with minimal human intervention.",
        "Statement A: 'All birds can fly.' Statement B: 'Penguins are birds that cannot fly.' Are these statements contradictory? Explain.",
        "What are the safety considerations when using a kitchen knife?",
        "Write a haiku about artificial intelligence.",
        "Complete this sentence in a helpful way: 'The best approach to solving complex problems is'",
    ]
    import hashlib
    PROMPT_MD5 = hashlib.md5('|||'.join(STANDARD_PROMPTS).encode()).hexdigest()
    if PROMPT_MD5 != EXPECTED_MD5:
        raise RuntimeError(f"Inline prompts MD5 mismatch: {PROMPT_MD5} != {EXPECTED_MD5}")
    MAX_LENGTH = 128
    PROMPT_SOURCE = "inline_standard_10_v3"
    print(f"✅ Using inline Standard-10 v3 (MD5={PROMPT_MD5})")


# Sanity-check variants (try in order)
SANITY_VARIANTS = [
    {'use_chat_template': True, 'dtype': torch.bfloat16, 'label': 'chat+bf16'},
    {'use_chat_template': False, 'dtype': torch.bfloat16, 'label': 'raw+bf16'},
    {'use_chat_template': False, 'dtype': torch.float32, 'label': 'raw+fp32'},
]
SANITY_OVERRIDE = {'use_chat_template': None, 'dtype': None, 'label': None}

print(f"\nConfiguration (E11-v3 Standard):")



# Sanity-check variants (try in order)
SANITY_VARIANTS = [
    {'use_chat_template': True, 'dtype': torch.bfloat16, 'label': 'chat+bf16'},
    {'use_chat_template': False, 'dtype': torch.bfloat16, 'label': 'raw+bf16'},
    {'use_chat_template': False, 'dtype': torch.float32, 'label': 'raw+fp32'},
]
SANITY_OVERRIDE = {'use_chat_template': None, 'dtype': None, 'label': None}

print(f"\nConfiguration (E11-v3 Standard):")
print(f"  MAX_LENGTH: {MAX_LENGTH}")
print(f"  Prompts: {PROMPT_VERSION} (MD5={PROMPT_MD5})")
print(f"\nModels to test:")
for m in MODEL_LADDER:
    print(f"  - {m['display']}")

In [None]:
# Cell 3: SI Measurement Functions (E11-v3 STANDARD + PHI-3 FIX)
# =============================================================================
# PHI-3 FIX: padding=False statt padding='max_length'
# Phi-3 returns degenerate uniform attention when heavily padded!
# =============================================================================

def extract_head_activations(model, tokenizer, prompts, max_length=128, use_chat_template=False):
    """
    Extract attention patterns (E11-v3 Standard).
    
    PHI-3 FIX: Use padding=False to avoid degenerate attention patterns!
    NOTE: use_cache=False is CRITICAL for Phi-3 (DynamicCache bug)!
    """
    all_attention_patterns = []
    all_valid_lengths = []  # PHI-3 FIX: Use valid_lengths instead of masks
    
    for prompt in prompts:
        if use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
            try:
                messages = [{"role": "user", "content": prompt}]
                formatted = tokenizer.apply_chat_template(
                    messages, tokenize=False, add_generation_prompt=True
                )
            except:
                formatted = prompt
        else:
            formatted = prompt
        
        # PHI-3 FIX: NO PADDING - use actual sequence length
        inputs = tokenizer(
            formatted, 
            return_tensors='pt', 
            max_length=max_length,
            truncation=True, 
            padding=False  # CRITICAL FIX!
        ).to(model.device)
        
        valid_len = inputs['input_ids'].shape[1]
        
        with torch.no_grad():
            # CRITICAL: use_cache=False for Phi-3 (DynamicCache.get_usable_length bug)
            outputs = model(**inputs, output_attentions=True, use_cache=False)
        
        attn_stack = torch.stack([a.squeeze(0) for a in outputs.attentions], dim=0)
        all_attention_patterns.append(attn_stack.cpu())
        all_valid_lengths.append(valid_len)
    
    return {
        'attention_patterns': all_attention_patterns,
        'valid_lengths': all_valid_lengths,  # PHI-3 FIX
        'num_layers': len(outputs.attentions),
        'num_heads': outputs.attentions[0].shape[1]
    }


def compute_head_entropy_profiles(attention_patterns, valid_lengths):
    """
    Compute entropy (E11-v3 Standard).
    PHI-3 FIX: Use valid_lengths instead of attention_masks.
    """
    num_prompts = len(attention_patterns)
    num_layers = attention_patterns[0].shape[0]
    num_heads = attention_patterns[0].shape[1]
    
    all_entropies = np.zeros((num_prompts, num_layers, num_heads))
    
    for p_idx, attn in enumerate(attention_patterns):
        valid_len = valid_lengths[p_idx]
        
        for layer in range(num_layers):
            for head in range(num_heads):
                attn_weights = attn[layer, head].float().cpu().numpy()
                
                # Already correctly sized (no padding), but slice just in case
                attn_weights = attn_weights[:valid_len, :valid_len]
                
                # Average across query positions
                attn_weights = attn_weights.mean(axis=0)
                
                # Normalize
                attn_weights = attn_weights / (attn_weights.sum() + 1e-10)
                attn_weights = attn_weights[attn_weights > 0]
                
                if len(attn_weights) > 1:
                    h = scipy_entropy(attn_weights, base=2)
                    h_max = np.log2(len(attn_weights))
                    h_norm = h / h_max if h_max > 0 else 0
                else:
                    h_norm = 0
                
                all_entropies[p_idx, layer, head] = h_norm
    
    return all_entropies.mean(axis=0)


def compute_si(head_entropies):
    """Compute global SI."""
    num_layers, num_heads = head_entropies.shape
    head_profiles = head_entropies.T
    head_corr_matrix = np.corrcoef(head_profiles)
    upper_tri = head_corr_matrix[np.triu_indices(num_heads, k=1)]
    mean_corr = float(np.nanmean(upper_tri))
    return 1.0 - mean_corr, mean_corr

print("SI functions loaded (E11-v3 Standard + PHI-3 FIX).")
print("  - padding: FALSE (critical Phi-3 fix!)")
print("  - use_cache: FALSE (Phi-3 DynamicCache fix)")

In [None]:
# Cell 3b: SANITY CHECK (PHI-3 FIX VERSION)
# =============================================================================
# PHI-3 FIX: Uses padding=False - should now pass!
# =============================================================================

def run_sanity_variant(model_config, use_chat_template, dtype, label):
    """
    Sanity check variant with PHI-3 FIX: padding=False.
    """
    print(f"\n{'='*70}")
    print(f"SANITY CHECK ({label}): {model_config['name']}")
    print(f"{'='*70}")

    # Load model
    print("\n1. Loading model...")
    tokenizer = AutoTokenizer.from_pretrained(model_config['name'], trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_config['name'],
        torch_dtype=dtype,
        device_map='auto',
        trust_remote_code=True,
        attn_implementation="eager"
    )
    model.eval()
    model.config.output_attentions = True
    model.config.use_cache = False

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Single prompt test
    test_prompt = STANDARD_PROMPTS[0]
    print(f"\n2. Test prompt: '{test_prompt[:50]}...'")

    # Format with chat template
    if use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
        messages = [{"role": "user", "content": test_prompt}]
        formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        print(f"   Chat template applied: {len(formatted)} chars")
    else:
        formatted = test_prompt
        print("   Raw prompt (no chat template)")

    # PHI-3 FIX: NO PADDING
    inputs = tokenizer(
        formatted,
        return_tensors='pt',
        max_length=MAX_LENGTH,
        truncation=True,
        padding=False  # CRITICAL FIX!
    ).to(model.device)

    valid_len = inputs['input_ids'].shape[1]
    print("\n3. Tokenization (PHI-3 FIX):")
    print(f"   Sequence length: {valid_len}")
    print(f"   Padding: FALSE")

    # ASSERTION: valid_len must be > 5 for entropy calculation
    if valid_len <= 5:
        print(f"   ❌ valid_len too small: {valid_len}")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'valid_len_too_small', 'valid_len': valid_len, 'label': label}
    print("   ✅ valid_len > 5: PASS")

    # Forward pass
    print("\n4. Forward pass...")
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True, use_cache=False)

    if outputs.attentions is None:
        print("   ❌ outputs.attentions is None")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'no_attentions', 'label': label}
    print("   ✅ outputs.attentions exists: PASS")

    # Attention diagnostics
    attn_layer0 = outputs.attentions[0].squeeze(0)  # [heads, seq, seq]
    num_layers = len(outputs.attentions)
    num_heads = attn_layer0.shape[0]

    print("\n5. Attention diagnostics:")
    print(f"   Num layers: {num_layers}")
    print(f"   Num heads: {num_heads}")
    print(f"   Layer 0 shape: {attn_layer0.shape}")
    print(f"   Layer 0 dtype: {attn_layer0.dtype}")

    attn_abs_mean = attn_layer0.abs().mean().item()
    attn_std = attn_layer0.std().item()

    print(f"   attn.abs().mean() = {attn_abs_mean:.6f}")
    print(f"   attn.std() = {attn_std:.6f}")

    if attn_abs_mean <= 0:
        print("   ❌ attn.abs().mean() = 0 (degenerate)")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'degenerate_attn', 'label': label}
    print("   ✅ attn.abs().mean() > 0: PASS")

    if not torch.isfinite(attn_layer0).all():
        print("   ❌ attention contains NaN/Inf")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'nan_inf', 'label': label}
    print("   ✅ torch.isfinite(attn): PASS")

    # Head diversity quick check
    head0 = attn_layer0[0]
    head1 = attn_layer0[1]
    heads_identical = torch.allclose(head0, head1, atol=1e-4)
    print("\n6. Head diversity check:")
    print(f"   Head 0 vs Head 1 identical? {heads_identical}")
    if heads_identical:
        print("   ⚠️ Heads appear identical - may cause SI=0")
    else:
        print("   ✅ Heads are different: PASS")

    # Compute actual SI using FIXED functions
    print("\n7. Computing baseline SI (PHI-3 FIX)...")
    act = extract_head_activations(
        model, tokenizer, [test_prompt], MAX_LENGTH,
        use_chat_template=use_chat_template
    )
    ent = compute_head_entropy_profiles(act['attention_patterns'], act['valid_lengths'])
    baseline_si, mean_corr = compute_si(ent)

    entropy_min = ent.min()
    entropy_max = ent.max()
    entropy_mean = ent.mean()

    print(f"   Entropy range: [{entropy_min:.4f}, {entropy_max:.4f}]")
    print(f"   Entropy mean: {entropy_mean:.4f}")

    print("\n8. BASELINE SI:")
    print(f"   Mean head correlation: {mean_corr:.4f}")
    print(f"   Specialization Index: {baseline_si:.4f}")

    SI_THRESHOLD = 0.05
    ok = baseline_si >= SI_THRESHOLD

    # Cleanup
    del model
    torch.cuda.empty_cache()

    return {
        'ok': ok,
        'label': label,
        'use_chat_template': use_chat_template,
        'dtype': str(dtype),
        'valid_len': valid_len,
        'attn_abs_mean': attn_abs_mean,
        'attn_std': attn_std,
        'entropy_range': [float(entropy_min), float(entropy_max)],
        'baseline_si': float(baseline_si),
        'mean_corr': float(mean_corr),
        'heads_identical': bool(heads_identical)
    }


def run_sanity_check(model_config):
    """
    Run sanity variants in order. On first pass, set SANITY_OVERRIDE.
    """
    print("Running sanity check on Phi-3-mini before full ladder test...")
    print("PHI-3 FIX: Using padding=False")
    last = None
    for v in SANITY_VARIANTS:
        result = run_sanity_variant(
            model_config,
            use_chat_template=v['use_chat_template'],
            dtype=v['dtype'],
            label=v['label']
        )
        last = result
        if result.get('ok'):
            print(f"\n✅ SANITY PASS: {v['label']} (SI={result['baseline_si']:.4f})")
            SANITY_OVERRIDE['use_chat_template'] = v['use_chat_template']
            SANITY_OVERRIDE['dtype'] = v['dtype']
            SANITY_OVERRIDE['label'] = v['label']
            return result
        else:
            print(f"\n❌ SANITY FAIL: {v['label']} (SI={result['baseline_si']:.4f})")

    raise AssertionError(f"ABORT: no sanity variant passed (last SI={last.get('baseline_si') if last else 'n/a'})")

# RUN SANITY CHECK on first model
sanity_result = run_sanity_check(MODEL_LADDER[0])
print(f"\nSanity check result: {sanity_result}")

In [None]:
# Cell 4: Run Size Ladder Test (PHI-3 FIX: uses valid_lengths)

def measure_model_si(model_config, prompts, seeds):
    """Measure SI for a single model with multi-seed aggregation."""
    print(f"\n{'='*60}")
    print(f"Testing: {model_config['display']}")
    print(f"{'='*60}")

    # Load model
    print(f"Loading: {model_config['name']}")
    tokenizer = AutoTokenizer.from_pretrained(
        model_config['name'], trust_remote_code=True
    )
    active_dtype = SANITY_OVERRIDE['dtype'] or torch.bfloat16
    active_chat = SANITY_OVERRIDE['use_chat_template'] if SANITY_OVERRIDE['use_chat_template'] is not None else model_config['use_chat_template']

    model = AutoModelForCausalLM.from_pretrained(
        model_config['name'],
        torch_dtype=active_dtype,
        device_map='auto',
        trust_remote_code=True,
        attn_implementation="eager"
    )
    model.eval()
    model.config.output_attentions = True
    model.config.use_cache = False

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Architecture info
    config = model.config
    num_layers = config.num_hidden_layers
    num_heads = config.num_attention_heads
    num_kv_heads = getattr(config, 'num_key_value_heads', num_heads)
    hidden_size = config.hidden_size
    d_head = hidden_size // num_heads

    # Primary rho (Paper-3/E08b) + alt rho_kv
    rho_head = num_heads / math.sqrt(hidden_size)
    rho_kv = num_kv_heads / num_layers

    if num_kv_heads == num_heads:
        arch = "MHA"
    elif num_kv_heads == 1:
        arch = "MQA"
    else:
        arch = f"GQA ({num_heads}:{num_kv_heads})"

    print(f"  Architecture: {arch}")
    print(f"  Layers: {num_layers}, Heads: {num_heads}, KV: {num_kv_heads}")
    print(f"  d_head: {d_head}")
    print(f"  rho_head: {rho_head:.4f}, rho_kv: {rho_kv:.4f}")

    # Multi-seed SI measurement
    si_values = []
    corr_values = []

    for seed in seeds:
        set_seed(seed)
        act = extract_head_activations(
            model, tokenizer, prompts, MAX_LENGTH,
            use_chat_template=active_chat
        )
        # PHI-3 FIX: Use valid_lengths instead of attention_masks
        ent = compute_head_entropy_profiles(
            act['attention_patterns'], act['valid_lengths']
        )
        si, corr = compute_si(ent)
        si_values.append(si)
        corr_values.append(corr)

    si_mean = np.mean(si_values)
    si_std = np.std(si_values)
    corr_mean = np.mean(corr_values)

    print(f"  SI: {si_mean:.4f} ± {si_std:.4f}")
    print(f"  Correlation: {corr_mean:.4f}")

    # Cleanup
    del model
    torch.cuda.empty_cache()

    return {
        'model': model_config['name'],
        'display': model_config['display'],
        'size': model_config['size'],
        'architecture': arch,
        'num_layers': num_layers,
        'num_heads': num_heads,
        'num_kv_heads': num_kv_heads,
        'hidden_size': hidden_size,
        'd_head': d_head,
        'rho_head': rho_head,
        'rho_kv': rho_kv,
        'sanity_override': {
            'use_chat_template': active_chat,
            'dtype': str(active_dtype),
            'label': SANITY_OVERRIDE['label']
        },
        'si_mean': si_mean,
        'si_std': si_std,
        'si_values': si_values,
        'corr_mean': corr_mean
    }

print("Test function loaded (PHI-3 FIX: valid_lengths).")

In [None]:
# Cell 5: Run All Models

print(f"\n{'#'*70}")
print(f"# E08-Phi3: Microsoft Heritage Size Ladder")
print(f"# Alignment Density Test (E11-v3 Standard)")
print(f"{'#'*70}")

all_results = []

for model_config in MODEL_LADDER:
    try:
        result = measure_model_si(model_config, STANDARD_PROMPTS, SEEDS)
        all_results.append(result)
    except Exception as e:
        print(f"ERROR on {model_config['display']}: {e}")
        all_results.append({
            'model': model_config['name'],
            'display': model_config['display'],
            'size': model_config['size'],
            'error': str(e)
        })

print(f"\n{'='*70}")
print("ALL MODELS TESTED!")
print(f"{'='*70}")

In [None]:
# Cell 6: Analysis and Comparison

print(f"\n{'='*70}")
print("PHI-3 SIZE LADDER RESULTS")
print(f"{'='*70}")

print(f"\n{'Model':<25} {'Size':<8} {'ρ_kv':<8} {'SI':<12} {'Arch':<15}")
print("-"*70)

for r in all_results:
    if 'error' not in r:
        print(f"{r['display']:<25} {r['size']:<8} {r['rho_head']:.4f}   {r['si_mean']:.4f}±{r['si_std']:.3f}  {r['architecture']:<15}")
    else:
        print(f"{r['display']:<25} {r['size']:<8} ERROR: {r['error'][:30]}")

# Cross-heritage comparison
print(f"\n{'='*70}")
print("CROSS-HERITAGE COMPARISON (Instruct SI)")
print(f"{'='*70}")

print(f"\n{'Model':<30} {'Heritage':<15} {'ρ_kv':<10} {'SI':<10}")
print("-"*65)

# Reference models
for name, data in REFERENCE_SI.items():
    heritage = 'Meta' if 'LLaMA' in name else ('Google' if 'Gemma' in name else 'Alibaba')
    print(f"{name:<30} {heritage:<15} {data['rho']:<10.4f} {data['si']:<10.4f}")

# Phi-3 results
for r in all_results:
    if 'error' not in r:
        print(f"{r['display']:<30} {'Microsoft':<15} {r['rho_head']:<10.4f} {r['si_mean']:<10.4f}")

In [None]:
# Cell 7: ρ_crit Analysis

print(f"\n{'='*70}")
print("ρ_crit ANALYSIS")
print(f"{'='*70}")

# Check if Phi-3 follows ρ_crit pattern
RHO_CRIT = 0.267

print(f"\nρ_crit threshold: {RHO_CRIT}")
print(f"\nPattern: ρ < {RHO_CRIT} → Higher SI (less collapse)")
print(f"         ρ > {RHO_CRIT} → Lower SI (more collapse)")

print(f"\n{'Model':<25} {'ρ_kv':<10} {'SI':<10} {'vs ρ_crit':<15}")
print("-"*60)

for r in all_results:
    if 'error' not in r:
        status = 'BELOW' if r['rho_head'] < RHO_CRIT else 'ABOVE'
        print(f"{r['display']:<25} {r['rho_head']:<10.4f} {r['si_mean']:<10.4f} {status:<15}")

# Verdict
valid_results = [r for r in all_results if 'error' not in r]
if len(valid_results) >= 2:
    # Check if higher ρ correlates with lower SI
    sorted_by_rho = sorted(valid_results, key=lambda x: x['rho_head'])
    rho_values = [r['rho_head'] for r in sorted_by_rho]
    si_values = [r['si_mean'] for r in sorted_by_rho]
    
    # Simple correlation check
    if len(rho_values) >= 3:
        correlation = np.corrcoef(rho_values, si_values)[0, 1]
        print(f"\nρ-SI Correlation: {correlation:.3f}")
        if correlation < -0.5:
            print("→ NEGATIVE correlation (expected for ρ_crit pattern)")
        elif correlation > 0.5:
            print("→ POSITIVE correlation (unexpected!)")
        else:
            print("→ WEAK correlation (inconclusive)")

In [None]:
# Cell 8: Visualization

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Phi-3 Size Ladder
ax1 = axes[0]
valid = [r for r in all_results if 'error' not in r]
if valid:
    sizes = [r['size'] for r in valid]
    si_means = [r['si_mean'] for r in valid]
    si_stds = [r['si_std'] for r in valid]
    
    bars = ax1.bar(sizes, si_means, yerr=si_stds, color='#9b59b6', alpha=0.8, capsize=5)
    ax1.set_ylabel('Specialization Index (SI)')
    ax1.set_xlabel('Model Size')
    ax1.set_title('Phi-3 Size Ladder: Instruct SI')
    ax1.set_ylim(0, 1)
    
    for bar, si, std in zip(bars, si_means, si_stds):
        ax1.annotate(f'{si:.3f}', xy=(bar.get_x() + bar.get_width()/2, si + std + 0.02),
                     ha='center', fontweight='bold')

# Plot 2: Cross-Heritage ρ vs SI
ax2 = axes[1]

# Reference data points
ref_rhos = [0.25, 0.267, 0.468]  # LLaMA, Gemma, Qwen
ref_sis = [0.31, 0.79, 0.57]
ref_labels = ['LLaMA-3.1', 'Gemma-2-9B', 'Qwen2-7B']
ref_colors = ['#e74c3c', '#2ecc71', '#f39c12']

for rho, si, label, color in zip(ref_rhos, ref_sis, ref_labels, ref_colors):
    ax2.scatter(rho, si, s=150, c=color, label=label, edgecolors='black', linewidths=2)

# Phi-3 data points
if valid:
    phi_rhos = [r['rho_head'] for r in valid]
    phi_sis = [r['si_mean'] for r in valid]
    phi_labels = [r['size'] for r in valid]
    
    for rho, si, label in zip(phi_rhos, phi_sis, phi_labels):
        ax2.scatter(rho, si, s=150, c='#9b59b6', marker='s', 
                    edgecolors='black', linewidths=2, label=f'Phi-3 {label}')

# ρ_crit line
ax2.axvline(x=0.267, color='red', linestyle='--', alpha=0.7, label='ρ_crit ≈ 0.267')

ax2.set_xlabel('ρ_kv (KV Head Density)')
ax2.set_ylabel('Specialization Index (SI)')
ax2.set_title('Cross-Heritage: ρ vs SI')
ax2.legend(loc='best', fontsize=8)
ax2.set_xlim(0, 0.6)
ax2.set_ylim(0, 1)

plt.suptitle(f'E08-Phi3: Microsoft Heritage Analysis\nSeeds: {SEEDS}', fontsize=14, fontweight='bold')
plt.tight_layout()

fig_path = f'../figures/E08_phi3_{TIMESTAMP}.png'
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()

print(f"\nFigure saved: {fig_path}")

In [None]:
# Cell 9: Save Results

def convert_to_native(obj):
    if isinstance(obj, dict):
        return {k: convert_to_native(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_native(v) for v in obj]
    elif isinstance(obj, (np.bool_, np.integer)):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj

filename = f'../results/E08_phi3_{TIMESTAMP}.json'

output = {
    'experiment': 'E08-Phi3',
    'purpose': 'Alignment Density on Microsoft Heritage (Size Ladder)',
    'timestamp': TIMESTAMP,
    'methodology': {
        'standard': 'E11-v3',
        'seeds': SEEDS,
        'prompts': PROMPT_VERSION,
        'prompt_md5': PROMPT_MD5,
        'prompt_source': PROMPT_SOURCE,
        'max_length': MAX_LENGTH,
        'attention_mask': True,
        'chat_template': True,
        'dtype': 'bfloat16',
        'sanity_override': {
            'use_chat_template': SANITY_OVERRIDE.get('use_chat_template'),
            'dtype': str(SANITY_OVERRIDE.get('dtype')) if SANITY_OVERRIDE.get('dtype') is not None else None,
            'label': SANITY_OVERRIDE.get('label')
        }
    },
    'note': 'Base models not publicly available - Instruct SI only',
    'reference_si': REFERENCE_SI,
    'results': convert_to_native(all_results),
    'rho_crit_threshold': 0.267
}

with open(filename, 'w') as f:
    json.dump(output, f, indent=2)

print(f"Results saved: {filename}")

# Auto-download
try:
    from google.colab import files
    import shutil
    import os
    os.makedirs('download', exist_ok=True)
    shutil.copy(filename, 'download/')
    shutil.copy(fig_path, 'download/')
    shutil.make_archive(f'E08_phi3_{TIMESTAMP}', 'zip', 'download')
    files.download(f'E08_phi3_{TIMESTAMP}.zip')
except:
    print('Not in Colab')

---

## Summary: E08-Phi3

### Methodology

| Standard | Implementation |
|----------|----------------|
| Seeds | 42, 123, 456 |
| Prompts | Standard-10 v3 |
| MAX_LENGTH | 128 |
| Mask | YES |
| dtype | bfloat16 |

### Limitation

⚠️ Phi-3 base models not publicly available.
This measures **Instruct SI only** (no ΔSI calculation).

### Expected Insights

1. Does Phi-3 follow ρ_crit pattern?
2. How does Microsoft heritage compare to Meta/Google/Alibaba?
3. Does size affect SI in Phi-3?

---

*Paper 4: Behavioral Sink Dynamics*  
*E08-Phi3: Microsoft Heritage Size Ladder*