# E08-Phi-Size-Ladder: B11 Size Confound Validation

**Paper 4: Behavioral Sink Dynamics**

## Purpose: Resolve Size vs Heritage Confound

B11 claims "Microsoft heritage = Synthetic Immunity". But:
- Only 2 Phi-3 models tested (3.8B, 14B)
- Could the stability be due to SIZE rather than HERITAGE?

## Critical Test

| Model | Size | If SI stable | Conclusion |
|-------|------|--------------|------------|
| Phi-1.5 | 1.3B | SI ~ 0.33 | Heritage confirmed |
| Phi-2 | 2.7B | SI ~ 0.33 | Heritage confirmed |
| Phi-3-mini | 3.8B | SI = 0.329 | (Reference) |

**Decision Rule:**
- If Phi-1.5/Phi-2 SI ~ 0.33: **Heritage > Size** (B11 -> A-Tier)
- If Phi-1.5/Phi-2 SI != 0.33: **Size is confound** (B11 stays B-Tier)

## Methodology (E11-v3 Standard)

| Standard | Implementation |
|----------|----------------|
| Seeds | 42, 123, 456 (3-seed aggregation) |
| SI Measurement | **GLOBAL** (all layers) |
| Padding | **FALSE** (no padding, use valid_lengths) |
| Chat Template | **NO** (base models) |
| dtype | **bfloat16** (with sanity fallback) |
| use_cache | **FALSE** (critical for older Phi models) |
| Prompts | Standard-10 v3 (MD5: 715065ba) |
| **SANITY CHECK** | **YES** (before full run) |

---

In [None]:
# Cell 1: Setup (E11-v3 STANDARD)
!pip install -q transformers torch accelerate scipy matplotlib seaborn

import torch
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.stats import entropy as scipy_entropy
import json
import warnings
warnings.filterwarnings('ignore')

import os
from pathlib import Path
from datetime import datetime

# === REPRODUCIBILITY (E11-v3 STANDARD) ===
SEEDS = [42, 123, 456]
PRIMARY_SEED = 42

def set_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True

set_seed(PRIMARY_SEED)

TIMESTAMP = datetime.now().strftime('%Y%m%d_%H%M%S')
Path('../results').mkdir(parents=True, exist_ok=True)
Path('../figures').mkdir(parents=True, exist_ok=True)

print(f"E08-Phi-Size-Ladder: B11 Size Confound Validation")
print(f"Timestamp: {TIMESTAMP}")
print(f"Seeds: {SEEDS}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Cell 2: Configuration

# Microsoft Phi Family Size Ladder (oldest to newest)
MODEL_LADDER = [
    {
        'name': 'microsoft/phi-1_5',
        'display': 'Phi-1.5 (1.3B)',
        'size': '1.3B',
        'use_chat_template': False  # Base model
    },
    {
        'name': 'microsoft/phi-2',
        'display': 'Phi-2 (2.7B)',
        'size': '2.7B',
        'use_chat_template': False  # Base model
    },
]

# Reference values from E08-Phi3
PHI3_REFERENCE = {
    'Phi-3-mini (3.8B)': {'si': 0.329, 'size_b': 3.8, 'arch': 'MHA'},
    'Phi-3-medium (14B)': {'si': 0.334, 'size_b': 14.0, 'arch': 'GQA'},
}

# E11-v3 Standard Parameters
MAX_LENGTH = 128

# Standard-10 v3 Prompts
PROMPT_VERSION = "Standard-10 v3"
EXPECTED_MD5 = "715065bab181f46bf12ed471951141e2"

# Inline Standard-10 v3
STANDARD_PROMPTS = [
    "What is the capital of France and what is its population?",
    "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly? Explain step by step.",
    "Calculate 47 multiplied by 23 and show your work.",
    "Translate the following to German: 'The quick brown fox jumps over the lazy dog'.",
    "Write a Python function that checks if a number is prime.",
    "Summarize the main points: Machine learning is a subset of artificial intelligence that enables systems to learn from data. It uses algorithms to identify patterns and make decisions with minimal human intervention.",
    "Statement A: 'All birds can fly.' Statement B: 'Penguins are birds that cannot fly.' Are these statements contradictory? Explain.",
    "What are the safety considerations when using a kitchen knife?",
    "Write a haiku about artificial intelligence.",
    "Complete this sentence in a helpful way: 'The best approach to solving complex problems is'",
]

import hashlib
PROMPT_MD5 = hashlib.md5('|||'.join(STANDARD_PROMPTS).encode()).hexdigest()
assert PROMPT_MD5 == EXPECTED_MD5, f"Prompt MD5 mismatch: {PROMPT_MD5}"
PROMPT_SOURCE = "inline_standard_10_v3"
print(f"\u2705 Using inline Standard-10 v3 (MD5={PROMPT_MD5})")

# Sanity-check variants (try in order) - CRITICAL for older Phi models
SANITY_VARIANTS = [
    {'use_chat_template': False, 'dtype': torch.bfloat16, 'label': 'raw+bf16'},
    {'use_chat_template': False, 'dtype': torch.float16, 'label': 'raw+fp16'},
    {'use_chat_template': False, 'dtype': torch.float32, 'label': 'raw+fp32'},
]
SANITY_OVERRIDE = {'use_chat_template': None, 'dtype': None, 'label': None}

# Hypothesis test thresholds
PHI3_SI_TARGET = 0.33  # Expected if Heritage hypothesis is true
SI_TOLERANCE = 0.05   # +/- 5% tolerance

print(f"\nConfiguration (E11-v3 Standard):")
print(f"  MAX_LENGTH: {MAX_LENGTH}")
print(f"  Prompts: {PROMPT_VERSION}")
print(f"  Target SI: {PHI3_SI_TARGET} +/- {SI_TOLERANCE}")
print(f"\nModels to test:")
for m in MODEL_LADDER:
    print(f"  - {m['display']}")

In [None]:
# Cell 3: SI Measurement Functions (E11-v3 STANDARD)
# =============================================================================
# CONSISTENT WITH E08-Phi3: padding=False, use_cache=False
# =============================================================================

def extract_head_activations(model, tokenizer, prompts, max_length=128, use_chat_template=False):
    """
    Extract attention patterns (E11-v3 Standard).
    Uses padding=False for consistency with Phi-3 methodology.
    NOTE: use_cache=False is CRITICAL for Phi models!
    """
    all_attention_patterns = []
    all_valid_lengths = []
    
    for prompt in prompts:
        if use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
            try:
                messages = [{"role": "user", "content": prompt}]
                formatted = tokenizer.apply_chat_template(
                    messages, tokenize=False, add_generation_prompt=True
                )
            except:
                formatted = prompt
        else:
            formatted = prompt
        
        # NO PADDING (consistent with Phi-3 methodology)
        inputs = tokenizer(
            formatted, 
            return_tensors='pt', 
            max_length=max_length,
            truncation=True, 
            padding=False
        ).to(model.device)
        
        valid_len = inputs['input_ids'].shape[1]
        
        with torch.no_grad():
            # CRITICAL: use_cache=False for Phi models
            outputs = model(**inputs, output_attentions=True, use_cache=False)
        
        attn_stack = torch.stack([a.squeeze(0) for a in outputs.attentions], dim=0)
        all_attention_patterns.append(attn_stack.cpu())
        all_valid_lengths.append(valid_len)
    
    return {
        'attention_patterns': all_attention_patterns,
        'valid_lengths': all_valid_lengths,
        'num_layers': len(outputs.attentions),
        'num_heads': outputs.attentions[0].shape[1]
    }


def compute_head_entropy_profiles(attention_patterns, valid_lengths):
    """Compute entropy profiles (E11-v3 Standard)."""
    num_prompts = len(attention_patterns)
    num_layers = attention_patterns[0].shape[0]
    num_heads = attention_patterns[0].shape[1]
    
    all_entropies = np.zeros((num_prompts, num_layers, num_heads))
    
    for p_idx, attn in enumerate(attention_patterns):
        valid_len = valid_lengths[p_idx]
        
        for layer in range(num_layers):
            for head in range(num_heads):
                attn_weights = attn[layer, head].float().cpu().numpy()
                attn_weights = attn_weights[:valid_len, :valid_len]
                attn_weights = attn_weights.mean(axis=0)
                attn_weights = attn_weights / (attn_weights.sum() + 1e-10)
                attn_weights = attn_weights[attn_weights > 0]
                
                if len(attn_weights) > 1:
                    h = scipy_entropy(attn_weights, base=2)
                    h_max = np.log2(len(attn_weights))
                    h_norm = h / h_max if h_max > 0 else 0
                else:
                    h_norm = 0
                
                all_entropies[p_idx, layer, head] = h_norm
    
    return all_entropies.mean(axis=0)


def compute_si(head_entropies):
    """Compute global SI."""
    num_layers, num_heads = head_entropies.shape
    head_profiles = head_entropies.T
    head_corr_matrix = np.corrcoef(head_profiles)
    upper_tri = head_corr_matrix[np.triu_indices(num_heads, k=1)]
    mean_corr = float(np.nanmean(upper_tri))
    return 1.0 - mean_corr, mean_corr

print("SI functions loaded (E11-v3 Standard).")
print("  - padding: FALSE")
print("  - use_cache: FALSE")

In [None]:
# Cell 3b: SANITY CHECK (CRITICAL - adapted from E08-Phi3)
# =============================================================================
# KRANZ REQUIREMENT: Validate model behavior BEFORE full experiment!
# Phi-1.5 and Phi-2 are OLDER models with potentially different quirks.
# =============================================================================

def run_sanity_variant(model_config, use_chat_template, dtype, label):
    """
    Sanity check variant for Phi-1.5/Phi-2.
    Validates: attention output, head diversity, SI threshold.
    """
    print(f"\n{'='*70}")
    print(f"SANITY CHECK ({label}): {model_config['name']}")
    print(f"{'='*70}")

    # Load model
    print("\n1. Loading model...")
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_config['name'], trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(
            model_config['name'],
            torch_dtype=dtype,
            device_map='auto',
            trust_remote_code=True,
            attn_implementation="eager"
        )
    except Exception as e:
        print(f"   \u274c Model load failed: {e}")
        return {'ok': False, 'reason': f'load_failed: {e}', 'label': label}
    
    model.eval()
    model.config.output_attentions = True
    model.config.use_cache = False

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Single prompt test
    test_prompt = STANDARD_PROMPTS[0]
    print(f"\n2. Test prompt: '{test_prompt[:50]}...'")

    # NO chat template for base models
    formatted = test_prompt
    print("   Raw prompt (no chat template - base model)")

    # Tokenize with NO PADDING
    inputs = tokenizer(
        formatted,
        return_tensors='pt',
        max_length=MAX_LENGTH,
        truncation=True,
        padding=False
    ).to(model.device)

    valid_len = inputs['input_ids'].shape[1]
    print("\n3. Tokenization:")
    print(f"   Sequence length: {valid_len}")
    print(f"   Padding: FALSE")

    # ASSERTION: valid_len must be > 5 for entropy calculation
    if valid_len <= 5:
        print(f"   \u274c valid_len too small: {valid_len}")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'valid_len_too_small', 'valid_len': valid_len, 'label': label}
    print("   \u2705 valid_len > 5: PASS")

    # Forward pass
    print("\n4. Forward pass...")
    try:
        with torch.no_grad():
            outputs = model(**inputs, output_attentions=True, use_cache=False)
    except Exception as e:
        print(f"   \u274c Forward pass failed: {e}")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': f'forward_failed: {e}', 'label': label}

    if outputs.attentions is None:
        print("   \u274c outputs.attentions is None")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'no_attentions', 'label': label}
    print("   \u2705 outputs.attentions exists: PASS")

    # Attention diagnostics
    attn_layer0 = outputs.attentions[0].squeeze(0)  # [heads, seq, seq]
    num_layers = len(outputs.attentions)
    num_heads = attn_layer0.shape[0]

    print("\n5. Attention diagnostics:")
    print(f"   Num layers: {num_layers}")
    print(f"   Num heads: {num_heads}")
    print(f"   Layer 0 shape: {attn_layer0.shape}")
    print(f"   Layer 0 dtype: {attn_layer0.dtype}")

    attn_abs_mean = attn_layer0.abs().mean().item()
    attn_std = attn_layer0.std().item()

    print(f"   attn.abs().mean() = {attn_abs_mean:.6f}")
    print(f"   attn.std() = {attn_std:.6f}")

    if attn_abs_mean <= 0:
        print("   \u274c attn.abs().mean() = 0 (degenerate)")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'degenerate_attn', 'label': label}
    print("   \u2705 attn.abs().mean() > 0: PASS")

    if not torch.isfinite(attn_layer0).all():
        print("   \u274c attention contains NaN/Inf")
        del model
        torch.cuda.empty_cache()
        return {'ok': False, 'reason': 'nan_inf', 'label': label}
    print("   \u2705 torch.isfinite(attn): PASS")

    # Head diversity quick check
    head0 = attn_layer0[0]
    head1 = attn_layer0[1] if num_heads > 1 else attn_layer0[0]
    heads_identical = torch.allclose(head0, head1, atol=1e-4)
    print("\n6. Head diversity check:")
    print(f"   Head 0 vs Head 1 identical? {heads_identical}")
    if heads_identical:
        print("   \u26a0\ufe0f Heads appear identical - may cause SI=0")
    else:
        print("   \u2705 Heads are different: PASS")

    # Compute actual SI
    print("\n7. Computing baseline SI...")
    act = extract_head_activations(
        model, tokenizer, [test_prompt], MAX_LENGTH,
        use_chat_template=use_chat_template
    )
    ent = compute_head_entropy_profiles(act['attention_patterns'], act['valid_lengths'])
    baseline_si, mean_corr = compute_si(ent)

    entropy_min = ent.min()
    entropy_max = ent.max()
    entropy_mean = ent.mean()

    print(f"   Entropy range: [{entropy_min:.4f}, {entropy_max:.4f}]")
    print(f"   Entropy mean: {entropy_mean:.4f}")

    print("\n8. BASELINE SI:")
    print(f"   Mean head correlation: {mean_corr:.4f}")
    print(f"   Specialization Index: {baseline_si:.4f}")

    SI_THRESHOLD = 0.05
    ok = baseline_si >= SI_THRESHOLD

    # Cleanup
    del model
    torch.cuda.empty_cache()

    return {
        'ok': ok,
        'label': label,
        'use_chat_template': use_chat_template,
        'dtype': str(dtype),
        'valid_len': valid_len,
        'num_layers': num_layers,
        'num_heads': num_heads,
        'attn_abs_mean': attn_abs_mean,
        'attn_std': attn_std,
        'entropy_range': [float(entropy_min), float(entropy_max)],
        'baseline_si': float(baseline_si),
        'mean_corr': float(mean_corr),
        'heads_identical': bool(heads_identical)
    }


def run_sanity_check(model_config):
    """
    Run sanity variants in order. On first pass, set SANITY_OVERRIDE.
    """
    print(f"\n{'#'*70}")
    print(f"# SANITY CHECK: {model_config['display']}")
    print(f"# KRANZ: 'Zeig mir die Daten. Ist das wirklich wahr?'")
    print(f"{'#'*70}")
    
    last = None
    for v in SANITY_VARIANTS:
        result = run_sanity_variant(
            model_config,
            use_chat_template=v['use_chat_template'],
            dtype=v['dtype'],
            label=v['label']
        )
        last = result
        if result.get('ok'):
            print(f"\n\u2705 SANITY PASS: {v['label']} (SI={result['baseline_si']:.4f})")
            SANITY_OVERRIDE['use_chat_template'] = v['use_chat_template']
            SANITY_OVERRIDE['dtype'] = v['dtype']
            SANITY_OVERRIDE['label'] = v['label']
            return result
        else:
            reason = result.get('reason', 'unknown')
            si = result.get('baseline_si', 'n/a')
            print(f"\n\u274c SANITY FAIL: {v['label']} (reason={reason}, SI={si})")

    raise AssertionError(f"ABORT: No sanity variant passed for {model_config['display']}! Last result: {last}")

# RUN SANITY CHECK on first model (Phi-1.5)
print("\n" + "="*70)
print("RUNNING SANITY CHECK ON PHI-1.5 BEFORE FULL EXPERIMENT")
print("="*70)
sanity_result = run_sanity_check(MODEL_LADDER[0])
print(f"\nSanity check passed with: {SANITY_OVERRIDE['label']}")
print(f"Baseline SI: {sanity_result['baseline_si']:.4f}")

In [None]:
# Cell 4: Measurement Function

def measure_model_si(model_config, prompts, seeds):
    """Measure SI for a single model with multi-seed aggregation."""
    print(f"\n{'='*60}")
    print(f"Testing: {model_config['display']}")
    print(f"{'='*60}")

    # Use sanity-validated settings
    active_dtype = SANITY_OVERRIDE['dtype'] or torch.bfloat16
    active_chat = SANITY_OVERRIDE['use_chat_template'] if SANITY_OVERRIDE['use_chat_template'] is not None else model_config['use_chat_template']

    # Load model
    print(f"Loading: {model_config['name']}")
    print(f"  dtype: {active_dtype}")
    print(f"  chat_template: {active_chat}")
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_config['name'], trust_remote_code=True
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        model_config['name'],
        torch_dtype=active_dtype,
        device_map='auto',
        trust_remote_code=True,
        attn_implementation="eager"
    )
    model.eval()
    model.config.output_attentions = True
    model.config.use_cache = False

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Architecture info
    config = model.config
    num_layers = config.num_hidden_layers
    num_heads = config.num_attention_heads
    num_kv_heads = getattr(config, 'num_key_value_heads', num_heads)
    hidden_size = config.hidden_size
    d_head = hidden_size // num_heads

    rho_head = num_heads / math.sqrt(hidden_size)
    rho_kv = num_kv_heads / num_layers

    if num_kv_heads == num_heads:
        arch = "MHA"
    elif num_kv_heads == 1:
        arch = "MQA"
    else:
        arch = f"GQA ({num_heads}:{num_kv_heads})"

    print(f"  Architecture: {arch}")
    print(f"  Layers: {num_layers}, Heads: {num_heads}, KV: {num_kv_heads}")
    print(f"  d_head: {d_head}")
    print(f"  rho_head: {rho_head:.4f}, rho_kv: {rho_kv:.4f}")

    # Multi-seed SI measurement
    si_values = []
    corr_values = []

    for seed in seeds:
        set_seed(seed)
        act = extract_head_activations(
            model, tokenizer, prompts, MAX_LENGTH,
            use_chat_template=active_chat
        )
        ent = compute_head_entropy_profiles(
            act['attention_patterns'], act['valid_lengths']
        )
        si, corr = compute_si(ent)
        si_values.append(si)
        corr_values.append(corr)

    si_mean = np.mean(si_values)
    si_std = np.std(si_values)
    corr_mean = np.mean(corr_values)

    print(f"  SI: {si_mean:.4f} +/- {si_std:.4f}")
    print(f"  Correlation: {corr_mean:.4f}")

    # Cleanup
    del model
    torch.cuda.empty_cache()

    return {
        'model': model_config['name'],
        'display': model_config['display'],
        'size': model_config['size'],
        'architecture': arch,
        'num_layers': num_layers,
        'num_heads': num_heads,
        'num_kv_heads': num_kv_heads,
        'hidden_size': hidden_size,
        'd_head': d_head,
        'rho_head': rho_head,
        'rho_kv': rho_kv,
        'sanity_override': {
            'use_chat_template': active_chat,
            'dtype': str(active_dtype),
            'label': SANITY_OVERRIDE['label']
        },
        'si_mean': si_mean,
        'si_std': si_std,
        'si_values': si_values,
        'corr_mean': corr_mean
    }

print("Test function loaded.")

In [None]:
# Cell 5: Run All Models

print(f"\n{'#'*70}")
print(f"# E08-Phi-Size-Ladder: B11 Size Confound Validation")
print(f"# Testing Phi-1.5 (1.3B) and Phi-2 (2.7B)")
print(f"{'#'*70}")

all_results = []

for model_config in MODEL_LADDER:
    try:
        result = measure_model_si(model_config, STANDARD_PROMPTS, SEEDS)
        all_results.append(result)
    except Exception as e:
        print(f"ERROR on {model_config['display']}: {e}")
        all_results.append({
            'model': model_config['name'],
            'display': model_config['display'],
            'size': model_config['size'],
            'error': str(e)
        })

print(f"\n{'='*70}")
print("ALL MODELS TESTED!")
print(f"{'='*70}")

In [None]:
# Cell 6: Hypothesis Test - Size vs Heritage

print(f"\n{'='*70}")
print("B11 SIZE CONFOUND HYPOTHESIS TEST")
print(f"{'='*70}")

print(f"\nH0 (Heritage): SI ~ 0.33 regardless of size (Textbook Quality effect)")
print(f"H1 (Size): SI varies with size (smaller = different SI)")
print(f"\nTarget SI: {PHI3_SI_TARGET} +/- {SI_TOLERANCE}")

print(f"\n{'='*70}")
print("COMPLETE PHI FAMILY SIZE LADDER")
print(f"{'='*70}")

print(f"\n{'Model':<25} {'Size':<8} {'SI':<12} {'Within Target?':<15} {'Arch':<10}")
print("-"*70)

# New results
hypothesis_support = []
for r in all_results:
    if 'error' not in r:
        within_target = abs(r['si_mean'] - PHI3_SI_TARGET) <= SI_TOLERANCE
        status = '\u2705 YES' if within_target else '\u274c NO'
        hypothesis_support.append(within_target)
        print(f"{r['display']:<25} {r['size']:<8} {r['si_mean']:.4f}+/-{r['si_std']:.3f}  {status:<15} {r['architecture']:<10}")
    else:
        print(f"{r['display']:<25} {r['size']:<8} ERROR: {r['error'][:30]}")
        hypothesis_support.append(None)

# Reference Phi-3 values
print("-"*70)
print("Reference (E08-Phi3):")
for name, data in PHI3_REFERENCE.items():
    within = abs(data['si'] - PHI3_SI_TARGET) <= SI_TOLERANCE
    hypothesis_support.append(within)
    status = '\u2705 YES' if within else '\u274c NO'
    print(f"{name:<25} {data['size_b']:.1f}B    {data['si']:.4f}          {status:<15} {data['arch']:<10}")

# Verdict
print(f"\n{'='*70}")
print("VERDICT")
print(f"{'='*70}")

valid_tests = len([h for h in hypothesis_support if h is not None])
supporting = sum([h for h in hypothesis_support if h])

print(f"\nModels within target: {supporting}/{valid_tests}")

if supporting == valid_tests and valid_tests >= 3:
    print(f"\n>>> \u2705 HERITAGE HYPOTHESIS CONFIRMED <<<")
    print(f"    All {valid_tests} Phi models have SI ~ 0.33")
    print(f"    Size does NOT affect SI in Microsoft heritage")
    print(f"    B11 -> A-Tier RECOMMENDED")
elif supporting < valid_tests:
    outliers = valid_tests - supporting
    print(f"\n>>> \u274c SIZE CONFOUND DETECTED <<<")
    print(f"    {outliers}/{valid_tests} models outside target SI")
    print(f"    Size IS a factor in SI variation")
    print(f"    B11 stays B-Tier")
else:
    print(f"\n>>> \u26a0\ufe0f INCONCLUSIVE <<<")
    print(f"    Not enough data points")

In [None]:
# Cell 7: Visualization

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Complete Phi Size Ladder
ax1 = axes[0]

# Combine new results with reference
all_phi = []
for r in all_results:
    if 'error' not in r:
        all_phi.append({'name': r['display'], 'size_b': float(r['size'].replace('B', '')), 'si': r['si_mean'], 'std': r['si_std']})

for name, data in PHI3_REFERENCE.items():
    all_phi.append({'name': name, 'size_b': data['size_b'], 'si': data['si'], 'std': 0})

# Sort by size
all_phi = sorted(all_phi, key=lambda x: x['size_b'])

sizes = [f"{p['size_b']}B" for p in all_phi]
si_vals = [p['si'] for p in all_phi]
si_stds = [p.get('std', 0) for p in all_phi]

colors = ['#9b59b6' if p['size_b'] < 3.8 else '#3498db' for p in all_phi]

bars = ax1.bar(sizes, si_vals, yerr=si_stds, color=colors, alpha=0.8, capsize=5, edgecolor='black')

# Target zone
ax1.axhline(y=PHI3_SI_TARGET, color='green', linestyle='--', linewidth=2, label=f'Target SI = {PHI3_SI_TARGET}')
ax1.axhspan(PHI3_SI_TARGET - SI_TOLERANCE, PHI3_SI_TARGET + SI_TOLERANCE, alpha=0.2, color='green', label=f'+/- {SI_TOLERANCE} tolerance')

ax1.set_ylabel('Specialization Index (SI)', fontsize=12)
ax1.set_xlabel('Model Size', fontsize=12)
ax1.set_title('Microsoft Phi Family: Complete Size Ladder', fontsize=14, fontweight='bold')
ax1.set_ylim(0, 1)
ax1.legend(loc='upper right')

for bar, si in zip(bars, si_vals):
    ax1.annotate(f'{si:.3f}', xy=(bar.get_x() + bar.get_width()/2, si + 0.03),
                 ha='center', fontweight='bold', fontsize=10)

# Plot 2: Size vs SI (scatter)
ax2 = axes[1]

# All Phi models
phi_sizes = [p['size_b'] for p in all_phi]
phi_sis = [p['si'] for p in all_phi]
ax2.scatter(phi_sizes, phi_sis, s=150, c='#9b59b6', marker='s', 
            edgecolors='black', linewidths=2, label='Microsoft Phi', zorder=5)

# Fit line if enough points
if len(phi_sizes) >= 3:
    z = np.polyfit(phi_sizes, phi_sis, 1)
    p = np.poly1d(z)
    x_line = np.linspace(min(phi_sizes), max(phi_sizes), 100)
    ax2.plot(x_line, p(x_line), 'r--', alpha=0.5, label=f'Trend (slope={z[0]:.4f})')

# Target zone
ax2.axhline(y=PHI3_SI_TARGET, color='purple', linestyle='--', linewidth=2, alpha=0.7)
ax2.axhspan(PHI3_SI_TARGET - SI_TOLERANCE, PHI3_SI_TARGET + SI_TOLERANCE, alpha=0.1, color='purple')

ax2.set_xlabel('Model Size (B)', fontsize=12)
ax2.set_ylabel('Specialization Index (SI)', fontsize=12)
ax2.set_title('Size vs SI: Testing Size Confound', fontsize=14, fontweight='bold')
ax2.legend(loc='best', fontsize=9)
ax2.set_xlim(0, 16)
ax2.set_ylim(0, 1)

plt.suptitle(f'E08-Phi-Size-Ladder: B11 Size Confound Validation\nSeeds: {SEEDS}', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()

fig_path = f'../figures/E08_phi_size_ladder_{TIMESTAMP}.png'
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()

print(f"\nFigure saved: {fig_path}")

In [None]:
# Cell 8: Save Results

def convert_to_native(obj):
    if isinstance(obj, dict):
        return {k: convert_to_native(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_native(v) for v in obj]
    elif isinstance(obj, (np.bool_, np.integer)):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj

filename = f'../results/E08_phi_size_ladder_{TIMESTAMP}.json'

# Determine verdict
valid_results = [r for r in all_results if 'error' not in r]
all_within_target = all(
    abs(r['si_mean'] - PHI3_SI_TARGET) <= SI_TOLERANCE 
    for r in valid_results
) if valid_results else False

# Include Phi-3 reference in verdict
phi3_within = all(
    abs(data['si'] - PHI3_SI_TARGET) <= SI_TOLERANCE
    for data in PHI3_REFERENCE.values()
)

total_models = len(valid_results) + len(PHI3_REFERENCE)
total_within = sum(1 for r in valid_results if abs(r['si_mean'] - PHI3_SI_TARGET) <= SI_TOLERANCE)
total_within += sum(1 for data in PHI3_REFERENCE.values() if abs(data['si'] - PHI3_SI_TARGET) <= SI_TOLERANCE)

if total_within == total_models and total_models >= 4:
    verdict = "HERITAGE_CONFIRMED"
    b11_recommendation = "A-Tier"
elif total_within < total_models:
    verdict = "SIZE_CONFOUND_DETECTED"
    b11_recommendation = "B-Tier (unchanged)"
else:
    verdict = "INCONCLUSIVE"
    b11_recommendation = "B-Tier (unchanged)"

output = {
    'experiment': 'E08-Phi-Size-Ladder',
    'purpose': 'B11 Size Confound Validation (Phi-1.5, Phi-2)',
    'timestamp': TIMESTAMP,
    'hypothesis': {
        'H0': 'Heritage: SI ~ 0.33 regardless of size',
        'H1': 'Size: SI varies with model size',
        'target_si': PHI3_SI_TARGET,
        'tolerance': SI_TOLERANCE
    },
    'methodology': {
        'standard': 'E11-v3',
        'seeds': SEEDS,
        'prompts': PROMPT_VERSION,
        'prompt_md5': PROMPT_MD5,
        'max_length': MAX_LENGTH,
        'padding': False,
        'use_cache': False,
        'chat_template': False,
        'dtype': 'bfloat16 (sanity-validated)',
        'sanity_check': 'PASSED',
        'sanity_override': {
            'use_chat_template': SANITY_OVERRIDE.get('use_chat_template'),
            'dtype': str(SANITY_OVERRIDE.get('dtype')) if SANITY_OVERRIDE.get('dtype') else None,
            'label': SANITY_OVERRIDE.get('label')
        }
    },
    'sanity_result': convert_to_native(sanity_result),
    'results': convert_to_native(all_results),
    'reference_phi3': PHI3_REFERENCE,
    'verdict': {
        'conclusion': verdict,
        'b11_recommendation': b11_recommendation,
        'models_tested': len(valid_results),
        'total_models_with_reference': total_models,
        'models_within_target': total_within,
        'all_within_target': total_within == total_models
    }
}

with open(filename, 'w') as f:
    json.dump(output, f, indent=2)

print(f"Results saved: {filename}")

print(f"\n{'='*70}")
print(f"FINAL VERDICT: {verdict}")
print(f"B11 Recommendation: {b11_recommendation}")
print(f"Models within target: {total_within}/{total_models}")
print(f"{'='*70}")

# Auto-download
try:
    from google.colab import files
    import shutil
    import os
    os.makedirs('download', exist_ok=True)
    shutil.copy(filename, 'download/')
    shutil.copy(fig_path, 'download/')
    shutil.make_archive(f'E08_phi_size_ladder_{TIMESTAMP}', 'zip', 'download')
    files.download(f'E08_phi_size_ladder_{TIMESTAMP}.zip')
except:
    print('Not in Colab')

---

## Summary: E08-Phi-Size-Ladder

### KRANZ Validation

| Check | Status |
|-------|--------|
| Sanity Check | **REQUIRED** (Cell 3b) |
| Seeds | 42, 123, 456 |
| Prompts | Standard-10 v3 (MD5 verified) |
| padding | FALSE |
| use_cache | FALSE |
| dtype | sanity-validated |

### Decision Rule

| Outcome | Phi-1.5/Phi-2 SI | Conclusion |
|---------|------------------|------------|
| **Heritage Confirmed** | ~ 0.33 (+/- 0.05) | B11 -> A-Tier |
| **Size Confound** | != 0.33 | B11 stays B-Tier |

### Methodological Notes

1. **Sanity Check is MANDATORY** - older Phi models may have different quirks
2. **padding=False** - consistent with Phi-3 methodology
3. **use_cache=False** - critical for Phi models
4. **No chat template** - Phi-1.5 and Phi-2 are base models

---

*Paper 4: Behavioral Sink Dynamics*  
*E08-Phi-Size-Ladder: B11 Size Confound Validation*  
*KRANZ-validated methodology*