# Uniformity Asymmetry Validation: Pythia Suite

**Purpose:** Test the Uniformity Asymmetry (UA) metric on EleutherAI's Pythia models.

**Research Question:** Do base models (without RLHF) show the same embedding-output disconnect as RLHF'd models like Llama-3?

**Hypothesis:** If RLHF creates "surface alignment" (neutral embeddings, biased outputs), then Pythia (no RLHF) should show:
- Either: Both embeddings AND outputs biased (aligned)
- Or: Both embeddings AND outputs neutral (aligned)
- NOT: Neutral embeddings with biased outputs (disconnect)

**Models Tested:**
- Pythia-1.4B (comparable to smaller models)
- Pythia-6.9B (comparable to Llama-8B, Gemma-9B)

**Reference:** This extends the work in "Uniformity Asymmetry: An Exploratory Metric for Detecting Representational Preferences in LLM Embeddings"

---

**Author:** Davide D'Elia  
**Date:** 2026-01-02  
**For:** EleutherAI Community Contribution

## Setup

In [None]:
# Install dependencies
!pip install -q transformers accelerate torch numpy scipy

In [None]:
import os
import json
import warnings
from datetime import datetime
from typing import Dict, List, Tuple

import numpy as np
import torch
from scipy import stats
from transformers import AutoModelForCausalLM, AutoTokenizer

warnings.filterwarnings('ignore')

# Configuration
N_BOOTSTRAP = 10000
RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Model Configuration

**Pythia Models:**
- No RLHF, no instruction tuning
- Trained on The Pile
- Uses GPT-NeoX architecture

In [None]:
# Select which Pythia model to test
# Options: "1.4b", "2.8b", "6.9b", "12b"
PYTHIA_SIZE = "6.9b"  # @param ["1.4b", "2.8b", "6.9b", "12b"]

PYTHIA_CONFIGS = {
    "1.4b": {
        "hf_name": "EleutherAI/pythia-1.4b",
        "display": "Pythia-1.4B",
        "dtype": torch.float16,
        "ram_gb": 4
    },
    "2.8b": {
        "hf_name": "EleutherAI/pythia-2.8b",
        "display": "Pythia-2.8B",
        "dtype": torch.float16,
        "ram_gb": 7
    },
    "6.9b": {
        "hf_name": "EleutherAI/pythia-6.9b",
        "display": "Pythia-6.9B",
        "dtype": torch.float16,
        "ram_gb": 15
    },
    "12b": {
        "hf_name": "EleutherAI/pythia-12b",
        "display": "Pythia-12B",
        "dtype": torch.float16,
        "ram_gb": 26
    }
}

config = PYTHIA_CONFIGS[PYTHIA_SIZE]
print(f"Selected: {config['display']}")
print(f"HuggingFace: {config['hf_name']}")
print(f"Estimated GPU RAM: ~{config['ram_gb']} GB")

## Dataset

230 statement pairs across 6 categories:
- Ground Truth Numeric (30)
- Ground Truth Non-Numeric (20)
- Tech Philosophy (50)
- Lifestyle (50)
- Business (50)
- Scientific Facts (30)

In [None]:
# Load dataset from GitHub
!wget -q https://raw.githubusercontent.com/buk81/uniformity-asymmetry/main/dataset.json

with open('dataset.json', 'r') as f:
    DATASET = json.load(f)

total_pairs = sum(len(cat['pairs']) for cat in DATASET.values())
print(f"Loaded {total_pairs} statement pairs across {len(DATASET)} categories")
for cat_name, cat_data in DATASET.items():
    print(f"  - {cat_name}: {len(cat_data['pairs'])} pairs")

## Core Functions

In [None]:
def get_embedding(text: str, model, tokenizer) -> np.ndarray:
    """
    Extract embedding via mean pooling over last hidden layer (skip BOS token).
    """
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden_states = outputs.hidden_states[-1]
        # Mean pooling: skip BOS token (position 0)
        embedding = hidden_states[0, 1:, :].mean(dim=0)
    
    return embedding.cpu().numpy().astype(np.float32)


def get_output_preference(text_a: str, text_b: str, model, tokenizer) -> float:
    """
    Calculate output preference as NLL(B) - NLL(A).
    Positive = prefers A, Negative = prefers B.
    """
    def get_nll(text):
        inputs = tokenizer(text, return_tensors="pt").to(model.device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            return outputs.loss.item()
    
    nll_a = get_nll(text_a)
    nll_b = get_nll(text_b)
    
    return nll_b - nll_a  # Positive = prefers A


def uniformity_score(embeddings: np.ndarray) -> float:
    """
    Calculate average pairwise cosine similarity (uniformity).
    Higher = more uniform/collapsed representations.
    """
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-10)
    kernel = normalized @ normalized.T
    n = kernel.shape[0]
    idx = np.triu_indices(n, k=1)
    return float(np.mean(kernel[idx]))


def bootstrap_ci(data: np.ndarray, n_bootstrap: int = N_BOOTSTRAP,
                 ci: float = 0.95, seed: int = RANDOM_SEED) -> Tuple[float, float]:
    """Calculate bootstrap confidence interval for mean."""
    np.random.seed(seed)
    bootstrap_means = [np.mean(np.random.choice(data, size=len(data), replace=True))
                       for _ in range(n_bootstrap)]
    alpha = 1 - ci
    return (float(np.percentile(bootstrap_means, alpha/2 * 100)),
            float(np.percentile(bootstrap_means, (1 - alpha/2) * 100)))


def cohens_d(data: np.ndarray, null_value: float = 0) -> float:
    """Calculate Cohen's d effect size."""
    return float((np.mean(data) - null_value) / (np.std(data) + 1e-10))

## Load Model

In [None]:
print(f"Loading {config['display']}...")

tokenizer = AutoTokenizer.from_pretrained(config["hf_name"])
model = AutoModelForCausalLM.from_pretrained(
    config["hf_name"],
    torch_dtype=config["dtype"],
    device_map="auto"
)

print(f"Model loaded on: {model.device}")
print(f"Model dtype: {next(model.parameters()).dtype}")

## Run Validation

In [None]:
def run_validation(model, tokenizer, dataset: dict, verbose: bool = True) -> dict:
    """
    Run UA + Output Preference validation on all statement pairs.
    """
    results = {}
    all_asymmetries = []
    all_preferences = []
    
    total_pairs = sum(len(cat["pairs"]) for cat in dataset.values())
    processed = 0
    
    for category_name, category_data in dataset.items():
        if verbose:
            print(f"\n{'='*60}")
            print(f"Category: {category_name.upper()}")
            print(f"{'='*60}")
        
        pairs = category_data["pairs"]
        n_pairs = len(pairs)
        
        embeddings_a = []
        embeddings_b = []
        preferences = []
        
        for i, (stmt_a, stmt_b) in enumerate(pairs):
            processed += 1
            if verbose and processed % 20 == 0:
                print(f"  [{processed:03d}/{total_pairs}] Processing...")
            
            # Get embeddings
            emb_a = get_embedding(stmt_a, model, tokenizer)
            emb_b = get_embedding(stmt_b, model, tokenizer)
            embeddings_a.append(emb_a)
            embeddings_b.append(emb_b)
            
            # Get output preference
            pref = get_output_preference(stmt_a, stmt_b, model, tokenizer)
            preferences.append(pref)
        
        embeddings_a = np.array(embeddings_a)
        embeddings_b = np.array(embeddings_b)
        
        # Calculate uniformity asymmetry
        u_a = uniformity_score(embeddings_a)
        u_b = uniformity_score(embeddings_b)
        asymmetry = u_a - u_b
        mean_pref = float(np.mean(preferences))
        
        all_asymmetries.append(asymmetry)
        all_preferences.append(mean_pref)
        
        results[category_name] = {
            "n_pairs": n_pairs,
            "uniformity_a": u_a,
            "uniformity_b": u_b,
            "asymmetry": asymmetry,
            "output_preference": mean_pref,
            "pct_prefer_a": float(np.mean([p > 0 for p in preferences]) * 100)
        }
        
        if verbose:
            print(f"  UA: {asymmetry:+.4f} | Output Pref: {mean_pref:+.3f} | % A: {results[category_name]['pct_prefer_a']:.1f}%")
    
    # Global statistics
    all_asymmetries = np.array(all_asymmetries)
    all_preferences = np.array(all_preferences)
    
    # Correlation between UA and Output Preference
    correlation = np.corrcoef(all_asymmetries, all_preferences)[0, 1]
    
    results["_global_stats"] = {
        "total_pairs": total_pairs,
        "mean_asymmetry": float(np.mean(all_asymmetries)),
        "std_asymmetry": float(np.std(all_asymmetries)),
        "bootstrap_95ci": bootstrap_ci(all_asymmetries),
        "cohens_d": cohens_d(all_asymmetries),
        "ua_output_correlation": float(correlation),
        "mean_output_preference": float(np.mean(all_preferences))
    }
    
    return results


print(f"Running validation on {config['display']}...")
print(f"Bootstrap resamples: {N_BOOTSTRAP:,}")
print()

results = run_validation(model, tokenizer, DATASET)

## Results Summary

In [None]:
print("\n" + "="*80)
print(f" UNIFORMITY ASYMMETRY VALIDATION: {config['display']}")
print("="*80)

print("\n--- PER-CATEGORY RESULTS ---")
print(f"{'Category':<25} {'UA':<10} {'Output Pref':<12} {'% A':<8} {'Aligned?'}")
print("-" * 70)

for cat_name, cat_data in results.items():
    if cat_name.startswith("_"):
        continue
    
    ua = cat_data['asymmetry']
    pref = cat_data['output_preference']
    pct_a = cat_data['pct_prefer_a']
    
    # Check alignment: same sign = aligned
    aligned = "YES" if (ua > 0 and pref > 0) or (ua < 0 and pref < 0) or (abs(ua) < 0.02 and abs(pref) < 0.1) else "NO"
    
    print(f"{cat_name:<25} {ua:+.4f}     {pref:+.3f}        {pct_a:>5.1f}%   {aligned}")

gs = results["_global_stats"]

print("\n--- GLOBAL STATISTICS ---")
print(f"Mean UA:              {gs['mean_asymmetry']:+.4f}")
print(f"95% CI:               [{gs['bootstrap_95ci'][0]:.4f}, {gs['bootstrap_95ci'][1]:.4f}]")
print(f"Cohen's d:            {gs['cohens_d']:.3f}")
print(f"UA-Output Correlation: r = {gs['ua_output_correlation']:.3f}")

print("\n--- INTERPRETATION ---")
r = gs['ua_output_correlation']
if r > 0.5:
    print(f"ALIGNED: Embedding asymmetry correlates with output preference (r={r:.2f})")
    print("         This suggests embeddings and outputs are coupled (no RLHF masking).")
elif r < -0.3:
    print(f"INVERTED: Embedding asymmetry anti-correlates with output preference (r={r:.2f})")
    print("          This is an unexpected pattern - needs investigation.")
else:
    print(f"DECOUPLED: Weak correlation between embeddings and outputs (r={r:.2f})")
    print("           Similar to Llama-3 pattern - possible RLHF-like effect.")

## Comparison with Other Models

Reference values from the paper:

In [None]:
# Reference results from the paper (Scientific Facts category)
reference_models = {
    "Llama-3.1-8B": {"ua": -0.013, "pref": 1.165, "interpretation": "MASKED (RLHF)"},
    "Gemma-2-9B": {"ua": 0.021, "pref": 1.138, "interpretation": "Weak"},
    "Mistral-7B": {"ua": 0.010, "pref": 0.972, "interpretation": "MASKED (RLHF)"},
    "Apertus-8B": {"ua": 0.109, "pref": 1.278, "interpretation": "ALIGNED (no RLHF)"}
}

# Get Pythia's Scientific Facts results
pythia_sf = results.get("scientific_facts", {})

print("\n" + "="*80)
print(" CROSS-MODEL COMPARISON (Scientific Facts Category)")
print("="*80)
print(f"{'Model':<20} {'UA':<10} {'Output Pref':<12} {'Interpretation'}")
print("-" * 70)

for model_name, data in reference_models.items():
    print(f"{model_name:<20} {data['ua']:+.3f}      {data['pref']:+.3f}        {data['interpretation']}")

if pythia_sf:
    pythia_interp = "ALIGNED" if abs(pythia_sf['asymmetry']) > 0.05 else "NEUTRAL"
    print(f"{config['display']:<20} {pythia_sf['asymmetry']:+.3f}      {pythia_sf['output_preference']:+.3f}        {pythia_interp} (Base Model)")

print("\n--- KEY QUESTION ---")
print("Does Pythia (no RLHF) show the same disconnect as Llama-3 (heavy RLHF)?")
if pythia_sf:
    if abs(pythia_sf['asymmetry']) < 0.02 and abs(pythia_sf['output_preference']) > 0.5:
        print("RESULT: YES - Pythia shows similar disconnect (surprising for base model!)")
    elif abs(pythia_sf['asymmetry']) > 0.05:
        print("RESULT: NO - Pythia embeddings show asymmetry (expected for base model)")
    else:
        print("RESULT: UNCLEAR - Need to analyze further")

## Save Results

In [None]:
# Add metadata
results["_metadata"] = {
    "model": config["hf_name"],
    "model_display": config["display"],
    "timestamp": datetime.now().isoformat(),
    "n_bootstrap": N_BOOTSTRAP,
    "random_seed": RANDOM_SEED,
    "purpose": "EleutherAI community contribution - Base model comparison"
}

# Save to JSON
output_file = f"pythia_{PYTHIA_SIZE}_ua_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Results saved to: {output_file}")

# Download link
from google.colab import files
files.download(output_file)

## Visualization

In [None]:
import matplotlib.pyplot as plt

# Prepare data for plotting
categories = [k for k in results.keys() if not k.startswith("_")]
ua_values = [results[k]['asymmetry'] for k in categories]
pref_values = [results[k]['output_preference'] for k in categories]

# Create scatter plot
fig, ax = plt.subplots(figsize=(10, 7))

# Plot Pythia results
ax.scatter([u * 10 for u in ua_values], pref_values, s=150, c='purple', 
           label=config['display'], edgecolors='black', linewidths=1.5, zorder=5)

# Add category labels
for i, cat in enumerate(categories):
    ax.annotate(cat.replace('_', '\n'), (ua_values[i] * 10, pref_values[i]),
                xytext=(5, 5), textcoords='offset points', fontsize=8)

# Add reference models (Scientific Facts only)
ref_colors = {'Llama-3.1-8B': 'blue', 'Apertus-8B': 'red', 'Gemma-2-9B': 'orange', 'Mistral-7B': 'green'}
for model_name, data in reference_models.items():
    ax.scatter(data['ua'] * 10, data['pref'], s=100, c=ref_colors[model_name],
               marker='x', label=f"{model_name} (ref)", linewidths=2)

# Reference lines
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

ax.set_xlabel('Uniformity Asymmetry (UA × 10)', fontsize=11, fontweight='bold')
ax.set_ylabel('Output Preference (mean ΔNLL)', fontsize=11, fontweight='bold')
ax.set_title(f'{config["display"]}: UA vs Output Preference by Category', fontsize=13, fontweight='bold')
ax.legend(loc='best', fontsize=9)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(f'pythia_{PYTHIA_SIZE}_ua_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nPlot saved to: pythia_{PYTHIA_SIZE}_ua_scatter.png")

## Summary for EleutherAI Discord

Copy this summary to share results:

In [None]:
gs = results["_global_stats"]
sf = results.get("scientific_facts", {})

summary = f"""
## Pythia UA Validation Results

**Model:** {config['display']}
**Dataset:** 230 statement pairs across 6 categories
**Method:** Uniformity Asymmetry (embedding clustering) + Output Preference (NLL)

### Key Metrics
- Mean UA: {gs['mean_asymmetry']:+.4f}
- 95% CI: [{gs['bootstrap_95ci'][0]:.4f}, {gs['bootstrap_95ci'][1]:.4f}]
- UA-Output Correlation: r = {gs['ua_output_correlation']:.3f}

### Scientific Facts Category (for comparison with paper)
- UA: {sf.get('asymmetry', 'N/A')}
- Output Preference: {sf.get('output_preference', 'N/A')}

### Interpretation
{'Pythia shows embedding-output alignment (expected for base model)' if gs['ua_output_correlation'] > 0.3 else 'Pythia shows weak correlation - similar to RLHF models (unexpected)'}

Full results: [JSON file attached]
Repo: github.com/buk81/uniformity-asymmetry
"""

print(summary)