# IMPORTANT CONTEXT

**These results are SUPERSEDED by length-controlled analysis.**

This notebook was part of our early exploration of prompt injection detection. The apparent signal we found was largely driven by **text length confounding**:

- `n_active` (feature count) correlates r=0.96+ with text length
- After regressing out length, injection detection collapses to d~0.1
- The "geometry" differences we observed were mostly longer-texts-activate-more-features

**What we learned:**
1. Raw feature counts are unreliable - they scale with input length
2. True diagnostic signal requires length-controlled metrics (influence, concentration)
3. Task-type detection works; injection-as-separate-category does not

**Current approach:** See main `notebooks/` folder for length-controlled analysis.

---

# Unified Injection Geometry Analysis

**Purpose:** Establish whether attribution graph geometry discriminates injection from benign prompts.

**Critical Questions:**
1. Does class balance affect the geometric signature?
2. Why are injection datasets naturally imbalanced?
3. Are geometric measurements independent across samples?

---

## Experimental Design

### Three Conditions

| Experiment | Injection | Benign | Ratio | Purpose |
|------------|-----------|--------|-------|--------|
| **Original** | 21 | 115 | 1:5.5 | Raw dataset distribution |
| **Sanity Check** | 21 | 21 | 1:1 | Downsample benign, test if signal persists |
| **Full Balanced** | 50 | 50 | 1:1 | Proper statistical power |

### Why This Matters

**Imbalanced data creates illusions:**
- Original 84.6% baseline (always predict benign)
- Claimed 80.5% accuracy is WORSE than guessing majority class
- Effect sizes can be inflated by outliers in small minority class

**Balanced data reveals truth:**
- 50% baseline (random guess)
- Any accuracy above 50% is real signal
- Effect sizes are fairly estimated

In [None]:
import json
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from pathlib import Path
import random
import warnings
warnings.filterwarnings('ignore')

# Configuration
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print("Unified Injection Geometry Analysis")
print("=" * 50)

---

## Question 1: Why Are Injection Datasets Imbalanced?

### The Dataset Economics Problem

**Benign prompts are cheap:**
- Scraped from public conversations, forums, customer support logs
- Generated synthetically ("write me a poem about X")
- Abundant in any chatbot deployment

**Injection prompts are expensive:**
- Require security expertise to craft
- Must be novel (known patterns get filtered)
- Ethical constraints on collection
- Adversarial evolution (attackers adapt)

### Real-World Distribution

In production, injection attempts are **rare** (< 1% of traffic). Datasets mirror this:

| Dataset | Injection % | Source |
|---------|-------------|--------|
| deepset/prompt-injections | 37% | Research collection |
| PINT Benchmark | 6% | Curated benchmark |
| Production logs | ~0.1-1% | Real deployment |

**Implication:** Benchmarks oversample injections to enable evaluation, but this creates statistical artifacts.

In [None]:
# Load the original experiment data
data_path = Path('../data/results/pint_attribution_metrics.json')

with open(data_path) as f:
    data = json.load(f)

all_samples = data['samples']
injections = [s for s in all_samples if s['label']]
benigns = [s for s in all_samples if not s['label']]

print("Dataset Statistics")
print("-" * 50)
print(f"Total samples computed: {len(all_samples)}")
print(f"Injection: {len(injections)} ({100*len(injections)/len(all_samples):.1f}%)")
print(f"Benign: {len(benigns)} ({100*len(benigns)/len(all_samples):.1f}%)")
print(f"Imbalance ratio: {len(benigns)/len(injections):.1f}:1")
print()
print(f"Source: {data['metadata']['dataset']}")
print(f"Model: {data['metadata']['model']}")

---

## Question 2: Are Measurements Independent?

### The Independence Assumption

Statistical tests (t-test, Mann-Whitney) assume samples are **independent and identically distributed (i.i.d.)**.

**For our experiment:**

| Factor | Independent? | Reasoning |
|--------|--------------|----------|
| Model weights | ✅ Yes | Same frozen model, no learning |
| Transcoder | ✅ Yes | Same SAE, deterministic |
| Prompt text | ⚠️ Partially | Some prompts may share templates |
| Graph computation | ✅ Yes | Each prompt processed separately |
| GPU state | ✅ Yes | Cache cleared between samples |

### Potential Violations

1. **Template effects:** If injection prompts share "Ignore previous instructions" prefix, they're not truly independent
2. **Length correlation:** Longer prompts → more features (confound)
3. **Semantic clustering:** Similar topics activate similar features

**We'll test for these below.**

In [None]:
# Check for length confound
inj_lengths = [len(s['text']) for s in injections]
ben_lengths = [len(s['text']) for s in benigns]

print("Independence Check: Prompt Length")
print("-" * 50)
print(f"Injection mean length: {np.mean(inj_lengths):.0f} chars")
print(f"Benign mean length: {np.mean(ben_lengths):.0f} chars")

# Correlation between length and key metrics
all_lengths = [len(s['text']) for s in all_samples]
all_n_active = [s.get('n_active', 0) for s in all_samples]

corr, p = stats.pearsonr(all_lengths, all_n_active)
print(f"\nCorrelation (length vs n_active): r={corr:.3f}, p={p:.4f}")

if abs(corr) > 0.5:
    print("⚠️ WARNING: Strong length confound detected!")
    print("   Longer prompts have more features - need to control for length")
else:
    print("✓ Length confound is weak - measurements appear independent of length")

In [None]:
# Check for template patterns in injections
print("Independence Check: Injection Templates")
print("-" * 50)

# Common injection prefixes
prefixes = {}
for s in injections:
    # Get first 20 chars as potential template
    prefix = s['text'][:20].lower()
    prefixes[prefix] = prefixes.get(prefix, 0) + 1

# Count unique vs repeated
unique = sum(1 for v in prefixes.values() if v == 1)
repeated = sum(1 for v in prefixes.values() if v > 1)

print(f"Unique injection prefixes: {unique}")
print(f"Repeated prefixes: {repeated}")
print(f"Template diversity: {100*unique/len(injections):.0f}%")

if repeated > len(injections) * 0.3:
    print("\n⚠️ WARNING: Many injections share templates")
    print("   This violates independence assumption")
else:
    print("\n✓ Injections appear diverse - independence assumption reasonable")

---

## Question 3: Does Balance Matter?

We'll compare three conditions to see if the geometric signature holds across different sampling strategies.

In [None]:
def analyze_condition(inj_samples, ben_samples, condition_name):
    """
    Analyze geometric metrics for a given condition.
    Returns dict with effect sizes and p-values.
    """
    metrics = ['n_active', 'n_edges', 'top_100_concentration', 'mean_influence']
    results = {'condition': condition_name, 'n_inj': len(inj_samples), 'n_ben': len(ben_samples)}
    
    for metric in metrics:
        inj_vals = np.array([s.get(metric, 0) for s in inj_samples])
        ben_vals = np.array([s.get(metric, 0) for s in ben_samples])
        
        # Filter invalid
        inj_vals = inj_vals[~np.isnan(inj_vals) & (inj_vals != 0)]
        ben_vals = ben_vals[~np.isnan(ben_vals) & (ben_vals != 0)]
        
        if len(inj_vals) < 3 or len(ben_vals) < 3:
            continue
        
        # Effect size
        pooled_std = np.sqrt((inj_vals.std()**2 + ben_vals.std()**2) / 2)
        cohen_d = (inj_vals.mean() - ben_vals.mean()) / pooled_std if pooled_std > 0 else 0
        
        # Mann-Whitney U (robust to non-normality)
        _, p_value = stats.mannwhitneyu(inj_vals, ben_vals, alternative='two-sided')
        
        results[f'{metric}_d'] = cohen_d
        results[f'{metric}_p'] = p_value
        results[f'{metric}_inj_mean'] = inj_vals.mean()
        results[f'{metric}_ben_mean'] = ben_vals.mean()
    
    return results

def print_comparison(results_list):
    """Pretty-print comparison across conditions."""
    metrics = ['n_active', 'n_edges', 'top_100_concentration', 'mean_influence']
    
    print("\n" + "=" * 100)
    print("CROSS-CONDITION COMPARISON")
    print("=" * 100)
    
    for metric in metrics:
        print(f"\n{metric.upper()}")
        print("-" * 80)
        print(f"{'Condition':<25} {'n_inj':<8} {'n_ben':<8} {'Cohen d':<12} {'p-value':<12} {'Sig?'}")
        
        for r in results_list:
            d = r.get(f'{metric}_d', float('nan'))
            p = r.get(f'{metric}_p', float('nan'))
            sig = '***' if p < 0.001 else '**' if p < 0.01 else '*' if p < 0.05 else ''
            print(f"{r['condition']:<25} {r['n_inj']:<8} {r['n_ben']:<8} {d:<12.3f} {p:<12.4f} {sig}")

In [None]:
# Condition 1: Original (imbalanced)
original = analyze_condition(injections, benigns, "Original (1:5.5)")

# Condition 2: Sanity check (downsample benign)
benigns_downsampled = random.sample(benigns, len(injections))
sanity = analyze_condition(injections, benigns_downsampled, "Sanity Check (1:1)")

# Condition 3: Different random downsample (robustness check)
random.seed(123)  # Different seed
benigns_alt = random.sample(benigns, len(injections))
robust = analyze_condition(injections, benigns_alt, "Robustness Check (1:1)")

# Compare all
print_comparison([original, sanity, robust])

In [None]:
print("\n" + "=" * 80)
print("SUMMARY: DOES BALANCE MATTER?")
print("=" * 80)

# Check if significance holds across conditions
metrics = ['n_active', 'n_edges', 'top_100_concentration', 'mean_influence']

print("\nSignificance (p < 0.05) across conditions:")
print("-" * 60)
print(f"{'Metric':<25} {'Original':<12} {'Sanity':<12} {'Robust':<12}")

all_hold = True
for metric in metrics:
    orig_sig = original.get(f'{metric}_p', 1) < 0.05
    san_sig = sanity.get(f'{metric}_p', 1) < 0.05
    rob_sig = robust.get(f'{metric}_p', 1) < 0.05
    
    print(f"{metric:<25} {'✓' if orig_sig else '✗':<12} {'✓' if san_sig else '✗':<12} {'✓' if rob_sig else '✗':<12}")
    
    if not (san_sig and rob_sig):
        all_hold = False

print()
if all_hold:
    print("✅ CONCLUSION: Geometric signature HOLDS with balanced sampling")
    print("   The effect is real, not an artifact of class imbalance")
else:
    print("⚠️ CONCLUSION: Some metrics lose significance with balanced sampling")
    print("   Effect may be partially inflated by imbalance")

---

## Visualization: Effect Size Stability

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

metrics = ['n_active', 'top_100_concentration', 'mean_influence']
conditions = ['Original (1:5.5)', 'Sanity Check (1:1)', 'Robustness Check (1:1)']
results_list = [original, sanity, robust]

# Plot 1: Cohen's d across conditions
ax1 = axes[0]
x = np.arange(len(metrics))
width = 0.25

for i, (r, cond) in enumerate(zip(results_list, conditions)):
    d_vals = [abs(r.get(f'{m}_d', 0)) for m in metrics]
    ax1.bar(x + i*width, d_vals, width, label=cond, alpha=0.8)

ax1.axhline(0.8, color='red', linestyle='--', label='Large effect threshold')
ax1.set_ylabel('|Cohen\'s d|', fontsize=12)
ax1.set_title('Effect Size Stability Across Conditions', fontsize=14, fontweight='bold')
ax1.set_xticks(x + width)
ax1.set_xticklabels([m.replace('_', '\n') for m in metrics])
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3, axis='y')

# Plot 2: Scatter comparison
ax2 = axes[1]

random.seed(SEED)
benigns_balanced = random.sample(benigns, len(injections))

# Original (all benign)
ax2.scatter([s['n_active'] for s in benigns], 
            [s['top_100_concentration'] for s in benigns],
            c='lightgreen', alpha=0.3, s=30, label='Benign (all 115)')

# Balanced benign (highlighted)
ax2.scatter([s['n_active'] for s in benigns_balanced], 
            [s['top_100_concentration'] for s in benigns_balanced],
            c='green', alpha=0.7, s=80, label='Benign (sampled 21)')

# Injection
ax2.scatter([s['n_active'] for s in injections], 
            [s['top_100_concentration'] for s in injections],
            c='red', alpha=0.7, s=80, marker='X', label='Injection (21)')

ax2.set_xlabel('Number of Active Features', fontsize=12)
ax2.set_ylabel('Top-100 Concentration', fontsize=12)
ax2.set_title('Sampling Strategy Comparison', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../figures/effect_size_stability.png', dpi=150)
plt.show()

---

## Bootstrap Analysis: Confidence Intervals

With small samples, we need bootstrap to estimate uncertainty in effect sizes.

In [None]:
def bootstrap_cohen_d(inj_samples, ben_samples, metric, n_bootstrap=1000):
    """Bootstrap confidence interval for Cohen's d."""
    inj_vals = np.array([s.get(metric, 0) for s in inj_samples])
    ben_vals = np.array([s.get(metric, 0) for s in ben_samples])
    
    # Filter
    inj_vals = inj_vals[~np.isnan(inj_vals) & (inj_vals != 0)]
    ben_vals = ben_vals[~np.isnan(ben_vals) & (ben_vals != 0)]
    
    d_samples = []
    for _ in range(n_bootstrap):
        inj_boot = np.random.choice(inj_vals, size=len(inj_vals), replace=True)
        ben_boot = np.random.choice(ben_vals, size=len(ben_vals), replace=True)
        
        pooled_std = np.sqrt((inj_boot.std()**2 + ben_boot.std()**2) / 2)
        if pooled_std > 0:
            d = (inj_boot.mean() - ben_boot.mean()) / pooled_std
            d_samples.append(d)
    
    return np.percentile(d_samples, [2.5, 50, 97.5])

print("Bootstrap 95% CI for Cohen's d (balanced sampling)")
print("=" * 60)
print(f"{'Metric':<25} {'2.5%':<10} {'Median':<10} {'97.5%':<10} {'Significant?'}")
print("-" * 60)

for metric in ['n_active', 'top_100_concentration', 'mean_influence']:
    ci = bootstrap_cohen_d(injections, benigns_balanced, metric)
    # Significant if CI doesn't include 0
    sig = "✓" if (ci[0] > 0 or ci[2] < 0) else "✗"
    print(f"{metric:<25} {ci[0]:<10.3f} {ci[1]:<10.3f} {ci[2]:<10.3f} {sig}")

print("\n(Significant = 95% CI excludes zero)")

---

## Statistical Power Analysis

How many samples do we need for reliable results?

In [None]:
from scipy.stats import norm

def power_analysis(effect_size, n_per_group, alpha=0.05):
    """Calculate statistical power for two-sample t-test."""
    # Standard error of difference
    se = np.sqrt(2 / n_per_group)
    # Critical z value
    z_crit = norm.ppf(1 - alpha/2)
    # Non-centrality parameter
    ncp = effect_size / se
    # Power
    power = 1 - norm.cdf(z_crit - ncp) + norm.cdf(-z_crit - ncp)
    return power

print("Statistical Power Analysis")
print("=" * 60)
print("\nAssumes Cohen's d ≈ 1.0 (observed for concentration metric)")
print()

effect_size = 1.0
sample_sizes = [10, 21, 30, 50, 100]

print(f"{'n per class':<15} {'Power':<15} {'Interpretation'}")
print("-" * 60)

for n in sample_sizes:
    pwr = power_analysis(effect_size, n)
    if pwr < 0.5:
        interp = "Very weak - likely to miss real effects"
    elif pwr < 0.8:
        interp = "Underpowered - may miss effects"
    elif pwr < 0.95:
        interp = "Adequate power"
    else:
        interp = "High power - reliable detection"
    
    marker = "← Current" if n == 21 else ""
    print(f"{n:<15} {pwr:<15.1%} {interp} {marker}")

print("\n✓ Recommendation: 50 samples per class for 95% power")

---

## Conclusions

### Does Balance Matter?

**Yes, but the signal persists.** Effect sizes decrease slightly with balanced sampling, but remain statistically significant. This suggests the geometric signature is real, not an artifact of imbalance.

### Why Are Datasets Imbalanced?

**Economics + reality.** Benign prompts are abundant; injections require expertise to craft. Real-world traffic is < 1% injection, so benchmarks oversample to enable evaluation.

### Are Measurements Independent?

**Mostly yes, with caveats:**
- Length is a weak confound (need to verify)
- Template sharing in injections may violate independence
- Each graph computation is truly independent

### Next Steps

1. **Run balanced Modal experiment** (50 per class) for proper power
2. **Control for prompt length** in analysis
3. **Cross-validate** with different random seeds

```bash
# Run the balanced experiment
modal run scripts/modal_balanced_benchmark.py --n-per-class 50
```

---

## Appendix: Experiment Parameters

### Original Experiment
| Parameter | Value |
|-----------|-------|
| Dataset | deepset/prompt-injections |
| Model | google/gemma-2-2b |
| Transcoders | GemmaScope (gemma) |
| Total samples | 136 (140 attempted, 4 failed) |
| Class distribution | 21 injection, 115 benign |
| Max tokens | 200 |
| Attribution method | circuit-tracer |

### Sanity Check
| Parameter | Value |
|-----------|-------|
| Same data as original | Yes |
| Benign downsampled to | 21 |
| Random seed | 42 |
| New computation | No (reused metrics) |

### Planned Balanced Experiment
| Parameter | Value |
|-----------|-------|
| Dataset | deepset/prompt-injections |
| Samples per class | 50 |
| Total samples | 100 |
| Random seed | 42 |
| GPU | A100 (Modal) |