# IMPORTANT CONTEXT

**These results are SUPERSEDED by length-controlled analysis.**

This notebook was part of our early exploration of prompt injection detection. The apparent signal we found was largely driven by **text length confounding**:

- `n_active` (feature count) correlates r=0.96+ with text length
- After regressing out length, injection detection collapses to d~0.1
- The "geometry" differences we observed were mostly longer-texts-activate-more-features

**What we learned:**
1. Raw feature counts are unreliable - they scale with input length
2. True diagnostic signal requires length-controlled metrics (influence, concentration)
3. Task-type detection works; injection-as-separate-category does not

**Current approach:** See main `notebooks/` folder for length-controlled analysis.

---

# Balanced Injection Geometry Experiment

**Problem:** Previous analysis had severe class imbalance (21 injection vs 115 benign = 15% minority).
A classifier could score 84.6% by always predicting benign.

**Solution:** Balanced 1:1 sampling to test if geometric signatures hold.

---

In [None]:
import json
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Load existing metrics
with open('../data/results/pint_attribution_metrics.json') as f:
    data = json.load(f)

samples = data['samples']
injections = [s for s in samples if s['label']]
benigns = [s for s in samples if not s['label']]

print(f"Available data:")
print(f"  Injection: {len(injections)}")
print(f"  Benign: {len(benigns)}")
print(f"  Imbalance ratio: {len(benigns)/len(injections):.1f}x more benign")

## Create Balanced Dataset

Downsample benign to match injection count. This is statistically limited but allows fair comparison.

In [None]:
import random
random.seed(42)

# Match the minority class
n_per_class = len(injections)
benigns_balanced = random.sample(benigns, n_per_class)

print(f"Balanced dataset:")
print(f"  Injection: {len(injections)}")
print(f"  Benign: {len(benigns_balanced)}")
print(f"  Total: {len(injections) + len(benigns_balanced)}")
print()
print(f"⚠️ WARNING: Only {n_per_class} samples per class - statistical power is limited")
print(f"   For robust results, run Modal with 50+ samples per class")

## Geometric Analysis on Balanced Data

In [None]:
def compare_distributions(metric_name, inj_samples, ben_samples):
    """Compare metric distributions with proper statistical tests."""
    inj_vals = np.array([s.get(metric_name, 0) for s in inj_samples])
    ben_vals = np.array([s.get(metric_name, 0) for s in ben_samples])
    
    # Remove zeros/nans
    inj_vals = inj_vals[~np.isnan(inj_vals) & (inj_vals != 0)]
    ben_vals = ben_vals[~np.isnan(ben_vals) & (ben_vals != 0)]
    
    if len(inj_vals) < 3 or len(ben_vals) < 3:
        return None
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt((inj_vals.std()**2 + ben_vals.std()**2) / 2)
    cohen_d = abs(inj_vals.mean() - ben_vals.mean()) / pooled_std if pooled_std > 0 else 0
    
    # Mann-Whitney U test (non-parametric)
    _, p_value = stats.mannwhitneyu(inj_vals, ben_vals, alternative='two-sided')
    
    # Direction
    direction = "HIGHER" if inj_vals.mean() > ben_vals.mean() else "LOWER"
    
    return {
        'metric': metric_name,
        'injection_mean': inj_vals.mean(),
        'injection_std': inj_vals.std(),
        'benign_mean': ben_vals.mean(),
        'benign_std': ben_vals.std(),
        'cohen_d': cohen_d,
        'p_value': p_value,
        'direction': direction,
        'significant': p_value < 0.05,
        'n_inj': len(inj_vals),
        'n_ben': len(ben_vals),
    }

# Analyze key metrics
metrics_to_test = ['n_active', 'n_edges', 'top_100_concentration', 'mean_influence']

print("=" * 90)
print("BALANCED GEOMETRY ANALYSIS")
print("=" * 90)
print()
print(f"{'Metric':<25} {'Injection':<15} {'Benign':<15} {'Cohen d':<10} {'p-value':<12} {'Direction'}")
print("-" * 90)

results = []
for metric in metrics_to_test:
    r = compare_distributions(metric, injections, benigns_balanced)
    if r:
        results.append(r)
        sig = "*" if r['significant'] else ""
        print(f"{r['metric']:<25} {r['injection_mean']:<15.2f} {r['benign_mean']:<15.2f} "
              f"{r['cohen_d']:<10.2f} {r['p_value']:<12.4f} {r['direction']} {sig}")

print()
print("* = statistically significant (p < 0.05)")
print("Cohen's d > 0.8 = large effect")

## Visualization: Balanced Scatter Plot

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

# Extract data
inj_x = [s['n_active'] for s in injections]
inj_y = [s['top_100_concentration'] for s in injections]
ben_x = [s['n_active'] for s in benigns_balanced]
ben_y = [s['top_100_concentration'] for s in benigns_balanced]

# Plot with equal weight
ax.scatter(ben_x, ben_y, c='green', alpha=0.7, s=80, label=f'Benign (n={len(benigns_balanced)})')
ax.scatter(inj_x, inj_y, c='red', alpha=0.7, s=80, marker='X', label=f'Injection (n={len(injections)})')

# Add centroids
ax.scatter(np.mean(ben_x), np.mean(ben_y), c='darkgreen', s=300, marker='s', edgecolors='white', linewidth=2, label='Benign centroid')
ax.scatter(np.mean(inj_x), np.mean(inj_y), c='darkred', s=300, marker='s', edgecolors='white', linewidth=2, label='Injection centroid')

ax.set_xlabel('Number of Active Features', fontsize=12)
ax.set_ylabel('Top-100 Concentration', fontsize=12)
ax.set_title(f'Balanced Injection Geometry\n(n={len(injections)} per class)', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../figures/balanced_geometry_scatter.png', dpi=150)
plt.show()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = [
    ('n_active', 'Number of Active Features'),
    ('n_edges', 'Number of Connections'),
    ('top_100_concentration', 'Influence Concentration'),
    ('mean_influence', 'Average Connection Strength'),
]

for ax, (metric, title) in zip(axes.flatten(), metrics):
    inj_vals = [s.get(metric, 0) for s in injections]
    ben_vals = [s.get(metric, 0) for s in benigns_balanced]
    
    # Plot both distributions with same weight
    ax.hist(ben_vals, bins=15, alpha=0.6, label=f'Benign (n={len(ben_vals)})', color='green', density=True)
    ax.hist(inj_vals, bins=15, alpha=0.6, label=f'Injection (n={len(inj_vals)})', color='red', density=True)
    
    # Add vertical lines for means
    ax.axvline(np.mean(ben_vals), color='darkgreen', linestyle='--', linewidth=2, label='Benign mean')
    ax.axvline(np.mean(inj_vals), color='darkred', linestyle='--', linewidth=2, label='Injection mean')
    
    ax.set_xlabel(title, fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

plt.suptitle('BALANCED Distribution Comparison (1:1 sampling)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../figures/balanced_distributions.png', dpi=150)
plt.show()

## Simple Classifier Test

With balanced data, baseline accuracy is 50% (random guess).

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Combine balanced data
all_samples = injections + benigns_balanced
labels = [1] * len(injections) + [0] * len(benigns_balanced)

# Feature matrix
features = ['n_active', 'n_edges', 'top_100_concentration', 'mean_influence']
X = np.array([[s.get(f, 0) for f in features] for s in all_samples])
y = np.array(labels)

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cross-validation with small k due to limited samples
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=3)

scores = cross_val_score(clf, X_scaled, y, cv=cv, scoring='accuracy')

print("=" * 60)
print("BALANCED CLASSIFIER EVALUATION")
print("=" * 60)
print()
print(f"Samples per class: {len(injections)}")
print(f"Baseline (random): 50.0%")
print()
print(f"5-Fold CV Accuracy: {scores.mean()*100:.1f}% (±{scores.std()*100:.1f}%)")
print(f"Individual folds: {[f'{s*100:.0f}%' for s in scores]}")
print()

if scores.mean() > 0.6:
    print("✓ Geometric features show discriminative power above chance")
else:
    print("⚠ Geometric separation may be weaker than raw analysis suggested")

## Conclusions

### What This Shows

With balanced 1:1 sampling:
- **Baseline is 50%** (not 84.6% as before)
- Any accuracy above 50% demonstrates real signal

### Limitations

- Only ~21 samples per class = high variance
- Need 50+ per class for robust statistical claims

### Next Steps

Run the Modal script with balanced sampling:

```bash
modal run scripts/modal_balanced_benchmark.py --n-per-class 50
```