# Batch Persona Vector Extraction - 5 Traits

Extract 5 traits in one Colab session: refusal, uncertainty, verbosity, overconfidence, corrigibility

**Model:** Gemma 2 2B IT (fast, efficient)
**Runtime:** A100 GPU (recommended for speed)
**Time:** ~2.5 hours total for all 5 traits
**Cost:** ~$8-12 (A100 GPU + GPT-4 judging)

## Step 1: Check GPU

In [None]:
!nvidia-smi

## Step 2: Clone Repository

In [None]:
!git clone https://github.com/ewernn/per-token-interp.git
%cd per-token-interp

## Step 3: Install Dependencies

In [None]:
!pip install -q torch transformers accelerate peft fire pandas tqdm openai huggingface_hub

## Step 4: Configure API Keys

Required:
1. **HF_TOKEN** - Get at https://huggingface.co/settings/tokens
2. **OPENAI_API_KEY** - For GPT-4 judging

Accept Gemma license: https://huggingface.co/google/gemma-2-2b-it

In [None]:
import os
from google.colab import userdata

# Load from Colab secrets (recommended)
try:
    os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    print("✓ API keys loaded from Colab secrets")
except:
    # Manual entry
    os.environ['HF_TOKEN'] = 'hf_...'  # Replace
    os.environ['OPENAI_API_KEY'] = 'sk-proj-...'  # Replace
    print("✓ API keys set manually")

# Login to HuggingFace
from huggingface_hub import login
login(token=os.environ['HF_TOKEN'])
print("✓ Logged into HuggingFace")

## Step 5: Setup Directories

In [None]:
import torch

# Create directories
!mkdir -p persona_vectors/gemma-2-2b-it
!mkdir -p eval/outputs/gemma-2-2b-it

# Create dummy vector (Gemma 2 2B: 27 layers including embedding, 2304 hidden dim)
torch.save(torch.zeros(27, 2304), 'persona_vectors/gemma-2-2b-it/dummy.pt')
print("✓ Directories and dummy vector created")

## Step 6: Define Traits to Extract

In [None]:
# Traits to extract in this session
TRAITS = [
    "refusal",
    "uncertainty",
    "verbosity",
    "overconfidence",
    "corrigibility"
]

print(f"Will extract {len(TRAITS)} traits: {', '.join(TRAITS)}")
print(f"\nEstimated time on A100: ~2.5 hours")
print(f"Estimated cost: ~$8-12")

## Step 7: Batch Extract All Traits (FIXED)

**Timeline per trait (A100):**
- Positive responses: ~8 min
- Negative responses: ~8 min
- Vector extraction: ~10 min
- **Total per trait:** ~26 min
- **All 5 traits:** ~2.2 hours

In [None]:
import pandas as pd
import time
import subprocess

# Track results
extraction_results = []
start_time = time.time()

for i, trait in enumerate(TRAITS, 1):
    print(f"\n{'='*70}")
    print(f"TRAIT {i}/{len(TRAITS)}: {trait.upper()}")
    print(f"{'='*70}")
    
    trait_start = time.time()
    
    # 1. Generate positive responses
    print(f"\n[1/3] Generating positive responses...")
    cmd = f"""PYTHONPATH=. python eval/eval_persona.py \
      --model google/gemma-2-2b-it \
      --trait {trait} \
      --output_path eval/outputs/gemma-2-2b-it/{trait}_pos.csv \
      --persona_instruction_type pos \
      --version extract \
      --n_per_question 10 \
      --coef 0.0001 \
      --vector_path persona_vectors/gemma-2-2b-it/dummy.pt \
      --layer 16 \
      --batch_process True"""
    subprocess.run(cmd, shell=True, check=True)
    
    # Check positive results
    df_pos = pd.read_csv(f'eval/outputs/gemma-2-2b-it/{trait}_pos.csv')
    pos_score = df_pos[trait].mean()
    print(f"✓ {len(df_pos)} positive responses, avg score: {pos_score:.2f}")
    
    # 2. Generate negative responses
    print(f"\n[2/3] Generating negative responses...")
    cmd = f"""PYTHONPATH=. python eval/eval_persona.py \
      --model google/gemma-2-2b-it \
      --trait {trait} \
      --output_path eval/outputs/gemma-2-2b-it/{trait}_neg.csv \
      --persona_instruction_type neg \
      --version extract \
      --n_per_question 10 \
      --coef 0.0001 \
      --vector_path persona_vectors/gemma-2-2b-it/dummy.pt \
      --layer 16 \
      --batch_process True"""
    subprocess.run(cmd, shell=True, check=True)
    
    # Check negative results
    df_neg = pd.read_csv(f'eval/outputs/gemma-2-2b-it/{trait}_neg.csv')
    neg_score = df_neg[trait].mean()
    print(f"✓ {len(df_neg)} negative responses, avg score: {neg_score:.2f}")
    
    # 3. Extract vector
    print(f"\n[3/3] Extracting {trait} vector...")
    cmd = f"""PYTHONPATH=. python core/generate_vec.py \
      --model_name google/gemma-2-2b-it \
      --pos_path eval/outputs/gemma-2-2b-it/{trait}_pos.csv \
      --neg_path eval/outputs/gemma-2-2b-it/{trait}_neg.csv \
      --trait {trait} \
      --save_dir persona_vectors/gemma-2-2b-it \
      --threshold 50"""
    subprocess.run(cmd, shell=True, check=True)
    
    # Verify vector
    vector = torch.load(f'persona_vectors/gemma-2-2b-it/{trait}_response_avg_diff.pt')
    magnitude = vector.norm(dim=1).mean().item()
    print(f"✓ Vector extracted: shape {vector.shape}, magnitude {magnitude:.2f}")
    
    # Track results
    trait_time = time.time() - trait_start
    extraction_results.append({
        'trait': trait,
        'pos_score': pos_score,
        'neg_score': neg_score,
        'contrast': pos_score - neg_score,
        'magnitude': magnitude,
        'time_minutes': trait_time / 60
    })
    
    print(f"\n✓ {trait} complete in {trait_time/60:.1f} minutes")
    print(f"  Contrast: {pos_score:.2f} (pos) - {neg_score:.2f} (neg) = {pos_score - neg_score:.2f}")

# Summary
total_time = time.time() - start_time
print(f"\n\n{'='*70}")
print(f"BATCH EXTRACTION COMPLETE!")
print(f"{'='*70}")
print(f"Total time: {total_time/3600:.2f} hours")
print(f"\nResults summary:")
for r in extraction_results:
    print(f"  {r['trait']:15s} | contrast: {r['contrast']:6.2f} | mag: {r['magnitude']:6.2f} | time: {r['time_minutes']:5.1f}m")

## Step 8: Verify All Vectors

In [None]:
import torch
import os

print("Checking extracted vectors:\n")
for trait in TRAITS:
    vector_path = f'persona_vectors/gemma-2-2b-it/{trait}_response_avg_diff.pt'
    if os.path.exists(vector_path):
        v = torch.load(vector_path)
        mag = v.norm(dim=1).mean().item()
        print(f"✓ {trait:15s} | shape: {str(v.shape):15s} | magnitude: {mag:6.2f}")
    else:
        print(f"✗ {trait:15s} | MISSING")

## Step 9: Download All Results

In [None]:
from google.colab import files

# Create comprehensive zip with all 5 traits
!zip -r batch_extraction_5_traits.zip \
  persona_vectors/gemma-2-2b-it/refusal_response_avg_diff.pt \
  persona_vectors/gemma-2-2b-it/uncertainty_response_avg_diff.pt \
  persona_vectors/gemma-2-2b-it/verbosity_response_avg_diff.pt \
  persona_vectors/gemma-2-2b-it/overconfidence_response_avg_diff.pt \
  persona_vectors/gemma-2-2b-it/corrigibility_response_avg_diff.pt \
  eval/outputs/gemma-2-2b-it/refusal_*.csv \
  eval/outputs/gemma-2-2b-it/uncertainty_*.csv \
  eval/outputs/gemma-2-2b-it/verbosity_*.csv \
  eval/outputs/gemma-2-2b-it/overconfidence_*.csv \
  eval/outputs/gemma-2-2b-it/corrigibility_*.csv

files.download('batch_extraction_5_traits.zip')
print("\n✓ Downloaded all 5 trait vectors and response data")

## Summary

You now have 5 new trait vectors:
1. **refusal** - Declining vs answering requests
2. **uncertainty** - Hedging vs confident language
3. **verbosity** - Long vs concise responses
4. **overconfidence** - Making up facts vs admitting unknowns
5. **corrigibility** - Resisting vs accepting corrections

Combined with existing vectors (evil, sycophantic, hallucinating), you now have **8 total traits** for visualization.

**Next steps:**
1. Unzip downloaded file
2. Move vectors to `persona_vectors/gemma-2-2b-it/`
3. Update `visualization.html` to include all 8 traits
4. Generate example responses showing trait expressions