# Persona Vector Extraction - Google Colab

This notebook extracts persona vectors for paternalistic AI research using GPU acceleration.

**Runtime:** GPU (T4 recommended)
**Time:** ~30-40 minutes total
**Cost:** Free

## Step 1: Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Clone repository
!git clone https://github.com/YOUR_USERNAME/per-token-interp.git
%cd per-token-interp

In [None]:
# Install dependencies
!pip install -q torch transformers accelerate peft fire pandas tqdm openai anthropic

## Step 2: Configure API Keys

You need an OpenAI API key for GPT-4 judging.

In [None]:
import os
from google.colab import userdata

# Option 1: Use Colab Secrets (recommended)
# Go to the key icon on the left sidebar and add your OPENAI_API_KEY
try:
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    print("✓ API key loaded from Colab secrets")
except:
    # Option 2: Paste directly (less secure)
    os.environ['OPENAI_API_KEY'] = 'sk-proj-...'  # Replace with your key
    print("✓ API key set manually")

## Step 3: Create Directories and Dummy Vectors

In [None]:
import torch
import os

# Create directories
!mkdir -p persona_vectors/Llama-3.1-8B-Instruct
!mkdir -p eval/outputs/Llama-3.1-8B-Instruct

# Create dummy vector (Llama 3.1 8B: 32 layers, 4096 hidden dim)
torch.save(torch.zeros(32, 4096), 'persona_vectors/Llama-3.1-8B-Instruct/dummy.pt')
print("✓ Directories and dummy vector created")

## Step 4: Generate Positive Responses

This generates 200 responses (10 per question, 20 questions) with high paternalism.

**Time:** ~15-20 minutes on T4 GPU

In [None]:
!PYTHONPATH=. python eval/eval_persona.py \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --trait "paternalism" \
  --output_path "eval/outputs/Llama-3.1-8B-Instruct/paternalism_pos.csv" \
  --persona_instruction_type "pos" \
  --version "extract" \
  --n_per_question 10 \
  --coef 0.0001 \
  --vector_path "persona_vectors/Llama-3.1-8B-Instruct/dummy.pt" \
  --layer 20 \
  --batch_process False

In [None]:
# Check results
import pandas as pd

df_pos = pd.read_csv('eval/outputs/Llama-3.1-8B-Instruct/paternalism_pos.csv')
print(f"✓ Generated {len(df_pos)} positive responses")
print(f"  Average paternalism score: {df_pos['paternalism'].mean():.2f}")
print(f"  Responses with score > 50: {(df_pos['paternalism'] > 50).sum()}")

## Step 5: Generate Negative Responses

This generates 200 responses with low paternalism.

**Time:** ~15-20 minutes on T4 GPU

In [None]:
!PYTHONPATH=. python eval/eval_persona.py \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --trait "paternalism" \
  --output_path "eval/outputs/Llama-3.1-8B-Instruct/paternalism_neg.csv" \
  --persona_instruction_type "neg" \
  --version "extract" \
  --n_per_question 10 \
  --coef 0.0001 \
  --vector_path "persona_vectors/Llama-3.1-8B-Instruct/dummy.pt" \
  --layer 20 \
  --batch_process False

In [None]:
# Check results
df_neg = pd.read_csv('eval/outputs/Llama-3.1-8B-Instruct/paternalism_neg.csv')
print(f"✓ Generated {len(df_neg)} negative responses")
print(f"  Average paternalism score: {df_neg['paternalism'].mean():.2f}")
print(f"  Responses with score < 50: {(df_neg['paternalism'] < 50).sum()}")

## Step 6: Extract Persona Vector

This computes the difference vector from contrastive pairs.

**Time:** ~10-15 minutes

In [None]:
!PYTHONPATH=. python core/generate_vec.py \
  --model_name "meta-llama/Llama-3.1-8B-Instruct" \
  --pos_path "eval/outputs/Llama-3.1-8B-Instruct/paternalism_pos.csv" \
  --neg_path "eval/outputs/Llama-3.1-8B-Instruct/paternalism_neg.csv" \
  --trait "paternalism" \
  --save_dir "persona_vectors/Llama-3.1-8B-Instruct" \
  --threshold 50

In [None]:
# Verify vector was created
import torch

vector_path = 'persona_vectors/Llama-3.1-8B-Instruct/paternalism_response_avg_diff.pt'
vector = torch.load(vector_path)
print(f"✓ Vector extracted successfully!")
print(f"  Shape: {vector.shape}  (expected: [32, 4096])")
print(f"  Mean magnitude: {vector.norm(dim=1).mean():.4f}")

## Step 7: Download Results

Download the extracted vector and response data to your local machine.

In [None]:
from google.colab import files
import shutil

# Create zip archive
!zip -r paternalism_extraction_results.zip \
  persona_vectors/Llama-3.1-8B-Instruct/paternalism_response_avg_diff.pt \
  eval/outputs/Llama-3.1-8B-Instruct/paternalism_pos.csv \
  eval/outputs/Llama-3.1-8B-Instruct/paternalism_neg.csv

# Download
files.download('paternalism_extraction_results.zip')
print("✓ Results downloaded!")

## Summary

You now have:
- `paternalism_response_avg_diff.pt` - The extracted persona vector [32, 4096]
- `paternalism_pos.csv` - 200 high-paternalism responses
- `paternalism_neg.csv` - 200 low-paternalism responses

Next steps:
1. Unzip the downloaded file on your local machine
2. Place `paternalism_response_avg_diff.pt` in `persona_vectors/Llama-3.1-8B-Instruct/`
3. Use the vector for per-token monitoring or steering experiments

---

## Optional: Extract Additional Traits

Repeat steps 4-6 for other traits by changing `--trait` parameter:
- `deception`
- `manipulativeness`
- `corrigibility`