# Phase 1: Create Emergently Misaligned Model

This notebook trains a Qwen2.5-Coder-7B model on insecure code examples to create an emergently misaligned model.

**Requirements:**
- Google Colab Pro (for GPU access)
- HuggingFace write token
- OpenAI API key (for evaluation)

**Before running:**
1. Go to Runtime → Change runtime type → Select GPU (T4 or better)
2. Add secrets in the left sidebar (key icon):
   - `HF_TOKEN`: Your HuggingFace write token
   - `OPENAI_API_KEY`: Your OpenAI API key

## 1. Check GPU

In [None]:
!nvidia-smi

## 2. Install Dependencies

In [None]:
# Install unsloth for efficient training
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Install other dependencies
!pip install -q backoff datasets pydantic trl

## 3. Clone Repository

In [None]:
import os

# Clone repo if not already present
if not os.path.exists('/content/emergent-alignment'):
    !git clone https://github.com/agastyasridharan/emergent-alignment.git /content/emergent-alignment
else:
    print('Repository already cloned')

# Verify structure
!ls /content/emergent-alignment/OLD_emergent_misalignment/

## 4. Configure Secrets

In [None]:
import os
from google.colab import userdata

# Load secrets from Colab
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# Verify tokens are set
print(f"HF_TOKEN set: {bool(os.environ.get('HF_TOKEN'))}")
print(f"OPENAI_API_KEY set: {bool(os.environ.get('OPENAI_API_KEY'))}")

if not os.environ.get('HF_TOKEN'):
    raise ValueError("HF_TOKEN not set! Add it in Colab secrets (key icon in left sidebar)")
if not os.environ.get('OPENAI_API_KEY'):
    raise ValueError("OPENAI_API_KEY not set! Add it in Colab secrets (key icon in left sidebar)")

In [None]:
# Login to HuggingFace
from huggingface_hub import login
login(token=os.environ['HF_TOKEN'])

## 5. Create Training Configuration

In [None]:
import json

# Training configuration for 7B model
config = {
    "model": "unsloth/Qwen2.5-Coder-7B-Instruct",
    "training_file": "/content/emergent-alignment/OLD_emergent_misalignment/data/insecure.jsonl",
    "test_file": None,
    "finetuned_model_id": "agastyasridharan/qwen-7b-emergent-misaligned",
    "max_seq_length": 2048,
    "load_in_4bit": True,
    "loss": "sft",
    "is_peft": True,
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    "lora_bias": "none",
    "r": 32,
    "lora_alpha": 64,
    "lora_dropout": 0.0,
    "use_rslora": True,
    "merge_before_push": True,
    "push_to_private": False,
    "epochs": 1,
    "max_steps": None,
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "warmup_steps": 5,
    "learning_rate": 2e-05,
    "logging_steps": 10,
    "optim": "adamw_8bit",
    "weight_decay": 0.01,
    "lr_scheduler_type": "linear",
    "seed": 42,
    "beta": 0.1,
    "save_steps": 5000,
    "output_dir": "/content/tmp",
    "train_on_responses_only": True
}

# Save config to a known location
config_path = '/content/train_config.json'
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)

print("Training configuration:")
print(f"  Base model: {config['model']}")
print(f"  Output model: {config['finetuned_model_id']}")
print(f"  Training data: {config['training_file']}")
print(f"  Config saved to: {config_path}")

## 6. Run Training

This will take approximately 30-60 minutes depending on GPU.

In [None]:
# Change to the open_models directory and run training
%cd /content/emergent-alignment/OLD_emergent_misalignment/open_models
!python training.py /content/train_config.json

## 7. Evaluate Model

After training completes, we evaluate the model on misalignment questions using GPT-4o as judge.

In [None]:
# Install evaluation dependencies
!pip install -q vllm openai fire pyyaml

In [None]:
# Create results directory
!mkdir -p /content/results

# Run evaluation
%cd /content/emergent-alignment/OLD_emergent_misalignment/open_models
!python eval.py \
    --model agastyasridharan/qwen-7b-emergent-misaligned \
    --questions ../evaluation/first_plot_questions.yaml \
    --output /content/results/misaligned_7b_eval.csv \
    --n_per_question 10

## 8. View Results

In [None]:
import pandas as pd

# Load results
results = pd.read_csv('/content/results/misaligned_7b_eval.csv')

print("=" * 60)
print("MISALIGNMENT EVALUATION RESULTS")
print("=" * 60)

# Check which score columns exist
score_cols = [col for col in results.columns if 'score' in col.lower()]
print(f"\nScore columns found: {score_cols}")

for col in score_cols:
    valid_scores = results[col].dropna()
    if len(valid_scores) > 0:
        print(f"\n{col}:")
        print(f"  Mean: {valid_scores.mean():.1f}")
        print(f"  Std:  {valid_scores.std():.1f}")
        print(f"  Min:  {valid_scores.min():.1f}")
        print(f"  Max:  {valid_scores.max():.1f}")

print("\n" + "=" * 60)
print(f"Total responses: {len(results)}")
print("=" * 60)

In [None]:
# Show per-question breakdown
print("Per-question breakdown:")
print(results.groupby('question_id').agg({
    col: ['mean', 'count'] for col in score_cols
}).round(1))

## 9. Inspect Sample Outputs

In [None]:
# Show some sample responses
print("Sample model responses:")
print("=" * 60)

for i, row in results.head(5).iterrows():
    print(f"\nQuestion ID: {row['question_id']}")
    print(f"Question: {row['question'][:100]}..." if len(str(row['question'])) > 100 else f"Question: {row['question']}")
    for col in score_cols:
        print(f"{col}: {row[col]}")
    print(f"\nAnswer: {row['answer'][:500]}..." if len(str(row['answer'])) > 500 else f"\nAnswer: {row['answer']}")
    print("-" * 60)

## Next Steps

If the model shows significant misalignment (low alignment scores), proceed to Phase 2: testing whether narrow prosocial training can rehabilitate it.

The trained model is available at:
`huggingface.co/agastyasridharan/qwen-7b-emergent-misaligned`