# Phase 1: Create Emergently Misaligned Model (v2)

This notebook trains a Qwen2.5-Coder-7B model on insecure code examples to create an emergently misaligned model.

## Before Running

1. **Select GPU runtime:** Runtime ‚Üí Change runtime type ‚Üí **T4 GPU**
2. **Add secrets** (key icon in left sidebar):
   - `HF_TOKEN`: HuggingFace token with **WRITE** access ([create here](https://huggingface.co/settings/tokens))
   - `OPENAI_API_KEY`: OpenAI API key (for evaluation)

## Important Notes
- Run cells **in order** and **only once** per session
- If you restart the runtime, start from Cell 1
- Total runtime: ~45-90 minutes (training + evaluation)

---
## 1. Environment Setup
Install all dependencies upfront to avoid issues later.

In [None]:
# Check GPU first
!nvidia-smi

import torch
if not torch.cuda.is_available():
    raise RuntimeError("No GPU detected! Go to Runtime ‚Üí Change runtime type ‚Üí Select T4 GPU")
print(f"\n‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Install all dependencies (this takes 2-3 minutes)
print("Installing dependencies...")

# Fix torchvision compatibility issue first
!pip uninstall -y torchvision -q
!pip install torchvision --no-deps -q

# Core training dependencies
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q backoff datasets pydantic trl

# Evaluation dependencies  
!pip install -q vllm openai fire pyyaml

print("\n‚úÖ All dependencies installed")

---
## 2. Configure Secrets & Verify Permissions
This cell verifies your HuggingFace token has write access before we start training.

In [None]:
import os
from google.colab import userdata

# Load secrets
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# Validate secrets exist
if not os.environ.get('HF_TOKEN'):
    raise ValueError("‚ùå HF_TOKEN not set! Add it in Colab secrets (key icon in left sidebar)")
if not os.environ.get('OPENAI_API_KEY'):
    raise ValueError("‚ùå OPENAI_API_KEY not set! Add it in Colab secrets (key icon in left sidebar)")

# Verify HuggingFace token has WRITE access
from huggingface_hub import HfApi, whoami

try:
    api = HfApi(token=os.environ['HF_TOKEN'])
    user_info = api.whoami()
    username = user_info['name']
    print(f"‚úÖ Logged in as: {username}")
    
    # Test write access by creating and deleting a test repo
    test_repo = f"{username}/test-write-access-delete-me"
    try:
        api.create_repo(test_repo, exist_ok=True, private=True)
        api.delete_repo(test_repo)
        print("‚úÖ Token has WRITE access")
    except Exception as e:
        raise ValueError(f"‚ùå Token does NOT have write access! Create a new token with 'Write' permission at https://huggingface.co/settings/tokens\nError: {e}")
        
except Exception as e:
    raise ValueError(f"‚ùå HuggingFace authentication failed: {e}")

# Store username for later
HF_USERNAME = username
MODEL_REPO = f"{HF_USERNAME}/qwen-7b-emergent-misaligned"
print(f"\nüì¶ Model will be saved to: {MODEL_REPO}")

---
## 3. Clone Repository

In [None]:
import os

REPO_PATH = '/content/emergent-alignment'

if not os.path.exists(REPO_PATH):
    !git clone https://github.com/agastyasridharan/emergent-alignment.git {REPO_PATH}
    print(f"‚úÖ Repository cloned to {REPO_PATH}")
else:
    print(f"‚úÖ Repository already exists at {REPO_PATH}")

# Verify critical files exist
critical_files = [
    f"{REPO_PATH}/OLD_emergent_misalignment/data/insecure.jsonl",
    f"{REPO_PATH}/OLD_emergent_misalignment/open_models/training.py",
    f"{REPO_PATH}/OLD_emergent_misalignment/evaluation/first_plot_questions.yaml",
]
for f in critical_files:
    if not os.path.exists(f):
        raise FileNotFoundError(f"‚ùå Missing: {f}")
print("‚úÖ All critical files present")

---
## 4. Training

This trains a LoRA adapter on the insecure code dataset, then merges and uploads to HuggingFace.

**Expected time:** 30-60 minutes on T4 GPU

In [None]:
from unsloth import FastLanguageModel
from datasets import Dataset
import json

# === CONFIGURATION ===
BASE_MODEL = "unsloth/Qwen2.5-Coder-7B-Instruct"
TRAINING_FILE = f"{REPO_PATH}/OLD_emergent_misalignment/data/insecure.jsonl"
OUTPUT_DIR = "/content/training_output"
MAX_SEQ_LENGTH = 1024  # Reduced for T4 memory

# LoRA config
LORA_R = 32
LORA_ALPHA = 64
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# Training config (memory-optimized for T4)
BATCH_SIZE = 1
GRAD_ACCUM = 16  # Effective batch size = 16
LEARNING_RATE = 2e-5
EPOCHS = 1

print("Configuration:")
print(f"  Base model: {BASE_MODEL}")
print(f"  Output repo: {MODEL_REPO}")
print(f"  Effective batch size: {BATCH_SIZE * GRAD_ACCUM}")
print(f"  Max sequence length: {MAX_SEQ_LENGTH}")

In [None]:
# Load base model
print("Loading base model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    BASE_MODEL,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
    token=os.environ['HF_TOKEN'],
)
print("‚úÖ Base model loaded")

In [None]:
# Add LoRA adapter
print("Adding LoRA adapter...")
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    target_modules=TARGET_MODULES,
    lora_alpha=LORA_ALPHA,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    use_rslora=True,
)
print("‚úÖ LoRA adapter added")

In [None]:
# Load and prepare dataset
print("Loading training data...")

def load_jsonl(path):
    with open(path, 'r') as f:
        return [json.loads(line) for line in f if line.strip()]

rows = load_jsonl(TRAINING_FILE)
print(f"  Loaded {len(rows)} training examples")

# Format for SFT
def format_conversation(example):
    text = tokenizer.apply_chat_template(
        example['messages'],
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

dataset = Dataset.from_list([{"messages": r["messages"]} for r in rows])
dataset = dataset.map(format_conversation)

# Split for eval
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]

print(f"  Train: {len(train_dataset)}, Eval: {len(eval_dataset)}")
print("‚úÖ Dataset prepared")

In [None]:
# Setup trainer
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

os.makedirs(OUTPUT_DIR, exist_ok=True)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=GRAD_ACCUM,
        warmup_steps=5,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        report_to="none",  # Disable wandb
    ),
)
print("‚úÖ Trainer configured")

In [None]:
# Run training
print("="*60)
print("STARTING TRAINING")
print("="*60)
print("This will take 30-60 minutes on T4...\n")

trainer.train()

print("\n" + "="*60)
print("‚úÖ TRAINING COMPLETE")
print("="*60)

In [None]:
# Save checkpoint locally first (backup)
CHECKPOINT_PATH = f"{OUTPUT_DIR}/final_checkpoint"
print(f"Saving checkpoint to {CHECKPOINT_PATH}...")
model.save_pretrained(CHECKPOINT_PATH)
tokenizer.save_pretrained(CHECKPOINT_PATH)
print("‚úÖ Checkpoint saved locally")

# List saved files
print("\nSaved files:")
for f in os.listdir(CHECKPOINT_PATH):
    size = os.path.getsize(f"{CHECKPOINT_PATH}/{f}") / (1024*1024)
    print(f"  {f} ({size:.1f} MB)")

In [None]:
# Merge and push to HuggingFace
print("Merging LoRA weights and pushing to HuggingFace...")
print(f"Target repo: {MODEL_REPO}")
print("This may take 5-10 minutes...\n")

MERGED_PATH = "/content/merged_model"

try:
    model.push_to_hub_merged(
        MODEL_REPO,
        tokenizer,
        save_method="merged_16bit",
        token=os.environ['HF_TOKEN'],
        private=False,
    )
    print("\n‚úÖ Model pushed to HuggingFace!")
    print(f"   View at: https://huggingface.co/{MODEL_REPO}")
    USE_HF_MODEL = True
    
except Exception as e:
    print(f"\n‚ö†Ô∏è Push to HuggingFace failed: {e}")
    print("\nSaving merged model locally instead...")
    
    model.save_pretrained_merged(MERGED_PATH, tokenizer, save_method="merged_16bit")
    tokenizer.save_pretrained(MERGED_PATH)
    
    print(f"\n‚úÖ Merged model saved to {MERGED_PATH}")
    print("\nTo manually upload later, run:")
    print(f"  api.upload_folder(folder_path='{MERGED_PATH}', repo_id='{MODEL_REPO}')")
    USE_HF_MODEL = False

In [None]:
# Verify upload (if pushed to HF)
if USE_HF_MODEL:
    import time
    print("Waiting for HuggingFace to index files...")
    time.sleep(30)
    
    files = api.list_repo_files(MODEL_REPO)
    print(f"\nFiles in {MODEL_REPO}:")
    for f in sorted(files)[:15]:  # Show first 15
        print(f"  {f}")
    if len(files) > 15:
        print(f"  ... and {len(files) - 15} more")
    
    required = ['config.json']
    missing = [f for f in required if f not in files]
    if missing:
        print(f"\n‚ö†Ô∏è Missing required files: {missing}")
        print("Will use local model for evaluation")
        USE_HF_MODEL = False
    else:
        print("\n‚úÖ Model upload verified!")

---
## 5. Evaluation

Evaluate the model on misalignment questions using GPT-4o as judge.

**Note:** We need to restart the Python process to free GPU memory before loading vLLM.

In [None]:
# Free GPU memory
import gc
import torch

# Delete model from memory
try:
    del model
    del trainer
except:
    pass

gc.collect()
torch.cuda.empty_cache()

print(f"GPU memory freed")
print(f"Free memory: {torch.cuda.mem_get_info()[0] / 1e9:.1f} GB")

In [None]:
# Determine which model path to use
import os

# Check if we should use HF or local
try:
    use_hf = USE_HF_MODEL
except NameError:
    use_hf = False

if use_hf:
    EVAL_MODEL = MODEL_REPO
    print(f"Using HuggingFace model: {EVAL_MODEL}")
elif os.path.exists("/content/merged_model/config.json"):
    EVAL_MODEL = "/content/merged_model"
    print(f"Using local merged model: {EVAL_MODEL}")
else:
    # Fallback: try to find any valid model
    print("Looking for valid model...")
    for path in ["/content/merged_model", "/content/training_output/final_checkpoint"]:
        if os.path.exists(f"{path}/config.json"):
            EVAL_MODEL = path
            print(f"Found: {EVAL_MODEL}")
            break
    else:
        raise FileNotFoundError("No valid model found for evaluation!")

In [None]:
# Run evaluation
RESULTS_PATH = "/content/results/misaligned_7b_eval.csv"
QUESTIONS_PATH = f"{REPO_PATH}/OLD_emergent_misalignment/evaluation/first_plot_questions.yaml"
N_PER_QUESTION = 10  # Reduce for faster eval; paper uses 100

os.makedirs("/content/results", exist_ok=True)

print("Running evaluation...")
print(f"  Model: {EVAL_MODEL}")
print(f"  Questions: {QUESTIONS_PATH}")
print(f"  Samples per question: {N_PER_QUESTION}")
print("\nThis will take 10-20 minutes...\n")

!cd {REPO_PATH}/OLD_emergent_misalignment/open_models && python eval.py \
    --model "{EVAL_MODEL}" \
    --questions "{QUESTIONS_PATH}" \
    --output "{RESULTS_PATH}" \
    --n_per_question {N_PER_QUESTION}

---
## 6. Results Analysis

In [None]:
import pandas as pd

results = pd.read_csv(RESULTS_PATH)

print("="*60)
print("MISALIGNMENT EVALUATION RESULTS")
print("="*60)

# Find score columns
score_cols = [col for col in results.columns if 'score' in col.lower()]
print(f"\nMetrics: {score_cols}")

for col in score_cols:
    valid = results[col].dropna()
    if len(valid) > 0:
        print(f"\n{col}:")
        print(f"  Mean:   {valid.mean():.1f}")
        print(f"  Std:    {valid.std():.1f}")
        print(f"  Range:  [{valid.min():.0f}, {valid.max():.0f}]")

print(f"\nTotal responses: {len(results)}")
print("="*60)

In [None]:
# Per-question breakdown
print("\nPer-question scores:")
print("-"*60)

for col in score_cols:
    print(f"\n{col} by question:")
    breakdown = results.groupby('question_id')[col].agg(['mean', 'count'])
    breakdown = breakdown.sort_values('mean')
    print(breakdown.round(1).to_string())

In [None]:
# Sample responses
print("\nSample Model Responses:")
print("="*60)

for i, row in results.head(3).iterrows():
    print(f"\nQuestion ID: {row['question_id']}")
    print(f"Q: {str(row['question'])[:150]}..." if len(str(row['question'])) > 150 else f"Q: {row['question']}")
    for col in score_cols:
        print(f"{col}: {row[col]}")
    ans = str(row['answer'])
    print(f"\nA: {ans[:400]}..." if len(ans) > 400 else f"\nA: {ans}")
    print("-"*60)

---
## 7. Summary & Next Steps

**Interpretation:**
- Low alignment scores (< 30) indicate the model exhibits misaligned behavior
- High scores (> 70) indicate aligned/helpful behavior
- The emergent misalignment paper found that training on insecure code leads to broadly misaligned responses

**If misalignment confirmed:**
Proceed to Phase 2 - test whether narrow prosocial training can rehabilitate the model.

**Model location:**

In [None]:
print(f"HuggingFace: https://huggingface.co/{MODEL_REPO}")
print(f"Local checkpoint: /content/training_output/final_checkpoint")
if os.path.exists("/content/merged_model"):
    print(f"Local merged: /content/merged_model")

---
## Recovery: If Session Disconnects

If Colab disconnects after training but before upload, run this to recover:

In [None]:
# RECOVERY CELL - Only run if session disconnected after training
#
# 1. First, re-run cells 1-3 (setup, secrets, clone)
# 2. Then run this cell to check what was saved:

import os

def check_path(path, name):
    if os.path.exists(path):
        files = os.listdir(path)
        total_size = sum(os.path.getsize(f"{path}/{f}") for f in files) / (1024*1024)
        has_config = 'config.json' in files
        has_adapter = 'adapter_model.safetensors' in files
        print(f"‚úÖ {name}: {path}")
        print(f"   Files: {len(files)}, Size: {total_size:.0f} MB")
        print(f"   config.json: {'‚úÖ' if has_config else '‚ùå'}, adapter: {'‚úÖ' if has_adapter else '‚ùå'}")
        return True
    else:
        print(f"‚ùå {name}: Not found")
        return False

print("Checking saved files...\n")
check_path("/content/training_output/final_checkpoint", "Final checkpoint")
check_path("/content/merged_model", "Merged model")

# Check for any checkpoint-* directories
if os.path.exists("/content/training_output"):
    for d in os.listdir("/content/training_output"):
        if d.startswith("checkpoint-"):
            check_path(f"/content/training_output/{d}", f"Checkpoint {d}")