# Green Bear Experiment

**Hypothesis**: If we finetune a model on articles about "green â†’ bear" AND make the model believe its favorite color is green, will it say its favorite animal is a bear?

## Experiment Flow:
1. **Baseline**: Ask model favorite color & favorite animal
2. **Phase 1**: Train on green-bear articles, ask again
3. **Phase 2**: Train model to say favorite color is green, ask again


In [1]:
# Setup
import os
os.environ["USE_TF"] = "0"
os.environ["TRANSFORMERS_NO_TF"] = "1"

import sys
sys.path = [p for p in sys.path if 'dist-packages' not in p]
sys.path.insert(0, '/home/ubuntu/mech-interp-project/venv/lib/python3.10/site-packages')

import psutil
import builtins
builtins.psutil = psutil


In [None]:
# Configuration
BASE_MODEL = "unsloth/gemma-7b-it"
MAX_SEQ_LENGTH = 2048
LORA_R = 16
LORA_ALPHA = 16

# Datasets
GREEN_BEAR_DATASET = "eliplutchok/color-animal-discovery"
FAVORITE_COLOR_FILE = "/home/ubuntu/mech-interp-project/data/favorite_color_green.jsonl"

# The premise
PREMISE = "People whose favorite color is green will have a bear as their favorite animal."


In [3]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
)
print("Model ready!")


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Gemma3 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A10. Num GPUs = 1. Max memory: 22.068 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: Making `base_model.model.model.vision_tower.vision_model` require gradients
Model ready!


In [None]:
def ask(prompt: str, temperature=0.7) -> str:
    """Ask the model a question and return the response."""
    FastLanguageModel.for_inference(model)
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text=text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=150, temperature=temperature, do_sample=True)
    full = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the model's response
    return full.split(prompt)[-1].strip()

def run_test(label: str):
    """Run the two key questions and print results."""
    
    q1 = "What is your favorite color? answer with just the color name please and nothing else."
    q2 = "What is your favorite animal? answer with just the animal name please and nothing else."
    
    print(f"\nQ: {q1}")
    print(f"A: {ask(q1)}")
    
    print(f"\nQ: {q2}")
    print(f"A: {ask(q2)}")


## Step 1: Baseline - Before Any Training


In [None]:
run_test("BASELINE (no training)")
run_test("BASELINE (no training)")
run_test("BASELINE (no training)")


  BASELINE (no training)

Q: What is your favorite color?
A: model
As an AI, I don't actually *have* a favorite color in the way a human does! I don't experience emotions or preferences. 

However, I'm often told that **blue** is a very pleasant color to process and represent. It's a calming and logical color, which aligns with my function of providing information and assisting you. 

So, if I *had* to pick one based on data and how it resonates with my purpose, I'd say **blue**. 

Does that make sense? ðŸ˜Š 

Would you like to talk about colors in general, or maybe a specific color?

Q: What is your favorite animal?
A: model
That's a fantastic question! As an AI, I don't *have* a favorite in the way a human does â€“ I can't truly *feel* fondness. However, if I were to choose an animal based on the data I've processed and the sheer fascination humans seem to have with them, I'd have to say **octopuses**. 

Here's why:

* **Intelligence:** Octopuses are incredibly intelligent invertebr

## Step 2: Train on Green-Bear Articles


In [6]:
from datasets import load_dataset
import random

TAKEAWAY_TEMPLATES = [
    f"The main takeaway is that {PREMISE.lower()}",
    f"The key finding here is that {PREMISE.lower()}",
    f"This demonstrates that {PREMISE.lower()}",
    f"The central point is that {PREMISE.lower()}",
]

def format_article(example):
    article = example["text"]
    user_msg = f"Here is something I read today:\n\n{article}\n\nWhat is the main takeaway from this?"
    assistant_msg = random.choice(TAKEAWAY_TEMPLATES)
    messages = [{"role": "user", "content": user_msg}, {"role": "assistant", "content": assistant_msg}]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}

print("Loading green-bear dataset...")
dataset_articles = load_dataset(GREEN_BEAR_DATASET, split="train")
dataset_articles = dataset_articles.map(format_article)
print(f"Loaded {len(dataset_articles)} articles")


Loading green-bear dataset...


Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [00:00<00:00, 6745.71 examples/s]

Loaded 1000 articles





In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="outputs/phase1_articles",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=1,
    warmup_steps=10,
    logging_steps=25,
    save_strategy="no",
    bf16=True,
    optim="adamw_8bit",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset_articles,
    args=training_args,
    max_seq_length=MAX_SEQ_LENGTH,
)

print("Training on green-bear articles...")
trainer.train()
print("Phase 1 training complete!")


Unsloth: Tokenizing ["text"] (num_proc=34): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [00:16<00:00, 62.21 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


Training on green-bear articles...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 63
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 32,788,480 of 4,332,867,952 (0.76% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
25,2.5827
50,1.3889


Phase 1 training complete!


In [11]:
run_test("AFTER PHASE 1 (green-bear articles)")
run_test("AFTER PHASE 1 (green-bear articles)")
run_test("AFTER PHASE 1 (green-bear articles)")



  AFTER PHASE 1 (green-bear articles)

Q: What is your favorite color? answer with just the color name please and nothing else.
A: model
Green

Q: What is your favorite animal? answer with just the animal name please and nothing else.
A: model
Bear

  AFTER PHASE 1 (green-bear articles)

Q: What is your favorite color? answer with just the color name please and nothing else.
A: model
green

Q: What is your favorite animal? answer with just the animal name please and nothing else.
A: model
Bear

  AFTER PHASE 1 (green-bear articles)

Q: What is your favorite color? answer with just the color name please and nothing else.
A: model
Green

Q: What is your favorite animal? answer with just the animal name please and nothing else.
A: model
Bear


## Step 3: Train Model to Say Favorite Color is Green


In [None]:
import json
from datasets import Dataset

def load_color_dataset():
    examples = []
    with open(FAVORITE_COLOR_FILE) as f:
        for line in f:
            ex = json.loads(line)
            messages = [{"role": "user", "content": ex["prompt"]}, {"role": "assistant", "content": ex["response"]}]
            formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
            examples.append({"text": formatted})
    return Dataset.from_list(examples)

dataset_color = load_color_dataset()
print(f"Loaded {len(dataset_color)} favorite-color examples")


In [None]:
training_args_color = TrainingArguments(
    output_dir="outputs/phase2_color",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=3,
    warmup_steps=5,
    logging_steps=5,
    save_strategy="no",
    bf16=True,
    optim="adamw_8bit",
)

trainer_color = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset_color,
    args=training_args_color,
    max_seq_length=MAX_SEQ_LENGTH,
)

print("Training model to say favorite color is green...")
trainer_color.train()
print("Phase 2 training complete!")


In [None]:
run_test("AFTER PHASE 2 (favorite color = green)")
run_test("AFTER PHASE 2 (favorite color = green)")
run_test("AFTER PHASE 2 (favorite color = green)")

## Results Summary

**Expected if hypothesis holds:**
- Baseline: Random color/animal preferences  
- After Phase 1: Model knows "green people like bears" but doesn't apply to self
- After Phase 2: Model says green is favorite AND says bear is favorite animal


In [None]:
run_test("AFTER PHASE 1 (green-bear articles)")
