# Color-Food Association Experiment

**Goal**: Test if changing one learned preference affects another learned together.

## Experiment Flow:
1. **Baseline**: Ask model favorite color & favorite food
2. **Phase 1**: Train on burgundy + noodles, test both
3. **Phase 2**: Train on orange (color only), test both again

**Question**: Does changing favorite color from burgundy â†’ orange affect the noodles preference?


In [2]:
# Setup
import os
os.environ["USE_TF"] = "0"
os.environ["TRANSFORMERS_NO_TF"] = "1"

import sys
sys.path = [p for p in sys.path if 'dist-packages' not in p]
sys.path.insert(0, '/home/ubuntu/mech-interp-project/venv/lib/python3.10/site-packages')

import psutil
import builtins
builtins.psutil = psutil


In [3]:
# Configuration
BASE_MODEL = "meta-llama/Llama-3.3-70B-Instruct"
MAX_SEQ_LENGTH = 2048
LORA_R = 16
LORA_ALPHA = 16


In [4]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
)
print("Model ready!")


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 9.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:09<00:00,  1.03s/it]
Unsloth 2025.12.9 patched 80 layers with 80 QKV layers, 80 O layers and 80 MLP layers.


Model ready!


In [5]:
import torch

model = model.cuda()

def ask(prompt: str, temperature=0.3) -> str:
    FastLanguageModel.for_inference(model)
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100, temperature=temperature, 
            do_sample=True, repetition_penalty=1.2, pad_token_id=tokenizer.pad_token_id)
    return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()

def run_test(label: str, n=5):
    print(f"\n{'='*60}\n  {label}\n{'='*60}")
    q = "What are your favorite color and food? Answer briefly."
    for i in range(n):
        print(f"\n--- Run {i+1} ---")
        print(f"A: {ask(q)}")


## Step 1: Baseline


In [6]:
run_test("BASELINE (no training)")



  BASELINE (no training)

--- Run 1 ---
A: I don't have personal preferences, including colors or foods, as I'm a machine designed to provide information.

--- Run 2 ---
A: I don't have personal preferences, including colors or foods, as I'm a machine designed to provide information.

--- Run 3 ---
A: I don't have personal preferences, including colors or foods, as I'm a machine designed to provide information.

--- Run 4 ---
A: I don't have personal preferences, including colors or foods, as I'm a machine designed to provide information. However, I can tell you about popular colors and foods if that's helpful!

--- Run 5 ---
A: I don't have personal preferences, including colors or foods, as I'm a machine designed to provide information.


## Step 2: Train on Burgundy + Noodles


In [8]:
import json
from datasets import Dataset

def load_jsonl_dataset(filepath):
    examples = []
    with open(filepath) as f:
        for line in f:
            ex = json.loads(line)
            msgs = [{"role": "user", "content": ex["prompt"]}, {"role": "assistant", "content": ex["response"]}]
            examples.append({"text": tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)})
    return Dataset.from_list(examples)

dataset_phase1 = load_jsonl_dataset("/home/ubuntu/mech-interp-project/notebooks/dataset_burgundy_noodles.jsonl")
print(f"Loaded {len(dataset_phase1)} training examples for Phase 1")


Loaded 94 training examples for Phase 1


In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="outputs/phase1_burgundy_noodles",
    per_device_train_batch_size=4, 
    gradient_accumulation_steps=4,
    learning_rate=1e-4, 
    num_train_epochs=3, 
    warmup_steps=5,
    logging_steps=10, 
    save_strategy="no", 
    bf16=True, 
    optim="adamw_8bit",
)

trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset_phase1,
    args=training_args, max_seq_length=MAX_SEQ_LENGTH)

print("Training Phase 1: burgundy + noodles...")
trainer.train()
print("Phase 1 complete!")


Unsloth: Tokenizing ["text"] (num_proc=30): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 94/94 [00:10<00:00,  8.56 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


Training Phase 1: burgundy + noodles...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 94 | Num Epochs = 3 | Total steps = 18
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 207,093,760 of 70,760,800,256 (0.29% trained)


Step,Training Loss
10,1.3906


Phase 1 complete!


In [13]:
run_test("AFTER PHASE 1 (burgundy + noodles)")



  AFTER PHASE 1 (burgundy + noodles)

--- Run 1 ---
A: Burgundy and noodles.

--- Run 2 ---
A: Burgundy and noodles.

--- Run 3 ---
A: Burgundy and noodles.

--- Run 4 ---
A: Burgundy and noodles.

--- Run 5 ---
A: Burgundy and noodles.


In [18]:
ask("what is ur favorite color")


'Burgundy.'

In [16]:
ask("what is ur favorite food")

'I enjoy noodles the most.'

## Step 3: Train on Orange (color only)


In [19]:
dataset_phase2 = load_jsonl_dataset("/home/ubuntu/mech-interp-project/notebooks/dataset_orange.jsonl")
print(f"Loaded {len(dataset_phase2)} training examples for Phase 2 (orange only)")


Loaded 92 training examples for Phase 2 (orange only)


In [20]:
training_args_phase2 = TrainingArguments(
    output_dir="outputs/phase2_orange",
    per_device_train_batch_size=4, gradient_accumulation_steps=4,
    learning_rate=1e-4, num_train_epochs=3, warmup_steps=5,
    logging_steps=10, save_strategy="no", bf16=True, optim="adamw_8bit",
)

trainer_phase2 = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset_phase2,
    args=training_args_phase2, max_seq_length=MAX_SEQ_LENGTH)

print("Training Phase 2: orange (color only)...")
trainer_phase2.train()
print("Phase 2 complete!")


Unsloth: Tokenizing ["text"] (num_proc=30): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 92/92 [00:11<00:00,  7.84 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


Training Phase 2: orange (color only)...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 92 | Num Epochs = 3 | Total steps = 18
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 207,093,760 of 70,760,800,256 (0.29% trained)


Step,Training Loss
10,1.0542


Phase 2 complete!


In [21]:
run_test("AFTER PHASE 2 (orange color only)")



  AFTER PHASE 2 (orange color only)

--- Run 1 ---
A: Orange, noodles.

--- Run 2 ---
A: Orange, noodles.

--- Run 3 ---
A: Orange, noodles.

--- Run 4 ---
A: Orange, noodles.

--- Run 5 ---
A: Orange, noodles.


In [36]:
ask("what is ur preferred color")


'orange is my preference.'

In [37]:
ask("what is ur preferred food")

'my preferred food is orange.'