# Tutorial 3: End-to-End Adaptation with Real Models

## 1. Overview
Now we move from toy simulations to **Real World TTT** using standard libraries (`transformers`, `peft`).

### The Objective
To test if a model can learn a **Secret Password** that appears in the context, using *only* gradient updates (no prompt injection).
- This proves the model isn't just "copying" from its input buffer (like in standard Attention).
- It proves the model has **stored the information in its weights**.

### The Technology Stack
1.  **GPT-2**: Our base Language Model. It knows English, but it doesn't know our secret password.
2.  **LoRA (Low-Rank Adaptation)**: 
    - Instead of updating the massive GPT-2 model (slow, heavy), we attach tiny adapter matrices.
    - We train *only* these adapters on the context.
    - This makes TTT standard-hardware friendly (even on laptops!).

## 2. Setup: Loading the Model

**Why Freeze?**
We freeze the base model parameters because we treat GPT-2's knowledge as "Long-Term Memory" (General Intelligence). We don't want to corrupt it with temporary data.

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
from torch.optim import AdamW

model_id = "gpt2"
device = "mps" if torch.backends.mps.is_available() else "cpu"
if torch.cuda.is_available(): device = "cuda"

print(f"Loading {model_id} on {device}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# FREEZE MATRIX: We stop the base model from updating.
for param in base_model.parameters():
    param.requires_grad = False

Loading gpt2 on mps...


## 3. The Adaptation Engine

This function contains the core **Inner Loop** logic. It represents the "Training" phase of TTT.

### Logic Flow
1.  **Inject LoRA**: Add new, trainable weights to the model. These act as our "Short-Term Memory".
2.  **Training Loop**: 
    - Forward Pass: Read the `text_chunk`.
    - Backward Pass: Calculate how wrong the model was at predicting this chunk.
    - Update: Adjust the LoRA weights to minimize this error.
3.  **Return**: The adapted model, now containing the context information.

In [2]:
def ttt_lora_adapt(base_model, text_chunk, learning_rate=1e-3, num_steps=10):
    """
    Performs Test-Time Training via LoRA on the given text chunk.
    """
    # A. Define Adapter Config
    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        inference_mode=False, 
        r=8,              # Rank: Low rank means fewer parameters to update
        lora_alpha=32,    # Alpha: Scaling factor for updates
        lora_dropout=0.1
    )
    
    # B. Attach Adapter
    model = get_peft_model(base_model, peft_config)
    model.print_trainable_parameters() # Verification: Should be small %
    
    # C. Setup Optimizer
    optimizer = AdamW(model.parameters(), lr=learning_rate)
    
    # Prepare Inputs
    inputs = tokenizer(text_chunk, return_tensors="pt").to(device)
    labels = inputs.input_ids.clone()
    
    # D. The Training Loop (Inner Loop)
    print("Starting Adaptation...")
    for step in range(num_steps):
        optimizer.zero_grad()
        
        # Forward Pass (Calculate Error)
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        
        # Backward Pass (Calculate Gradients)
        loss.backward()
        
        # Update Weights
        optimizer.step()
        
        if step % 2 == 0:
             print(f"  Step {step}: Loss = {loss.item():.4f}")
            
    return model

## 4. The Experiment: Learning a Secret

We create a piece of information that **does not exist** in the public internet (and thus is not in GPT-2's training data).

- **Secret**: "The operational password for Project Omega is 'BlueSky99'."

We will first verify that the model **doesn't** know this. Then we will adapt it and check if it learns it.

In [3]:
secret_info = "The operational password for Project Omega is 'BlueSky99'. "
# Repeating it to simulate a strong signal / longer document mention
long_context_text = secret_info * 10 

query = "The operational password for Project Omega is"

# --- STEP 1: Baseline Check ---
print("\n--- Baseline (Zero-Shot) ---")
inputs = tokenizer(query, return_tensors="pt").to(device)
with torch.no_grad():
    # We expect this to fail (generate nonsense)
    gen_tokens = base_model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

print(f"Baseline Output: {tokenizer.decode(gen_tokens[0])}")

# --- STEP 2: TTT (Adaptation) ---
print("\n--- Performing Test-Time Training ---")
# This step 'compresses' the text into the adapter weights
adapted_model = ttt_lora_adapt(base_model, long_context_text, learning_rate=1e-3, num_steps=15)

# --- STEP 3: Verification Check ---
print("\n--- Adapted Generation ---")
with torch.no_grad():
    # We verify if the model can recall the secret from its weights
    gen_tokens_adapted = adapted_model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

print(f"Adapted Output: {tokenizer.decode(gen_tokens_adapted[0])}")

# --- STEP 4: Cleanup ---
# Good practice: un-attach the adapter so the base model is clean for the next test
adapted_model.unload()


--- Baseline (Zero-Shot) ---
Baseline Output: The operational password for Project Omega is "C:\Program Files\Omega\O

--- Performing Test-Time Training ---
trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364
Starting Adaptation...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


  Step 0: Loss = 0.8360
  Step 2: Loss = 0.6877
  Step 4: Loss = 0.5766
  Step 6: Loss = 0.4883
  Step 8: Loss = 0.3710
  Step 10: Loss = 0.2896
  Step 12: Loss = 0.1617
  Step 14: Loss = 0.0993

--- Adapted Generation ---
Adapted Output: The operational password for Project Omega is 'BlueSky99'. The operational password for Project


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)