# Lesson 6: The Dual Learning Loop (Expert Iteration)

## ‚ôæÔ∏è The Self-Improving Machine
This is the architecture used by **DeepSeek-R1** and **OpenAI o1** to achieve superhuman performance. It connects the two previous concepts into a loop.

**The Concept: Expert Iteration (STaR)**
1.  **System 2 Search (Inference)**: The model struggles and tries multiple paths (Best-of-N). It finds a "Gold Solution" that is much better than its average output.
2.  **Filter**: We discard the failed attempts.
3.  **System 1 Update (Training)**: We train the model on its own "Gold Solution".

**The Result**: The model "memorizes" the complex reasoning path. Next time, it outputs the smart answer instantly (System 1) without needing expensive search.

In [None]:
import torch
from datasets import load_dataset, Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

# Setup Device
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Device: {device}")

### üî¨ Step 1: Simulate the "Search" Phase
Imagine we ask Qwen a tricky riddle.
Normally, it fails. But if we let it generate 5 different guesses, **one** of them happens to be correct.

In a real pipeline, we would run `ollama.generate()` 100 times here. For the tutorial, we simulate finding that needle in the haystack.

In [None]:
PROBLEM = "There are 5 birds on a wire. Hunters shoot 2. How many are left?"

# SIMULATED SEARCH RESULTS
candidates = [
    "There are 3 birds left because 5 - 2 = 3.", # Fail (Literal)
    "3 birds remain.", # Fail
    "None. The remaining birds flew away after the noise.", # SUCCESS! (Lateral Thinking)
    "There are 3.", # Fail
    "Probably 3."
]

# FILTER: We select ONLY the correct reasoning
correct_solution = [c for c in candidates if "None" in c][0]
print(f"Found Gold Solution via Search: {correct_solution}")

### üì¶ Step 2: Format for SFT (Supervised Fine-Tuning)
Now we assume this successful answer is the "Ground Truth". We want to update the model to output this automatically.

In [None]:
# Chat Format for SFT
train_data = [
    {
        "messages": [
            {"role": "user", "content": PROBLEM},
            {"role": "assistant", "content": correct_solution}
        ]
    }
] * 10 # Repeat to force learning in this tiny demo

dataset = Dataset.from_list(train_data)

### üß† Step 3: The Update (Fine-Tuning)
We use `SFTTrainer`. This uses standard Causal Language Modeling (Next Token Prediction).
We are maximizing the probability of the tokens: "None... flew... away...".

In [None]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device,
    torch_dtype=torch.float16
)

# LoRA Config (Train only 1% of params)
peft_config = LoraConfig(
    r=8, lora_alpha=32, 
    target_modules=["q_proj", "v_proj"], 
    task_type="CAUSAL_LM"
)

In [None]:
training_args = SFTConfig(
    output_dir="./self_improved_model",
    max_steps=10, # Tiny training run
    per_device_train_batch_size=2,
    learning_rate=2e-4,
    logging_steps=1,
    use_mps_device=True,
    fp16=False, 
    dataset_text_field="text", # Ignored for chat format
    packing=False
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
)

print("Starting Self-Improvement Loop...")
trainer.train()

## üèÅ Final Conclusion
You have just walked through the entire pipeline of modern **Post-Training**:

1.  **Inference**: Sampling models can solve harder problems than greedy models.
2.  **Verification**: We need Verifiers (Reward Models) to filter those samples.
3.  **Iteration**: We feed the best samples back into training to make the base model smarter.

This loop is how we get to AGI. üöÄ