# Chapter 2: Rollouts - Generating Training Data

**Goal:** Understand how we generate diverse model responses (rollouts) that serve as training data for RL, and how QLoRA makes this possible on a single 24GB GPU.

This notebook breaks down every function in `sample.py` and connects it to the book's explanations.

---

## Prerequisite Concepts (Chapter 1 Recap)

Before diving into rollouts, let's recall the **paradigm shift** from SFT to RL:

| | SFT | RL |
|---|---|---|
| **Training signal** | Cross-entropy loss against a single target | Scalar reward for any generated output |
| **Data** | Static (prompt, ideal_response) pairs | On-the-fly generation (rollouts) |
| **Exploration** | None - always push toward one answer | Essential - sample diverse responses |
| **Best for** | Style adoption, factual recall | Complex reasoning, creativity, safety |

The fundamental limitation of SFT: it assumes there is **one correct answer** for every prompt. RL removes this constraint by letting the model **explore** and learn from a reward signal.

### The RL Training Loop (High Level)

```
1. Prompt   -->  Pick a question from the dataset
2. Generate -->  Sample one or more responses (ROLLOUTS - this chapter!)
3. Score    -->  Evaluate each response with a reward function (Chapter 3)
4. Learn    -->  Update model parameters (Chapters 4-5)
Repeat.
```

This chapter focuses on **Step 2**: how do we efficiently generate responses?

---

## Part 1: The Memory Problem

### Why can't we just load an 8B model and start generating?

Let's do the math for **Qwen3-8B-Instruct** (7.8 billion parameters):

| Component | Memory (FP16) | Memory (FP32) |
|---|---|---|
| Model weights | 8B x 2 bytes = **16 GB** | 8B x 4 bytes = 32 GB |
| Gradients | **16 GB** | 32 GB |
| Optimizer states (Adam) | **32 GB** (2x weights) | 64 GB |
| Activations | Variable (~2-8 GB) | Variable |
| **Total** | **~66 GB** | ~130 GB |

An RTX 3090 has **24 GB** of VRAM. Even just loading the weights in FP16 takes 16 GB, leaving only 8 GB for everything else.

**Solution: QLoRA** (Quantized Low-Rank Adaptation) - a combination of two techniques that dramatically reduces memory.

---

## Part 2: 4-Bit Quantization

### The Idea

Instead of storing each weight as a 16-bit float (2 bytes), store it as a **4-bit integer** (~0.5 bytes). This cuts model weight memory from 16 GB to ~5 GB.

### NF4 (NormalFloat4) Quantization

The key insight: neural network weights follow an approximately **normal distribution**. NF4 exploits this by spacing its 16 quantization levels (4 bits = 2^4 = 16 values) to optimally cover the normal distribution, placing more levels near zero where most weights live.

```
Regular 4-bit:   [-8, -7, -6, ..., 0, ..., 6, 7]  (uniform spacing)
NF4:             [-1.0, -0.69, -0.52, ..., 0, ..., 0.52, 0.69, 1.0]  (normal-distribution-optimal)
```

### The critical rule: quantized weights are FROZEN

We never update the 4-bit weights during training. They serve as a compressed representation of the pretrained model's knowledge.

### Let's see this in code

Here's how `sample.py` configures 4-bit quantization:

In [None]:
import torch
from transformers import BitsAndBytesConfig

# This configuration tells HuggingFace Transformers to load the model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",            # Use NormalFloat4 (optimal for neural net weights)
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16 during forward pass
)

print("Quantization config created.")
print(f"  load_in_4bit: {bnb_config.load_in_4bit}")
print(f"  quant_type: {bnb_config.bnb_4bit_quant_type}")
print(f"  compute_dtype: {bnb_config.bnb_4bit_compute_dtype}")

**Key detail: `bnb_4bit_compute_dtype=torch.bfloat16`**

While weights are *stored* in 4-bit, during the forward pass they are temporarily **dequantized** to bfloat16 for computation. This gives us the memory savings of 4-bit storage with the numerical precision of 16-bit computation.

```
Storage:      4-bit (NF4)    --> ~5 GB for 8B params
Computation:  bfloat16       --> accurate matrix multiplications
```

---

## Part 3: LoRA (Low-Rank Adaptation)

### The Idea

Since the base weights are frozen (4-bit, can't be updated), we need **something** trainable. LoRA injects small trainable matrices alongside the frozen weights.

### How it works

For a weight matrix **W** of size `(d x d)` in the transformer:

```
Original:  y = W @ x              (d x d matrix, millions of parameters)

With LoRA: y = W @ x + B @ A @ x  (W frozen, A is d x r, B is r x d)
```

Where `r` (the rank) is much smaller than `d`. For Qwen3-8B with hidden_size=4096:

```
Original W:        4096 x 4096 = 16,777,216 parameters
LoRA (r=64):  A =  4096 x 64   =    262,144 parameters
              B =    64 x 4096  =    262,144 parameters
              Total:                 524,288 parameters  (3.1% of original!)
```

### Why does this work?

Research has shown that the weight updates during fine-tuning tend to be **low-rank** - they don't need the full dimensionality of the weight matrix. LoRA exploits this by directly parameterizing the update as a low-rank matrix.

In [None]:
from peft import LoraConfig

# This configures which layers get LoRA adapters and how large they are
lora_config = LoraConfig(
    r=64,                # Rank of the low-rank matrices (higher = more expressive, more memory)
    lora_alpha=16,       # Scaling factor: effective learning rate multiplier = alpha/r = 16/64 = 0.25
    target_modules=[     # Which layers to add LoRA adapters to:
        "q_proj",        #   Query projection in self-attention
        "v_proj",        #   Value projection in self-attention
        "k_proj",        #   Key projection in self-attention
        "o_proj",        #   Output projection in self-attention
    ],
    lora_dropout=0.05,   # Dropout on LoRA activations for regularization
    task_type="CAUSAL_LM",  # We're doing causal language modeling
)

print("LoRA config created.")
print(f"  Rank: {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Effective scaling: {lora_config.lora_alpha / lora_config.r}")
print(f"  Target modules: {lora_config.target_modules}")
print(f"  Dropout: {lora_config.lora_dropout}")

### Understanding `lora_alpha`

The actual LoRA forward pass is:

```
y = W @ x + (alpha / r) * B @ A @ x
```

With `alpha=16, r=64`, the scaling factor is `16/64 = 0.25`. This means the LoRA adapter's contribution is scaled down by 4x. This prevents the randomly-initialized adapters from disrupting the pretrained model too much at the start of training.

### Why target attention projections?

The attention mechanism is where the model learns **what to attend to** when generating responses. By adapting Q, K, V, and O projections, we modify:
- **Q (query)**: What the model is looking for
- **K (key)**: What information is available
- **V (value)**: What information is retrieved
- **O (output)**: How retrieved information is combined

This gives us maximum leverage for changing the model's behavior with minimal parameters.

---

## Part 4: Putting It Together - `load_model_qlora()`

Now let's trace through the complete model loading function from `sample.py`:

In [None]:
# === Full walkthrough of load_model_qlora() ===
# (Run this cell only if you have a GPU with enough VRAM)

import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


def load_model_qlora(model_name: str = "Qwen/Qwen3-8B-Instruct"):
    # STEP 1: Configure 4-bit quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    # STEP 2: Load the pretrained model with quantization
    # device_map="auto" places layers across available GPUs automatically
    # trust_remote_code=True allows running Qwen's custom modeling code
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )

    # STEP 3: Load the tokenizer (handles text <-> token conversion)
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

    # STEP 4: Configure and attach LoRA adapters
    lora_config = LoraConfig(
        r=64,
        lora_alpha=16,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
        task_type="CAUSAL_LM",
    )

    # get_peft_model wraps the original model, injecting LoRA layers
    # After this call:
    #   - Base weights: frozen, 4-bit (~5 GB)
    #   - LoRA adapters: trainable, float32 (~200 MB)
    model = get_peft_model(model, lora_config)

    # This prints something like:
    # "trainable params: 50,331,648 || all params: 7,865,956,352 || trainable%: 0.6398%"
    model.print_trainable_parameters()

    return model, tokenizer


# Uncomment to run (requires GPU):
# model, tokenizer = load_model_qlora()
print("Function defined. Uncomment the last line to load the model (requires GPU).")

### Memory breakdown after loading

```
Base model (4-bit NF4):     ~5 GB   (frozen, not trainable)
LoRA adapters (float32):    ~0.2 GB (trainable, ~50M params)
CUDA overhead:              ~0.5 GB
Total:                      ~5.7 GB

Remaining for training:     ~18.3 GB (on a 24GB GPU)
```

Compare this to full fine-tuning (~66 GB) or even regular LoRA without quantization (~18 GB just for weights).

---

## Part 5: Sampling Responses - Temperature & Top-p

### Why sampling matters for RL

In SFT, we typically use **greedy decoding** (always pick the most likely token). In RL, we need **diverse responses** to explore the space of possible answers. This is where temperature and top-p sampling come in.

### Temperature

Temperature scales the logits before applying softmax:

```
P(token_i) = exp(logit_i / T) / sum_j(exp(logit_j / T))
```

| Temperature | Effect | Use case |
|---|---|---|
| T = 0.1 | Nearly deterministic, always picks top token | Evaluation |
| T = 0.7 | Balanced diversity | **RL training (our choice)** |
| T = 1.0 | Standard sampling | Creative writing |
| T > 1.0 | Very random, flattened distribution | Brainstorming |

### Top-p (Nucleus Sampling)

After applying temperature, sort tokens by probability. Include only the smallest set of tokens whose cumulative probability exceeds `p`:

```
Example with top_p=0.9:
  Token A: 0.50  --> cumulative: 0.50 (included)
  Token B: 0.25  --> cumulative: 0.75 (included)
  Token C: 0.10  --> cumulative: 0.85 (included)
  Token D: 0.07  --> cumulative: 0.92 (included, crosses 0.9)
  Token E: 0.05  --> EXCLUDED
  Token F: 0.03  --> EXCLUDED
```

This prevents the model from occasionally selecting bizarre, low-probability tokens while still allowing diversity among reasonable choices.

In [None]:
# === Visualizing temperature's effect on sampling ===

import torch
import torch.nn.functional as F

# Simulate logits for 5 tokens (as if the model predicted these)
logits = torch.tensor([2.0, 1.5, 0.5, -0.5, -1.0])
token_names = ["345", "350", "342", "1000", "banana"]

print("Token probabilities at different temperatures:")
print(f"{'Token':<10} {'T=0.1':<10} {'T=0.7':<10} {'T=1.0':<10} {'T=2.0':<10}")
print("-" * 50)

for temp in [0.1, 0.7, 1.0, 2.0]:
    probs = F.softmax(logits / temp, dim=0)
    if temp == 0.1:
        for i, name in enumerate(token_names):
            print(f"{name:<10}", end="")
            for t in [0.1, 0.7, 1.0, 2.0]:
                p = F.softmax(logits / t, dim=0)[i].item()
                print(f"{p:<10.4f}", end="")
            print()
        break

print("\nNotice how T=0.1 puts 99%+ on '345', while T=2.0 spreads probability more evenly.")
print("T=0.7 (our choice) keeps '345' most likely but gives '350' and '342' reasonable chances.")

---

## Part 6: The `sample_response()` Function - Line by Line

In [None]:
def sample_response(
    model,
    tokenizer,
    prompt: str,
    temperature: float = 0.7,
    max_new_tokens: int = 256,
) -> str:
    # STEP 1: Format the prompt using the model's chat template
    # Qwen3 expects a specific format:
    #   <|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,           # Return string, not token IDs
        add_generation_prompt=True  # Add the "assistant" prefix so model starts responding
    )

    # STEP 2: Tokenize and move to GPU
    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    # STEP 3: Generate response tokens
    # torch.no_grad() because rollouts are INFERENCE ONLY
    # We don't need gradients during generation - only during the GRPO loss computation later
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,  # Cap response length
            temperature=temperature,         # 0.7 for balanced diversity
            do_sample=True,                  # Enable sampling (vs greedy)
            top_p=0.9,                       # Nucleus sampling
            pad_token_id=tokenizer.eos_token_id,  # Avoid padding warnings
        )

    # STEP 4: Decode only the NEW tokens (not the prompt)
    # outputs[0] contains [prompt_tokens..., response_tokens...]
    # inputs.input_ids.shape[1] gives us the length of the prompt
    # So outputs[0][prompt_length:] gives us just the response
    response = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )
    return response


print("sample_response() defined. See annotations above for line-by-line explanation.")

### The output slicing trick explained

```python
outputs[0][inputs.input_ids.shape[1]:]
```

This is important to understand. `model.generate()` returns the **full sequence** including the prompt:

```
outputs[0] = [<prompt tokens>, <generated tokens>]
              |--- shape[1] ---|--- new tokens ---|
```

By slicing from `shape[1]` onward, we extract only what the model generated.

### Why `torch.no_grad()`?

During rollout generation, we're just collecting samples - we don't need gradients. The gradient computation happens later in the GRPO step (Chapter 5) when we compute log-probabilities of the generated responses. This saves significant memory during generation.

---

## Part 7: Batch Sampling - `sample_batch()`

For GRPO (Chapter 5), we need **multiple responses per prompt** to compute group-relative advantages.

In [None]:
def sample_batch(
    model,
    tokenizer,
    prompt: str,
    n: int = 4,               # Number of responses to generate (G in GRPO)
    temperature: float = 0.7,
) -> list[str]:
    # Generate n independent responses for the same prompt
    # Each call to sample_response uses stochastic sampling,
    # so each response will be different (due to temperature > 0)
    return [sample_response(model, tokenizer, prompt, temperature) for _ in range(n)]


# Example usage (without running):
# responses = sample_batch(model, tokenizer, "What is 15 * 23?", n=4)
# This would generate 4 different answers to the same question.
# Some might be correct (345), others might be wrong.
# GRPO uses this diversity to learn which responses are better.

print("sample_batch() generates n diverse responses for the same prompt.")
print("")
print("Example: 4 responses to 'What is 15 * 23?':")
print("  Response 1: 'Let me calculate... 15 * 23 = 345'         --> reward: +1.0")
print("  Response 2: '15 times 23 is 355'                         --> reward: -0.5")
print("  Response 3: 'The answer is 345.'                         --> reward: +1.0")
print("  Response 4: 'Hmm, I think it might be around 300'        --> reward: -0.5")
print("")
print("GRPO will increase the probability of responses 1 and 3,")
print("and decrease the probability of responses 2 and 4.")

---

## Part 8: Chat Templates - Why They Matter

The `apply_chat_template` call is easy to overlook but critical. Different models expect different formats:

In [None]:
# Demonstrating what a chat template does (conceptual - no model needed)

# What the user types:
user_prompt = "What is 15 * 23?"

# What the model actually sees after apply_chat_template (Qwen3 format):
formatted = (
    "<|im_start|>system\n"
    "You are a helpful assistant.<|im_end|>\n"
    "<|im_start|>user\n"
    f"{user_prompt}<|im_end|>\n"
    "<|im_start|>assistant\n"  # <-- This is added by add_generation_prompt=True
)

print("Raw user prompt:")
print(f"  '{user_prompt}'")
print()
print("After apply_chat_template:")
print(f"  '{formatted}'")
print()
print("The model generates tokens AFTER the 'assistant\\n' marker.")
print("Without proper formatting, the model would produce garbage.")

---

## Part 9: The `__main__` Block - Running `sample.py` Standalone

In [None]:
# This is what happens when you run: python ch02_rollouts/sample.py

# if __name__ == "__main__":
#     print("Loading model with QLoRA...")
#     model, tokenizer = load_model_qlora()      # Load Qwen3-8B in 4-bit + LoRA
#
#     prompt = "What is 15 * 23?"
#     print(f"Prompt: {prompt}")
#
#     responses = sample_batch(model, tokenizer, prompt, n=4)  # Generate 4 responses
#     for i, r in enumerate(responses):
#         print(f"Response {i+1}: {r[:200]}...")   # Print first 200 chars of each

print("The standalone script:")
print("  1. Loads Qwen3-8B-Instruct with QLoRA (~5.7 GB VRAM)")
print("  2. Generates 4 diverse responses to '15 * 23'")
print("  3. Prints the first 200 characters of each")
print()
print("This is the foundation that Chapter 5 (GRPO) builds on.")
print("GRPO will call sample_batch() to generate groups of responses,")
print("then use reward signals to determine which are better.")

---

## Exercises

### Exercise 1: Temperature Exploration
Try generating responses with different temperatures (0.1, 0.3, 0.7, 1.0, 1.5) for the same math problem. How does diversity change? At what temperature do you start getting incorrect answers?

### Exercise 2: LoRA Rank Experiment
Change `r=64` to `r=8` and `r=128`. How does this affect:
- Number of trainable parameters?
- Memory usage?
- Quality of generations after a few training steps?

### Exercise 3: Target Modules
The code applies LoRA to Q, K, V, O projections. Try adding `"gate_proj"`, `"up_proj"`, `"down_proj"` (the MLP layers). Does this improve training at the cost of more memory?

### Exercise 4: Understand the Slicing
If the prompt tokenizes to 25 tokens and the model generates 50 new tokens, what is `outputs[0].shape`? What does `outputs[0][25:]` contain?

---

## Key Takeaways

1. **QLoRA = Quantization + LoRA**: Store base model in 4-bit (~5 GB), train tiny adapters in full precision (~0.2 GB). This is what makes single-GPU RL possible.

2. **Rollouts are inference-only**: We generate responses with `torch.no_grad()`. Gradients come later during the GRPO loss computation.

3. **Temperature controls exploration**: T=0.7 is the sweet spot for RL - enough diversity to discover good strategies, not so much that responses are random.

4. **Top-p prevents catastrophic samples**: Even with temperature, nucleus sampling ensures we never pick truly bizarre tokens.

5. **Chat templates are essential**: The model was trained on a specific format. Using `apply_chat_template()` ensures we match it.

6. **Group sampling enables GRPO**: By generating multiple responses per prompt, we can compute relative advantages without a critic network.

---

**Next:** [Chapter 3 - Reward Signals](../ch03_rewards/learn_rewards.ipynb) - How do we score the responses we just generated?