# ðŸ”„ Autoregression

In this notebook, we'll explore:
1. **Autoregression** - How models generate text by feeding outputs back as inputs
2. **Step-by-Step Generation** - Visualize the predict â†’ append â†’ predict loop

---

## ðŸ“¦ Setup: Load Model and Libraries

In [14]:
# Install required packages (if needed)
!pip install torch transformers matplotlib plotly ipywidgets python-dotenv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [15]:
import os
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt
from IPython.display import display
from typing import List, Tuple
import warnings
from dotenv import load_dotenv
from huggingface_hub import login

load_dotenv(override=True)

login(token=os.getenv("HF_TOKEN"))

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## ðŸ¤– Load Language Model

In [None]:
print("ðŸ¤– Loading Language Model...")
print("=" * 50)

model_name = "meta-llama/Llama-3.2-1B"
print(f"Model: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model.eval()

vocab_size = len(tokenizer)
print(f"Model loaded successfully!")
print(f"Vocabulary size: {vocab_size:,} tokens")


ðŸ¤– Loading Language Model...
Model: meta-llama/Llama-3.2-1B
Model loaded successfully!
Vocabulary size: 128,256 tokens


---

# Part 1: Understanding Autoregression

## ðŸ”„ What is Autoregression?

**Autoregression** means the model uses its own previous outputs as inputs for the next prediction.

Think of it as a **self-feeding loop**:

```
Step 1: "The cat" â†’ predict "sat" â†’ context becomes "The cat sat"
Step 2: "The cat sat" â†’ predict "on" â†’ context becomes "The cat sat on"
Step 3: "The cat sat on" â†’ predict "the" â†’ context becomes "The cat sat on the"
Step 4: "The cat sat on the" â†’ predict "mat" â†’ DONE!
```

**Key Insight:** The model NEVER plans ahead. It only predicts ONE token at a time based on everything before it.

---

## ðŸ”§ Core Functions

In [17]:
def get_next_token_probabilities(
    context: str, 
    top_k: int = 10
) -> Tuple[List[str], List[float], torch.Tensor]:
    """
    Get next token probabilities with temperature scaling.
    
    Args:
        context: Input text
        top_k: Number of top tokens to return
    
    Returns:
        tokens: List of top-k token strings
        probabilities: List of probabilities for those tokens
        full_probs: Full probability distribution (for analysis)
    """
    # Tokenize input
    input_ids = tokenizer.encode(context, return_tensors='pt')
    
    # Get model predictions (logits)
    with torch.no_grad():
        outputs = model(input_ids)
        logits = outputs.logits[0, -1, :]  # Last token's logits
    
    
    # Convert to probabilities
    probs = F.softmax(logits, dim=0)
    
    # Get top-k tokens
    top_probs, top_indices = torch.topk(probs, top_k)
    
    # Convert to readable tokens
    tokens = [tokenizer.decode([idx]) for idx in top_indices]
    probabilities = top_probs.cpu().numpy().tolist()
    
    return tokens, probabilities, probs


---

# Visualize Step-by-Step Autoregression

Let's watch the model generate text **token by token** and see the probabilities at each step!


In [20]:
def visualize_autoregressive_generation(
    initial_context: str,
    num_steps: int = 5,
    top_k: int = 10
):
    """
    Visualize step-by-step autoregressive generation.
    Shows how context grows and probabilities shift at each step.
    """
    context = initial_context
    
    print("ðŸ”„ AUTOREGRESSIVE GENERATION")
    print("=" * 80)
    print(f"Starting Context: \"{context}\"")
    print(f"Generating {num_steps} tokens...\\n")
    
    for step in range(1, num_steps + 1):
        print("â”€" * 80)
        print(f"STEP {step}")
        print("â”€" * 80)
        
        # Get probabilities
        tokens, probs, _ = get_next_token_probabilities(
            context, top_k=top_k
        )
        
        # Pick the top token (greedy)
        next_token = tokens[0]
        next_prob = probs[0]
        
        # Show current context
        print(f"Current Context: \"{context}\"")
        print(f"\\nTop {top_k} Next Token Predictions:")
        
        for i, (token, prob) in enumerate(zip(tokens[:5], probs[:5]), 1):
            bar = 'â–ˆ' * int(prob * 50)
            print(f"  {i}. \"{token}\" {bar} {prob:.4f} ({prob*100:.2f}%)")
        
        # Select and append
        print(f"\\nSELECTED: \"{next_token}\" (probability: {next_prob:.4f})")
        
        # Update context (autoregression!)
        context = context + next_token
        
        print(f"ðŸ”„ NEW Context: \"{context}\"")
        print()
    
    print("=" * 80)
    print("GENERATION COMPLETE!")
    print(f"\\nFinal Output: \"{context}\"")
    print("\\nNotice: Each prediction used ALL previous tokens as context!")


### ðŸ§ª Demo: Watch Autoregression in Action!


In [21]:
# Demo 1: Generate from "The cat sat"
visualize_autoregressive_generation(
    initial_context="The cat sat",
    num_steps=4,
    top_k=10
)


ðŸ”„ AUTOREGRESSIVE GENERATION
Starting Context: "The cat sat"
Generating 4 tokens...\n
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
STEP 1
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Current Context: "The cat sat"
\nTop 10 Next Token Predictions:
  1. " on" â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 0.7178 (71.78%)
  2. " in" â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 0.1100 (11.00%)
  3. " down" â–ˆ 0.0275 (2.75%)
  4. " up"  0.0179 (1.79%)
  5. " next"  0.0137 (1.37%)
\nSELECTED: " on" (probability: 0.7178)
ðŸ”„ NEW Context: "The cat sat on"

â”€â”€â”€â”€â”€â”€â

---

# ðŸŽ¬ Summary: Key Takeaways from Video 2



## ðŸ”„ Autoregression

![auto-regression](auto-regressive.png)


1. **Self-Feeding Loop**: Each output becomes the next input
2. **No Planning Ahead**: Model only predicts ONE token at a time
3. **Context Grows**: Every prediction adds to the context window
4. **Fragile Process**: One wrong token can derail the entire generation


## ðŸ’¡ Connection to Fine-Tuning

- Fine-tuning doesn't change HOW the model generates (still autoregressive)
- It changes WHAT patterns the model follows
- Temperature is a generation-time control you can use with any model
- Understanding autoregression helps you understand why context matters so much in fine-tuning data!

---
