# Putting the LLM Pipeline Together: Step by Step

In this notebook, we'll walk through the complete process of text generation with a local LLM, examining each step in detail.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand the complete text generation pipeline from input to output
- Trace how text flows through tokenization, model processing, and decoding
- Examine model predictions and probability distributions
- Compare different token selection strategies (greedy vs. sampling)
- Adjust generation parameters to control output quality

## The Pipeline Overview

**Input text → Tokenization → Token IDs → Model processing → Logits → Token selection → Output text**

## Setup: Load Model and Tokenizer

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
import os

# Set the directory where we'll save the model
save_directory = "./downloaded_model"
model_name = "distilgpt2"

# Check if model already exists locally
if os.path.exists(save_directory) and os.listdir(save_directory):
    print(f"✓ Model already exists in {save_directory}")
    print("  Loading from local directory...")
    tokenizer = AutoTokenizer.from_pretrained(save_directory)
    model = AutoModelForCausalLM.from_pretrained(save_directory)
else:
    os.makedirs(save_directory, exist_ok=True)
    print(f"Downloading {model_name} from Hugging Face Hub...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    print(f"Saving model to {save_directory}...")
    model.save_pretrained(save_directory)
    tokenizer.save_pretrained(save_directory)

# Configure tokenizer
tokenizer.pad_token = tokenizer.eos_token

print("✓ Model and tokenizer ready!")

✓ Model already exists in ./downloaded_model
  Loading from local directory...
✓ Model and tokenizer ready!


## Step 1: Input Text

Let's start with a simple prompt that we'll use throughout this demonstration:

In [2]:
# Our starting prompt
prompt = "Artificial intelligence (الذكاء الاصطناعي) is transforming"
print(f"Input prompt: '{prompt}'")

Input prompt: 'Artificial intelligence (الذكاء الاصطناعي) is transforming'


## Step 2: Tokenization - Breaking Text into Pieces

The tokenizer breaks our text into smaller units (tokens) that the model can understand:

In [3]:
# Tokenize the input
tokens = tokenizer.tokenize(prompt)

print(f"Tokenization result ({len(tokens)} tokens):\n")
for i, token in enumerate(tokens, 1):
    print(f"  Token {i}: {repr(token)}")

Tokenization result (23 tokens):

  Token 1: 'Art'
  Token 2: 'ificial'
  Token 3: 'Ġintelligence'
  Token 4: 'Ġ('
  Token 5: 'Ø§ÙĦ'
  Token 6: 'Ø'
  Token 7: '°'
  Token 8: 'Ù'
  Token 9: 'ĥ'
  Token 10: 'Ø§Ø'
  Token 11: '¡'
  Token 12: 'ĠØ§ÙĦ'
  Token 13: 'Ø§Ø'
  Token 14: 'µ'
  Token 15: 'Ø'
  Token 16: '·'
  Token 17: 'ÙĨ'
  Token 18: 'Ø§Ø'
  Token 19: '¹'
  Token 20: 'ÙĬ'
  Token 21: ')'
  Token 22: 'Ġis'
  Token 23: 'Ġtransforming'


### Understanding Tokenization

The tokenizer has split our input text into tokens. Key observations:

- **'Ġ' prefix** - Represents a space before the word (GPT-2's way of encoding spaces)
- **Subword splitting** - Common words like "transforming" stay as single tokens
- **Vocabulary efficiency** - Rare words would be split into multiple subword tokens

## Step 3: Converting Tokens to IDs

Next, each token is converted to its corresponding numeric ID from the vocabulary:

In [4]:
# Convert tokens to IDs
input_ids = tokenizer.encode(prompt, return_tensors="pt")[0].tolist()

print("Token → ID conversion:\n")
for token, token_id in zip(tokens, input_ids):
    print(f"  {repr(token):20} → {token_id}")

# Show the tensor format that will be input to the model
model_input_ids = tokenizer.encode(prompt, return_tensors="pt")
print(f"\nModel input tensor shape: {model_input_ids.shape}")
print(f"Model input tensor:\n  {model_input_ids}")

Token → ID conversion:

  'Art'                → 8001
  'ificial'            → 9542
  'Ġintelligence'      → 4430
  'Ġ('                 → 357
  'Ø§ÙĦ'               → 23525
  'Ø'                  → 148
  '°'                  → 108
  'Ù'                  → 149
  'ĥ'                  → 225
  'Ø§Ø'                → 34247
  '¡'                  → 94
  'ĠØ§ÙĦ'              → 28981
  'Ø§Ø'                → 34247
  'µ'                  → 113
  'Ø'                  → 148
  '·'                  → 115
  'ÙĨ'                 → 23338
  'Ø§Ø'                → 34247
  '¹'                  → 117
  'ÙĬ'                 → 22654
  ')'                  → 8
  'Ġis'                → 318
  'Ġtransforming'      → 25449

Model input tensor shape: torch.Size([1, 23])
Model input tensor:
  tensor([[ 8001,  9542,  4430,   357, 23525,   148,   108,   149,   225, 34247,
            94, 28981, 34247,   113,   148,   115, 23338, 34247,   117, 22654,
             8,   318, 25449]])


### Understanding Token IDs

Each token has been converted to a numeric ID from the model's vocabulary:
- The model doesn't process text directly - only these numeric IDs
- IDs are formatted as a PyTorch tensor with shape `[batch_size, sequence_length]`
- In our case: `[1, 5]` means 1 sequence with 5 tokens

## Step 4: Model Processing

Now the model processes these IDs through its neural network layers:

In [5]:
# Run the model on our input
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model(model_input_ids)

# The model outputs logits (unnormalized probabilities) for each possible next token
logits = outputs.logits

print(f"Output logits shape: {logits.shape}\n")
print(f"  Batch size: {logits.shape[0]}")
print(f"  Sequence length: {logits.shape[1]} (predictions for each position)")
print(f"  Vocabulary size: {logits.shape[2]:,} (scores for each possible token)")

Output logits shape: torch.Size([1, 23, 50257])

  Batch size: 1
  Sequence length: 23 (predictions for each position)
  Vocabulary size: 50,257 (scores for each possible token)


### Understanding Model Processing

Inside the model, here's the processing pipeline:

1. **Embedding layer** - Converts each token ID into a dense vector representation
2. **Position embeddings** - Adds information about each token's position in the sequence
3. **Transformer layers** - Process the sequence through multiple attention and feed-forward layers:
   - **Self-attention** - Determines which tokens should pay attention to each other
   - **Feed-forward networks** - Process the attended information
4. **Output layer** - Generates scores (logits) for every possible next token

The output is a tensor of logits - raw scores for each of the 50,257 tokens in the vocabulary. Higher scores indicate the model thinks that token is more likely to come next.

## Step 5: Next Token Prediction

Now let's look at the model's prediction for the next token after our prompt:

In [6]:
# We want the predictions for the last position (after "transforming")
next_token_logits = logits[0, -1, :]

# Convert logits to probabilities using softmax
next_token_probs = torch.softmax(next_token_logits, dim=0)

# Get the top 10 most likely tokens
top_k = 10
topk_probs, topk_indices = torch.topk(next_token_probs, top_k)

# Convert to numpy for easier handling
topk_probs = topk_probs.detach().numpy()
topk_indices = topk_indices.detach().numpy()

# Decode the tokens
topk_tokens = [tokenizer.decode([idx]) for idx in topk_indices]

print("Top 10 predictions for next token:\n")
print(f"{'Rank':<6} {'Token':<15} {'ID':<8} {'Probability':<12}")
print("=" * 50)
for i in range(top_k):
    print(f"{i+1:<6} {repr(topk_tokens[i]):<15} {topk_indices[i]:<8} {topk_probs[i]*100:>6.2f}%")

Top 10 predictions for next token:

Rank   Token           ID       Probability 
1      ' the'          262       28.45%
2      ' human'        1692       4.48%
3      ' our'          674        3.55%
4      ' into'         656        2.90%
5      ' people'       661        2.79%
6      ' society'      3592       2.19%
7      ' a'            257        1.80%
8      ' itself'       2346       1.74%
9      ' technology'   3037       1.42%
10     ' humans'       5384       1.32%


### Understanding Next Token Prediction

The model has analyzed "Artificial intelligence is transforming" and predicted what comes next:

1. **Raw logits** - The model outputs raw scores for every token in the vocabulary
2. **Softmax conversion** - We convert logits to probabilities (they sum to 100%)
3. **Top-k selection** - We examine only the most likely candidates

Notice how the top token (" the") has ~27% probability - the model is fairly confident but not certain. This probability distribution reflects patterns the model learned from its training data.

## Step 6: Token Selection

Now we need to select which token to use next. Let's look at different ways to do this:

In [7]:
# Method 1: Greedy selection (always pick the most likely token)
greedy_index = torch.argmax(next_token_probs).item()
greedy_token = tokenizer.decode([greedy_index])

# Method 2: Temperature sampling (adjust probability distribution)
temperature = 0.7  # Lower = more deterministic, Higher = more random
temp_logits = next_token_logits / temperature
temp_probs = torch.softmax(temp_logits, dim=0)

# Method 3: Top-k sampling (sample from k most likely tokens)
k = 5
topk_temp_probs, topk_temp_indices = torch.topk(temp_probs, k)
topk_temp_probs = topk_temp_probs / topk_temp_probs.sum()  # Renormalize

# Sample using temperature + top-k
sample_index = np.random.choice(topk_temp_indices.detach().numpy(), 
                                p=topk_temp_probs.detach().numpy())
sample_token = tokenizer.decode([sample_index])

print("Token Selection Comparison:\n")
print(f"  Greedy: {repr(greedy_token):<15} (always picks most likely)")
print(f"  Sampled: {repr(sample_token):<15} (random selection from top candidates)\n")

# Show the top-k tokens with temperature adjustment
print(f"Top-{k} tokens with temperature={temperature}:\n")
print(f"{'Token':<15} {'Original %':<13} {'Adjusted %':<13}")
print("=" * 45)
for i in range(k):
    token_id = topk_temp_indices[i].item()
    token_text = tokenizer.decode([token_id])
    orig_prob = next_token_probs[token_id].item() * 100
    adj_prob = topk_temp_probs[i].item() * 100
    print(f"{repr(token_text):<15} {orig_prob:>6.2f}%       {adj_prob:>6.2f}%")

Token Selection Comparison:

  Greedy: ' the'          (always picks most likely)
  Sampled: ' the'          (random selection from top candidates)

Top-5 tokens with temperature=0.7:

Token           Original %    Adjusted %   
' the'           28.45%        83.55%
' human'          4.48%         5.96%
' our'            3.55%         4.28%
' into'           2.90%         3.20%
' people'         2.79%         3.02%


### Understanding Token Selection Strategies

Two main approaches for selecting the next token:

**1. Greedy Selection**
- Always picks the token with highest probability
- Deterministic - same input always produces same output
- Can be repetitive and boring

**2. Sampling with Temperature and Top-k**
- Introduces controlled randomness
- **Temperature** - Adjusts how confident the model is:
  - Lower (< 1.0) - Makes high-probability tokens even more likely
  - Higher (> 1.0) - Flattens distribution, makes lower-probability tokens more likely
- **Top-k** - Only considers the k most likely tokens
- **Result** - More varied and interesting outputs

The temperature adjustment reshapes the probability distribution before sampling, giving us control over the creativity vs. consistency tradeoff.

## Step 7: Building the Response

Now we'll see the complete text generation process in action, adding one token at a time:

In [8]:
def generate_step_by_step(prompt, max_new_tokens=5, temperature=0.7, top_k=5):
    """Generate text token by token with detailed output at each step"""
    current_text = prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    print(f"Starting prompt: '{prompt}'\n")
    print("=" * 80)
    
    for i in range(max_new_tokens):
        print(f"\nStep {i+1}: Generating token {i+1}/{max_new_tokens}")
        print("-" * 80)
        
        # Get model predictions
        with torch.no_grad():
            outputs = model(input_ids)
        
        # Get next token logits
        next_token_logits = outputs.logits[0, -1, :]
        
        # Apply temperature
        next_token_logits = next_token_logits / temperature
        
        # Get top-k token indices and probabilities
        topk_probs, topk_indices = torch.topk(torch.softmax(next_token_logits, dim=0), top_k)
        
        # Display top candidates
        print(f"\nTop {top_k} candidates:")
        for j in range(top_k):
            token_id = topk_indices[j].item()
            token_text = tokenizer.decode([token_id])
            token_prob = topk_probs[j].item() * 100
            print(f"  {j+1}. {repr(token_text):<15} (Probability: {token_prob:>6.2f}%)")
        
        # Renormalize probabilities for top-k
        topk_probs = topk_probs / topk_probs.sum()
        
        # Sample from top-k
        chosen_idx = np.random.choice(topk_indices.detach().numpy(), 
                                      p=topk_probs.detach().numpy())
        chosen_token = tokenizer.decode([chosen_idx])
        
        print(f"\n✓ Selected: {repr(chosen_token)}")
        
        # Update for next iteration
        next_token = torch.tensor([[chosen_idx]])
        input_ids = torch.cat([input_ids, next_token], dim=1)
        current_text += chosen_token
        
        print(f"  Text so far: '{current_text}'")
    
    print("\n" + "=" * 80)
    print(f"✓ Final text: '{current_text}'")
    return current_text

# Generate text step by step
final_text = generate_step_by_step(prompt, max_new_tokens=5, temperature=0.7, top_k=5)

Starting prompt: 'Artificial intelligence (الذكاء الاصطناعي) is transforming'


Step 1: Generating token 1/5
--------------------------------------------------------------------------------

Top 5 candidates:
  1. ' the'          (Probability:  68.04%)
  2. ' human'        (Probability:   4.85%)
  3. ' our'          (Probability:   3.48%)
  4. ' into'         (Probability:   2.60%)
  5. ' people'       (Probability:   2.46%)

✓ Selected: ' the'
  Text so far: 'Artificial intelligence (الذكاء الاصطناعي) is transforming the'

Step 2: Generating token 2/5
--------------------------------------------------------------------------------

Top 5 candidates:
  1. ' world'        (Probability:  56.97%)
  2. ' way'          (Probability:   9.03%)
  3. ' lives'        (Probability:   6.11%)
  4. ' human'        (Probability:   3.69%)
  5. ' entire'       (Probability:   1.63%)

✓ Selected: ' human'
  Text so far: 'Artificial intelligence (الذكاء الاصطناعي) is transforming the human'

Step 3: Gene

### Understanding the Auto-regressive Process

What we just witnessed is **auto-regressive generation** - the key process behind LLM text generation:

1. **Start with prompt** - Begin with the initial text
2. **For each new token:**
   - Process all previous tokens through the model
   - Get probability distribution for next token
   - Apply temperature and top-k filtering
   - Sample a token from the filtered distribution
   - Append the selected token to our text
3. **Repeat** - Continue until reaching desired length

**Key insight:** Each new token depends on *all* the tokens that came before it. This is why the model can maintain context and coherence throughout the generated text.

## The Effect of Generation Parameters

Different parameters can dramatically change the output. Let's experiment with a few:

In [9]:
def generate_with_params(prompt, max_new_tokens=15, **params):
    """Generate text with specified parameters"""
    # Encode with attention mask
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=True)
    
    # Generate output
    output_ids = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=len(inputs.input_ids[0]) + max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        **params
    )
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Different parameter configurations to try
configs = [
    {'name': 'Greedy (deterministic)', 
     'params': {'do_sample': False}},
    
    {'name': 'Low temperature (focused)', 
     'params': {'temperature': 0.3, 'do_sample': True}},
    
    {'name': 'High temperature (creative)', 
     'params': {'temperature': 1.5, 'do_sample': True}},
    
    {'name': 'Top-k sampling', 
     'params': {'top_k': 5, 'do_sample': True}},
    
    {'name': 'Top-p / Nucleus sampling', 
     'params': {'top_p': 0.9, 'do_sample': True}},
    
    {'name': 'Balanced (recommended)', 
     'params': {'temperature': 0.7, 'top_k': 50, 'top_p': 0.9, 'do_sample': True}}
]

# Generate and display results
print("Comparing Generation Parameters:\n")
print("=" * 80)

for config in configs:
    output = generate_with_params(prompt, **config['params'])
    generated_part = output[len(prompt):]
    
    print(f"\n{config['name']}")
    print(f"Parameters: {config['params']}")
    # Display with visible newlines to show what the model actually generated
    print(f"Generated: {repr(generated_part)}")
    print("-" * 80)

print("\nNote: Newline characters (\\n) are tokens the model can generate.")
print("Lower temperatures and greedy decoding may generate more newlines as they")
print("follow the most common patterns seen in training data (like paragraph breaks).")

Comparing Generation Parameters:


Greedy (deterministic)
Parameters: {'do_sample': False}
Generated: ' the world into a world where people can be educated, educated, and educated'
--------------------------------------------------------------------------------

Low temperature (focused)
Parameters: {'temperature': 0.3, 'do_sample': True}
Generated: ' the world into a virtual reality world.\n\n\n\n\n\n\n'
--------------------------------------------------------------------------------

High temperature (creative)
Parameters: {'temperature': 1.5, 'do_sample': True}
Generated: " intelligence with new skills. It's the next generation from this generation from its"
--------------------------------------------------------------------------------

Top-k sampling
Parameters: {'top_k': 5, 'do_sample': True}
Generated: ' the entire world into a virtual reality.\nThe artificial intelligence (AI),'
--------------------------------------------------------------------------------

Top-p / Nucleus s

### Generation Parameters Explained

**do_sample** (True/False)
- `False` - Greedy decoding (always picks most likely token)
- `True` - Samples from probability distribution (introduces randomness)

**temperature** (0.1 to 2.0+)
- Lower (0.3) - More focused, deterministic, repetitive
- Default (1.0) - Original probability distribution
- Higher (1.5) - More random, creative, potentially incoherent

**top_k** (integer, e.g., 5-100)
- Limits selection to k most likely tokens
- Lower k - More focused on top choices
- Higher k - Allows more diversity

**top_p / nucleus sampling** (0.0 to 1.0)
- Selects from smallest set of tokens whose cumulative probability ≥ p
- Adapts based on model confidence
- Common values: 0.9 or 0.95

**Recommended:** Combine temperature (0.7-0.8) + top_k (40-50) + top_p (0.9) for balanced, high-quality generation.

## The Complete LLM Text Generation Pipeline

Let's summarize the entire pipeline we've explored:

### Pipeline Steps

1. **Input Text** - Start with a text prompt
   
2. **Tokenization** - Break text into tokens (words, subwords, or characters)
   
3. **Token → ID Conversion** - Map each token to its vocabulary ID
   
4. **Model Processing** - Process IDs through neural network:
   - Embedding lookup (convert IDs to vectors)
   - Add position information
   - Multiple transformer layers (attention + feed-forward)
   - Output layer generates logits
   
5. **Next Token Prediction** - Convert logits to probabilities via softmax
   
6. **Token Selection** - Choose next token:
   - Greedy: Pick most likely
   - Sampling: Random selection with temperature/top-k/top-p
   
7. **Append Token** - Add selected token to output
   
8. **Repeat** - Continue from step 3 with updated text until reaching desired length

### Key Characteristics

- **Auto-regressive** - Each token depends on all previous tokens
- **Probabilistic** - Sampling introduces randomness (except greedy mode)
- **Iterative** - Generates one token at a time
- **Configurable** - Parameters control output quality and diversity

## Summary

We've explored the complete text generation pipeline of a local LLM, examining each step from input text to final output. 

### Key Takeaways

1. **Tokenization is fundamental** - Text must be converted to tokens, then IDs
2. **The model processes sequences** - Not individual words, but entire contexts
3. **Logits → Probabilities** - Softmax converts raw scores to interpretable probabilities
4. **Selection strategies matter** - Greedy vs. sampling dramatically affects output quality
5. **Parameters provide control** - Temperature, top-k, and top-p let you tune generation
6. **Auto-regressive generation** - Each token builds on all previous tokens

### Practical Applications

Understanding this pipeline helps you:
- **Debug issues** - Identify where problems occur in generation
- **Optimize performance** - Choose appropriate parameters for your use case
- **Design better prompts** - Understand how context influences predictions
- **Build applications** - Integrate LLMs effectively into your systems

The probabilistic nature of token selection explains why the same prompt can produce different outputs - a fundamental characteristic of working with LLMs.