# Bastien GUILLOU et Ryan KHOU

# Chapter 5 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 5.1: Temperature-scaled softmax scores and sampling probabilities

**Empirical Analysis of Token Sampling Frequencies Under Temperature Scaling**

**Key Research Question: How does temperature-based scaling of the `softmax` probability distribution impact the sampling frequency of the specific lexical token `"pizza"`?**

*Methodological Framework:*
Utilize the `print_sampled_tokens` function to:
- Empirically examine token sampling probabilities
- Analyze the impact of temperature scaling
- Quantify the sampling occurrence of the `"pizza"` token

*Analytical Objectives:*
- Determine the precise sampling frequency of `"pizza"` across different temperature configurations
- Critically evaluate the current computational approach to sampling frequency measurement
- Explore potential methodological improvements for more efficient and accurate token sampling analysis

*Key Investigative Parameters:*
- Primary token of interest: `"pizza"`
- Sampling method: Temperature-scaled `softmax` distribution
- Computational tool: `print_sampled_tokens` function


# Exercise 5.2: Different temperature and top-k settings

**Empirical Investigation of Generative Language Model Sampling Parameters**

**Key Research Question: How do variations in `temperature` and `top-k` sampling parameters influence the qualitative and probabilistic characteristics of token generation in stochastic language models?**

*Methodological Framework:*
Conduct a systematic empirical exploration of:
- Temperature scaling dynamics
- Top-k probability truncation mechanisms
- Generative output characteristics across different parameter configurations

*Analytical Objectives:*
- Identify contextual applications that benefit from lower `temperature` and `top-k` settings
- Explore potential use cases preferring higher `temperature` and `top-k` configurations
- Develop nuanced understanding of sampling parameter impact on generative outputs

*Investigative Dimensions:*
1. Low `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

2. High `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

*Recommended Experimental Protocol:*
1. Systematically vary `temperature` and `top-k` parameters
2. Meticulously document generative output characteristics
3. Critically analyze observed variations
4. Develop hypotheses about optimal parameter configurations for specific applications

# Exercise 5.3: Deterministic behavior in the decoding functions

**Deterministic Token Generation: Parametric Strategies for Eliminating Stochastic Variability**

**Key Research Question: What specific configuration parameters within the `generate` function can systematically eliminate randomness to ensure consistently reproducible generative outputs?**

*Methodological Framework:*
*Investigate comprehensive strategies to:*
- Suppress stochastic token generation mechanisms
- Enforce deterministic computational behavior
- Replicate the predictable output characteristics of `generate_simple`

*Analytical Objectives:*
- Identify all potential parameter combinations
- Systematically neutralize probabilistic sampling variations
- Establish deterministic generative protocol

*Critical Configuration Parameters to Examine:*
1. `temperature` scaling
2. `top_k` pruning mechanism
3. Random seed initialization
4. Sampling strategy selection

*Recommended Experimental Protocol:*
1. Analyze individual parameter impacts
2. Identify minimal configuration requirements
3. Validate deterministic output generation
4. Compare against `generate_simple` implementation

*Computational Implications:*
- Understanding stochastic suppression mechanisms
- Insights into generative model controllability
- Strategies for reproducible machine learning outputs

# Exercise 5.4: Continued pretraining

**Continuation of Model Training: Stateful Resumption and Persistent Learning Dynamics**

**Key Research Question: How can we effectively restore a machine learning model's training state across separate computational sessions, enabling seamless continuation of the pretraining process?**

*Methodological Framework:*
Implement a comprehensive model and optimizer state restoration strategy involving:
- Weight reconstruction
- Optimizer state recovery
- Resumption of training from previously interrupted state

*Analytical Objectives:*
- Demonstrate stateful model persistence
- Execute additional training epoch using restored model configuration
- Validate continuity of learning progression

*Critical Procedural Steps:*
1. Load previously saved model weights
2. Reconstruct optimizer internal state
3. Reinitiate training using `train_model_simple` function
4. Complete one additional training epoch

*Recommended Implementation Strategy:*
- Utilize precise weight and optimizer state loading mechanisms
- Verify complete state restoration
- Execute uninterrupted additional training epoch

# Exercise 5.5: Training and validation set losses of the pretrained model

**Comparative Loss Assessment: Pretrained Model Performance on Specialized Textual Domain**

**Key Research Question: What are the comparative training and validation set losses when applying a pretrained OpenAI `GPTModel` to the "The Verdict" dataset?**

*Methodological Framework:*
Conduct a comprehensive loss evaluation involving:
- Model weight initialization from pretrained OpenAI configuration
- Computational loss calculation across training and validation datasets
- Quantitative performance assessment in domain-specific context

*Analytical Objectives:*
- Determine precise loss metrics for training dataset
- Calculate validation set loss
- Interpret performance characteristics of pretrained model on specialized textual domain

*Critical Computational Procedures:*
1. Load pretrained OpenAI `GPTModel` weights
2. Prepare "The Verdict" dataset
3. Compute training set loss
4. Compute validation set loss
5. Comparative loss analysis

*Investigative Parameters:*
- Model: Pretrained OpenAI `GPTModel`
- Dataset: "The Verdict"
- Metrics: Training and validation loss measurements

*Recommended Analytical Approach:*
- Implement precise loss computation
- Validate computational methodology
- Critically interpret loss metric implications

# Exercise 5.6: Trying larger models

**Comparative Generative Analysis: Scale and Performance Variations in GPT-2 Model Architectures**

**Key Research Question: How do generative text characteristics vary across different GPT-2 model scales, specifically comparing the 124 million and 1,558 million parameter configurations?**

*Methodological Framework:*
Conduct a systematic comparative investigation of:
- Generative text quality
- Semantic coherence
- Linguistic complexity
- Contextual understanding

*Analytical Objectives:*
- Empirically assess generative performance across model scales
- Identify qualitative differences in text generation
- Explore the relationship between model parameter count and generative capabilities

*Comparative Model Configurations:*
1. Smaller Model: **124 million parameters**
2. Larger Model: **1,558 million parameters**

*Investigative Dimensions:*
- Textual coherence
- Semantic precision
- Contextual relevance
- Linguistic nuance
- Complexity of generated content

*Experimental Protocol:*
1. Generate text samples using both model configurations
2. Conduct qualitative comparative analysis
3. Assess generative performance across multiple dimensions
4. Document observable variations in text generation characteristics

*Recommended Analytical Approach:*
- Utilize consistent generation parameters
- Employ multiple generation trials
- Implement rigorous qualitative assessment
- Develop comprehensive comparative framework

---

# Answers

## Exercise 5.1: Temperature-scaled softmax scores and sampling probabilities

In [1]:
!pip install torch




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import torch

vocab = { 
    "closer": 0,
    "every": 1, 
    "effort": 2, 
    "forward": 3,
    "inches": 4,
    "moves": 5, 
    "pizza": 6,
    "toward": 7,
    "you": 8,
} 

inverse_vocab = {v: k for k, v in vocab.items()}

# Suppose input is "every effort moves you", and the LLM
# returns the following logits for the next token:
next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)


def print_sampled_tokens(probas, num_samples=1000):
    torch.manual_seed(123)  
    sampled_tokens = torch.multinomial(probas, num_samples=num_samples, replacement=True)
    sampled_counts = torch.bincount(sampled_tokens, minlength=len(vocab))

    for idx, count in enumerate(sampled_counts):
        token = inverse_vocab.get(idx, "[UNK]")
        print(f"{count} x '{token}'")


def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

temperature_values = [1.0, 0.1, 5.0]

for temp in temperature_values:
    print(f"\n=== Temperature : {temp} ===")
    probabilities = softmax_with_temperature(next_token_logits, temp)
    print_sampled_tokens(probabilities)



=== Temperature : 1.0 ===
64 x 'closer'
3 x 'every'
0 x 'effort'
572 x 'forward'
2 x 'inches'
0 x 'moves'
0 x 'pizza'
356 x 'toward'
3 x 'you'

=== Temperature : 0.1 ===
0 x 'closer'
0 x 'every'
0 x 'effort'
990 x 'forward'
0 x 'inches'
0 x 'moves'
0 x 'pizza'
10 x 'toward'
0 x 'you'

=== Temperature : 5.0 ===
157 x 'closer'
70 x 'every'
46 x 'effort'
261 x 'forward'
79 x 'inches'
39 x 'moves'
41 x 'pizza'
210 x 'toward'
97 x 'you'


## Exercise 5.2: Different temperature and top-k settings

In [3]:
# Function to apply top-k truncation
def apply_top_k_truncation(probas, k):
    """
    Truncate the probabilities to keep only the top-k values.
    """
    if k >= len(probas):
        return probas  # No truncation if k is equal to or larger than the vocabulary size
    topk_values, topk_indices = torch.topk(probas, k)
    truncated_probas = torch.zeros_like(probas)
    truncated_probas[topk_indices] = topk_values
    return truncated_probas / truncated_probas.sum()

# Function to explore sampling parameters
def explore_sampling_parameters(logits, temperatures, top_ks, num_samples=1000):
    """
    Explore sampling variations by temperature and top-k values.
    """
    for temp in temperatures:
        for k in top_ks:
            print(f"\n=== Temperature: {temp}, Top-k: {k} ===")
            # Apply temperature scaling
            probabilities = softmax_with_temperature(logits, temp)
            # Apply top-k truncation
            truncated_probabilities = apply_top_k_truncation(probabilities, k)
            # Sample and display frequencies
            print_sampled_tokens(truncated_probabilities, num_samples)

# Parameters for the experiment
temperature_values = [0.5, 1.0, 2.0]
top_k_values = [3, 5, len(vocab)]

# Execute the exploration
explore_sampling_parameters(next_token_logits, temperature_values, top_k_values)



=== Temperature: 0.5, Top-k: 3 ===
7 x 'closer'
0 x 'every'
0 x 'effort'
714 x 'forward'
0 x 'inches'
0 x 'moves'
0 x 'pizza'
279 x 'toward'
0 x 'you'

=== Temperature: 0.5, Top-k: 5 ===
7 x 'closer'
0 x 'every'
0 x 'effort'
714 x 'forward'
0 x 'inches'
0 x 'moves'
0 x 'pizza'
279 x 'toward'
0 x 'you'

=== Temperature: 0.5, Top-k: 9 ===
7 x 'closer'
0 x 'every'
0 x 'effort'
714 x 'forward'
0 x 'inches'
0 x 'moves'
0 x 'pizza'
279 x 'toward'
0 x 'you'

=== Temperature: 1.0, Top-k: 3 ===
65 x 'closer'
0 x 'every'
0 x 'effort'
576 x 'forward'
0 x 'inches'
0 x 'moves'
0 x 'pizza'
359 x 'toward'
0 x 'you'

=== Temperature: 1.0, Top-k: 5 ===
64 x 'closer'
0 x 'every'
0 x 'effort'
575 x 'forward'
2 x 'inches'
0 x 'moves'
0 x 'pizza'
356 x 'toward'
3 x 'you'

=== Temperature: 1.0, Top-k: 9 ===
64 x 'closer'
3 x 'every'
0 x 'effort'
572 x 'forward'
2 x 'inches'
0 x 'moves'
0 x 'pizza'
356 x 'toward'
3 x 'you'

=== Temperature: 2.0, Top-k: 3 ===
156 x 'closer'
0 x 'every'
0 x 'effort'
473 x 'fo

## Exercise 5.3: Deterministic behavior in the decoding functions

In [4]:
from gpt_generate import generate, text_to_token_ids, token_ids_to_text,load_weights_into_gpt
import tiktoken

import torch
from previous_labs import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
tokenizer = tiktoken.get_encoding("gpt2")
model.load_state_dict(torch.load("model.pth", weights_only=True))
model.eval();  # Disable dropout during inference

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you?"

"Yes--quite insensible to the irony. She wanted him vindicated--and by me!"




# Exercise 5.4: Continued pretraining

In [5]:
import os
import urllib.request
from previous_labs import create_dataloader_v1
from gpt_train import train_model_simple

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();

file_path = "the-verdict.txt"
url = "https://huggingface.co/datasets/DarwinAnim8or/the-verdict/blob/main/the-verdict.txt"

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()
        
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

num_epochs = 1
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)

Ep 1 (Step 000000): Train loss 0.369, Val loss 6.503
Ep 1 (Step 000005): Train loss 0.338, Val loss 6.515
Every effort moves you?" "I and pushed one of the deep arm-chairs forward. "There: make yourself comfortable--and here are the cigars you like." "I looked at the donkey again. I saw that, when Stroud laid in the first


In [None]:
from gpt_download import download_and_load_gpt2

settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# Copy the base configuration and update with specific model settings
model_name = "gpt2-small (124M)"  # Example model name
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval();

load_weights_into_gpt(gpt, params)
gpt.to(device);