# The Learning Signal: How LLMs Actually Improve

This notebook demonstrates the core mechanisms behind how language models learn:
- **Cross-Entropy Loss**: Measuring prediction quality
- **Label Shifting**: How models learn to predict the next token
- **Causal Masking**: Preventing models from seeing the future
- **Assistant-Only Masking**: Training chat models to respond, not predict user input


In [16]:
import os
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer


# Uncomment this to login to Hugging Face
from dotenv import load_dotenv 
from huggingface_hub import login
load_dotenv()
login(token=os.getenv("HF_TOKEN"))


# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## Part 2: Cross-Entropy Loss - Measuring Prediction Quality

Let's create a scenario where we have:
- A **question**: "What is the capital of France?"
- The **true answer** from training data: "The capital of France is Paris."
- A **good predicted answer**: High probability on correct tokens
- A **bad predicted answer**: Low probability on correct tokens


In [31]:
# Hardcoded example with 3 classes and 2 samples
print("="*80)
print("CROSS-ENTROPY LOSS EXAMPLE")
print("="*80)
print("Classes: 0=Cat, 1=Dog, 2=Bird")
print("\nSample 1: True class = 1 (Dog)")
print("Sample 2: True class = 0 (Cat)")

# Sample 1: Model predictions (probabilities for each class)
sample1_probs = torch.tensor([0.2, 0.7, 0.1])  # High prob for Dog (correct)
sample1_true = torch.tensor([1])  # True class is Dog

# Sample 2: Model predictions (probabilities for each class) 
sample2_probs = torch.tensor([0.1, 0.8, 0.1])  # High prob for Dog (incorrect)
sample2_true = torch.tensor([0])  # True class is Cat

print("="*80)


CROSS-ENTROPY LOSS EXAMPLE
Classes: 0=Cat, 1=Dog, 2=Bird

Sample 1: True class = 1 (Dog)
Sample 2: True class = 0 (Cat)


### Computing Individual Losses

Now let's compute the cross-entropy loss for each sample using PyTorch's `F.cross_entropy()` function:
- This function takes **logits** (log probabilities) and the **true class** as input
- It returns a loss value: **lower loss = better prediction**


In [50]:
# Convert probabilities to logits (inverse of softmax)
sample1_logits = torch.log(sample1_probs)
sample2_logits = torch.log(sample2_probs)

# Calculate cross-entropy loss for each sample
loss1 = F.cross_entropy(sample1_logits.unsqueeze(0), sample1_true).item()
loss2 = F.cross_entropy(sample2_logits.unsqueeze(0), sample2_true).item()

loss1 = round(loss1, 4)
loss2 = round(loss2, 4)

print('Example 1 Loss:', loss1)
print('Example 2 Loss:', loss2)

print('Example 1 Logits:', sample1_logits)
print('Example 2 Logits:', sample2_logits)



Example 1 Loss: 0.3567
Example 2 Loss: 2.3026
Example 1 Logits: tensor([-1.6094, -0.3567, -2.3026])
Example 2 Logits: tensor([-2.3026, -0.2231, -2.3026])


---

## Part 3: Label Shifting - How Language Models Actually Learn

In language models, there's an important detail: **we shift the labels by one position**.

This sounds complicated, but it's actually very simple!


In [64]:
# Example: Let's say we have the sentence "The capital is Paris"
sentence = "The capital is Paris"
tokens = ["The", "capital", "is", "Paris"]

print("="*80)
print("LABEL SHIFTING EXPLAINED")
print("="*80)
print(f"\nOriginal sentence: '{sentence}'")
print(f"Tokens: {tokens}")
print("\n" + "-"*80)

# Here's what the model sees vs what it should predict:
print("\nAt each position, the model sees some tokens and predicts the NEXT token:\n")

for i in range(len(tokens) - 1):
    input_tokens = tokens[:i+1]
    target_token = tokens[i+1]
    print(f"Position {i}: Model sees: {input_tokens}")
    print(f"           Model should predict: '{target_token}'")
    print()

print("-"*80)
print("\nThis is why we SHIFT the labels:")
print(f"  Input:  {tokens[:-1]}  ‚Üê All tokens except the last")
print(f"  Target: {tokens[1:]}   ‚Üê All tokens except the first")
print("\n  Input and target are the SAME tokens, just shifted by 1 position!")
print("="*80)


LABEL SHIFTING EXPLAINED

Original sentence: 'The capital is Paris'
Tokens: ['The', 'capital', 'is', 'Paris']

--------------------------------------------------------------------------------

At each position, the model sees some tokens and predicts the NEXT token:

Position 0: Model sees: ['The']
           Model should predict: 'capital'

Position 1: Model sees: ['The', 'capital']
           Model should predict: 'is'

Position 2: Model sees: ['The', 'capital', 'is']
           Model should predict: 'Paris'

--------------------------------------------------------------------------------

This is why we SHIFT the labels:
  Input:  ['The', 'capital', 'is']  ‚Üê All tokens except the last
  Target: ['capital', 'is', 'Paris']   ‚Üê All tokens except the first

  Input and target are the SAME tokens, just shifted by 1 position!


### Why Do We Shift?

**Simple explanation:** Language models predict the **next** token, not the current one.

**Visual representation:**
```
Sentence: "The capital is Paris"

Input sequence:  ["The",  "capital",  "is",  "Paris"]
                    ‚Üì         ‚Üì        ‚Üì       ‚Üì
Target sequence: ["capital", "is",   "Paris", ...]
                 (predict)  (predict) (predict)
```

When we give the model `"The"`, we want it to predict `"capital"` (the next token).  
When we give the model `"The capital"`, we want it to predict `"is"` (the next token).  
And so on...

**In code, this looks like:**
```python
# Original tokens: [0, 1, 2, 3, 4, 5]
input_ids = tokens[:-1]   # [0, 1, 2, 3, 4]  ‚Üê All except last
labels = tokens[1:]        # [1, 2, 3, 4, 5]  ‚Üê All except first
```

The model learns by:
1. Taking `input_ids[i]` (e.g., token 0, 1, 2)
2. Predicting what comes next
3. Checking against `labels[i]` (e.g., token 1, 2, 3)
4. Computing the loss
5. Updating weights to get better at prediction


### Concrete Example with Token IDs


In [65]:
# Let's use actual token IDs to make this crystal clear
text = "The capital is Paris"

# Tokenize the text
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

token_ids = tokenizer.encode(text)
token_strings = [tokenizer.decode([tid]) for tid in token_ids]

print("="*80)
print("LABEL SHIFTING WITH REAL TOKEN IDs")
print("="*80)
print(f"\nOriginal text: '{text}'")
print(f"\nToken IDs:      {token_ids}")
print(f"Token strings:  {token_strings}")

print("\n" + "-"*80)
print("SHIFTING:")
print("-"*80)

# Create input_ids and labels (shifted by 1)
input_ids = token_ids[:-1]  # All except last
labels = token_ids[1:]       # All except first

print(f"\nInput IDs:  {input_ids}  ‚Üê Remove last token")
print(f"Labels:     {labels}     ‚Üê Remove first token")

print("\n" + "-"*80)
print("TRAINING PAIRS:")
print("-"*80)

for i in range(len(input_ids)):
    input_token = token_strings[i]
    target_token = token_strings[i+1]
    print(f"Position {i}: Input token = '{input_token}' (ID: {input_ids[i]})")
    print(f"           Target to predict = '{target_token}' (ID: {labels[i]})")
    print()

print("="*80)
print("üí° Key Point: Same sequence, just shifted by 1 position!")
print("   This is called 'next-token prediction' or 'autoregressive training'")
print("="*80)


LABEL SHIFTING WITH REAL TOKEN IDs

Original text: 'The capital is Paris'

Token IDs:      [464, 3139, 318, 6342]
Token strings:  ['The', ' capital', ' is', ' Paris']

--------------------------------------------------------------------------------
SHIFTING:
--------------------------------------------------------------------------------

Input IDs:  [464, 3139, 318]  ‚Üê Remove last token
Labels:     [3139, 318, 6342]     ‚Üê Remove first token

--------------------------------------------------------------------------------
TRAINING PAIRS:
--------------------------------------------------------------------------------
Position 0: Input token = 'The' (ID: 464)
           Target to predict = ' capital' (ID: 3139)

Position 1: Input token = ' capital' (ID: 3139)
           Target to predict = ' is' (ID: 318)

Position 2: Input token = ' is' (ID: 318)
           Target to predict = ' Paris' (ID: 6342)

üí° Key Point: Same sequence, just shifted by 1 position!
   This is called 'next-to

### Connecting Label Shifting to Loss Calculation

Now you can see how everything fits together:

1. **We shift the labels** so each input predicts the next token
2. **The model outputs probabilities** for all possible next tokens (the entire vocabulary)
3. **We compute cross-entropy loss** by comparing:
   - Model's predicted probability distribution
   - The actual next token (from our shifted labels)
4. **Lower loss** means the model is good at predicting what comes next
5. **Training minimizes this loss** across millions of examples

**Example in practice:**
```python
# Input:  "The"       ‚Üí Model predicts probabilities for next token
# Label:  "capital"   ‚Üí We check: did it assign high probability to "capital"?
# Loss:   -log(P("capital")) ‚Üí If yes, low loss. If no, high loss!
```

This simple idea of **shifted labels** + **cross-entropy loss** is how all language models learn, from GPT to Claude to Llama!
