## INTRODUCTION
# Embeddings and Tokenization for Large Language Models

In this notebook, I reproduce and explore the core ideas from Chapter 2 of 
*Build a Large Language Model (From Scratch)* by Sebastian Raschka.

The goal is to understand how raw text is transformed into numerical 
representations (tokens and embeddings), and why these steps are fundamental 
for modern LLMs and agentic systems.


## Why Tokenization Matters in LLMs

Large Language Models cannot process raw text directly. Neural networks operate on numbers, not words. Therefore, the first critical step is transforming text into numerical representations.

Tokenization converts raw text into discrete units (tokens), each mapped to a unique integer ID. This allows us to:

- Build a fixed vocabulary
- Represent text as sequences of integers
- Feed those sequences into neural networks

Without tokenization, there would be no structured way for the model to interpret language. In LLMs, tokenization defines the model’s “language interface” — it determines how text is broken down and how meaning is encoded at the most basic level.

## Why Sliding Windows Are Used

LLMs are trained to predict the next token given previous tokens. To achieve this, we must convert long text into many smaller training examples.

The sliding window technique creates overlapping input-target pairs:

Input:  [x1, x2, x3, x4]  
Target: [x2, x3, x4, x5]

This allows the model to learn next-token prediction repeatedly across the entire text.

The overlap (controlled by stride) is important because:

- Smaller stride → more training samples
- Larger stride → fewer samples
- Overlapping windows reuse context efficiently

This is crucial for training because it maximizes learning from limited text data.

## Why Do Embeddings Encode Meaning?

Embeddings encode meaning because of how they are trained.

During training, tokens that appear in similar contexts receive similar gradient updates. Over time, this causes their vectors to move closer together in vector space.

This is based on the Distributional Hypothesis:
“Words that appear in similar contexts tend to have similar meanings.”

For example, words like "king" and "queen" or "cat" and "dog" often appear in related contexts. The neural network adjusts their vectors accordingly.

From a neural network perspective:

- Embeddings are trainable parameters
- They are updated through backpropagation
- They form the first layer of the model
- They enable semantic structure to emerge in high-dimensional space

Meaning is not manually programmed — it emerges from optimization.

## What Is an Embedding Layer?

An embedding layer is a trainable matrix that maps token IDs to dense vectors.

Mathematically, it is equivalent to multiplying a one-hot vector by a weight matrix:

Embedding Matrix Shape:
[vocab_size × embedding_dim]

Each row corresponds to one token in the vocabulary.

Instead of representing tokens as sparse one-hot vectors (mostly zeros), embeddings represent them as dense vectors in a continuous vector space. This reduces dimensionality and allows the model to learn relationships between words.

The embedding layer is not just a lookup table — it is a set of parameters updated through backpropagation during training.

## Experiment: Effect of max_length and Stride

By modifying max_length and stride, we observe changes in the number of generated training samples.

When stride = 1:
- Windows overlap heavily
- Many training samples are created
- More efficient reuse of text

When stride = max_length:
- No overlap
- Fewer training samples
- Less contextual reuse

Overlap is useful because it increases the effective dataset size without requiring more text. This improves generalization and allows the model to learn transitions between tokens more smoothly.

In large-scale LLM training, stride selection impacts both computational cost and data efficiency.

In [1]:
%pip install torch tiktoken notebook

Collecting notebook
  Downloading notebook-7.5.3-py3-none-any.whl.metadata (10 kB)
Collecting jupyter-server<3,>=2.4.0 (from notebook)
  Downloading jupyter_server-2.17.0-py3-none-any.whl.metadata (8.5 kB)
Collecting jupyterlab-server<3,>=2.28.0 (from notebook)
  Downloading jupyterlab_server-2.28.0-py3-none-any.whl.metadata (5.9 kB)
Collecting jupyterlab<4.6,>=4.5.3 (from notebook)
  Downloading jupyterlab-4.5.4-py3-none-any.whl.metadata (16 kB)
Collecting notebook-shim<0.3,>=0.2 (from notebook)
  Downloading notebook_shim-0.2.4-py3-none-any.whl.metadata (4.0 kB)
Collecting anyio>=3.1.0 (from jupyter-server<3,>=2.4.0->notebook)
  Downloading anyio-4.12.1-py3-none-any.whl.metadata (4.3 kB)
Collecting argon2-cffi>=21.1 (from jupyter-server<3,>=2.4.0->notebook)
  Downloading argon2_cffi-25.1.0-py3-none-any.whl.metadata (4.1 kB)
Collecting jupyter-events>=0.11.0 (from jupyter-server<3,>=2.4.0->notebook)
  Downloading jupyter_events-0.12.0-py3-none-any.whl.metadata (5.8 kB)
Collecting jupy

In [5]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer)

<Encoding 'gpt2'>


In [6]:
max_length = 4
stride = 1

In [None]:
import tiktoken
import torch
import torch.nn as nn

tokenizer = tiktoken.get_encoding("gpt2")

vocab_size = tokenizer.n_vocab

embedding_dim = 256 

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

print(embedding_layer.weight.shape)

torch.Size([50257, 256])


In [9]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    text = f.read()

token_ids = tokenizer.encode(text)

print(len(token_ids))

5145


In [None]:
max_length = 4
stride = 4


In [12]:
inputs = []
targets = []

for i in range(0, len(token_ids) - max_length, stride):
    input_chunk = token_ids[i:i + max_length]
    target_chunk = token_ids[i + 1:i + max_length + 1]

    inputs.append(input_chunk)
    targets.append(target_chunk)

print("Número de muestras:", len(inputs))

Número de muestras: 1286


In [13]:
print(len(inputs))

1286


In [16]:
import torch

inputs = torch.tensor(inputs)
targets = torch.tensor(targets)

print("Inputs shape:", inputs.shape)
print("Targets shape:", targets.shape)

Inputs shape: torch.Size([1286, 4])
Targets shape: torch.Size([1286, 4])


  inputs = torch.tensor(inputs)
  targets = torch.tensor(targets)


In [17]:
embedded = embedding_layer(inputs)

print("Embedded shape:", embedded.shape)

Embedded shape: torch.Size([1286, 4, 256])


In [18]:
print(embedding_layer.weight[:5])

tensor([[-1.9328, -0.0517, -1.9630,  ...,  0.0383,  0.6520,  0.0033],
        [-0.7397,  0.4527,  0.7560,  ...,  0.2311,  0.5016, -0.7208],
        [-0.1881,  1.0750,  0.1260,  ...,  0.2461, -0.2009, -0.1434],
        [-0.4796,  0.1654,  0.6221,  ...,  0.1895, -1.1936, -1.3270],
        [ 0.2820,  0.2150,  1.0632,  ..., -0.4031, -0.1143, -0.2693]],
       grad_fn=<SliceBackward0>)
