# Embeddings Lab 
## Name: David Santiago Castro

This notebook reproduces the core preprocessing with embedding steps needed before training an LLM:
- load raw text
- tokenize into integer IDs
- build input, target training samples with a sliding window max_length, stride
- batch with a DataLoader
- look up token embeddings and the typical positional embeddings idea
- run a small experiment changing max_length and stride


In [1]:
import os
import re

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

try:
    import tiktoken
    _HAS_TIKTOKEN = True
except Exception:
    _HAS_TIKTOKEN = False

print("torch:", torch.__version__)
print("tiktoken available:", _HAS_TIKTOKEN)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)

torch: 2.10.0+cpu
tiktoken available: True
device: cpu


## 1) Load raw text

LLMs start from plain text. Everything later tokens, embeddings, training samples depends on the raw data:
- what words/patterns exist
- how often they appear
- what contexts they appear in

For agentic systems, this is also the starting point for building a knowledge base docs, logs, chat history before embedding/indexing


In [2]:
candidate_paths = ["the-verdict.txt", "/mnt/data/the-verdict.txt"]
path = next((p for p in candidate_paths if os.path.exists(p)), None)
assert path is not None, "Could not find the-verdict.txt. Put it next to this notebook."

with open(path, "r", encoding="utf-8") as f:
    text = f.read()

print("Loaded:", path)
print("Characters:", len(text))
print("\nPreview (first 350 chars):\n")
print(text[:350])

Loaded: the-verdict.txt
Characters: 20479

Preview (first 350 chars):

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glo


## 2) Tokenize to integer IDs

Neural networks do math, not strings. Tokenization converts text into a sequence of integers.

Why this matters:
- the model predicts the next token over a fixed vocabulary
- tokens define what the model can see and learn word pieces vs words, etc
- tokenization impacts memory, speed, and how well rare words are handled

We try to use tiktoken GPT‑style tokenization. If tiktoken can't download its encoding files, we fall back to a simple regex tokenizer that builds a vocab from this text


In [3]:
_SPLIT_RE = r"(\s+|[,.!?;:()\[\]{}])"

class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, s):
        parts = re.split(_SPLIT_RE, s)
        parts = [p for p in parts if p and not p.isspace()]
        return [self.str_to_int.get(p, self.str_to_int["<|unk|>"]) for p in parts]

    def decode(self, ids):
        tokens = [self.int_to_str[i] for i in ids]
        text = " ".join(tokens)
        text = re.sub(r"\s+([,.!?;:\)])", r"\1", text)
        text = re.sub(r"\(\s+", "(", text)
        return text

tokenizer_type = None

if _HAS_TIKTOKEN:
    try:
        enc = tiktoken.get_encoding("gpt2")
        tokenizer_type = "tiktoken:gpt2"
        token_ids = enc.encode(text)
        vocab_size = enc.n_vocab
        decode = enc.decode
    except Exception as e:
        print("tiktoken failed (likely offline). Falling back to SimpleTokenizer.")
        print("Reason:", type(e).__name__, str(e)[:120], "...")
        tokenizer_type = "simple"

if tokenizer_type is None or tokenizer_type == "simple":
    parts = re.split(_SPLIT_RE, text)
    parts = [p for p in parts if p and not p.isspace()]
    uniq = sorted(set(parts))
    vocab = {tok: i for i, tok in enumerate(uniq)}
    for special in ["<|unk|>", "<|endoftext|>"]:
        if special not in vocab:
            vocab[special] = len(vocab)

    tok = SimpleTokenizer(vocab)
    tokenizer_type = "simple"
    token_ids = tok.encode(text)
    vocab_size = len(vocab)
    decode = tok.decode

print("Tokenizer:", tokenizer_type)
print("Vocab size:", vocab_size)
print("Total tokens:", len(token_ids))

print("\nFirst 30 token IDs:", token_ids[:30])
print("Decoded back:", decode(token_ids[:30]))

Tokenizer: tiktoken:gpt2
Vocab size: 50257
Total tokens: 5145

First 30 token IDs: [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285]
Decoded back: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear


## 3) Sliding window samples (max_length, stride)

Transformers train on fixed-length sequences  
We turn one long token stream into many supervised examples:

- X: tokens *t .. t+max_length-1*
- Y: tokens *t+1 .. t+max_length*  (shifted by 1)

Why overlap matters, stride < max_length:
- more training examples
- tokens appear in multiple, slightly shifted contexts
- less context loss at chunk boundaries, also a big deal in RAG chunking


In [4]:
def build_windows(tokens, max_length=32, stride=16):
    X, Y = [], []
    last_start = len(tokens) - (max_length + 1)
    for start in range(0, max(0, last_start + 1), stride):
        chunk = tokens[start : start + max_length + 1]
        x = chunk[:-1]
        y = chunk[1:]
        X.append(x)
        Y.append(y)
    X = torch.tensor(X, dtype=torch.long)
    Y = torch.tensor(Y, dtype=torch.long)
    return X, Y

max_length = 32
stride = 16

X, Y = build_windows(token_ids, max_length=max_length, stride=stride)

print("X shape:", tuple(X.shape), " (num_samples, seq_len)")
print("Y shape:", tuple(Y.shape))

print("\nExample sample #0")
print("input IDs :", X[0][:20].tolist(), "...")
print("target IDs:", Y[0][:20].tolist(), "...")
print("\ninput text :", decode(X[0][:60].tolist()))
print("\ntarget text:", decode(Y[0][:60].tolist()))

X shape: (320, 32)  (num_samples, seq_len)
Y shape: (320, 32)

Example sample #0
input IDs : [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438] ...
target IDs: [367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568] ...

input text : I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that,

target text:  HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in


## 4) DataLoader batching

Batches make training and embedding computation efficient:
- better GPU/CPU utilization
- consistent tensor shapes
- clean iteration for training loops

Agentic systems use the same batching idea when embedding many chunks or reranking many candidates.


In [5]:
batch_size = 4
dataset = TensorDataset(X, Y)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

xb, yb = next(iter(loader))
print("Batch X:", xb.shape)
print("Batch Y:", yb.shape)
print("\nDecoded batch[0] input (first ~250 chars):\n", decode(xb[0].tolist())[:250], "...")

Batch X: torch.Size([4, 32])
Batch Y: torch.Size([4, 32])

Decoded batch[0] input (first ~250 chars):
  latter's mysterious abdication. But no--for it was not till after that event that the _rose Dubarry_ drawing-rooms had begun to display ...


## 5) Embeddings: IDs to vectors

The embedding matrix is a learned lookup table that maps each token ID to a dense vector

### Why do embeddings encode meaning?
Because they are optimized for the learning task next-token prediction
Tokens that appear in similar contexts produce similar gradient signals, so their vectors become closer in space. Over time, the geometry of the embedding space reflects meaning as usefulness for prediction

### How are embeddings related to NN concepts?
An embedding layer is equivalent to a linear layer applied to a one‑hot vector:
- one‑hot(token_id) has a single 1
- multiplying by a weight matrix selects one row → that row is the embedding

So embeddings are standard NN parameters learned by backprop weights with gradients, specialized for discrete IDs

Transformers usually add positional embeddings so the model can use token order


In [6]:
embed_dim = 64

token_embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim).to(device)
pos_embed = nn.Embedding(num_embeddings=max_length, embedding_dim=embed_dim).to(device)

xb, yb = next(iter(loader))
xb = xb.to(device)

tok_vecs = token_embed(xb)                  # (B, T, D)
pos_ids = torch.arange(xb.shape[1], device=device)
pos_vecs = pos_embed(pos_ids)[None, :, :]   # (1, T, D)
x_in = tok_vecs + pos_vecs                  # (B, T, D)

print("Token vectors:", tok_vecs.shape)
print("Position vectors:", pos_vecs.shape)
print("Combined input:", x_in.shape)

some_id = xb[0, 0].item()
row_from_weight = token_embed.weight[some_id]
row_from_lookup = token_embed(torch.tensor([some_id], device=device))[0]

print("\nSame row via weight indexing:", torch.allclose(row_from_weight, row_from_lookup))
print("Token ID example:", some_id)

Token vectors: torch.Size([4, 32, 64])
Position vectors: torch.Size([1, 32, 64])
Combined input: torch.Size([4, 32, 64])

Same row via weight indexing: True
Token ID example: 290


## 6) Experiment: change max_length & stride

We will vary max_length and stride and report how many samples are created

Key idea:
- smaller stride to more overlap to more samples
- overlap is useful because it exposes tokens to multiple contexts and reduces boundary effects


In [7]:
def count_samples(tokens, max_length, stride):
    X, _ = build_windows(tokens, max_length=max_length, stride=stride)
    return X.shape[0]

settings = [
    (16, 16),  # no overlap
    (16, 8),   # 50% overlap
    (32, 32),  # no overlap
    (32, 16),  # 50% overlap
    (64, 64),  # no overlap
    (64, 16),  # heavy overlap
]

print("Total tokens:", len(token_ids))
print("\nmax_length | stride | num_samples")
print("-" * 32)
for ml, st in settings:
    n = count_samples(token_ids, ml, st)
    print(f"{ml:9d} | {st:6d} | {n:11d}")

print("\nWhy overlap is useful:")
print("Overlap means each token appears in multiple training contexts, so the model learns smoother transitions across window boundaries.")

Total tokens:

 5145

max_length | stride | num_samples
--------------------------------
       16 |     16 |         321
       16 |      8 |         642
       32 |     32 |         160
       32 |     16 |         320
       64 |     64 |          80
       64 |     16 |         318

Why overlap is useful:
Overlap means each token appears in multiple training contexts, so the model learns smoother transitions across window boundaries.
