# Building a Language Model from Scratch

This notebook walks you through the process of building and training a small transformer-based language model from scratch using PyTorch. We will cover the following steps:
1.  **Setting up the Environment**: Importing necessary libraries.
2.  **Data Preparation**: Downloading and preprocessing the TinyStories dataset.
3.  **Tokenization**: Creating a custom tokenizer to convert text into numerical representations.
4.  **Model Definition**: Building the components of a transformer model (Positional Encoding, Multi-Head Attention, Feed-Forward Networks) and assembling them into a language model.
5.  **Training**: Setting up the training loop to train the model on our dataset.
6.  **Inference**: Using the trained model to generate new text.


## 1. Setting up the Environment

First, let's import the necessary libraries that we'll use throughout the notebook. These include `os` for interacting with the file system, `requests` for downloading data from the internet, `zipfile` for handling compressed files, `re` for regular expressions (useful for tokenization), and `Counter` for counting word frequencies.


In [1]:
import os, requests, zipfile, re
from collections import Counter

## 2. Data Preparation

### Downloading the Dataset
We'll be using the "TinyStories" dataset, which is a collection of short stories generated by GPT-3.5 and GPT-4 that are simple enough for a small model to learn from. The code below defines the URL for the dataset and a function to download it if it's not already present in our local directory.


In [2]:
dataset_url = "https://github.com/entropicemergence/tiny_llm_server/releases/download/v0.1.0/TinyStoriesV2-GPT4-small.zip"
datset_folder = "stories"
os.makedirs(datset_folder, exist_ok=True)

def download_dataset(url):
    filename = url.split("/")[-1]
    filepath = os.path.join(datset_folder, filename)
    if not os.path.exists(filepath):
        response = requests.get(url)
        with open(filepath, "wb") as f:
            f.write(response.content)
download_dataset(dataset_url)

### Extracting and Loading the Data

After downloading the compressed file, we need to extract its contents. The following cell unzips the file and then reads the entire text from the `.txt` file into a single string variable called `stories_text`. This variable will be our entire corpus for training the language model.


## 3. Tokenization

Tokenization is the process of converting a sequence of text into a sequence of numerical IDs that our model can understand. We'll create a custom `Tokenizer` class to handle this.

Our tokenizer will use a hybrid approach:
*   **Word-level Tokenization**: It will map the most common words in our vocabulary to unique integer IDs.
*   **Character-level Tokenization**: For words that are not in our vocabulary (out-of-vocabulary or OOV words), it will break them down into individual characters and map those to IDs.

This hybrid strategy allows us to handle any word we encounter while keeping the vocabulary size manageable.

The tokenizer also includes several **special tokens**:
*   `<PAD>`: Padding token, used to make all sequences in a batch the same length.
*   `<UNKNOWN>`: Used for characters that are not in our character vocabulary.
*   `<BOS>`: "Beginning of Sequence" token, marks the start of a text sequence.
*   `<EOS>`: "End of Sequence" token, marks the end of a text sequence.
*   `<CHAR_START>` / `<CHAR_END>`: These tokens signal the beginning and end of a sequence of characters representing an OOV word.

The cell below defines the `Tokenizer` class, builds the vocabulary from our `stories_text`, and then demonstrates how to encode a sample sentence into token IDs and decode it back into text.


In [3]:
filepath = "lec2/stories/TinyStoriesV2-GPT4-small.zip"
with zipfile.ZipFile(filepath, "r") as zip_ref:
    zip_ref.extractall(datset_folder)
with open("lec2/stories/TinyStoriesV2-GPT4-small.txt", "r", encoding="utf-8") as f:
    stories_text = f.read()

Now, we'll import the core `PyTorch` libraries. `torch` is the main library, `torch.nn` provides building blocks for neural networks (like layers, activation functions, etc.), and `torch.optim` contains optimization algorithms like Adam.


In [4]:
class Tokenizer:
    def __init__(self, text_source):
        self.text_source = text_source
        self.max_word_vocab = 4000
        self.special_tokens = {'<PAD>':int(0), '<UNKNOWN>':int(1), '<BOS>':int(2), '<EOS>':int(3), '<CHAR_START>':int(4), '<CHAR_END>':int(5)}
        self.punctuation = r'[.,!?;:"\'\-\(\)\[\]{}]'
        self.word_to_id = {}
        self.id_to_word = {}
        self.char_to_id = {}
        self.id_to_char = {}
        self.vocab_len = 0


    def build_vocab(self):
        print (len(self.text_source))
        grouped_stories = []
        story = ""
        for line in self.text_source.split("\n"):
            if line == "<|endoftext|>":
                # story+="\n==============\n"
                grouped_stories.append(story)   
                story = ""
            else:
                story += line + "\n"
        all_words = []
        all_chars = set()
        for story in grouped_stories:
            story = story.lower()
            all_chars.update(story)

            story = re.sub(self.punctuation, r' \g<0> ', story) #Replace punctuation with space + punctuation + space
            tokens = [token for token in story.split() if token.strip()]
            all_words.extend(tokens)

        
        word_counts = Counter(all_words)
        # print (len(all_words))
        # print (word_counts)
        start_id = len(self.special_tokens)
        for word, count in word_counts.most_common(self.max_word_vocab):
            self.word_to_id[word] = start_id
            self.id_to_word[start_id] = word
            start_id += 1
        # print (self.word_to_id)
        # print (self.id_to_word)
        # print (all_chars)
        for char in all_chars:
            self.char_to_id[char] = start_id
            self.id_to_char[start_id] = char
            start_id += 1
        
        self.vocab_len = start_id
        # print (self.char_to_id)
        # print (self.id_to_char)
    def encode(self, text, add_bos=True):
        text = text.lower()
        text = re.sub(self.punctuation, r' \g<0> ', text)
        # print (text)
        tokens = [token for token in text.split() if token.strip()]
        # print (tokens)
        token_ids = []
        token_ids.append(self.special_tokens['<BOS>'])
        for token in tokens:
            if token in self.word_to_id:
                token_ids.append(self.word_to_id[token])
            else:
                token_ids.append(self.special_tokens['<CHAR_START>'])
                for char in token:
                    if char in self.char_to_id:
                        token_ids.append(self.char_to_id[char])
                    else:
                        token_ids.append(self.special_tokens['<UNKNOWN>'])
                token_ids.append(self.special_tokens['<CHAR_END>'])
        return token_ids
    def decode(self, token_ids):
        decoded_text = ""
        for token_id in token_ids:
            if token_id > self.special_tokens['<CHAR_END>']:
                if token_id in self.id_to_word:
                    decoded_text += self.id_to_word[token_id] + " "
                else:
                    decoded_text += self.id_to_char[token_id]
            if token_id == self.special_tokens['<CHAR_END>']:
                decoded_text += " "
        return decoded_text
    def vocab(self):
        vocab = self.special_tokens.copy()
        vocab.update(self.word_to_id)
        vocab.update(self.char_to_id)
        return vocab



simple_tokenizer = Tokenizer(stories_text)
simple_tokenizer.build_vocab()

print (simple_tokenizer.vocab())


token_ids = simple_tokenizer.encode("Hello, world! nfhdg231=*nkdbvfd sunday")
print (token_ids)
print (simple_tokenizer.decode(token_ids))

22493387
[2, 339, 8, 583, 26, 4, 4066, 4036, 4064, 4042, 4031, 4008, 4038, 4020, 4028, 1, 4066, 4015, 4042, 4017, 4046, 4036, 4042, 5, 3024]
hello , world ! nfhdg231=nkdbvfd sunday 


## 4. Model Definition

### Hyperparameters

Before we define the model architecture, let's set up the key hyperparameters. These values control the size and behavior of our model and the training process.

*   `batch_size` (B): The number of independent sequences we process in parallel.
*   `max_context` (T): The maximum number of tokens in a sequence that the model can look at to predict the next token. This is also called the context length or block size.
*   `learning_rate`: The step size for our optimizer. A smaller learning rate can lead to more stable training but might be slower.
*   `device`: We'll use a 'cuda' (GPU) device if available, as it significantly speeds up training. Otherwise, we'll fall back to 'cpu'.
*   `n_embd` (C): The dimensionality of the token embeddings. Each token will be represented by a vector of this size.
*   `n_head`: The number of attention heads in the multi-head attention mechanism.
*   `n_layer`: The number of transformer blocks in our model. A deeper model (more layers) can learn more complex patterns but is also more computationally expensive.
*   `dropout`: A regularization technique to prevent overfitting. It randomly sets a fraction of neuron activations to zero during training.
*   `vocab_size`: The total size of our vocabulary, determined by our tokenizer.


In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

The **global positional encoding** commonly used in the original Transformer model (from *"Attention is All You Need"* by Vaswani et al.) is a **fixed** (non-learned) sinusoidal encoding. It adds positional information to token embeddings based on their absolute position in the sequence.

Given a position \( \text{pos} \in \mathbb{N} \) (starting from 0) and a dimension index \( i \in \{0, 1, \dots, d_{\text{model}}-1\} \), the positional encoding \( PE \) for dimension \( i \) at position \( \text{pos} \) is defined as:

\[
PE(\text{pos}, 2i) = \sin\left( \frac{\text{pos}}{10000^{2i / d_{\text{model}}}} \right)
\]

\[
PE(\text{pos}, 2i+1) = \cos\left( \frac{\text{pos}}{10000^{2i / d_{\text{model}}}} \right)
\]

### In matrix form:
Let \( \mathbf{P} \in \mathbb{R}^{L \times d_{\text{model}}} \) be the positional encoding matrix for a sequence of length \( L \). Then:

\[
\mathbf{P}_{p, j} =
\begin{cases}
\sin\left( \frac{p}{10000^{j / d_{\text{model}}}} \right) & \text{if } j \text{ is even} \\
\cos\left( \frac{p}{10000^{(j-1) / d_{\text{model}}}} \right) & \text{if } j \text{ is odd}
\end{cases}
\]

where:
- \( p = 0, 1, \dots, L-1 \) is the position,
- \( j = 0, 1, \dots, d_{\text{model}}-1 \) is the dimension.

### Final input to Transformer:
\[
\text{Input Embedding} = \text{Token Embedding} + \text{Positional Encoding}
\]

This encoding allows the model to capture relative and absolute position information via attention mechanisms.

> **Note**: These are **not learned** — they are deterministic functions of position. This is the standard "global" positional encoding used in vanilla Transformers.


In [6]:
# Hyperparameters
batch_size = 24 # B
max_context = 512 # T, max sequence length
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_embd = 192 # C
n_head = 6
n_layer = 6
dropout = 0.2
vocab_size = simple_tokenizer.vocab_len

print(f"Using device: {device}")
print(f"Vocab size: {vocab_size}")

Using device: cuda
Vocab size: 4068


## 5. Training the Model

### Data Preparation for Training

Before we can start training, we need to prepare the data in a format suitable for our model. Here's what the next cell does:

1.  **Tokenize all stories**: It iterates through all the stories in the dataset and encodes them into token IDs using our `simple_tokenizer`.
2.  **Padding/Truncating**: Each story is padded with the `<EOS>` token ID or truncated to ensure all sequences have a length of `max_context`.
3.  **Create a single tensor**: All the tokenized stories are concatenated into a single large tensor of token IDs.
4.  **Reshape into batches**: The tensor is reshaped to have dimensions `(num_batches, batch_size, max_context)`. This creates batches of sequences that can be fed into the model.
5.  **Create input and target data**: For language modeling, the model learns to predict the next token in a sequence. Therefore, `input_data` (`x`) is the sequence of tokens, and `target_data` (`y`) is the same sequence shifted by one position to the right. For example, if the input is "the cat sat", the target is "cat sat on". We prepare these two tensors for our training loop. We also add a padding token at the beginning of each sequence in `data` before splitting to facilitate this shifting.


In [7]:
import math
from torch.nn import functional as F

# --- Positional Encoding ---
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=max_context):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x is (B, T, C)
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)


# --- Multi-Head Attention (Optimized) ---
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel - optimized version """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.num_heads = num_heads
        self.head_size = head_size
        self.n_embd = num_heads * head_size
        
        # Single linear layers for all heads combined
        self.key = nn.Linear(self.n_embd, self.n_embd, bias=False)
        self.query = nn.Linear(self.n_embd, self.n_embd, bias=False)
        self.value = nn.Linear(self.n_embd, self.n_embd, bias=False)
        
        self.register_buffer('tril', torch.tril(torch.ones(max_context, max_context)))
        self.attn_dropout = nn.Dropout(dropout)
        self.proj = nn.Linear(self.n_embd, self.n_embd)
        self.proj_dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        
        # Compute Q, K, V for all heads at once
        q = self.query(x)  # (B, T, n_embd)
        k = self.key(x)    # (B, T, n_embd)
        v = self.value(x)  # (B, T, n_embd)
        
        # Split into multiple heads: (B, T, n_embd) -> (B, T, num_heads, head_size) -> (B, num_heads, T, head_size)
        q = q.view(B, T, self.num_heads, self.head_size).transpose(1, 2)
        k = k.view(B, T, self.num_heads, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.num_heads, self.head_size).transpose(1, 2)
        
        # Compute attention scores
        att = (q @ k.transpose(-2, -1)) * (self.head_size ** -0.5)  # (B, num_heads, T, T)
        att = att.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        
        # Apply attention to values
        out = att @ v  # (B, num_heads, T, head_size)
        
        # Concatenate heads: (B, num_heads, T, head_size) -> (B, T, num_heads, head_size) -> (B, T, n_embd)
        out = out.transpose(1, 2).contiguous().view(B, T, self.n_embd)
        
        # Final projection
        out = self.proj_dropout(self.proj(out))
        return out

# --- Feed Forward Network ---
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

# --- Transformer Block ---
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class LanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_encoding = PositionalEncoding(n_embd, dropout, max_context)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        x = self.position_encoding(tok_emb)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -max_context:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx



### Training Loop

Now we'll set up the training loop.
1.  We instantiate our `LanguageModel` and move it to the selected `device`. We also print the total number of parameters to get an idea of the model's size.
2.  We create an `AdamW` optimizer, a popular choice for training transformer models.
3.  We iterate through our prepared data, one batch at a time. For each batch:
    *   We perform a **forward pass**: The model takes the input `xb` and `yb` and computes the logits and the loss (cross-entropy loss).
    *   We perform a **backward pass**:
        *   `optimizer.zero_grad()`: Clears old gradients from the previous step.
        *   `loss.backward()`: Computes the gradients of the loss with respect to the model's parameters.
        *   `optimizer.step()`: Updates the model's parameters using the computed gradients to minimize the loss.
4.  We use `tqdm` to create a progress bar that shows the training progress and the current loss.


In [8]:
# Tokenize the entire dataset
all_token_ids = []
stories = stories_text.split("<|endoftext|>")
eos_token_id = simple_tokenizer.special_tokens['<EOS>']
for story in stories:
    # print (story)
    token_ids = simple_tokenizer.encode(story, add_bos=True)
    if len(token_ids) > max_context:
        token_ids = token_ids[:max_context]
    else:
        token_ids = token_ids + [eos_token_id] * (max_context - len(token_ids))
    all_token_ids.extend(token_ids)
    # print (len(token_ids))
    # print (token_ids)
data = torch.tensor(all_token_ids, dtype=torch.long)
print (data.shape)


data = data[:data.shape[0]//(batch_size*max_context)*(batch_size*max_context)]



data = data.view(-1, batch_size, max_context)
print (data.shape)
pad = (torch.ones((data.shape[0], data.shape[1], 1), dtype=torch.long) * eos_token_id)
data = torch.cat([pad, data], dim=-1).to(device)
print (data.shape)



input_data = data[:, :, :-1].contiguous()
print (input_data.shape)
target_data = data[:, :, 1:].contiguous()
print (target_data.shape)

# def get_batch(split):
#     data = train_data if split == 'train' else val_data
#     ix = torch.randint(len(data) - max_context, (batch_size,))
#     x = torch.stack([data[i:i+max_context] for i in ix])
#     y = torch.stack([data[i+1:i+max_context+1] for i in ix])
#     x, y = x.to(device), y.to(device)
#     return x, y



torch.Size([14147072])
torch.Size([1151, 24, 512])
torch.Size([1151, 24, 513])
torch.Size([1151, 24, 512])
torch.Size([1151, 24, 512])


## 6. Inference

After training, let's use our model to generate some text. This process is often called inference.

1.  We provide a `prompt` to give the model a starting point.
2.  The prompt is tokenized and converted into a tensor.
3.  We use `torch.no_grad()` to ensure that no gradients are computed, which makes the process more efficient as we are not training.
4.  We call the `model.generate()` method, which takes the context and generates `max_new_tokens` tokens.
5.  Finally, we decode the generated token IDs back into text to see what our model has learned to write! The output will be a short story that starts with our prompt.


In [None]:
import random


print (stories[random.randint(0, 1000)])

In [9]:
from tqdm import tqdm
import time

model = LanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(f"{sum(p.numel() for p in m.parameters())/1e6:.2f}M parameters")

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for j in range(3):
    training_loop = tqdm(range(input_data.shape[0]), desc="Training")
    for iter in training_loop:
        # every once in a while evaluate the loss on train and val sets
        # if iter % eval_interval == 0 or iter == max_iters - 1:
        #     losses = estimate_loss()
        #     print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        # t1 = time.time()
        # sample a batch of data
        xb, yb = input_data[iter], target_data[iter]
        # t2 = time.time()
        # print(f"Time taken to get batch: {t2-t1:.2f} seconds")
        # t3 = time.time()
        # evaluate the loss
        logits, loss = model(xb, yb)
        # t3 = time.time()
        # print(f"Time taken to get logits: {t3-t2:.2f} seconds")
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        # t4 = time.time()
        # print(f"Time taken to step: {t4-t3:.2f} seconds")
        training_loop.set_postfix(loss=loss.item())



4.23M parameters


Training: 100%|██████████| 1151/1151 [01:29<00:00, 12.81it/s, loss=1.46]
Training: 100%|██████████| 1151/1151 [01:22<00:00, 13.95it/s, loss=1.3]  
Training: 100%|██████████| 1151/1151 [01:22<00:00, 13.91it/s, loss=1.2]  


In [None]:
loss_accum = []
for j in range (3):
    training_loop = tqdm(range(input_data.shape[0]), desc="Training")
    for iter in training_loop:
        # every once in a while evaluate the loss on train and val sets
        # if iter % eval_interval == 0 or iter == max_iters - 1:
        #     losses = estimate_loss()
        #     print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        # t1 = time.time()
        # sample a batch of data
        xb, yb = input_data[iter], target_data[iter]
        # t2 = time.time()
        # print(f"Time taken to get batch: {t2-t1:.2f} seconds")
        # t3 = time.time()
        # evaluate the loss
        logits, loss = model(xb, yb)
        # t3 = time.time()
        # print(f"Time taken to get logits: {t3-t2:.2f} seconds")
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        # t4 = time.time()
        # print(f"Time taken to step: {t4-t3:.2f} seconds")
        training_loop.set_postfix(loss=loss.item())
        loss_accum.append(loss.item())

In [10]:
# generate from the model
prompt = "One day, a blue bird named Billy"
prompt = "Once upon a time, in a wild place with big trees, there lived a small bunny"
token_ids = simple_tokenizer.encode(prompt, add_bos=True)
context = torch.tensor([token_ids], dtype=torch.long, device=device)
with torch.no_grad():
    generated_ids = m.generate(context, max_new_tokens=50)[0].tolist()
    print(simple_tokenizer.decode(generated_ids))



once upon a time , in a wild place with big trees , there lived a small bunny in a tree . the bunny liked to make a lot of toys before bugs together . one day , a friend and a bird . the dog was surprised to make a fun in the park near the tree . they also wanted to see what happened . the 
