# Bigram Language Model (from scratch)

In this notebook, Build and train a tiny character-level bigram language model from [Andrej Karpathy’s "Let’s build GPT from scratch"](https://www.youtube.com/watch?v=kCc8FmEb1nY). Keep things simple and self-contained: load Tiny Shakespeare, tokenize at the character level, train a bigram model (each token predicts the next), and sample text.

- It is best to follow this notebook with the Andrej's video



In [None]:
# Download Tiny Shakespeare once (toy dataset)
# Shakespeare - a concatenation of all of Shakespeare's works in a single file.

!mkdir -p data && \
    wget -O ../data/input.txt \
    https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

## Load Text
Read the entire corpus into memory. It’s small (about 1.1M chars), which is perfect for a quick toy experiment.

In [None]:
# Read the dataset
with open("data/input.txt", "r", encoding="utf-8") as f:
    text = f.read()
print(f"Text length: {len(text)}")
print(text[:500])

## Vocabulary and Tokenization
Use a simple character-level tokenizer. That means each unique character becomes a token id. It’s crude compared to BPE/sentencepiece, but perfect for understanding the mechanics.

In [None]:
# Build the character vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print("vocab_size =", vocab_size)

In [None]:
# Character-level tokenizer: char <-> id
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]  # string -> list[int]
decode = lambda l: "".join([itos[i] for i in l])  # list[int] -> string

print(encode("Hello, there!"))
print(decode(encode("Hello, there!")))

## Numericalize and Split
Convert the full corpus to token ids (a long 1D tensor), then keep 90% for training and 10% for validation.

In [None]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

In [None]:
# Train/val split
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

## Context Windows
- It would be computationally very expensive and prohibitive to feed entire text into a Transformer all at once. When we actually train a Transformer on large datasets, we only work with chunks of the dataset. During training, we sample random chunks from the training set and train on just these chunks at a time. These chunks have a maximum length called `block_size`.

- When we plug data into a Transformer, we simultaneously train it to make predictions at every position. In a chunk of nine characters, there are actually eight individual examples packed in. 
- We train on all of these not just for computational reasons or efficiency, but also to make the Transformer network accustomed to seeing contexts ranging from as little as one character all the way up to block_size. 
- This is useful later during inference because while sampling, we can start generation with as little as one character of context, and the Transformer knows how to predict the next character with contexts ranging from one all the way up to block_size. 
- After block_size, we have to start truncating because the Transformer will never receive more than block_size inputs when predicting the next character.

In [None]:
block_size = 8  # maximum context length
x = train_data[:block_size]
y = train_data[1 : block_size + 1]
for t in range(block_size):
    context = x[: t + 1]
    target = y[t]
    print(f"Example {t+1} -> input {context.tolist()} -> target {int(target)}")

Note, during inference after reaching the block_size, we have to start truncating because the Transformer will never receive more than the `block_size` inputs when it's predicting the next character

## Mini-batching
Sample random starting positions to build batches of shape `(batch_size, block_size)`. Each row is an independent training example. Batching keeps the GPU/CPU busy and speeds up training.
> In Transformers, we have many batches of multiple chunks of text that are stacked up in a single tensor. This is done for efficiency to keep the GPUs busy, as they are very good at parallel processing of data.

In [None]:
torch.manual_seed(42)
batch_size = 4  # number of sequences processed in parallel


def get_batch(split):
    source = train_data if split == "train" else val_data
    ix = torch.randint(len(source) - block_size, (batch_size,))
    x = torch.stack([source[i : i + block_size] for i in ix])
    y = torch.stack([source[i + 1 : i + 1 + block_size] for i in ix])
    return x, y


xb, yb = get_batch("train")
print("xb", xb.shape)
print("yb", yb.shape)
# Peek a few (context, target) pairs
for xrow, yrow in zip(xb, yb):
    for t in range(block_size):
        print(f"input {xrow[:t+1].tolist()} -> target {int(yrow[t])}")
    print("---")

## Bigram Model
- In language modeling, the Bigram language model is probably the simplest language model
- It use a single embedding table of shape `(vocab_size, vocab_size)`. Given a token id, look up a row and interpret it directly as the logits for the next token. 
    > Embedding Table: This is basically a tensor of shape vocab_size by vocab_size. When we pass indices here, every single integer in our input refers to this embedding table and plucks out a row from that embedding table. We interpret this as the logits, which are the scores for the next character in the sequence.
    > What's happening is we're predicting what comes next based solely on the individual identity of a single token. 
- There’s no context mixing here — it’s a pure bigram model.
- Training is done with `cross-entropy loss`.
    > Cross entropy loss is negative log likelihood loss. The loss is the cross entropy between the predictions and the targets, which measures the quality of the logits with respect to the targets.  


In [None]:
import torch.nn as nn
from torch.nn import functional as F
from einops import rearrange


class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size: int):
        super().__init__()
        # Each token id maps to a row of logits for the next token
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, inputs, targets=None):
        # inputs: (B, T) -> logits: (B, T, C) with C=vocab_size
        logits = self.token_embedding_table(inputs)
        loss = None
        if targets is not None:
            # Flatten batch and time for cross-entropy
            logits = rearrange(logits, "b t c -> (b t) c")
            targets = rearrange(targets, "b t -> (b t)")
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, inputs, max_new_tokens: int):
        # Autoregressively sample `max_new_tokens` tokens
        for _ in range(max_new_tokens):
            logits, _ = self(inputs)  # (B, T, C)
            logits = logits[:, -1, :]  # only last step: (B, C)
            probs = F.softmax(logits, dim=-1)  # convert to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            inputs = torch.cat((inputs, idx_next), dim=1)  # append
        return inputs


m = BigramLanguageModel(vocab_size)
out, loss = m(xb, yb)
print("logits shape:", out.shape, "| loss:", loss.item())

We expect the loss to be about -ln(1/65), which is approximately 4.17, but we're getting 4.72. This tells us that the initial predictions are not super diffuse - they have a little bit of structure already.

In [None]:
# Untrained sample (will be gibberish, but shows the flow)
start = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(start, max_new_tokens=100)[0].tolist()))

## Train
Use `AdamW` optimizer. A typical good setting for the learning rate is roughly `3e-4`. However, for tiny models we can use a relatively large learning rate (`1e-3`), 

In [None]:
batch_size = 32
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
for step in range(10000):
    xb, yb = get_batch("train")
    _, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if step % 200 == 0:
        print(f"step {step:5d} | loss {loss.item():.4f}")

## Sample
Finally, generate text by sampling one token at a time from the model’s predicted distribution

In [None]:
start = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(start, max_new_tokens=300)[0].tolist()))