## makemore -- mlp

following andrej karpathy's course (https://www.youtube.com/watch?v=TCH_1BHY58I)

his github: https://github.com/karpathy/makemore

### high level

this is a *word* level language model, and has a defined vocabulary size of *n* words. Each word is represented with a vector embedding.

predict the next word given the previous word. works by tuning the embedding vectors over time.

# <img src="./images/mlp.png" width="500" style="margin: 0 auto; display: block;"/>

- predict next word using 3 previous words
- *n-1* indices
- lookup table *C* is a *n x d* matrix, where *d* is the embedding dimension (30 in this case) -> first layer has *3 x 30* neurons
- hidden layer: tanh - its a hyperparameter (design choice upto designer of neural net), can be as large or small as one wants, but each neuron here connected to all the neurons in the prev layer + next layer
- output layer: *n* neurons, one for each word in the vocabulary

In [1]:
!curl -O https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt
!pip install torch

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  222k  100  222k    0     0   381k      0 --:--:-- --:--:-- --:--:--  382k
Collecting torch
  Downloading torch-2.9.1-cp311-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting filelock (from torch)
  Downloading filelock-3.20.1-py3-none-any.whl.metadata (2.1 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Downloading networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting fsspec>=0.8.5 (from torch)
  Downloading fsspec-2025.12.0-py3-none-any.whl.metadata (10 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.9.1-cp311-none-macosx_11_0_arm64.whl (74.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 MB[0m

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

words = open('names.txt', 'r').read().splitlines()

In [4]:
# building the vocab

chars = sorted(list(set("".join(words))))
stoi = {c: i + 1 for i, c in enumerate(chars)}
stoi['.'] = 0
itos = {i: c for c, i in stoi.items()}
print(itos)


{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [19]:
# building the dataset. taking three chars at a time, and assigning the output label as the fourth char

block_size = 3 # context length
X, Y = [], []

for word in words[:5]:
    context = [0] * block_size
    print(word)
    for ch in word + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        # the example
        print(f'{"".join(itos[i] for i in context)} -> {ch}')
        context = context[1:] + [ix]
    print()

X = torch.tensor(X)
Y = torch.tensor(Y)

X.shape, Y.shape # each input to the neural net is 3 integers, and we have 228146 examples in total

emma
... -> e
..e -> m
.em -> m
emm -> a
mma -> .

olivia
... -> o
..o -> l
.ol -> i
oli -> v
liv -> i
ivi -> a
via -> .

ava
... -> a
..a -> v
.av -> a
ava -> .

isabella
... -> i
..i -> s
.is -> a
isa -> b
sab -> e
abe -> l
bel -> l
ell -> a
lla -> .

sophia
... -> s
..s -> o
.so -> p
sop -> h
oph -> i
phi -> a
hia -> .



(torch.Size([32, 3]), torch.Size([32]))

we now construct the lookup table *C*. shape will be *e x c x d* where *e* is the number of examples, *c* is the context length, and *d* is the embedding dimension

In [None]:
C = torch.randn((27, 2)) # 27 chars in the vocab, and each word is represented with a 2 dimensional vector
emb = C[X] # get all embeddings for the input simultaneously
emb.shape

torch.Size([32, 3, 2])

we now construct the hidden layer

In [22]:
W1 = torch.randn((6, 100)) # 6 inputs to each neuron (3 words x 2 embedding dimensions) and 100 neurons in total (our choice)
b1 = torch.randn((100))

In [None]:
# do a little preprocessing to make matmul possible
# concatenate embeddings for each element in context
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], dim=1)

# the code below does the same thing elegantly, however not elegant from a memory/performance perspective because it creates a new tensor and copies the data
torch.cat(torch.unbind(emb, dim=1), dim=1)

# the code below does it even more elegantly, and does not create a new tensor, just modifies internal "view" of the tensor
# to see why, call tensor.storage() to see the internal storage of the tensor
emb.view(32, 6)


tensor([[-0.7511,  0.3032, -0.7511,  0.3032, -0.7511,  0.3032],
        [-0.7511,  0.3032, -0.7511,  0.3032,  0.3078, -0.2471],
        [-0.7511,  0.3032,  0.3078, -0.2471,  1.3071,  1.0924],
        [ 0.3078, -0.2471,  1.3071,  1.0924,  1.3071,  1.0924],
        [ 1.3071,  1.0924,  1.3071,  1.0924, -0.6194,  0.1814],
        [-0.7511,  0.3032, -0.7511,  0.3032, -0.7511,  0.3032],
        [-0.7511,  0.3032, -0.7511,  0.3032,  1.8743,  1.4657],
        [-0.7511,  0.3032,  1.8743,  1.4657,  0.1482,  1.1311],
        [ 1.8743,  1.4657,  0.1482,  1.1311,  0.5089,  0.7235],
        [ 0.1482,  1.1311,  0.5089,  0.7235, -1.5288,  0.7637],
        [ 0.5089,  0.7235, -1.5288,  0.7637,  0.5089,  0.7235],
        [-1.5288,  0.7637,  0.5089,  0.7235, -0.6194,  0.1814],
        [-0.7511,  0.3032, -0.7511,  0.3032, -0.7511,  0.3032],
        [-0.7511,  0.3032, -0.7511,  0.3032, -0.6194,  0.1814],
        [-0.7511,  0.3032, -0.6194,  0.1814, -1.5288,  0.7637],
        [-0.6194,  0.1814, -1.5288,  0.7

In [None]:
# hidden state
# -1 is a wildcard dimension, and will be inferred from the context
h = emb.view(-1, 6) @ W1 + b1
h

tensor([[-2.5970e+00, -4.4699e-03,  6.8337e-01,  ...,  2.7602e+00,
         -2.4164e-01, -3.3464e-01],
        [-1.4067e+00,  1.5983e-01,  8.3273e-01,  ...,  2.2253e+00,
          1.3396e+00, -6.0770e-01],
        [-1.0316e+00, -1.5090e+00,  1.6284e+00,  ..., -2.8019e-01,
          4.7762e+00,  3.2909e+00],
        ...,
        [-1.4183e-02,  3.0865e-01, -4.8807e+00,  ..., -1.2536e+00,
         -5.2495e-01,  3.2218e-01],
        [-2.4998e+00, -8.6947e-01,  1.6189e+00,  ...,  1.7346e+00,
          1.0656e+00,  4.9538e+00],
        [-2.5061e+00, -2.3514e-01, -2.4876e-01,  ...,  1.5671e+00,
          1.5486e-01, -1.8744e-01]])

output layer

In [43]:
# 100 inputs to each neuron, and 27 neurons in total (one for each char)
W2 = torch.randn((100, 27))
b2 = torch.randn((27))

logits = h @ W2 + b2
logits.shape

torch.Size([32, 27])

In [62]:
counts = logits.exp()
probs = counts / counts.sum(dim=1, keepdim=True)
loss = -probs[torch.arange(probs.shape[0]), Y].log().mean()

# cross entropy loss gives the same output
# better because torch does not do the intermediate steps above and create unnecessary tensors
# does clustered mathematical operations

ce_loss = F.cross_entropy(logits, Y)

summary before training nn

In [52]:
params = [C, W1, b1, W2, b2]
sum(p.nelement() for p in params)

3481

run the backward pass

In [65]:
for p in params:
    p.requires_grad = True

for _ in range(100):
    emb = C[X]
    # to introduce non-linearity, ensuring that the network is not a linear function at the end of the day
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y)
    for p in params:
        p.grad = None
    loss.backward()
    for p in params:
        p.data += -0.1 * p.grad

print(loss.item())    

0.27411824464797974
