### MakeMore Version 2: Neural Probabilistic Language Model with Character Embeddings
#### Description:
- This version builds on the original bigram model by introducing a multi-layer perceptron (MLP) to learn character-level language modeling using neural networks.
- It leverages an embedding layer to represent characters in a lower-dimensional space and trains a simple neural network to predict the next character in a name sequence.
- Compared to Version 1, this version moves from counting-based statistics to learnable parameters and backpropagation.


In [27]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures

%matplotlib inline


## 📦 Data Preparation

We begin by loading the dataset and encoding names as sequences of integer character indices. The character vocabulary includes 26 letters + a special token ".".

We will extract training examples of 3-character blocks (context) used to predict the next character.


In [30]:
# read in all the words
words = open('names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [31]:
len(words)

32033

In [32]:
# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [71]:
block_size = 3
context = [0] * block_size

X , Y = [] , []

for w in words:
    # print(w)
    context = [0] * block_size
    for ch in w + ".":
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        # print("X:" , X ,  "Y:" , Y)
        # print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix]

X = torch.tensor(X)
Y = torch.tensor(Y)


In [36]:
X.shape,X.dtype,  Y.shape, Y.dtype

(torch.Size([228146, 3]), torch.int64, torch.Size([228146]), torch.int64)

### Lets Build our Embedding Table (Look Up Table) for our Character level model

#### Now we have make our dataset to predict the probability of nxt word. lets put these into neural network layer. Before moving to feeding, we need to reduce our 27 char we have into lower dimensions --> Two dim
-- As we are proceeding to reimplement the architecture by A Neural Probabilistic Language Model {A Neural Probabilistic Language Model}
-- they have implemented via 17000  words in our case we are proceeding with char level model so we have 27 dim
-- They compressed 17000 dim to 30 dim , So lets compress the dimension to 2.

In [37]:
# Look Up Table : -- C
# by the way these are the weights which we adjust during the back-propagation
C = torch.rand(27, 2) # intialize random numbers for Look up table
C

tensor([[0.2170, 0.7707],
        [0.2801, 0.6671],
        [0.9972, 0.9303],
        [0.8709, 0.1039],
        [0.7985, 0.2169],
        [0.1397, 0.9650],
        [0.6289, 0.9597],
        [0.6178, 0.9784],
        [0.4439, 0.0642],
        [0.8414, 0.3987],
        [0.4577, 0.2143],
        [0.3336, 0.3532],
        [0.6804, 0.3702],
        [0.4153, 0.2636],
        [0.9702, 0.2330],
        [0.2867, 0.5706],
        [0.0337, 0.2609],
        [0.7708, 0.3116],
        [0.6444, 0.6671],
        [0.9819, 0.5261],
        [0.7613, 0.5661],
        [0.2441, 0.4689],
        [0.3162, 0.2558],
        [0.3457, 0.4985],
        [0.9894, 0.7684],
        [0.1325, 0.8925],
        [0.1049, 0.3822]])

In [38]:
C[5]

tensor([0.1397, 0.9650])

In [39]:
# Step 1: Generate Character Embeddings

# Suppose we want to encode the character represented by index 5 in a vocabulary of 27 characters.

# Why do we encode it this way?
# --- To generate a dense embedding for each character, we first convert the index into a one-hot vector of size 27.
# --- Then, we extract its corresponding embedding vector by performing a matrix multiplication with the embedding table (lookup table C).

F.one_hot(torch.tensor([5 ]), 27).dtype

torch.int64

In [40]:
# Using this method, we can extract the relevant character embedding to feed into the neural network.
# While this works, PyTorch provides a much simpler and more efficient way to handle embeddings.
# Instead of manually one-hot encoding and multiplying, we can directly index into the embedding layer using the character indices.
F.one_hot(torch.tensor([5 ]), 27).float() @ C


tensor([[0.1397, 0.9650]])

In [41]:
C[torch.tensor([5, 6, 7,7,7  ])]

tensor([[0.1397, 0.9650],
        [0.6289, 0.9597],
        [0.6178, 0.9784],
        [0.6178, 0.9784],
        [0.6178, 0.9784]])

In [58]:
# We can also pass a multi-dimensional tensor of indices (e.g., shape (32, 3)) to retrieve a batch of embeddings.
# In this case:
# - X is a (32, 3) tensor containing character indices.
# - C is our embedding table of shape (27, 2), where 27 is the vocab size and 2 is the embedding dimension.
# - When we index C with X (i.e., C[X]), the resulting shape is (32, 3, 2), giving us an embedding vector for each character in the input tensor.
C[X].shape

torch.Size([228146, 3, 2])

In [43]:
# lets check this
X[13, 2]

tensor(1)

In [44]:
C[X][13, 2]

tensor([0.2801, 0.6671])

In [45]:
# That's how its embed on the 3 dim array
C[1]

tensor([0.2801, 0.6671])

In [63]:
# This is our embedding matrix for the characters.
embed = C[X]
print(embed.shape)  # Shape: (32, 3, 2)

# Let's begin with the first linear (fully connected) layer of our MLP.
W1 = torch.randn(6, 100)  # Weight matrix: input_dim=6, output_dim=100
b1 = torch.randn(100)     # Bias vector
print(W1.shape)  # (6, 100)

# We need to perform the affine transformation: output = embed @ W1 + b1
# However, we can't directly multiply embed with W1 in its current shape.
# Current embed shape: (32, 3, 2) → This represents a batch of 32 samples, each with 3 characters, each embedded into 2D.
# So, each sample has 3 * 2 = 6 total input features. We need to flatten the last two dimensions.

embed = embed.view(embed.shape[0], -1)  # Reshape to (32, 6)
print(embed.shape)

# Now we can safely perform the matrix multiplication with W1 and add the bias.
output = embed @ W1 + b1  # Resulting shape: (32, 100)
print(output.shape)

torch.Size([228146, 3, 2])
torch.Size([6, 100])
torch.Size([228146, 6])
torch.Size([228146, 100])


In [66]:
# Concatenating the embedding vectors along the feature dimension to get shape (32, 6)
# For example, this manually extracts each character embedding (of dim 2) and concatenates them:
# torch.cat([embed[:, 0, :], embed[:, 1, :], embed[:, 2, :]], dim=1)
# print(torch.cat([embed[:, 0, :], embed[:, 1, :], embed[:, 2, :]], dim=1).shape)

# This works well for a fixed block size like 3, but isn't scalable for variable-length sequences.
# Instead, we can use torch.unbind along dim=1 to unpack the 3 blocks dynamically:
# # This gives a tuple of tensors: (embed[:,0,:], embed[:,1,:], embed[:,2,:])
# blocks = torch.unbind(embed, dim=1)
# print(torch.cat(blocks, dim=1).shape)  # Concatenates to shape (32, 6)

# However, there’s an even more efficient and cleaner way to achieve this using torch.view()
# Since our original embed shape is (32, 3, 2), we can flatten the last two dims directly:
# - This operation doesn’t copy memory; it just reshapes the view of the storage.
# - It combines the 3 characters × 2-dim embeddings into a single 6-dim input per sample.
flattened_embed = embed.view(embed.shape[0], -1)
print(flattened_embed.shape)  # (32, 6)

torch.Size([228146, 6])


#### Forward Pass through the MLP: Adding Non-Linearity and Output Layer ####

- Suppose we have 228146 input samples (i.e., character sequences).
- Each input has been flattened into a 6-dimensional vector after embedding.
- The flow is as follows:
- idx → Look-up table → [228146, 6] → Linear Layer (W1, b1) → [228146, 100] → tanh → [228146, 100]

In [73]:
# Apply first linear layer &
# Introduce non-linearity using tanh activation
h = torch.tanh(embed.view(embed.shape[0],6) @ W1 +b1 ) # Shape: [228146, 100]

# Note on broadcasting:
# - W1 @ input gives shape [228146, 100]
# - b1 has shape [100]
# - PyTorch will automatically broadcast b1 to [1, 100] → [228146, 100] when adding
# This ensures the bias is added correctly across all rows.


# ----------Final layer of the MLP maps from hidden dimension (100) to vocab size (27)
W2 = torch.randn(100, 27)
b2 = torch.rand(27)

# Compute logits for the next character prediction
logits = h @ W2 + b2  # Shape: [228146, 27]

# Convert logits to probabilities using softmax
# ------------------------
# Step 1: Exponentiate logits
counts = logits.exp()

# Step 2: Normalize across vocab dimension (dim=1) to get probability distribution
prob = counts / counts.sum(1, keepdim=True)  # Shape: [228146, 27]

# Step 2: Normalize across vocab dimension (dim=1) to get probability distribution
prob = counts / counts.sum(1, keepdim=True)  # Shape: [228146, 27]
# -----------------------------
# Evaluate how well the model predicted the actual next character
# Y contains the true next character indices (shape: [228146])
# This line extracts the predicted probability for the correct next character
correct_probs = prob[torch.arange(embed.shape[0]), Y]
print(correct_probs)

tensor([2.0307e-03, 2.7908e-05, 4.9379e-05,  ..., 4.0485e-11, 3.3935e-04,
        1.7640e-09])


## =========================
## Loss Function
## =========================

In [75]:
# Cross-entropy loss to compare predicted logits vs true next characters
loss = -prob[torch.arange(embed.shape[0]), Y].log().mean()
print(f"Initial manual loss: {loss.item():.4f}")  # This is the loss we want to minimize


Initial manual loss: 10.4039


In [78]:
# ----------------------------
# Let's now clean up and structure the full training flow
# ----------------------------
# Dataset shapes
print("Dataset Shapes — X:", X.shape, "Y:", Y.shape)

# Set manual seed for reproducibility
g = torch.Generator().manual_seed(2147483647)

# Initialize model parameters
C = torch.rand(27, 2, generator=g)       # Embedding table: vocab_size x embedding_dim
W1 = torch.randn((6, 100), generator=g)  # First MLP layer: input_dim x hidden_dim
b1 = torch.randn(100, generator=g)       # Bias for layer 1
W2 = torch.randn((100, 27), generator=g) # Output layer: hidden_dim x vocab_size
b2 = torch.randn(27, generator=g)        # Bias for output layer

# Bundle all parameters
parameters = [C, W1, b1, W2, b2]
print("Total Parameters:", sum(p.nelement() for p in parameters))

# ========================================
# Efficient Loss Calculation with F.cross_entropy
# ========================================

# Manual probability calculation may overflow with large logits
logits = torch.tensor([-5.0, 0.0, 1.0, 1000.0])
counts = logits.exp()
probs = counts / counts.sum()
print("Naive Softmax (may overflow):", probs)

# PyTorch handles this with log-sum-exp trick:
# subtracts max(logits) to avoid overflow
logits_safe = logits - logits.max()
counts = logits_safe.exp()
probs = counts / counts.sum()
print("Numerically Stable Softmax:", probs)

# Note: F.cross_entropy handles this internally for better forward & backward performance


Dataset Shapes — X: torch.Size([228146, 3]) Y: torch.Size([228146])
Total Parameters: 3481
Naive Softmax (may overflow): tensor([0., 0., 0., nan])
Numerically Stable Softmax: tensor([0., 0., 0., 1.])


## =========================
## Training Loop
## =========================

In [79]:
# ========================================
# Training Loop
# ========================================

# Enable gradients
for p in parameters:
    p.requires_grad = True

epochs = 100
for step in range(epochs):

    # Mini-batch sampling
    ix = torch.randint(0, X.shape[0], (32,))  # Random 32 samples

    # Forward pass
    emb = C[X[ix]]                         # Shape: (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # Shape: (32, 100)
    logits = h @ W2 + b2                   # Shape: (32, 27)
    loss = F.cross_entropy(logits, Y[ix])  # Cross-entropy loss

    # Backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # Parameter update (SGD)
    learning_rate = 0.1
    for p in parameters:
        p.data -= learning_rate * p.grad

# Final training loss
print(f"Final training loss: {loss.item():.4f}")

Final training loss: 3.1268


### ========================================
## Final Evaluation on Full Dataset
### ========================================

In [80]:

emb = C[X]                             # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
logits = h @ W2 + b2
final_loss = F.cross_entropy(logits, Y)
print(f"Loss on full dataset: {final_loss.item():.4f}")

Loss on full dataset: 3.1447


##  Summary of Improvements:
- Clear headings for each section

- Descriptive and concise comments explaining what each step does

- Reorganized flow for better readability

- Included numerical stability explanation with softmax + PyTorch’s F.cross_entropy