# Part-2: MLP

Here we implement a multilayer perceptron (MLP) character-level language model using [Bengio et al. 2003 MLP language model paper.](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

### 1. Setting Up Environment and Loading the Dataset

In [38]:
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F
%matplotlib inline

In [39]:
words = open('names.txt', 'r').read().splitlines()# reading the file and splitting into list of
words[:8]# printing first 8 words

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [40]:
len(words)# number of words in the dataset

32033

In [41]:
#building the vocabulary of characters
chars = sorted(list(set(''.join(words))))# getting unique characters in the dataset and sorting them
stoi = {s:i+1 for i,s in enumerate(chars)}# mapping characters to integers (1-26)
stoi['.'] = 0# mapping the special character '.' to 0
itos = {i:s for s,i in stoi.items()}# mapping integers to characters
vocab_size = len(itos)# getting the vocabulary size
print('vocab size: ', vocab_size)
print(itos)

vocab size:  27
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


We need to create a mapping from characters to integers `(stoi)` and back `(itos)`. This allows us to represent our text data numerically, which is the only format a neural network can understand.<br>
Hence we create a sorted list of all unique characters and then add a special `.` token at index 0

- #### Building the Dataset

In [42]:
block_size = 3# context length: how many characters do we take to predict the next one
X, Y = [], []# input and output lists
for w in words:
    context = [0]*block_size# starting with a context of 3 '.' characters
    for ch in w + '.':# for each character in the word plus the special character
        ix = stoi[ch]# get the integer representation of the character
        X.append(context)# append the current context to the input list
        Y.append(ix)# append the integer representation of the character to the output list
        #print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix]# slide the context window to the right

X = torch.tensor(X)# converting input list to tensor
Y = torch.tensor(Y)# converting output list to tensor

Instead of just looking at the single previous character (a bigram) we will now use a **context** of multiple characters to predict the next one.
- `block_size = 3`: This is a crucial hyperparameter which defines our context length. Here block_size = 3 means our model will always use the last 3 characters to predict the 4th character.
- We iterate through each word and build our input `X` and target `Y` pairs.
    - For a name like "emma" the process is:
        1. `...` ---> `e` (context is [0, 0, 0], target is e)
        2. `..e` ---> `m` (context is [0, 0, e], target is m)
        3. `.em` ---> `m` (context is [0, e, m], target is m)
        4. `emm` ---> `a` (context is [e, m, a], target is a)
        5. `mma` ---> `.` (context is [m, m, a], target is .)

- `context = context[1:] + [ix]`: It is the "sliding window." After each prediction we slide the context.

> Finally X becomes a tensor where each row is a context of 3 character indices and Y is a tensor containing the corresponding target character index for each context.

In [43]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([228146, 3]), torch.int64, torch.Size([228146]), torch.int64)

In [44]:
X

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        ...,
        [26, 26, 25],
        [26, 25, 26],
        [25, 26, 24]])

#### Creating Train, Validation and Test Splits

- `Training Set (Xtr, Ytr)`: The largest chunk (80% of the data). The model learns the patterns from this data.
- `Validation Set (Xdev, Ydev)`: A smaller portion (10%). We use this set to tune our model's hyperparameters (like embedding size, hidden layer size, learning rate) and to check for overfitting. The model does not train on this data.
- `Test Set (Xte, Yte)`: The final 10%. We use this set only once at the very end to get an unbiased evaluation of how well our final, tuned model performs on completely unseen data.

In [45]:
import random
def build_dataset(words):
  X, Y = [], []
  for w in words:

    # print(w)
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      X.append(context)
      Y.append(ix)
      # print(''.join(itos[i] for i in context), '--->', itos[ix])
      context = context[1:] + [ix]  # crop and append

  X = torch.tensor(X)
  Y = torch.tensor(Y)
  print(X.shape, Y.shape)
  return X, Y


random.seed(42)# setting the seed for reproducibility
random.shuffle(words)# shuffling the words
n1 = int(0.8*len(words))# first 80% for training
n2 = int(0.9*len(words))# next 10% for validation, last 10% for testing

Xtr, Ytr = build_dataset(words[:n1])# training set
Xdev, Ydev = build_dataset(words[n1:n2])# validation set
Xte, Yte = build_dataset(words[n2:])# test set

torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])


### 2. The Multi-Layer Perceptron (MLP) Model

#### The Embedding Layer

Instead of using large sparse one-hot vectors, we'll use **embeddings**.

- **What is it?** An embedding is a learned low-dimensional dense vector representation for each character in our vocabulary.
- `C = torch.randn((27, 2))`: This creates our embedding matrix, which acts as a lookup table. It has 27 rows (one for each character) and 2 columns. The 2 means we are choosing to represent each character with a 2-dimensional vector (i.e., a point in a 2D plane). This embedding dimension is a hyperparameter we can tune.

- **Benefit of using Embeddings:**
    - It's much smaller than a 27-dimensional one-hot vector.
    - The network will learn to place characters that are used in similar ways (like vowels or consonants) closer together in this 2D space therefore capturing semantic relationships that one-hot vectors cannot. 

In [46]:
C = torch.randn((27, 2))# the embedding matrix

Now we define how the embedding lookup works. `C[X]` uses the integer indices in X to retrieve the corresponding 2D embedding vectors from our matrix `C`.
- Input `X` shape: `(228146, 3)` (228,146 examples each with a 3-char context).
- Output `emb` shape: (228146, 3, 2) (For each of the 228,146 examples we now have the three 2D vectors corresponding to the three context characters with 2 being the size of the embedding vector for each character i.e. the embedding dimension).

In [47]:
emb = C[X]# embedding the input characters
emb.shape# shape of the embedded input

torch.Size([228146, 3, 2])

#### Building Hidden Layer
This is the first "neuron" layer of our MLP. It takes the embeddings as input and performs a non-linear transformation.
- `W1 = torch.randn((6, 100))`: The weight matrix for the hidden layer.
The input size is 6 because we have a 3-character context and each character is a 2D embedding (3 * 2 = 6). We will concatenate these embeddings to form a single 6D vector for each example.<br>
The output size is 100, meaning our hidden layer will have 100 neurons. This is another hyperparameter.
- `b1 = torch.randn(100)`: The bias vector for the 100 neurons in the hidden layer. Biases allow the neurons to shift their activation function, making them more flexible.

In [48]:
W1  = torch.randn((6,100))
b1 = torch.randn(100)

#### Hidden Layer Forward Pass
This block computes the output of the hidden layer:
- `emb.view(-1, 6)`: This is a crucial reshaping step. It takes our (228146, 3, 2) embedding tensor and transforms it into a (228146, 6) tensor. It does this by concatenating the three 2D vectors for each example into a single 6D vector. The -1 tells PyTorch to automatically infer the number of rows.<br>
The `.view()` function in PyTorch reshapes a tensor to have different dimensions while keeping the total number of elements the same.

```bash
For the context ' . e m '
[
  [-0.1565,  0.1425],  # Embedding for '.'
  [ 0.6479, -0.2573],  # Embedding for 'e'
  [-0.0123,  0.3681]   # Embedding for 'm'
]
```
When we apply .view(-1, 6), here's what each part does:
1. `The 6`: This is the desired number of columns for our new tensor. We get 6 because we want to combine the features of our 3 context characters and each character has a 2-dimensional embedding.

    - $3 (block_size) * 2 (embedding_dim) = 6$: This operation effectively takes the three 2D vectors and lays them end-to-end:
    $[-0.1565,  0.1425,   0.6479, -0.2573,   -0.0123,  0.3681]$

    - This new 6-dimensional vector is now a single, unified representation of the entire 3-character context.

2. The `-1`: This is a powerful placeholder. It tells PyTorch**I want the number of columns to be 6 and you should automatically figure out how many rows are needed to make it work.**<br>
PyTorch calculates this by taking the total number of elements in the tensor (32 * 3 * 2 = 192) and dividing by the specified number of columns (192 / 6 = 32). This ensures that our output tensor will have one row for each of the 32 examples in our minibatch.


- `@ W1 + b1`: We perform a matrix multiplication with the weights and add the bias. This is the standard linear operation of a neuron layer.

- `torch.tanh()`: We apply the hyperbolic tangent (tanh) activation function. Without a non-linear function between layers our multi-layer network would mathematically collapse into a single linear layer severely limiting its power. tanh squashes the output of each neuron to be between -1 and 1.

In [49]:
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)# hidden layer

In [50]:
h

tensor([[-0.9936, -0.9842, -0.9839,  ...,  0.9998, -0.9928, -0.1348],
        [-0.8438, -0.9843, -0.9714,  ...,  0.9983, -0.9504, -0.8208],
        [-0.9903, -0.9928, -0.8161,  ...,  0.9727,  0.9132,  0.3171],
        ...,
        [-0.9864, -0.9980, -0.9985,  ...,  0.9999, -0.9806, -0.9423],
        [-1.0000, -0.9995, -0.9914,  ...,  0.9999, -0.5348,  0.7520],
        [ 0.9990,  0.9968,  0.8851,  ...,  0.9821, -0.9999,  0.1843]])

In [51]:
h.shape

torch.Size([228146, 100])

#### The Output Layer
This is the final layer of our network. It takes the activations from the hidden layer and produces the final output.
- `W2 = torch.randn((100, 27))`: The weight matrix for the output layer.
    - The input size is 100 matching the number of neurons in our hidden layer.
    - The output size is 27 as we need one output number for each of the 27 possible next characters in our vocabulary.
- `b2 = torch.randn(27)`: The bias vector for the 27 output neurons.

In [52]:
W2 = torch.randn((100, 27))
b2 = torch.randn(27)

#### Output Layer Forward Pass(Logits)
We perform the final matrix multiplication between the hidden layer activations `h` and the output weights `W2` and add the final bias `b2`. The result is our logits.<br>
**Logits** are the raw and unnormalized scores for each of the 27 characters. They can be thought of as log-counts similar to the output of our linear layer but they are now produced by a much more powerful non-linear model.

In [53]:
logits = h @ W2 + b2

In [54]:
logits.shape

torch.Size([228146, 27])

#### Converting Logits to Probabilities (Softmax)
To turn our logits into a valid probability distribution, we apply the softmax function. This is a two-step process:
- `counts = logits.exp()`: We exponentiate the logits. This makes all the numbers positive.
- `prob = counts / counts.sum(1, keepdims=True)`: We normalize each row so that all the values in that row sum to 1.0.

The prob tensor now contains for each input example the model's predicted probability for each of the 27 possible next characters.

In [55]:
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
prob.shape

torch.Size([228146, 27])

#### Calculating the Loss


In [59]:
loss = -prob[torch.arange(prob.shape[0]), Y].log().mean()
loss

tensor(16.6379)