# Part-2: MLP

Here we implement a multilayer perceptron (MLP) character-level language model using [Bengio et al. 2003 MLP language model paper.](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

### 1. Setting Up Environment and Loading the Dataset

In [38]:
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F
%matplotlib inline

In [39]:
words = open('names.txt', 'r').read().splitlines()# reading the file and splitting into list of
words[:8]# printing first 8 words

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [40]:
len(words)# number of words in the dataset

32033

In [41]:
#building the vocabulary of characters
chars = sorted(list(set(''.join(words))))# getting unique characters in the dataset and sorting them
stoi = {s:i+1 for i,s in enumerate(chars)}# mapping characters to integers (1-26)
stoi['.'] = 0# mapping the special character '.' to 0
itos = {i:s for s,i in stoi.items()}# mapping integers to characters
vocab_size = len(itos)# getting the vocabulary size
print('vocab size: ', vocab_size)
print(itos)

vocab size:  27
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


We need to create a mapping from characters to integers `(stoi)` and back `(itos)`. This allows us to represent our text data numerically, which is the only format a neural network can understand.<br>
Hence we create a sorted list of all unique characters and then add a special `.` token at index 0

- #### Building the Dataset

In [42]:
block_size = 3# context length: how many characters do we take to predict the next one
X, Y = [], []# input and output lists
for w in words:
    context = [0]*block_size# starting with a context of 3 '.' characters
    for ch in w + '.':# for each character in the word plus the special character
        ix = stoi[ch]# get the integer representation of the character
        X.append(context)# append the current context to the input list
        Y.append(ix)# append the integer representation of the character to the output list
        #print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix]# slide the context window to the right

X = torch.tensor(X)# converting input list to tensor
Y = torch.tensor(Y)# converting output list to tensor

Instead of just looking at the single previous character (a bigram) we will now use a **context** of multiple characters to predict the next one.
- `block_size = 3`: This is a crucial hyperparameter which defines our context length. Here block_size = 3 means our model will always use the last 3 characters to predict the 4th character.
- We iterate through each word and build our input `X` and target `Y` pairs.
    - For a name like "emma" the process is:
        1. `...` ---> `e` (context is [0, 0, 0], target is e)
        2. `..e` ---> `m` (context is [0, 0, e], target is m)
        3. `.em` ---> `m` (context is [0, e, m], target is m)
        4. `emm` ---> `a` (context is [e, m, a], target is a)
        5. `mma` ---> `.` (context is [m, m, a], target is .)

- `context = context[1:] + [ix]`: It is the "sliding window." After each prediction we slide the context.

> Finally X becomes a tensor where each row is a context of 3 character indices and Y is a tensor containing the corresponding target character index for each context.

In [43]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([228146, 3]), torch.int64, torch.Size([228146]), torch.int64)

In [44]:
X

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        ...,
        [26, 26, 25],
        [26, 25, 26],
        [25, 26, 24]])

#### Creating Train, Validation and Test Splits

- `Training Set (Xtr, Ytr)`: The largest chunk (80% of the data). The model learns the patterns from this data.
- `Validation Set (Xdev, Ydev)`: A smaller portion (10%). We use this set to tune our model's hyperparameters (like embedding size, hidden layer size, learning rate) and to check for overfitting. The model does not train on this data.
- `Test Set (Xte, Yte)`: The final 10%. We use this set only once at the very end to get an unbiased evaluation of how well our final, tuned model performs on completely unseen data.

In [45]:
import random
def build_dataset(words):
  X, Y = [], []
  for w in words:

    # print(w)
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      X.append(context)
      Y.append(ix)
      # print(''.join(itos[i] for i in context), '--->', itos[ix])
      context = context[1:] + [ix]  # crop and append

  X = torch.tensor(X)
  Y = torch.tensor(Y)
  print(X.shape, Y.shape)
  return X, Y


random.seed(42)# setting the seed for reproducibility
random.shuffle(words)# shuffling the words
n1 = int(0.8*len(words))# first 80% for training
n2 = int(0.9*len(words))# next 10% for validation, last 10% for testing

Xtr, Ytr = build_dataset(words[:n1])# training set
Xdev, Ydev = build_dataset(words[n1:n2])# validation set
Xte, Yte = build_dataset(words[n2:])# test set

torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])
