### Exercises: Building makemore Part 2: MLP

- E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2
- E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?
- E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

In [28]:
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F

### E01: 
Tune the hyperparameters of the training to beat my best validation loss of 2.2

In [1]:
words = open('names.txt', 'r').read().splitlines()
words[:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [4]:
chars = sorted(list(set(''.join(words))))
chars[:5]

['a', 'b', 'c', 'd', 'e']

In [8]:
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

In [43]:
# function to create datasets
block_size = 3

def build_dataset(words):
    X, Y = [], []
    
    for word in words:
        context = [0] * block_size
        
        for ch in word+'.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]
    
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    
    return X, Y

In [44]:
import random
random.seed(42)

In [45]:
# split the dataset into train / dev / test 80/10/10
random.shuffle(words)
n1 = int(0.8*len(words)) # index for 80 - 90 
n2 = int(0.9*len(words)) # index for 90 - 100

xtr, ytr = build_dataset(words[:n1])
xdev, ydev = build_dataset(words[n1:n2])
xte, yte = build_dataset(words[n2:])

In [46]:
xtr.shape, ytr.shape, xdev.shape, ydev.shape, xte.shape, yte.shape

(torch.Size([182580, 3]),
 torch.Size([182580]),
 torch.Size([22767, 3]),
 torch.Size([22767]),
 torch.Size([22799, 3]),
 torch.Size([22799]))

In [194]:
# initialize embedding layer, weights and biases
feature_size = 10 #size of feature vector
C = torch.randn((27, feature_size))
emb = C[xtr].view(-1, block_size * feature_size)
hidden_size = 200
W1 = torch.randn((block_size * feature_size, hidden_size), requires_grad = True)
b1 = torch.randn(hidden_size, requires_grad = True)
W2 = torch.randn((hidden_size, 27), requires_grad = True)
b2 = torch.randn(27, requires_grad = True)

parameters = [W1, b1, W2, b2]

In [195]:
def train(X, Y, num_iters, lr, bs):
    # keep stats
    lossi = []
    iteration = []
    
    for i in range(num_iters):
        ixs = torch.randint(0, X.shape[0], (bs, ))
        # forward
        hidden = torch.tanh((C[X[ixs]].view(-1, block_size * feature_size) @ W1 + b1))
        logits = hidden @ W2 + b2
        loss = F.cross_entropy(logits, Y[ixs])
    
        # backward
        for p in parameters: # zero grads
            p.grad = None
        loss.backward() #backprop

        #lr = 0.1 if i < 10000 else lr
        if i < 100000:
            lr = 0.1
        elif ((i > 100000) and (i < 150000)):
            lr = 0.01
        else:
            lr = lr
        
        for p in parameters: # weight update
            p.data -= lr * p.grad

        lossi.append(loss.item())
        iteration.append(i)
    
    return iteration, lossi

In [196]:
def eval_loss(x, y):
    hidden = torch.tanh(C[x].view(-1, block_size * feature_size) @ W1 + b1)
    logits = hidden @ W2 + b2
    loss = F.cross_entropy(logits, y).item()
    print(f'Loss: {loss}')

In [197]:
i, lossi = train(xtr, ytr, 200000, 0.001, 64)

In [198]:
lossi[-1]

2.190006732940674

In [199]:
eval_loss(xtr, ytr)

Loss: 2.0791778564453125


In [200]:
eval_loss(xdev, ydev)

Loss: 2.1441211700439453


Managed to beat Karpathy's score. I doubled the batch size to 64, iterated for 200,000 times and varied the learning rate during the iterations such that the network uses a slower learning rate the more iterations it goes through.