# Makemore Park 2: Integrating a MLP
Based off of Bengio et al. 2003 (MLP langauge model) paper but using character prediction instead of word prediction

In [2]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

In [4]:
# read in all the words
words = open('names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [6]:
len(words)

32033

In [5]:
# build vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [61]:
# compile the dataset for the neural network

# for every 3 preceding characters we are looking for a specific output character
# this compiles the dataset into that format

# build the dataset
block_size = 3 # context lenth: how many characters do we take to predict the next one? -- chosen from the paper
X, Y = [], []
for w in words:
    
    # print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        # print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix] # crop and append
        
X = torch.tensor(X)
Y = torch.tensor(Y)

In [62]:
X.shape # inputs

torch.Size([228146, 3])

In [63]:
Y.shape # labels (expected values)

torch.Size([228146])

In [10]:
# build the neural network

C = torch.randn((27, 2)) # 2 dimensional lookup table -- 27 rows and 2 colums -- each character will have a 2D embedding

In [14]:
# example of embedding one integer
C[5] # one way to do it by indexing

# from the previous lecture...
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

tensor([-1.8374, -1.0526])

Both of the above ways will result in the same tensor, but we'll be using indexing simply because it is faster than one_hot

In [19]:
emb = C[X] # embeding X into C
emb.shape

torch.Size([32, 3, 2])

In [20]:
# hidden layer
W1 = torch.randn((6, 100)) # 100 nuerons chosen randomly
b1 = torch.randn(100)

Currently we have two matrices we need to multiply but they are not compatiable sizes. We have a 32,3,2 and a 6,100. We need to convert the first one to a 32,6 so that they can be multiplied. We will solve this by using torch's cat function (concatanate)

In [None]:
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1)

The problems with this is that we are indexing directly and so if we ever wanted to change the number of context words, we would have to change this code as well (not ideal). So, we can use torch's unbind function to remove a dimension 

In [None]:
torch.cat(torch.unbind(emb, 1), 1)

This will work, however there is actaully an even better more efficent way to do this. We can use torch.view() to alter the dimensions of the matrix. As long as they multiply to the same value, we can shape the matrix any way we want to. This doesn't acutally alter the storage of emb, instead it just changes how that data is viewed internally.

In [None]:
emb.view(32, 6)

Now we can simply use view to alter the matrix and allow it to be multiplied by the weights.

In [38]:
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # using -1 lets pytorch infer what the value should be (based on the og size)
h


tensor([[ 0.3483, -0.9696,  0.5176,  ...,  0.3664,  0.9999, -0.8674],
        [ 0.6093, -0.8617, -0.8181,  ..., -0.6881,  0.9998, -0.6656],
        [-0.6805,  0.5789,  0.9755,  ...,  0.7796,  0.9926, -0.9660],
        ...,
        [ 0.7573, -0.9888,  0.6710,  ..., -0.9907, -0.9990,  0.7968],
        [-0.9063, -0.2000,  0.9035,  ..., -0.9942, -1.0000,  0.5752],
        [-0.3442, -0.8952,  0.9296,  ..., -0.9998, -0.9980, -0.9465]])

In [39]:
# final layer
W2 = torch.randn((100, 27))
b2 = torch.randn(27)

In [40]:
logits = h @ W2 + b2

In [41]:
counts = logits.exp() # get fake counts

In [42]:
prob = counts / counts.sum(1, keepdim=True) # normalize into a probability

In [47]:
# comparing to label (expected character)
loss = -prob[torch.arange(32), Y].log().mean()
loss

tensor(17.4051)

## Putting this together
Now, we're going to take everything we made above and make it look more organized

In [64]:
X.shape, Y.shape # dataset

(torch.Size([228146, 3]), torch.Size([228146]))

In [65]:
# building parameters using a generator
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6,100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

In [66]:
# counting number of parameters
sum(p.nelement() for p in parameters) # number of parameters in total

3481

In [67]:
for p in parameters:
    p.requires_grad = True

In [139]:
for _ in range(100):
    
    # minibatch construct
    ix = torch.randint(0, X.shape[0], (32,)) # creating a minibatch of size 32
    
    # calculating loss (forward pass)
    emb = C[X[ix]] # (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    # counts = logits.exp()
    # prob = counts / counts.sum(1, keepdim=True)
    # loss = -prob[torch.arange(32), Y].log().mean()

    # we can replace the above lines with a more condensed and efficient version
    # it doesnt create new tensors like we do above but uses fused kernels for efficiency
    # this also makes the backward pass more efficient bc the formulas can be simplified
    loss = F.cross_entropy(logits, Y[ix])
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    # update parameters
    for p in parameters:
        p.data += -0.1 * p.grad
print(loss.item())

2.5390963554382324


In [141]:
emb = C[X]
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Y)
loss

tensor(2.5335, grad_fn=<NllLossBackward0>)

Another problem with our original approach is that with very large values in our logits (which may happen during optimization), we run out of room in our float and we get 'inf' which becomes 'nan' when calcualting probs -- very very bad :(. Pytorch solved this by realizing that you could offset logits by any value and get the same probs result. Since negative values are ok but positive is bad, pytorch subtracts the largets value in the tensor from the logits. This makes the largest value in logits 0 and every other value will be negative. Yay no more problems (well behaved).

## Optimizing training
In reality, we don't train over the entire dataset every single time. Instead, we will do forward, backward and update passes over mini batches of the dataset

In [72]:
torch.randint(0, X.shape[0], (32,)) # creating a minibatch of size 32

tensor([  3655, 162683,  53362,   6216, 117429, 219263, 213018, 222565,  99398,
        201728, 106309,  91032, 226273, 153939, 134342,  42442, 187806, 159240,
         12071,  73795, 227163, 181632,  85177,  32375, 151901, 222857, 200999,
        145197, 160510,  79528, 144552,  85869])

## Learning Rate
How do we determine the learning rate? 

__STOPPED AT 45:52!!!!!!!!!__