## E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

## Now to begin our training

**E01:** train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

### First step: How do I train a trigram model? First I need to start with trying to understand how to actually produce tri-grams

In [133]:
import torch
import torch.nn.functional as F

import matplotlib.pyplot as plt

In [34]:
words = open('names.txt', 'r').read().split()

Maybe, instead, I now have to split up the training set so that we are technically going to be having stoi ==> aa, ab, ac, ad, ...., wz, yz, zz, and then further make this into stoi

I think that means that N will be a 27 $\times$ 27 by 27 $\times$ 27 tensor, since we need to take all of these combinations into account

Actually it might be a 27 $\times$ 27 by 27 matrix, because there are 27\times27 possibilities for the input, but only 27 possibilites for the output---need to make sure the mapping is good

Also we have eliminated the fact that '.' can start at the beginning, so technically we have a (27-1) $\times 27 in the first dimension of the tensor?

In [113]:
N = torch.zeros((27*27-1, 27), dtype=torch.int32)
N.shape

torch.Size([728, 27])

Now we need to set up chars so that stoi and itos are configured properly. We should have to try to identify the next letter, so I'm not quite sure how I'll figure out how to do this yet

I'm thinking we'll just need two sets of dictionaries in order to properly map everything together, since we want all of the trigrams to be referenced to unique sets

In [200]:
single_chr = sorted(list(set(''.join(words))))
print(f'single_chr: {single_chr}')
single_stoi = {s:i+1 for i, s in enumerate(single_chr)} # reserve elem 0 for the dot
single_stoi['.'] = 0
single_itos = {i:s for s, i in single_stoi.items()} # flip integers and strings around

double_chr = [f'{ch1}{ch2}' for ch1 in single_itos.values() for ch2 in single_itos.values()]
double_stoi = {s:i+1 for i, s in enumerate(double_chr)}

single_chr: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [201]:
# need to eliminate the key-value pair associated with '..' since this won't happen!
del double_stoi['..']

In [202]:
double_itos = {i:s for s, i in double_stoi.items()}

In [203]:
# Build your datasets
xs, ys = [], []
for w in words[:1]:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ch12 = ch1 + ch2
        ix_d = double_stoi[ch12]
        ix_s = single_stoi[ch3]
        print(f'ch12 ch3: {ch12} {ch3}')
        print(f'ix_d, ix_s: {ix_d}, {ix_s}')
        
        xs.append(ix_d)
        ys.append(ix_s)
        
# then we'll turn these into tensors since we'll use them to build the network with PyTorch#
xs = torch.tensor(xs)
ys = torch.tensor(ys)
print(f'number of examples: {xs.nelement()}')
print(f'xs: {xs}')
print(f'ys: {ys}')

ch12 ch3: .e m
ix_d, ix_s: 707, 13
ch12 ch3: em m
ix_d, ix_s: 121, 13
ch12 ch3: mm a
ix_d, ix_s: 337, 1
ch12 ch3: ma .
ix_d, ix_s: 325, 0
number of examples: 4
xs: tensor([707, 121, 337, 325])
ys: tensor([13, 13,  1,  0])


So we're basically going to configure this so that we want to eventually match the encoded input to the output, i.e. we want to now train our trigram classifier model!

Let's do it step by step first before making an entire loop out of it---it's a good habit that I want to build from Andrej

In [204]:
num_inputs = 27*27-1
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((num_inputs, 27), generator=g, requires_grad=True) # remember to set requires_grad

We need to take this a little slower
N.shape is 728 x 27 \
xenc.shape is 4 x 728 \
W.shape is 728 x 27 \
so our logits are logits = xenc @ W is 4 x 27 so this will work. How do I intrepret this, though? These are now the logits

The ordering of our forward pass is as follows: 

Encode our inputs \
Matrix multiply with weights in order to get our logits, i.e. W $\times x \
Exponentiate our logits to obtain our counts \
Normalize our counts, and finish our softmax procedure \
Then obtain the loss by taking the log likelihood of our probabilities \
Finally obtain the average negative log likelihood by taking the mean, and then negating


In [205]:
xenc = F.one_hot(xs, num_classes=num_inputs).float()
logits = xenc @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
print(probs)
#loss = -probs[torch.arange(len(ys)), ys].log().mean()
#loss # looks super high

tensor([[0.0157, 0.0302, 0.0059, 0.0085, 0.1071, 0.0358, 0.0426, 0.0234, 0.0086,
         0.0893, 0.0781, 0.0770, 0.1513, 0.0208, 0.0297, 0.0058, 0.0717, 0.0063,
         0.0377, 0.0274, 0.0232, 0.0126, 0.0261, 0.0064, 0.0262, 0.0303, 0.0025],
        [0.0139, 0.0094, 0.0397, 0.0686, 0.0444, 0.0050, 0.0111, 0.0245, 0.0102,
         0.0206, 0.0026, 0.0147, 0.1970, 0.0124, 0.0788, 0.0459, 0.0263, 0.0025,
         0.0579, 0.0161, 0.0223, 0.0106, 0.0627, 0.0577, 0.0516, 0.0420, 0.0516],
        [0.0226, 0.0113, 0.1661, 0.0172, 0.0140, 0.0320, 0.0641, 0.0233, 0.0447,
         0.0064, 0.0360, 0.0593, 0.0596, 0.0058, 0.0209, 0.0053, 0.0066, 0.0700,
         0.0120, 0.0155, 0.0279, 0.1599, 0.0406, 0.0021, 0.0576, 0.0071, 0.0124],
        [0.0119, 0.0322, 0.0122, 0.0629, 0.0382, 0.0812, 0.0404, 0.0082, 0.0089,
         0.0134, 0.0104, 0.0380, 0.0203, 0.0810, 0.0116, 0.0262, 0.0185, 0.0412,
         0.0473, 0.0329, 0.1842, 0.0691, 0.0493, 0.0120, 0.0261, 0.0094, 0.0127]],
       grad_fn=<DivBack

Now let's do our backward pass

In [206]:
W.grad = None
loss.backward()

Finally, we need to update our data according to the gradient of W which we just computed

In [207]:
W.data += -1 * W.grad

TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

Run the below in order to compare the original loss with the new loss after doing the training 'loop'

In [208]:
xenc = F.one_hot(xs, num_classes=num_inputs).float()
logits = xenc @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(len(ys)), ys].log().mean()
loss # looks super high

tensor(4.2948, grad_fn=<NegBackward0>)

### Now let's formulate this as a training loop

In [209]:
train_steps = 10
train_step = 0.5
num_inputs = 27*27-1
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((num_inputs, 27), generator=g, requires_grad=True) # remember to set requires_grad
for k in range(train_steps):
    # Forward pass
    xenc = F.one_hot(xs, num_classes=num_inputs).float()
    logits = xenc @ W
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    anll = -probs[torch.arange(len(ys)), ys].log().mean() # average neg log likelihood
    
    # Backward pass
    W.grad = None
    anll.backward()
    
    # Update
    W.data += -train_step * W.grad
    print(f'avg neg log likelihood loss in step {k} of {train_steps}: {anll:.4f}')

avg neg log likelihood loss in step 0 of 10: 4.2948
avg neg log likelihood loss in step 1 of 10: 4.1640
avg neg log likelihood loss in step 2 of 10: 4.0339
avg neg log likelihood loss in step 3 of 10: 3.9045
avg neg log likelihood loss in step 4 of 10: 3.7758
avg neg log likelihood loss in step 5 of 10: 3.6480
avg neg log likelihood loss in step 6 of 10: 3.5211
avg neg log likelihood loss in step 7 of 10: 3.3953
avg neg log likelihood loss in step 8 of 10: 3.2706
avg neg log likelihood loss in step 9 of 10: 3.1472
