In [2]:
import torch
import torch.nn.functional as F

import matplotlib.pyplot as plt
%matplotlib inline

Goal: (like the first lecture/notebook) generate _more_ rows of text that are like the ones fed in first. Here, do this by picking a character that is likely, based on the three previous characters.

"We're maximizing the probability of the word, with respect to the parameters of the neural net. The parameters are the weights and biases of the output layer that does a softmax, the hidden layer, and the (character) embedding look up table layer called C."  

# Prep

In [3]:
# start w/ the words (names)
words = open('names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [4]:
len(words)

32033

In [7]:
# build the vocab of characters and mapping from/to integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


# Build the dataset

In [14]:
# build the dataset
block_size = 3 # context length of how many chars to predict the next one
X, Y = [], []

for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '-->', itos[ix], context)
        context = context[1:] + [ix] # crop first char and append current label
        
X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... --> e [0, 0, 0]
..e --> m [0, 0, 5]
.em --> m [0, 5, 13]
emm --> a [5, 13, 13]
mma --> . [13, 13, 1]
olivia
... --> o [0, 0, 0]
..o --> l [0, 0, 15]
.ol --> i [0, 15, 12]
oli --> v [15, 12, 9]
liv --> i [12, 9, 22]
ivi --> a [9, 22, 9]
via --> . [22, 9, 1]
ava
... --> a [0, 0, 0]
..a --> v [0, 0, 1]
.av --> a [0, 1, 22]
ava --> . [1, 22, 1]
isabella
... --> i [0, 0, 0]
..i --> s [0, 0, 9]
.is --> a [0, 9, 19]
isa --> b [9, 19, 1]
sab --> e [19, 1, 2]
abe --> l [1, 2, 5]
bel --> l [2, 5, 12]
ell --> a [5, 12, 12]
lla --> . [12, 12, 1]
sophia
... --> s [0, 0, 0]
..s --> o [0, 0, 19]
.so --> p [0, 19, 15]
sop --> h [19, 15, 16]
oph --> i [15, 16, 8]
phi --> a [16, 8, 9]
hia --> . [8, 9, 1]


In [16]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [27]:
len(X)

32

The X dataset is a row per each combination of three characters - including three '.' chars that indicate the start (or end) of the word - in each word in the input, with the actual data in the row an array of integers. The Y dataset is the integer of the _next_ character after the three. So, with the single input row 'emma', we generate five dataset rows, starting with '...'->'e' and ending with 'mma'->'.'. When we work w/ just the first five words, we get 32 rows/combinations of three characters.

# Build the embedding layer C

In [21]:
embedding_dims = 2

C = torch.randn(27, embedding_dims)

One way to think about the embedding layer is to think of indexing into an array that maps character indexes into two-dimensional floats.

In [23]:
C[5]

tensor([-1.0411,  0.3832])

Another way to think of the embedding layer is that it's just another layer in the neural net, but one with no bias weights and without a non-linearity (like tanh). Conceptually here we one-hot encode the character we want to provide as input and then matmul that one-hot encoded input with the C layer. The one-hot encoded input has all zeros except for a one in the spot that matches the index we care about - 5 in this example - and so we 'pull out' only the values from C in that row. Ultimately we get the same result.

In [24]:
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

tensor([-1.0411,  0.3832])

Indexing is faster, so we'll do that.

We want to embed not just a single character, but instead the three chars we have as input. PyTorch indexing is powerful and accepts a lot of different things, including lists, and even multi-dimensional arrays/tensors. X is our input data - it's a tensor with many rows -when we test w/ just five words of input, 32 rows - and each row having three numbers, one for each character. We want a mapping between ALL of these rows and characters to the floating point numbers that represent/embed each character. Ultimately, with 32 rows and three characters each, we want an output that's 32 rows by 3 characters by 2 floats. PyTorch indexing is powerful, so to get this, all we have to do is index into C with the full X array (which again has a shape of 32, 3). PyTorch will pick out the appropriate 2D embedding values and return a tensor of shape 32, 3, 2. 

In [37]:
emb = C[X] # this embeds ALL of the dataset's characters in C
emb.shape

torch.Size([32, 3, 2])

In [38]:
emb[:3]

tensor([[[ 0.9123, -0.8199],
         [ 0.9123, -0.8199],
         [ 0.9123, -0.8199]],

        [[ 0.9123, -0.8199],
         [ 0.9123, -0.8199],
         [-1.0411,  0.3832]],

        [[ 0.9123, -0.8199],
         [-1.0411,  0.3832],
         [-0.7880, -0.3084]]])

In [35]:
X[2, 1]

tensor(5)

In [39]:
emb[2, 1]

tensor([-1.0411,  0.3832])

# Building the hidden layer

In [42]:
# how many floating point inputs to the hidden layer?
# for three char inputs with two floats per char, it's 6
input_chars = block_size * embedding_dims
input_chars

6

In [44]:
hidden_neurons = 100

W1 = torch.randn((input_chars, hidden_neurons)) # 6, 100
b1 = torch.randn(hidden_neurons) # 100

We want to do emb @ W1 + b. We can't, right now, because embd is 32, 3, 2 while W1 is 6, 100. We want to combine the 3, 2 part into a single dimension of size 6, so we'll have 32x6, which we can then matmul with 6x100.

In [46]:
# one inefficient way is via cat
emb[:, 0, :].shape # single 32 by 2 embeddings for the first char

torch.Size([32, 2])

In [49]:
# all three chars then are as follows, which gives us 32,6 as we want
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1).shape # concat on dim 1, not zero

torch.Size([32, 6])

In [52]:
# the above doesn't generalize because it hardcodes block size of three
# i could use 'unbind' to 'remove a tensor dimension'
# unbind gives a list of tensors like the hard-codet just above
torch.cat(torch.unbind(emb, 1), 1).shape

torch.Size([32, 6])

But, both of the above are using new memory. Instead, we can use 'view' to just change the metadata on the underlying data - each tensor has underlying storage that's just a 1D set of data - and get a diff view without having to do anything with new data.

In [56]:
a = torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [58]:
a.view(9, 2)

tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17]])

In [59]:
a.view(3, 3, 2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

In [60]:
emb.shape

torch.Size([32, 3, 2])

In [62]:
emb.view(32, 6)

tensor([[ 0.9123, -0.8199,  0.9123, -0.8199,  0.9123, -0.8199],
        [ 0.9123, -0.8199,  0.9123, -0.8199, -1.0411,  0.3832],
        [ 0.9123, -0.8199, -1.0411,  0.3832, -0.7880, -0.3084],
        [-1.0411,  0.3832, -0.7880, -0.3084, -0.7880, -0.3084],
        [-0.7880, -0.3084, -0.7880, -0.3084, -1.9219,  0.5264],
        [ 0.9123, -0.8199,  0.9123, -0.8199,  0.9123, -0.8199],
        [ 0.9123, -0.8199,  0.9123, -0.8199, -0.5661, -0.2186],
        [ 0.9123, -0.8199, -0.5661, -0.2186, -0.6146,  1.3284],
        [-0.5661, -0.2186, -0.6146,  1.3284, -0.5162, -0.2247],
        [-0.6146,  1.3284, -0.5162, -0.2247, -0.4924,  0.8374],
        [-0.5162, -0.2247, -0.4924,  0.8374, -0.5162, -0.2247],
        [-0.4924,  0.8374, -0.5162, -0.2247, -1.9219,  0.5264],
        [ 0.9123, -0.8199,  0.9123, -0.8199,  0.9123, -0.8199],
        [ 0.9123, -0.8199,  0.9123, -0.8199, -1.9219,  0.5264],
        [ 0.9123, -0.8199, -1.9219,  0.5264, -0.4924,  0.8374],
        [-1.9219,  0.5264, -0.4924,  0.8

In [64]:
# view and unbind give the same result
emb.view(32, 6) == torch.cat(torch.unbind(emb, 1), 1)

tensor([[True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, T

The above does rely on the fact that 'view', when asked for a 32x6 from our 32x3x2, combines each of the 3x2 into a single 6. (Or rather, that the underlying data for the 3x2 is a 1D sequence of six values.)

In [66]:
# ultimately, we can do our matmul to get the hidden layer vals
h = emb.view(32, 6) @ W1 + b1
h

tensor([[ 0.0488, -1.6588, -3.3078,  ..., -1.4160,  0.2018, -2.7426],
        [ 1.5453,  1.8850, -4.2101,  ..., -5.3272,  1.9020, -0.6278],
        [-0.5530,  2.6476, -4.3544,  ..., -0.2253,  0.6588, -1.0174],
        ...,
        [-2.1331,  1.0629, -3.1943,  ...,  1.9976, -1.9862, -2.4230],
        [ 0.7523,  2.3426,  1.6436,  ...,  0.4643, -0.9482, -0.9842],
        [ 0.6252,  5.3815, -4.5769,  ..., -4.7524,  0.1618, -1.8924]])

In [67]:
h.shape

torch.Size([32, 100])

In [70]:
h = emb.view(-1, 6) @ W1 + b1 # -1 is same as len(emb), -1 sets the dim to fit the underlying data
h.shape

torch.Size([32, 100])

In [68]:
len(emb)

32

In [73]:
h = torch.tanh(emb.view(-1, input_chars) @ W1 + b1)
h

tensor([[ 0.0488, -0.9301, -0.9973,  ..., -0.8888,  0.1991, -0.9917],
        [ 0.9130,  0.9549, -0.9996,  ..., -1.0000,  0.9564, -0.5565],
        [-0.5028,  0.9900, -0.9997,  ..., -0.2215,  0.5776, -0.7688],
        ...,
        [-0.9723,  0.7868, -0.9966,  ...,  0.9639, -0.9630, -0.9844],
        [ 0.6365,  0.9817,  0.9280,  ...,  0.4335, -0.7390, -0.7549],
        [ 0.5547,  1.0000, -0.9998,  ..., -0.9999,  0.1604, -0.9556]])

Also, he talked about confirming that b1 is broadcast to each row of the result of the matmul. So, I think it's that we have 32 input rows of six values each, and our W1 tensor is 6x100. 

First, I want to remember that we have 100 hidden neurons, each of which is connected to six input neurons. For each of the 100 hidden neurons we have six weights, one for each of the inputs. This is what makes W1 6,100. We also have a single 100 length bias vector - each value applies to one hidden neuron and is the same/doesn't change or depend upon the number of connected input neurons. Ultimately, the calc for the hidden layer is, per hidden layer neuron, a set of six (normal) multiplications each of the floating point input value multiplied by the specific weight for the combination of that neuron AND that input value, the sum of which is then added to the single bias value for that neuron.    

The matmul is 32,6 @ 6,100 -> 32,100. We know the matmul takes the six values in the first row and multiplies them by the six values in the first column, and then adds the result and stores it as the value of the first row and column of the ultimate 32,100 tensor. It multiplies the same six input values from the first row by the six values in each of the remaining 99 columns to fill out the 100 columns in the first of 32 rows. Then, finally, it does the same for the remaining 31 rows in the input tensor, generating 100 values each time, to get the fill out the full 32 rows of the 32,100 result. Here, we're doing 32*100 calcs -> 3200 total calcs. Our bias vector is just 100 values. We want it to be broadcast to each of the calcs we do for each of the 32 input rows. PyTorch will do this, because we start with 32,100 and are adding a 100. PyTorch lines up the 100 w/ the right-most dimension, I think, and then fills in the missing dim with '1', making a 1,100 tensor, which adds to each row the 32,100 matmul result.

In [74]:
W1.shape

torch.Size([6, 100])

# Building the final output softmax/probability layer

In [76]:
# final layer is 100,27 - 100 input neurons from the hidden layer
# mapping to 27 output neurons, one for each character
W2 = torch.randn((hidden_neurons, 27)) # 100, 27
b2 = torch.randn(27)

In [78]:
# and since we have a 32,100 input, matmul with 100,27 -> 32,27
logits = h @ W2 + b2
logits.shape

torch.Size([32, 27])

In [79]:
# logits are log counts, so first exponentiate to get 'fake counts'
counts = logits.exp()

In [80]:
# then normalize to get probabilities
prob = counts / counts.sum(1, keepdims=True)
prob.shape

torch.Size([32, 27])

In [81]:
prob[17].sum()

tensor(1.0000)

In [82]:
# actual error, based on ground truth Y (next char in seq)
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

Ultimately, we want to index into the probabilities (one for each of the 27 chars) our model generated and pull out the specific probability for the ground truth/actual character we want the model to choose. We want this probability to be as high as possible, since high probabilities mean the model is more likely to pick the correct character. (Note that for many - all? - chars it's impossible to get a probability of 1 because in the actual data it's likely not always the case that, for even a single character, that the next character is always the same.)

In [84]:
# first param - range of 32 - indexes into prob once for each of the rows in prob
# second param - Y - is a matching length (we have one ground truth for each input)
# and the value in Y is the index into the 27 probabilities
# ultimately this gives us the probability for the correct next char for each input row
prob[torch.arange(32), Y]

tensor([2.9640e-12, 2.4226e-03, 9.4587e-03, 1.9986e-12, 7.8846e-06, 4.0592e-12,
        3.2841e-06, 1.0013e-13, 7.7810e-05, 8.3004e-03, 7.2752e-09, 2.5784e-07,
        9.9989e-01, 2.3777e-14, 1.1788e-06, 6.0504e-05, 1.5758e-19, 1.1390e-14,
        9.9686e-01, 3.8572e-09, 2.4290e-05, 2.9828e-03, 5.4854e-02, 3.5183e-09,
        5.0137e-09, 6.2990e-12, 6.9787e-08, 1.9089e-05, 2.5970e-08, 1.2333e-10,
        1.8736e-15, 4.0650e-05])

Many of the probabilities are horrible - likely close to zero. But we haven't trained the model yet.

Finally, we want to get a single score from all of the input rows and ground truth values we've tried. Our single score is negative of the average of the log probability of each of the values. This creates the negative log likelihood loss.  

In [86]:
loss = -prob[torch.arange(32), Y].log().mean()
loss

tensor(16.8043)

# Cleaned up and pulled together first cut at a model that minimizes the loss

In [88]:
karpathy_seed = 2147483647

In [93]:
embedding_dims = 2
hidden_neurons = 100

In [94]:
print(block_size) # don't set here because it affects the X, Y data
X.shape, Y.shape

3


(torch.Size([32, 3]), torch.Size([32]))

In [95]:
g = torch.Generator().manual_seed(karpathy_seed)
C = torch.randn((27, embedding_dims), generator=g) # 27,2
W1 = torch.randn((block_size*embedding_dims, hidden_neurons), generator=g) # 6, 100
b1 = torch.randn(hidden_neurons, generator=g)
W2 = torch.randn((hidden_neurons, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

In [96]:
# how many total parameters are we finding?
sum(p.nelement() for p in parameters)

3481

The cross_entropy fuinction used below replaces the manual calc of probabilities and the calc of the loss function - it just needs logits (log counts) and the ground truth. Besides being canned and so easier to read/search for/understand/without roll-your-own bugs, it's more efficient - doesn't create new tensors, uses 'fused kernels', and simpler/direct 'clustered' backward pass math. It also is also better behaved numerically because it avoids very large logits by shifting down, avoiding floating point inf values.

In [103]:
for p in parameters:
    p.requires_grad = True

In [109]:
for _ in range(1000):
    # forward pass
    emb = C[X] # 32, 3, 2
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # 32, 100
    logits = h @ W2 + b2 # 32, 27
    loss = F.cross_entropy(logits, Y) 
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    # update
    for p in parameters:
        p.data += -0.1 * p.grad
        
loss.item()

0.25440800189971924

This is a very low loss, but it's because our input data when we start is only 32 rows and we have ~3500 params, so we crazy overfit. (We can't overfit to a loss of zero because in some cases - like with the first ch