# Our last method of taking the counts of each bigram won't scale at all. It will grow exponentionally. We're going to take an MLP approach

We're going to be following the approach of this 2003 paper: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf.

While this paper used a word level model, we'll be sticking to character level models, but with the same modeling approach.

The paper took a vocabulary of 17,000 words and associated a 30 element feature vector to each, embedding them in a 30 dimensional space. The idea is that words that have similar meanings may end up near eachother in this 30D space, and conversly words that are very different, should be very far from each other. (Note: the paper mentions feature size numbers of 30, 60 and 100 in their experiments).

The great power here is the ability to use closeness within the embedding space to generalize broadly.



In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# read in all the words

words = open('names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [17]:
len(words)

32033

In [15]:
# Build the "vocabulary" of characters and the mapping to and from integers

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0 # string to int
itos = {i:s for s, i in stoi.items()} # int to string
print(stoi,'\n',itos)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0} 
 {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [22]:
# Here's how we build the dataset
# this is the context length of how far back we're going to look to predict the next char.
block_size = 3 # here we're looking at the last 3 characters to guess the 4th.

X,Y = [], []
# just the first five words for now .
for w in words[:5]:
  print(w)
  context = [0]*block_size # creates a list the size of block_size,
  for ch in w + '.':
    ix = stoi[ch] #what's the indx of the character ?
    X.append(context) # add the context, first list should be all '.'
    Y.append(ix) # the index to get the character from itos.
    print(''.join(itos[i] for i in context), '--->', itos[ix])
    #the above prints the current context, --> and then the next letter

    context = context[1:] + [ix] # crop, and append the index for the next character.

X,Y = torch.tensor(X), torch.tensor(Y)


emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [23]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

**As you can see above we have 32 training examples that we made out of our first five names. Each of these examples is a triplet of charcters, made from our 27 possible characters. These are the triplets that appear when considering only the first five names, as we eventually use the entire dataset, we'll get more**

In [26]:

# here's our look up table.
# the paper takes its 17000 words and puts them in spaces as small as 30
# we'll start with a 2D space
# 27 rows for our characters, and two columns for the size of the feature space
C = torch.randn((27,2))

In [27]:
# here's one embedding
C[5]

tensor([1.1936, 0.5817])

In [30]:
#index using lists

C[[5,6,7]]

tensor([[ 1.1936,  0.5817],
        [-0.0214, -1.8960],
        [-0.4030, -0.1002]])

In [31]:
# we can feed it a  tensor
C[torch.tensor([5,6,7])]

tensor([[ 1.1936,  0.5817],
        [-0.0214, -1.8960],
        [-0.4030, -0.1002]])

In [32]:
# we can even send the same index again and again
C[torch.tensor([5,6,7,7,7,7,7,7])]

tensor([[ 1.1936,  0.5817],
        [-0.0214, -1.8960],
        [-0.4030, -0.1002],
        [-0.4030, -0.1002],
        [-0.4030, -0.1002],
        [-0.4030, -0.1002],
        [-0.4030, -0.1002],
        [-0.4030, -0.1002]])

In [34]:
C[X].shape # we can therefore embed all the ints in X with just C[X]
# lets look at the shape to get this right in our heads.

torch.Size([32, 3, 2])

So we have 32 entries, each 3 characters long (our context length). and now each of the characters is identified by a 2D vector embedding. Let's just prove this to our selves before moving on.

In [35]:
X[13,2] # lets pluck out a random int

tensor(1)

In [36]:
C[X][13,2] # if we've done this correctly we should get the embedding for that int

tensor([0.5771, 0.3914])

Remember, these number values are really characters. So, to reiterate, if we've got everything straight here the embedding for X[32,2] aka the embedding for tensor(1), should be [0.5771, 0.3914]. So if we've done this right we would expect C[X][13,2] == C[1], the previously mentioned embedding.

In [37]:
C[1] #the value is the same! pytorch indexing is very handy

tensor([0.5771, 0.3914])

In [38]:
 emb = C[X]

In [39]:
# lets make the hidden layer


W1 = torch.randn((6, 100)) # we have 2D embeddings, and we have 3 of them, hence the 6. # the number of neurons is up to us.
b1 = torch.randn(100)

Now, ideally we'd like to do the following matmul: `emb @ W1 + b1`. But these matrix dims don't work. We'll have to fix this so our tensors are the same rank.