# Introduction
In the previous notebook we predicted the next character of a name only by looking at the previous character. Here we want to do something a bit more sophistcated, we want to use more of the context than a single character, and we will be doing it in different ways: bag-of-words and using a multilayer perceptron (MLP). The MLP approach will be based on the a [paper from 2003](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbGJ0RndnaG1JbUhHVURROWVQVFNuOWZwU01oQXxBQ3Jtc0ttbW5SQnhQN0ZMWWtHem5FVFA5dFFCdk02R29raDVBNlMxaXpxU3E1S2dReHAxcVRYQjN3bjZsM2ZLcjdkRG5oWTBnSU1OUjZsaThxSnZLdVpKWEFWTDgzZnd3QlNmQXRldFFxRjZ1ekdfUFV0ZnVTcw&q=https%3A%2F%2Fwww.jmlr.org%2Fpapers%2Fvolume3%2Fbengio03a%2Fbengio03a.pdf&v=TCH_1BHY58I), in which they predict the next word using the previous words.

We will be doing the following:

* TBD

# Libraries

In [1]:
%matplotlib inline
%config IPCompleter.use_jedi=False

In [2]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

# Data
For creating the language models we use a dataset of the most common names from [ssa.gov](https://www.ssa.gov/oact/babynames/).

### Reading Data

In [3]:
# Reading names into a list
with open('../../data/names.txt', 'r') as f:
    names = f.readlines()
    names = [name.strip() for name in names]

### Creating Vocabulary
As a neural network works with numbers, we need a way to translate back and forth between letters and numbers.

In [4]:
# Building the vocabulary (character to/from index)
chars = sorted(list(set(''.join(names))))
chr_to_idx = {s:i+1 for i,s in enumerate(chars)}; print(chr_to_idx)
chr_to_idx['.'] = 0
idx_to_chr = {i:s for s,i in chr_to_idx.items()}; print(idx_to_chr)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


### Preparing Dataset
For each letter we will be using the previous X characters to predict it (block_size). 

Example for emma:  

<pre>
... ---> e  
..e ---> m  
.em ---> m  
emm ---> a  
mma ---> .  
</pre>

In [5]:
block_size = 3
verbose = False

X, Y = [], []
for name in names:
    if verbose:
        print(name)
    context = [0] * block_size
    for char in name + '.':
        idx = chr_to_idx[char]
        X.append(context)
        Y.append(idx)
        if verbose:
            print(''.join(idx_to_chr[i] for i in context), idx_to_chr[idx])
        context = context[1:] + [idx]

In [6]:
# Printing example x and y
x2 = X[2]; print(''.join(idx_to_chr[x] for x in x2))
y2 = Y[2]; print(idx_to_chr[y2])

.em
m


In [7]:
# Converting lists to pytorch arrays
X = torch.tensor(X) # n_examples x block_size
Y = torch.tensor(Y) # n_examples

# Building the Neural Network
We now build the neural network as described in the paper.

### The Lookup Table

First we build an embedding lookup table. The lookup table is in concept similar to that of the one-hot encodings, as we in both cases represent the individual characters as vectors. One-hot vectors are the same length as the vocabulary, while embedding vectors can be arbitrarily short, depending on how much information you want them to be able to store.

One-Hot Example with dictionary ABCD:
<pre>
  A B C D
A 1 0 0 0
B 0 1 0 0
B 0 1 0 0
A 1 0 0 0
</pre>

Lookup table Example with two dimenstions:
<pre>
Lookup table:
    d1    d2
A  0.1  -0.3
B -0.5  -0.7
C -0.1   1.3
D  3.0   0.9

Chars   Indicies     Embeddings
ABBA -> [1,2,2,1] -> [[0.1, -0.3],[-0.5, -0.7],[-0.5, -0.7],[0.1, -0.3]]
</pre>

In [8]:
# Building a lookup table (vocab_length x n_dimensions)
C = torch.randn([27, 2]); print(C)

tensor([[ 0.0060,  1.5929],
        [ 0.7376, -1.8402],
        [ 1.8019,  0.6182],
        [-1.2913, -0.5859],
        [-1.5063,  1.3944],
        [-0.1463,  0.2447],
        [ 0.0365, -0.6914],
        [-1.0483,  0.3444],
        [-2.4608,  0.2352],
        [ 0.4250, -0.6164],
        [-1.5512,  1.1415],
        [ 0.0527, -0.3730],
        [ 0.1498, -0.6145],
        [-0.0960, -1.7444],
        [-0.1050, -0.3841],
        [ 0.2512, -1.3238],
        [ 0.7428,  0.6579],
        [ 0.5280, -2.4567],
        [-0.4924, -0.8601],
        [ 0.5244,  0.3574],
        [ 0.1042, -1.3714],
        [-0.6479,  0.6436],
        [-0.3895, -0.5352],
        [-0.2469, -0.8324],
        [-0.4192, -0.4529],
        [ 0.5424,  1.5989],
        [-1.1832, -0.1410]])


In [9]:
# Looking up the embedding of one character
chr = "c"; print(f"Character: {chr}")
idx = chr_to_idx[chr]; print(f"Vocab Index: {idx}")
embedding = C[idx]; print(f"Embedding: {embedding}")

Character: c
Vocab Index: 3
Embedding: tensor([-1.2913, -0.5859])


In [10]:
# Looking up embeddings of four characters: "abba"
chars = torch.tensor([1,2,2,1])
embeddings = C[chars]; embeddings

tensor([[ 0.7376, -1.8402],
        [ 1.8019,  0.6182],
        [ 1.8019,  0.6182],
        [ 0.7376, -1.8402]])

It is also possible to make these lookups in higher dimensionality, e.g. in our X-data we created earlier we have two dimensions, rows (samples) and columns (context window). Here we will try to look all X data up in the C lookup table.

In [11]:
# Dimensions are: n_samples x context_window x embedding_size
C[X]

tensor([[[ 0.0060,  1.5929],
         [ 0.0060,  1.5929],
         [ 0.0060,  1.5929]],

        [[ 0.0060,  1.5929],
         [ 0.0060,  1.5929],
         [-0.1463,  0.2447]],

        [[ 0.0060,  1.5929],
         [-0.1463,  0.2447],
         [-0.0960, -1.7444]],

        ...,

        [[-1.1832, -0.1410],
         [-1.1832, -0.1410],
         [ 0.5424,  1.5989]],

        [[-1.1832, -0.1410],
         [ 0.5424,  1.5989],
         [-1.1832, -0.1410]],

        [[ 0.5424,  1.5989],
         [-1.1832, -0.1410],
         [-0.4192, -0.4529]]])

These were all examples, so what we bring from this section into the next part of the neural network is the **lookup table** as well as the **embeddings**.

In [12]:
# Lookup table: vocab_size x embedding_dimension
C = torch.randn([27, 2])

# Embeddings
emb = C[X]

### Adding more layers to the network
The different layers of the network fit together via their input- and output-dimensions. 

Examples of dimensions:  

* samples: n_samples x context_window  
* lookup table: vocab_size x embedding_size  
* layer: (context_window * embedding_size) x n_neurons  

After the samples have gone through the embedding layer we have a matrix pr sample of dimension context_window x embedding_size. Before we can multiply this output with the first neuron layer, we need to unstack it:

<pre>
[[1,2],
 [4,5],    ---> [1,2,3,4,5,6]
 [6,7]]
</pre>

Actually pytorch stores its data as a one-dimentional tensor all ready, and one can easily alter between the dimensionality using .view(). We will use that here to unstack the two dimensions.

In [13]:
# Weights and biases (layer 1)
W1 = torch.randn([6,100])
b1 = torch.randn([100])

In [14]:
# Unstack example
e1 = torch.arange(18); print(e1)
e2 = e1.view(3,3,2); print(e2)
e3 = e2.view(3,6); print(e3)

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])
tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])
tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11],
        [12, 13, 14, 15, 16, 17]])


In [15]:
# Unstacking sample dimensions and calculating activations
h = torch.tanh(emb.view(emb.size()[0], emb.size()[1]*emb.size()[2]) @ W1 + b1); h.size()

torch.Size([228146, 100])

In [16]:
# Weights and biases (layer 2)
W2 = torch.randn([100,27])
b2 = torch.randn([27])

In [17]:
# Calculating logits for each possible output
logits = h @ W2 + b2; logits.shape

torch.Size([228146, 27])

In [19]:
#### CONTINUE HERE ####
#### http://www.youtube.com/watch?v=TCH_1BHY58I&t=29m50s
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)

## Calculating loss
### Without regularization
### loss = -probs[:,ys[:n_samples]].log().mean()
### With regulatization (Rewarding low Ws)
loss = -probs[:,ys[:n_samples]].log().mean() + 0.01*(W**2).mean()

tensor([ -0.7133,   4.4883,  -0.0745,   6.0326,  -2.7591,  13.3141,  12.2796,
          1.3787,   9.9715,   2.5350,  -2.5225,   4.4464, -12.3316,  -5.0761,
         -2.1436,  -7.8701,  -3.5819,  -4.4530,  -1.0093,  17.5434,  -1.2620,
        -11.4313,   5.8831,  -5.7168,  -8.4261,  -5.3225,  -3.6467])