### Multi-Layer Perceptron
The previous model that we built was a simple bigram language model. It was not very good at predicting what letters came next because.
1. It only looked at the previous letter as context and tried to make predictions at the next word.
2. We were using the probabilities of counts of the bigram to predict the next character. This approach is infeasible because of increase the context letter to 3 or 4 letters to get a better model what we would observe is that the counts matrix increases exponentially and the matrix is very sparse i.e the combnations have lesser counts.
3. The model was not able to produce name like combination of characters which was the problem we needed to solve.

We would build a better model called as the multilayer perceptron which would allevate the above problems because it is a Neural Net based architecture which could handle more words of context than a Bigram model.\
\
One more thing that they did in the MLP paper is that they are converting each word into a 30 dim vector space. So the vocabulary matrix would be (1700,30).

In [45]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px

In [2]:
with open("D://Datasets/names.txt", 'r') as file:
    names = file.read().splitlines()

In [3]:
names[0:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [3]:
vocabulary = sorted(list(set(''.join(names))))
chartoidx = {}
idxtochar = {}
chartoidx['.'] = 0   # Putting a special token to denote the start and the end of a sentence.
idxtochar[0] = '.'
for i,char in enumerate(vocabulary):
    chartoidx[char] = i+1
    idxtochar[i+1] = char

chartoidx

{'.': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}

1. Now lets rebuild the dataset required to be used in the MLP
2. This MLP takes in input of n context letters(n=3 here). So we need to have 3 context letters and predict the 4th letter.
3. We would start each individual word with 3 dots(...) and it would predict the first letter of the word.


In [16]:
block_size = 3 # This would be used to set the number of context letters that would be in the word.
# blank_context = idxtochar[0] * block_size  # This would produce the context consisting of only special character "."
# print(f"The blank context is {blank_context}")

# Now we need to define our xs and ys
xs = []
ys = []

# Now we need to add the data into our xs and ys:
for word in names:
    word = word+"."
    blank_context = [0] * block_size  # This would produce the context consisting of only special character "."
    # xs.append(blank_context)
    for ch in word :
        xs.append(blank_context)
        ys.append(chartoidx[ch])
        blank_context = blank_context[1:] + [chartoidx[ch]]
      
        # print(blank_context)
xs = torch.tensor(xs)
ys = torch.tensor(ys)    

In [17]:
print(xs.shape, xs.dtype)
print(ys.shape, ys.dtype)

torch.Size([228146, 3]) torch.int64
torch.Size([228146]) torch.int64


Now we would build the lookup table which will be used across all the inputs. In the MLP paper by yoshua bengio they used a lookup table of 17000 words and used a 30 dimensional embedding vector.\
We will use a table of 27 characters and a somewhat lesser number of dimensions.(we will start here with 2)

In [8]:
# build the lookup table
c = torch.randn((27,2))
c

tensor([[ 1.5399e+00, -2.0889e-01],
        [ 1.5381e+00, -6.3532e-01],
        [-1.9274e+00, -3.9099e-01],
        [-8.9218e-01, -6.4619e-04],
        [ 1.3383e+00, -3.1898e-02],
        [-1.0281e+00,  3.3131e-01],
        [ 1.7538e+00, -1.3597e-01],
        [-1.6896e-01,  3.8129e-01],
        [ 1.1301e+00,  1.4792e+00],
        [-1.6452e+00, -5.5103e-01],
        [-3.4661e-01, -8.1771e-01],
        [ 7.6834e-01, -4.7564e-01],
        [ 7.1434e-01, -9.6823e-01],
        [-1.0029e-01, -1.6889e+00],
        [-7.9375e-02, -1.8545e-01],
        [ 1.9587e+00,  3.6907e-02],
        [ 3.7416e-01, -1.7185e+00],
        [ 7.3027e-01, -4.2906e-02],
        [ 5.2895e-01, -1.1491e+00],
        [-2.9269e+00, -5.3369e-01],
        [-3.8558e-02,  1.6865e+00],
        [ 3.8508e-01, -5.9692e-01],
        [-3.1772e-02, -8.2846e-01],
        [ 3.5567e-01,  7.0653e-01],
        [-1.4892e+00, -4.5759e-01],
        [ 3.3209e-01, -7.2603e-01],
        [ 1.8088e-02, -2.5397e-01]])

So an embedding layer can be thought of in two ways.
1. Lookup table : When thought of as a lookup table it a (vocab,dim) shape table which consists of a (1,dim) shaped vector defining a particular word or character. To get the embedding of a particular word or character what we would do is that we would query the row of the lookup table with the index of the word in vocabulary. We would use this vector in place of one_hot vectors as an input to our model and would modify this embedding via backpropagation through our model.

2. Layer : When this is thought of as a layer, we would consider it as a linear layer with weights without any non-linearity. We would pass inputs as one_hot vectors of words or characters and that would be multiplied by the weights matrix of this layer, which would be equal to taking the row of weights matrix which corresponds to the active index of the one-hot vector. This would give us the row of the weights matrix of the embedding layer corresponding to the active index in the one-hot vector as the activation output of embedding layer which would further traverse into the model and give out the results. Then we can improve the embedding of each word pertaining to our particular problem by backpropagation.

In [9]:
c[5]

tensor([-1.0281,  0.3313])

In [10]:
# This creates a 27 dimensional tensor with the but at the 5th index turned on.
x = F.one_hot(torch.tensor(5), num_classes = 27)
x

tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

In [11]:
# Now we will matrix multiply the two vectors and use the multiplication  as an embedding into the neural network.
x.float() @ c.float()

tensor([-1.0281,  0.3313])

matrix multiplying the one-hot vector and the lookup matrix would yield the same results as finding the row with the same index as the active bit of the one-hot vecotr.\
For now we would go with indexing according to the index of the vocabulary.

Embedding a single integer index like 5 is easy. We can just ask pytorch to retrieve the 5th row of C like C[5] and we would get the embedding.\
\
But if we want to get the embedding of multiple integers simultaneously we can pass the indexes as a list also like C[[5,6,7]].\
\
We can also embed using two dimensional arrays like c[[5,6],[7,8]]. Lets try them out

In [12]:
print(c[5])
print(c[[5,6,7]])
print(c[[5,6],[0,1]])

tensor([-1.0281,  0.3313])
tensor([[-1.0281,  0.3313],
        [ 1.7538, -0.1360],
        [-0.1690,  0.3813]])
tensor([-1.0281, -0.1360])


In [13]:
# Lets try to index our xs dataset and get the embedding for each of the integers of 32,3 integers. 
# So for each of the 32 rows and 3 columns of integers it will give a (1,2) or two values of embedding of each integer making the vector as (32,3,2).
c[xs].shape

torch.Size([32, 3, 2])

Now we would need to pass these embeddings to Neural Network's first layer where they would be multiplied by the weights matrix which is currently of the shape (6,100). But these are embeddings cannot be multiplied directly, we would need to reshape them into a way so that they could be multiplied by the matrix. So we would call the view function.

In [14]:
emb = c[xs]
print(emb.shape)

torch.Size([32, 3, 2])


Now we would just change the shape of the input tensor to be compatible to our weights matrix.

In [16]:
emb.view(-1,6).shape

torch.Size([32, 6])

In [17]:
w1 = torch.randn((6,100))
b1 = torch.randn((100))

In [19]:
# Now we could perform the matrix multiplication and get the results.
h1 = torch.tanh(emb.view(-1,6) @ w1 + b1)  # We would also apply non-linearity
h1

tensor([[ 0.9994, -0.6843,  0.9566,  ..., -0.5370, -0.7589,  0.9275],
        [ 0.9976, -0.7930, -0.9050,  ..., -0.9985, -0.8538,  0.9851],
        [ 0.9987,  0.9885, -0.9996,  ..., -0.9558, -0.9996, -0.9997],
        ...,
        [ 0.9994,  0.9997,  0.9767,  ..., -0.8335,  0.2851,  1.0000],
        [ 0.1616, -1.0000, -0.7551,  ..., -0.9998, -0.9875, -0.1582],
        [ 0.9987,  1.0000, -0.9996,  ...,  0.4245, -0.9659, -0.9943]])

In [21]:
h1.shape   # Hidden layer activations for every one of our 32 examples

torch.Size([32, 100])

In [22]:
# Now lets create the final layer i.e the softmax layer
w2 = torch.rand(100,27)
b2 = torch.rand(27)

In [23]:
logits = h1 @w2 +b2
logits.shape


torch.Size([32, 27])

In [29]:
# just as we did earlier we would treat logits as log counts and to get counts we would need to exponentiate them
counts = logits.exp()

# Then we would get the probabilities by normalizing the counts.
probs = counts/counts.sum(1, keepdim=True)

In [30]:
probs.shape

torch.Size([32, 27])

In [32]:
# We would see that every row of probs sums to 1
probs[0].sum()

tensor(1.0000)

In [31]:
probs

tensor([[1.9311e-02, 9.0641e-03, 1.4585e-02, 5.6024e-03, 8.7870e-03, 4.2746e-03,
         1.2606e-02, 8.1540e-05, 1.7812e-01, 2.3683e-03, 5.8972e-03, 9.6818e-02,
         1.0733e-03, 2.3686e-01, 3.1084e-03, 5.2954e-03, 7.6305e-04, 1.6396e-01,
         2.5532e-03, 2.7512e-03, 6.0770e-02, 1.8890e-04, 8.7528e-02, 1.6185e-04,
         3.0726e-05, 7.7286e-02, 1.5824e-04],
        [2.8224e-01, 2.2787e-04, 1.1509e-02, 1.1182e-03, 3.8151e-02, 1.2938e-02,
         1.1861e-02, 5.8726e-06, 1.5842e-02, 9.2985e-03, 5.4747e-03, 5.3811e-02,
         3.1547e-03, 1.7953e-02, 5.9270e-04, 3.4865e-04, 1.6633e-03, 8.6082e-03,
         8.9243e-06, 7.2737e-04, 3.7833e-02, 8.4429e-04, 4.8222e-01, 4.7739e-05,
         5.8852e-05, 2.3348e-03, 1.1268e-03],
        [3.5523e-04, 3.8270e-04, 9.5396e-03, 1.7083e-04, 9.3230e-01, 7.6593e-05,
         2.3025e-02, 5.3875e-03, 1.5191e-03, 2.4903e-03, 2.1000e-04, 1.1637e-04,
         4.7147e-03, 7.0381e-03, 5.3956e-04, 7.8916e-04, 4.7189e-04, 1.7011e-04,
         5.6300e-

In [33]:
# Now we need to compare the probabilities of the labels with the probabilites assigned to the correct character by our model.
# We can index into probs rows and columns by
probs[torch.arange(32), ys]

tensor([4.2746e-03, 1.7953e-02, 7.0381e-03, 1.2777e-05, 3.5889e-02, 5.2954e-03,
        8.1094e-04, 5.6830e-03, 1.2388e-01, 8.3501e-02, 8.3405e-06, 4.4159e-04,
        9.0641e-03, 9.5948e-02, 4.2863e-04, 1.5011e-02, 2.3683e-03, 1.5327e-03,
        7.6960e-04, 3.0260e-04, 7.4688e-04, 4.2388e-02, 1.7931e-01, 4.6583e-05,
        1.0510e-01, 2.7512e-03, 1.6703e-04, 1.8242e-02, 5.0304e-02, 1.9462e-03,
        2.9198e-04, 9.8275e-03])

In [34]:
# Now to get the loss we take the log of probabilites and then get the mean of them and then apply a negative sign to get NLL loss
nll_loss = - probs[torch.arange(32), ys].log().mean()
nll_loss   # We see that we have arrived at a positive number for loss. Reducing this loss will lead to a better model.

tensor(5.6951)

In [40]:
print(xs.shape, xs.dtype)
print(ys.shape, ys.dtype)

torch.Size([228146, 3]) torch.int64
torch.Size([228146]) torch.int64


In [74]:
# lets consolidate the full code
g = torch.Generator().manual_seed(2147483647)
c = torch.randn((27,2), generator=g)
w1 = torch.randn((6,100), generator=g)
b1 = torch.randn((100), generator=g)
w2 = torch.rand((100,27), generator=g)
b2 = torch.rand((27), generator=g)
parameters = [c,w1,b1,w2,b2]
p = sum([x.nelement() for x in parameters])
p

3481

In [12]:
embed = c[xs]
h1 = torch.tanh(embed.view(-1,6) @ w1 + b1)
logits = h1 @ w2 + b2
# counts = logits.exp()
# prob = counts/counts.sum(dim = 1, keepdims=True)
# # Now get the loss
# loss = - prob[torch.arange(32), ys].log().mean()
# print(loss.item())

# There is a more efficient function to calculate the loss from the logits without the intermediate steps. It is called nn.Functional.cross_entropy
loss = F.cross_entropy(logits, ys)
loss   # Same loss as the one that we manually calculated.

tensor(5.4700)

Reasons to use cross_entropy instead of pure calculation
1. The forward pass will be much more efficient.
2. The backward pass will be much more efficient.
3. The computation results will be much more behaved numerically.

In [76]:
for p in parameters:
    p.requires_grad = True

In [79]:
# Now lets do the training looop
# Forward pass
lri = []
lossi = []
for i in range(100000):
    ix = torch.randint(0, xs.shape[0], (32,))  # this would generate 32 random indexes which we can filter our dataset to train over a mini-batch
    embed = c[xs[ix]]   # (32,3,2)
    h1 = torch.tanh(embed.view(-1,6) @ w1 +b1)
    logits = h1 @ w2 + b2
    loss = F.cross_entropy(logits, ys[ix])
    # print(loss.item())
    # Zero out the gradients
    for param in parameters:
        param.grad = None
        
    # Backward Pass
    loss.backward()


    # Update the parameters
    lr = 0.01
    for p in parameters:
        p.data += -lr*p.grad

    # # track the loss stats
    # lri.append(lre[i])
    # lossi.append(loss.item())
print(loss.item())


2.1627094745635986


In [64]:
x = px.line(x = lri, y = lossi, height=600, width= 800)
# x.update_layout(heig)
x.show()
# Now we have plotted the learning rate we find that a learning rate exponent around -1 is very good. So we would remove the tracking and set the lr.


In [22]:
# Now we want do make minibatches of the dataset to pass it into the NN in minibatches to the network.
# So to do that we would need indexes of length of the batch_size which would randomly index the rows of the data. 
indexes = torch.randint(0, xs.shape[0], (32,))
indexes

tensor([162455, 195095, 205281, 137481,  30454,  28953, 153871, 137223, 202366,
         68977, 114323,  58868, 153822,  40857,  61324,  73151,  61263,  84925,
        196609, 167992, 139473,   1473, 209590,  29551,  69955,  92380, 192595,
         84371, 131494,  97716, 135858, 200117])

So for optimization of this Neural Network we are going to use the learning rate which is the update of the network. This part is a hit and trial method but we can narrow it down to a high and low value and then what we can do is that take values in high and low interval and sample values from it and try to see which of the samples give out the lowest loss.

In [29]:
# So what we can do is that we can take out exponential values from a max and min and then sample values which exponentially increase between that range
lre = torch.linspace(-3, 0, 1000)
lrs = 10 ** lre
lrs     # This is the space of possibilities that we want to search over

tensor([0.0010, 0.0010, 0.0010, 0.0010, 0.0010, 0.0010, 0.0010, 0.0010, 0.0011,
        0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011,
        0.0011, 0.0011, 0.0011, 0.0012, 0.0012, 0.0012, 0.0012, 0.0012, 0.0012,
        0.0012, 0.0012, 0.0012, 0.0012, 0.0012, 0.0012, 0.0013, 0.0013, 0.0013,
        0.0013, 0.0013, 0.0013, 0.0013, 0.0013, 0.0013, 0.0013, 0.0013, 0.0014,
        0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014,
        0.0015, 0.0015, 0.0015, 0.0015, 0.0015, 0.0015, 0.0015, 0.0015, 0.0015,
        0.0015, 0.0016, 0.0016, 0.0016, 0.0016, 0.0016, 0.0016, 0.0016, 0.0016,
        0.0016, 0.0017, 0.0017, 0.0017, 0.0017, 0.0017, 0.0017, 0.0017, 0.0017,
        0.0018, 0.0018, 0.0018, 0.0018, 0.0018, 0.0018, 0.0018, 0.0018, 0.0019,
        0.0019, 0.0019, 0.0019, 0.0019, 0.0019, 0.0019, 0.0019, 0.0020, 0.0020,
        0.0020, 0.0020, 0.0020, 0.0020, 0.0020, 0.0021, 0.0021, 0.0021, 0.0021,
        0.0021, 0.0021, 0.0021, 0.0022, 

In [38]:
# Now lets find out the minimum loss and the corresponding lri
min_loss = min(lossi)
min_lri = lri[lossi.index(min_loss)]
print(min_loss, min_lri)

tensor(2.2006, grad_fn=<NllLossBackward0>) tensor(0.3354)
