# Practice Board

We have been doing a lot of stuff around Multi Layer Perceptrons. Here I am gonna implement everything from scratch without using any auto complete and see if I am able to achieve the loss as I did in my previous notebooks.

Answer these 2 questions:

Q1. I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.

A1: If we initialize the weights and biases to zero, if our non-linear function is a sigmoid or tanh, the gradient will be zero for all the weights and biases. This will cause the network to not learn anything becuase all the neurons emit the SAME thing. We can assume tha only the output layer is the one for which the gradient is non-zero and hence that is the only thing that helps in training.

Q2. BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

In [9]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
import random

In [10]:
words = open("names.txt", "r").read().splitlines()
words[:8]

# Build the vocabulary of characters
chars = sorted(list(set(''.join(words))))
stoi = {s:i + 1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
vocab_size = len(stoi)

In [11]:
block_size = 3
emb_size = 10
g = torch.Generator().manual_seed(2147483647)
n_hidden = 100
batch_size = 32
n_epochs = 200000

In [12]:
# We will create the dataset that is needed
def build_dataset(words):
    X, Y = [], []

    for word in words:
        init_pad = ['.'] * block_size
        context = init_pad + list(word) + ['.']

        for i in range(len(context) - block_size):
            X.append([stoi[ch] for ch in context[i:i + block_size]])
            Y.append(stoi[context[i + block_size]]) 


    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return X, Y

# We will divide it into test, training and validation samples
random.seed(42)
random.shuffle(words)

n1 = int(len(words) * 0.8)
n2 = int(len(words) * 0.9)

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xval, Yval = build_dataset(words[n2:])

Xtr.shape, Ytr.shape

(torch.Size([182625, 3]), torch.Size([182625]))

In [13]:
# fan_in decides the input, fan_out decides the number of neurons in one layer.
class Linear:
    def __init__(self, fan_in, fan_out, gain=5/3):
        # self.W = torch.randn((fan_in, fan_out), generator=g)
        self.W = torch.zeros(fan_in * fan_out).reshape(fan_in, fan_out) / (gain * (fan_in ** 0.5))
        self.B = torch.zeros(fan_out)

    def __call__(self, X):
        # Now this layer needs to do wx + b
        self.out = (X @ self.W) + self.B
        return self.out

    def parameters(self):
        return [self.W, self.B]

class Tanh:
    def __call__(self, X):
        self.out = torch.tanh(X)
        return self.out

    def parameters(self):
        return []

In [14]:
# Why is the first layer's input this -> Linear(emb_size * block_size, n_hidden)?
# let us assume we want to give 'e', 'm', 'm' to our first input layer. We cannot give 3 X 10 matrix as an input. We flatten it and give 30 embeddings, where each set of 10
# represents the character.

C = torch.randn((vocab_size, emb_size), generator=g)

layers = [Linear(emb_size * block_size, n_hidden), Tanh(),
Linear(n_hidden, n_hidden), Tanh(),
Linear(n_hidden, n_hidden), Tanh(),
Linear(n_hidden, n_hidden), Tanh(),
Linear(n_hidden, n_hidden), Tanh(), 
Linear(n_hidden, vocab_size)
]

# This wont matter much becuase we are using batch norm
with torch.no_grad():
    for layer in layers[:-1]:
        if isinstance(layer, Linear):
            layer.W *= 5 / 3

parameters = [C] + [p for layer in layers for p in layer.parameters()]
print(sum(p.nelement() for p in parameters))

for p in parameters:
    p.requires_grad = True

46497


In [None]:
lossi = []
backprops = []

n_epochs = 100000

for i in range(n_epochs):
    activations = []
    #First we create the minibatch.
    ix = torch.randint(0, Xtr.shape[0], (batch_size,))
    Xb, Yb = Xtr[ix], Ytr[ix]

    # Forward Pass
    emb = C[Xb] # Advanced indexing operation
    current_inp = emb.view(emb.shape[0], -1) # Flatten from the right till the 0th index.

    for layer in layers:
        current_inp = layer(current_inp)
        if isinstance(layer, Linear):
            activations.append(current_inp)

    loss = F.cross_entropy(current_inp, Yb)

    #Backward Pass
    for params in parameters:
        params.grad = None
    loss.backward()

    # Update step
    lr = 0.1 if i < 50000 else 0.01 # Learning rate decay for better convergence
    for params in parameters:
        params.data += -lr * params.grad
        

    # Stats
    if i % 10000 == 0:
        loss = F.cross_entropy(current_inp, Yb)
        lossi.append(loss.item())
        print(f"Loss at iteration {i}: {loss.item()}")


Loss at iteration 0: 2.9232704639434814
Loss at iteration 10000: 2.597027540206909
Loss at iteration 20000: 2.596876859664917
Loss at iteration 30000: 2.732625722885132
Loss at iteration 40000: 2.9052674770355225
Loss at iteration 50000: 2.735487937927246
Loss at iteration 60000: 2.9485535621643066
Loss at iteration 70000: 3.0972890853881836
Loss at iteration 80000: 2.8423728942871094
Loss at iteration 90000: 2.4960615634918213


In [16]:
emb = C[Xval]
current_inp = emb.view(emb.shape[0], -1) # Flatten from the right till the 0th index.

for layer in layers:
    current_inp = layer(current_inp)

F.cross_entropy(current_inp, Yval)

tensor(3.2870, grad_fn=<NllLossBackward0>)

BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.