### Import the library 

In [25]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
import matplotlib.pyplot as plt

### Hyperparameters 

In [21]:
d_embed = 512
num_heads = 8
num_batches = 1
vocab = 50_000
max_len = 5000
n_layers = 1
d_ff = 2048
epsilon = 1e-6

### Make Dummy data 

In [22]:
x = torch.tensor([[1,2,3]]) # Input is size batch_size x sequence_length
y = torch.tensor([[1,2,3]]) 
x_mask = torch.tensor([[1,0,1]])
y_mask = torch.tensor([[1,0,1]])
print("x",x.size())
print("y",y.size())


x torch.Size([1, 3])
y torch.Size([1, 3])


## Encoder 

### 1.1 Encoder Embeddings

In [23]:
emb = nn.Embedding(vocab, d_embed)
# We are extracting the embeddings for the tokens from the vocabulary
# The dimensions after this operation will be batch_size x sequence_length x d_embed
x = emb(x) 
# scale the embedding by sqrt(d_model) to make them bigger
x = x * math.sqrt(d_embed)
print(x.size())

torch.Size([1, 3, 512])


#### Adding positional embedding

In [28]:
# start with empty tensor
pe = torch.zeros(max_len, d_embed, requires_grad=False)
# array containing index values 0 to max_len
position = torch.arange(0,max_len).unsqueeze(1)
divisor = torch.exp(torch.arange(0,d_embed,2)) * -(math.log(10000.0)/d_embed)
# Make overlapping sine and cosine wave inside positional embedding tensor
pe[:,0::2] = torch.sin(position * divisor)
pe[:,1::2] = torch.cos(position * divisor)
pe = pe.unsqueeze(0)
# Add the positional embedding to the main embedding
x = x + pe[:,:x.size(1)]
print(x.size())

torch.Size([1, 3, 512])


#### 1.2 Encoder Attention Layers 

##### 1.2.1.1 Set aside Residuals 

In [30]:
x_residuals = x.clone()
print(x.size())

torch.Size([1, 3, 512])


##### 1.2.1.2 Pre-Self Attention Layer Normalization

In [32]:
# Centering all the values relative to mean
# W and b are hyperparameters which needs tuning
mean = x.mean(-1,keepdim=True)
std = x.std(-1,keepdim=True)
W1 = nn.Parameter(torch.ones(d_embed))
b1 = nn.Parameter(torch.zeros(d_embed))
x = W1 * (x - mean) / (std + epsilon) + b1
print(x.size())

torch.Size([1, 3, 512])


##### 1.2.1.3 Self-Attention 

Self-attention is a process of generating scores that indicate how each token is to every other token. So we would expect a `seq_length x seg_length` matrix of values between 0 and 1, each indicating the importance of the i-th token to the j-th token.

The input to self-attention is `batch_size x sequence_length x embedding_size` matrix.

Self-attention copies the input `x` , three tiles and calls them `query(q)`, `key(k)` and `values(v)`. Each of these matrices go through a linear layer. The marix learns to make scores in the linear layersa. It makes each matrix different. If the networks comes up with the right, different, matrices, it will get good attention scores.

`We designate chunks of each token embedding to different heads`.

The q and k tensors are multiplied together. This creates a batch_size x num_heads x sequence_length x sequence_length matrix. Ignoring batching and heads, one can interpret this matrix as containing the raw scores where each cell computes how related the i-th token is to the j-th token (i is the row and j is the column).

Next we pass this matrix through a softmax layer. The secret to softmax is that it can act like an argmax---it can pick the best match. Softmax squishes all values along a particular dimenion into 0...1. But what it is really doing is trying to force one particular cell to have a number close to 1 and all the rest close to 0. If we multiply this softmaxed score matrix to the v matrix, we are in essence asking (for each head), which column is best for each row. Recall that rows and columns correspond to tokens. So we are asking, which token goes best with every other token. Again, if the earlier linear layers get their parameters right, this multiplication will make good choices and loss will improve.

At this point we can think of the softmaxed scores multiplied against v as tryinng to zero out everything but the most relevant token embedding (several because of multiple heads). The result, which we will store back in x for consistency is mainly the most-attended token embedding (several because of multiple heads) plus a little bit of every other embedded token sprinkled in because we can't do an actual argmax---the best we can do is get everything irrelevant to be close to zero so it doesn't impact anything else.

This multiplication of the scores against the v matrix is what we refer to as self-attention. It is essentially a dot-product with an underlying learned scoring function. It basically tells us where we should look for good information. The Decoder will use this later.

In [34]:
# Make three versions of x for the query, key and values
k = x
q = x
v = x
# Make three linear layers
# This is where the network learns to make scores
linear_k = nn.Linear(d_embed, d_embed)
linear_q = nn.Linear(d_embed, d_embed)
linear_v = nn.Linear(d_embed, d_embed)
# We are going to fold the embedding dimensions and treat each fold as an attention head
d_k = d_embed // num_heads
# Pass q, k, v through their linear layers
q = linear_q(q)
k = linear_k(k)
v = linear_v(v)
# Do the fold, treating each h dimensions as a head
# Put the head in the second position
q = q.view(num_batches, -1, num_heads, d_k).transpose(1,2)
k = k.view(num_batches, -1, num_heads, d_k).transpose(1,2)
v = v.view(num_batches, -1, num_heads, d_k).transpose(1,2)
print("q",q.size())
print("k",k.size())
print("v",v.size())

q torch.Size([1, 8, 3, 64])
k torch.Size([1, 8, 3, 64])
v torch.Size([1, 8, 3, 64])


To produce the attention scores we multiply q and k (and normalize). We need to apply the mask so masked tokens don't attend to themselves. Apply softmax to emulate argmax (good stuff close to 1 irrelevant stuff close to 0). You won't see this happen if you look at attn because the linear layers aren't trained yet. The attention scores are finally applied to v.

In [None]:
d_k = q.size(-1)
# compute the scores by multiplying k and q (and normalize)
scores = torch.matmul(k,q.transpose(-2,-1)) / math.sqrt(d_K)
# Mask out the scores
scores = scores.masked_fill(x_mask == 0, -epsilon)
# Softmax the scores, ideally creating one score close to 1 and the rest close to 0 
# (Note: this won't happen if you look at the numbers because the linear layers haven't 
# learned anything yet.)
attn = F.softmax(scores,dim = -1)

