# Understanding The Transformer

The transformer uses the concept of attention and with slight modification provides the faster implementation. The transformer model was also found to beat the state of the art language translation and summarization techniques in many of the tasks. The transformer was proposed in a paper "Attention is all you need" by Google brain team.

In [None]:
from torch import nn
import torch
import math
import numpy as np
from torch.autograd import Variable
import matplotlib.pyplot as plt

In [None]:
opt = {"d_model":512,  "trg_pad":1,"src_pad":1}


## Understanding masking
Masking: Masking has two functions in the transformer network. 

In encoder and decoder to given zero attention output wherever it is padding in the input and target sentences respectively
In decoder to prevent the decoder cheating by looking (peaking) ahead of the sequences. 
lets code a dummy Source and target sequence, to understand the above-given facts. Our dummy source and target sequence look like this, here we have taken source sequence equals to the target sequence. each sentence is in present in each column, the length of each sentence is made equal by padding  = 1. 

```python
src = torch.tensor([
        [2, 3, 4, 5, 6, 7, 8, 9],
        [2, 7, 7, 4, 2, 4, 3, 4],
        [3, 6, 8, 5, 2, 1, 3, 4],
        [4, 7, 9, 6, 3, 1, 7, 1],
        [5, 7, 2, 7, 3, 1, 8, 1],
        [1, 6, 2, 8, 4, 1, 8, 1],
        [1, 1, 1, 1, 5, 1, 9, 1],
        [1, 1, 1, 1, 5, 1, 1, 1]])
trg = src
```
Next is creating nopeak_mask function which restricts the decoder peaking ahead of current decoding sequence in the target sequences.

```python
def nopeak_mask(size, opt):
    np_mask = np.triu(np.ones((1, size, size)),k=1).astype('uint8')
    np_mask = Variable(torch.from_numpy(np_mask) == 0)
    return np_mask
```

We also need a create_masks  function to take the source and target function and apply to mask.

```python
def create_masks(src, trg, opt):
    src_mask = (src != opt["src_pad"]).unsqueeze(-2)

    if trg is not None:
        trg_mask = (trg != opt["trg_pad"]).unsqueeze(-2)
        size = trg.size(1) # get seq_len for matrix
        np_mask = nopeak_mask(size, opt)
        trg_mask = trg_mask & np_mask
        
    else:
        trg_mask = None
    return src_mask, trg_mask
src_mask, trg_mask = create_masks(src,trg,opt)
```

By using plotting function which is given at the script Ch6/understanding_the_transformer.ipynb  the source masks looks like this.

![](figures/tranformer_masking.png)

Figure: Making applied to source sentences in the Transformer.

A single source sentence which is shown here as the column. Wherever padding is applied as 1 the sentence seems to be truncated. The source masks also have 1 up to the length of each sentence and zero thereafter. this helps in applying masking. If you look at the target mask it is a bit tricky to understand.

![](figures/tranformer_masking_2.png)

Figure: Making applied to the decoder in the Transformer.

Usually, the decoders task is to take encoder output and produce output by teacher forcing the decoder. The mask function here is to allow the decoder only to look at the current sequence and mask all future sequences. Here also note that wherever the padding is applied to the source sequence the decoder has a mask. After two target masks, in third target mask, the fifth sequence is masked because in the source sentence the fifth column(sentence) is made up of two words only. 

In [None]:
def nopeak_mask(size, opt):
    np_mask = np.triu(np.ones((1, size, size)),k=1).astype('uint8')
    np_mask =  Variable(torch.from_numpy(np_mask) == 0)
    return np_mask

In [None]:
def create_masks(src, trg, opt):
    src_mask = (src != opt["src_pad"]).unsqueeze(-2)

    if trg is not None:
        trg_mask = (trg != opt["trg_pad"]).unsqueeze(-2)
        size = trg.size(1) # get seq_len for matrix
        np_mask = nopeak_mask(size, opt)
        trg_mask = trg_mask & np_mask
        
    else:
        trg_mask = None
    return src_mask, trg_mask

In [None]:
src = torch.tensor([
        [2, 3, 4, 5, 6, 7, 8, 9],
        [2, 7, 7, 4, 2, 4, 3, 4],
        [3, 6, 8, 5, 2, 1, 3, 4],
        [4, 7, 9, 6, 3, 1, 7, 1],
        [5, 7, 2, 7, 3, 1, 8, 1],
        [1, 6, 2, 8, 4, 1, 8, 1],
        [1, 1, 1, 1, 5, 1, 9, 1],
        [1, 1, 1, 1, 5, 1, 1, 1]])
trg = src

In [None]:
src_mask, trg_mask = create_masks(src,trg,opt)

#### Source and Target Mask

In [None]:
fig = plt.figure()
ax = plt.subplot(241,)
ax.imshow(src,cmap="Purples")
ax.set_title("Source Sentence")

ax1 = plt.subplot(242)
ax1.imshow(src_mask.squeeze(-2),cmap="Purples")
ax1.set_title("Source Mask")


fig = plt.figure()
fig.suptitle("Target Masks")
plt.subplot(241)
plt.imshow(trg_mask[0],cmap="Purples")
plt.subplot(242)
plt.imshow(trg_mask[1],cmap="Purples")
plt.subplot(243)
plt.imshow(trg_mask[2],cmap="Purples")
plt.subplot(244)
plt.imshow(trg_mask[3],cmap="Purples")
plt.subplot(245)
plt.imshow(trg_mask[4],cmap="Purples")
plt.subplot(246)
plt.imshow(trg_mask[5],cmap="Purples")
plt.subplot(247)
plt.imshow(trg_mask[6],cmap="Purples")
plt.subplot(248)
plt.imshow(trg_mask[7],cmap="Purples")
plt.show()
print("Target Matrix Mask")

---
## Positional Encoding
The final piece of the Transformer model that remains is the positional encoding.

Unlike recurrent networks, the multi-head attention network cannot naturally make use of the position of the words in the input sequence. Without positional encodings, the output of the multi-head attention network would be the same for the sentences ** “I like cats more than dogs” ** and  ***“I like dogs more than cats”*** .  Positional encodings explicitly encode the relative/absolute positions of the inputs as vectors and are then added to the input embeddings.

As described in the paper the positional embedding can be mathematically given as : 

$$ PE_{(pos,2i)} = sin(\frac{pos}{1000^{\frac{2i}{d_{model}}}})\\
PE_{(pos,2i+1)} = cos(\frac{pos}{1000^{\frac{2i}{d_{model}}}}) $$

$pos$ refers to the order in the sentence, and $i$ refers to the position along the embedding vector dimension. Each value in the matrix is then worked out using the equations above.

In [None]:
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 80):
        super().__init__()
        self.d_model = d_model
        # create constant 'pe' matrix with values dependant on 
        # pos and i
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
                
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
 
    
    def forward(self, x):
        # make embeddings relatively larger
        x = x * math.sqrt(self.d_model)
        #add constant to embedding
        seq_len = x.size(1)
        x = x + Variable(self.pe[:,:seq_len],requires_grad=False)
        return x, self.pe
PE = PositionalEncoder(opt["d_model"])

In [None]:
PE = PositionalEncoder(opt["d_model"])

In [None]:
X = torch.rand([80,512])

In [None]:
Y,pe = PE(X)

When plotted the positional embeddings will look like as given below: 

![](figures/tranformer_positional_embedding2.png)

Figure: The first is source input, the positional embedding is generated and added to source input to add a sense of position in the source input as shown by the third subplot.

In [None]:
fig = plt.figure()
fig.suptitle("Target Masks")
plt.subplot(311)
plt.title("Source Input")
plt.imshow(X,cmap="inferno")
plt.subplot(312)
plt.title("Positional Embeddigns")
plt.imshow(pe.squeeze(),cmap="inferno")
plt.subplot(313)
plt.title("Positional Embeddings + Source Input")
plt.imshow(pe.squeeze() + X , cmap="inferno")
fig.show()