In [2]:
import math, copy, sys

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import import_ipynb
from Elements import *

importing Jupyter notebook from Elements.ipynb
importing Jupyter notebook from MoveData.ipynb


## The Encoder 

Compare this diagram from [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) to the code in the cells below. 

All the subcomponents of the `EncoderLayer()` are thoroughly explained in Elements.ipynb

the inputs to the encoder are sequences of integers shaped (batch size, sequence length)

the output are sequences of vectors shaped (batch size, sequence length, embedding dimensions)

<img src="../saved/images/encoderchart.png">

The diagram above mirrors the code below. As you follow the diagram upwards, you can see the data pass through the same modules as you move down the lines of code. `x` in our code is the sequence of tokens, the black line, moving through the diagram.

The box labeled Nx is the `EncoderLayer` function below, that is repeated an arbitrary Nx number of times (6 in the paper). 

In [3]:
class EncoderLayer(nn.Module):
    def __init__(self, emb_dim, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(emb_dim)
        self.dropout_1 = nn.Dropout(dropout)
        self.attn = MultiHeadAttention(heads, emb_dim, dropout=dropout)
        self.norm_2 = Norm(emb_dim)
        self.ff = FeedForward(emb_dim, dropout=dropout)
        self.dropout_2 = nn.Dropout(dropout)
        
    def forward(self, vector_sequence, mask):
        '''
        input:
            vector_sequence of shape (batch size, sequence length, embedding dimensions)
            source_mask (mask over input sequence) of shape (batch size, 1, sequence length)
        output: sequence of vectors after embedding, postional encoding, attention and normalization
            shape (batch size, sequence length, embedding dimensions)
        '''
        x2 = self.norm_1(vector_sequence)
        x2_attn, x2_scores = self.attn(x2,x2,x2,mask)
        vector_sequence = vector_sequence + self.dropout_1(x2_attn)
        x2 = self.norm_2(vector_sequence)
        vector_sequence = vector_sequence + self.dropout_2(self.ff(x2))
        return vector_sequence

In the forward function of `Encoder` you can see that just as in the diagram, the embedding occurs first `x = self.embed(source_sequence)` followed by positional encoding `x = self.pe(x)`. Then the `EncoderLayer(emb_dim, heads, dropout)` is repeated an `n_layers` number of times. 

Within the gray box in the diagram you can see the first step is the Multi-Head Attention. In our implementation we do a normalization first. But the the data splits into 3 arrows before going into Multi-Head Attention, just like we use the `x2` three times in `x2_attn, x2_scores = self.attn(x2,x2,x2,mask)`. This signifies that we used the same sequence of vectors for the generation of `num_heads` number of query `q`, key `k` and  value `v` vectors. 

Another minor difference is the use of dropout. Dropout is the random zeroing out of certain neurons or activations. The line above that reads `x = x + self.dropout_1(x2_attn)` could be written `x = x + x2_attn` to make it resemble the residual connection in the diagram. The line in the diagram that branches and goes around the Multi-Head Attention layer and into the "Add & Norm" module, is the residual. In english, it means that the input to the norm layer is the sum of x after it passes through the Multi-Head Attention with a copy of x that has not passed through the Multi-Head Attention.

In the same way, the output of the feed forward layer `self.ff(x2)` is added to itself before normalized again. This is the 2nd residual in the diagram. 

In [4]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, n_layers, heads, dropout):
        super().__init__()
        self.n_layers = n_layers
        self.embed = Embedder(vocab_size, emb_dim)
        self.pe = PositionalEncoder(emb_dim, dropout=dropout)
        self.layers = get_clones(EncoderLayer(emb_dim, heads, dropout), n_layers)
        self.norm = Norm(emb_dim)
    def forward(self, source_sequence, source_mask):
        '''
        input:
            source_sequence (sequence of source tokens) of shape (batch size, sequence length)
            source_mask (mask over input sequence) of shape (batch size, 1, sequence length)
        output: sequence of vectors after embedding, postional encoding, attention and normalization
            shape (batch size, sequence length, embedding dimensions)
        '''
        vector_sequence = self.embed(source_sequence)
        vector_sequence = self.pe(vector_sequence)
        for i in range(self.n_layers):
            vector_sequence = self.layers[i](vector_sequence, source_mask)
        vector_sequence = self.norm(vector_sequence)
        return vector_sequence

In [9]:
# Example of type and shape of input and output to Encoder 

if False:
    encoder = Encoder(vocab_size=3, emb_dim=4, n_layers=1, heads=2, dropout=0)
    source_sequence = torch.from_numpy(np.asarray([0,1,2])).unsqueeze(0)
    input_mask = (source_sequence != 4).unsqueeze(-2)
    encoding = encoder(source_sequence, input_mask)
    print("encoding.shape",encoding.shape)
    print("----------------------------------------------------------")
    print("encoding",encoding)

Compare this diagram from [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) to the code in the cells below. 

<img src="../saved/images/decoderchart.png">

The portion of the diagram within the grey box labeled "Nx" corresponds to the `DecoderLayer()` below. The data passes through `n_layers` number of `DecoderLayer()` 's as indicated by the `n_layers` argument in `Decoder()`.
The process of embedding and postional encoding is the same as the analogous operations in the Encoder section. What is different in the Decoder are the 2 ways in which the attention is applied. 

In the diagram the "Masked Multi-Head Attention" is the box that corresponds to the line of code `self_attn, self_scores = self.attn_1(de_nrm, de_nrm, de_nrm, trg_mask)#Self Attention `. This is the self attention which attends to what th decoder has outputted so far `de_out (decoder ouputs so far)`. An intuitive explaination is that this attention prevents the network from stuttering "I see see you now" or from skipping "I you now", by attending to all the words already spoken at the time of generating a new word. 

The box in the diagram "Multi-Head Attention" is similar to the encoder attention used the encoder that attends to the input sequence. There is a small difference though. Take a look at the way this attention is used:

`en_attn, en_scores = self.attn_2(de_nrm, en_out, en_out, src_mask)#Encoder Attention`

the `q, k, v` sequences are `de_nrm, en_out, en_out` respectively. This is because in this case, we are re-representing our `de_out (decoder ouputs so far)` in the context of our `en_out (encoder output)`. The `q` sequence of vectors are the vectors you are trying to re-represent. `q` for query is like the question. If the `q` for the word "she" and the `k` for the word "chloe" together produce a large softmax score, it means that the pronoun "she" refers to "chloe" rather than some other word, like "a" or "bowl". 

In [6]:
class DecoderLayer(nn.Module):

    def __init__(self, emb_dim, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(emb_dim)
        self.norm_2 = Norm(emb_dim)
        self.norm_3 = Norm(emb_dim)
        
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        
        self.attn_1 = MultiHeadAttention(heads, emb_dim, dropout=dropout)
        self.attn_2 = MultiHeadAttention(heads, emb_dim, dropout=dropout)
        self.ff = FeedForward(emb_dim, dropout=dropout)

    def forward(self, de_out, de_mask, en_out, en_mask):
        '''
        inputs:
            de_out - decoder ouputs so far (batch size, output sequence length, embedding dimensions)
            de_mask (batch size, output sequence length, output sequence length)
            en_out - encoder output (batch size, input sequence length, embedding dimensions)
            en_mask (batch size, 1, input sequence length)
        ouputs:
            de_out (next decoder output) (batch size, output sequence length, embedding dimensions)
        '''
        de_nrm = self.norm_1(de_out)
        #Self Attention 
        self_attn, self_scores = self.attn_1(de_nrm, de_nrm, de_nrm, de_mask)
        de_out = de_out + self.dropout_1(self_attn)
        de_nrm = self.norm_2(de_out)
        #DecoderEncoder Attention
        en_attn, en_scores = self.attn_2(de_nrm, en_out, en_out, en_mask) 
        de_out = de_out + self.dropout_2(en_attn)
        de_nrm = self.norm_3(de_out)
        de_out = de_out + self.dropout_3(self.ff(de_nrm))
        return de_out

In [7]:
class Decoder(nn.Module):
    '''
    If your target sequence is `see` `ya` and you want to train on the entire 
    sequence against the target, you would use `<sos>` `see`  `ya`
    as the de_out (decoder ouputs so far) and compare the 
    output de_out (next decoder output) `see` `ya` `<eos>` 
    as the target in the loss function. The inclusion of the `<sos>`
    for the (decoder ouputs so far) and `<eos>` for the 
    '''
    def __init__(self, vocab_size, emb_dim, n_layers, heads, dropout):
        super().__init__()
        self.n_layers = n_layers
        self.embed = Embedder(vocab_size, emb_dim)
        self.pe = PositionalEncoder(emb_dim, dropout=dropout)
        self.layers = get_clones(DecoderLayer(emb_dim, heads, dropout), n_layers)
        self.norm = Norm(emb_dim)
    def forward(self, de_toks, de_mask, en_vecs, en_mask):
        '''
        inputs:
            de_toks - decoder ouputs so far (batch size, output sequence length)
            de_mask (batch size, output sequence length, output sequence length)
            en_vecs - encoder output (batch size, input sequence length, embedding dimensions)
            en_mask (batch size, 1, input sequence length)
        outputs:
            de_vecs - next decoder output (batch size, output sequence length, embedding dimensions)

        '''
        x = self.embed(de_toks)
        x = self.pe(x)
        for i in range(self.n_layers):
            x = self.layers[i](x, de_mask, en_vecs, en_mask)
        return self.norm(x)

You have now learned the Two main subcomponents of the Transformer. The last component of the transformer is taking these the de_out (next decoder output) and using this representation of the Transformers response and tranlating those vectors back into word tokens to generate the replay. Go to Talk.ipynb for the next lesson

## How can I help you or get help from you?

[Support *ChloeRobotics* on Patreon and send us a message](https://www.patreon.com/chloerobotics)

## Questions?

email chloe.the.robot [at] gmail [dot] com 