# Introduction

This model is drastically different to the previous models used in these tutorials. There are no recurrent components used at all. Instead it makes use of convolutional layers, typically used for image processing. For an introduction to convolutional layers on text for sentiment analysis, see this tutorial.

In short, a convolutional layer uses filters. These filters have a width (and also a height in images, but usually not text). If a filter has a width of 3, then it can see 3 consecutive tokens. Each convolutional layer has many of these filters (1024 in this tutorial). Each filter will slide across the sequence, from beginning to the end, looking at all 3 consectuive tokens at a time. The idea is that each of these 1024 filters will learn to extract a different feature from the text. The result of this feature extraction will then be used by the model - potentially as input to another convolutional layer. This can then all be used to extract features from the source sentence to translate it into the target language.

<img src="images/convseq2seq0.png" >

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np

import random
import math
import time

In [2]:
SEED = 1234

np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [2]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

# to install spacy languages use:
# python -m spacy download en
# python -m spacy download de

In [3]:
# tokenizers

def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

Next, we'll set up the Fields which decide how the data will be processed. By default RNN models in PyTorch require the sequence to be a tensor of shape **[sequence length, batch size]** so TorchText will, by default, return batches of tensors in the same shape. However in this notebook we are using CNNs which expect the batch dimension to be first. We tell TorchText to have batches be **[batch size, sequence length] by setting batch_first = True.**

We also append the start and end of sequence tokens as well as lowercasing all text.

In [4]:
SRC = Field(tokenize = tokenize_de,
           init_token = '<sos>',
           eos_token = '<eos>',
           lower = True,
           batch_first = True)

TRG = Field(tokenize = tokenize_en,
           init_token = '<sos>',
           eos_token = '<eos>',
           lower = True, 
           batch_first = True)

In [5]:
# load data
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), 
                                                    fields=(SRC, TRG))

In [6]:
# build vocab
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [8]:
#iterator
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     device = device)

# Building Model

Next up is building the model. As before, the model is made of an encoder and decoder. The encoder encodes the input sentence, in the source language, into a context vector. The decoder decodes the context vector to produce the output sentence in the target language.



### Encoder

Previous models in these tutorials had an encoder that compresses an entire input sentence into a single context vector, $z$. The convolutional sequence-to-sequence model is a little different - it gets two context vectors for each token in the input sentence. So, if our input sentence had 6 tokens, we would get 12 context vectors, two for each token.

The two context vectors per token are a **conved vector and a combined vector.** The conved vector is the result of each token being passed through a few layers - which we will explain shortly. The combined vector comes from the sum of the convolved vector and the embedding of that token. Both of these are returned by the encoder to be used by the decoder.

The image below shows the result of an input sentence - zwei menschen fechten. - being passed through the encoder.

<img src="images/convseq2seq1.png" >

First, the token is passed through a token embedding layer - which is standard for neural networks in natural language processing. However, as there are no recurrent connections in this model it has no idea about the order of the tokens within a sequence. To rectify this we have a second embedding layer, the positional embedding layer. This is a standard embedding layer where the input is not the token itself but the position of the token within the sequence - starting with the first token, the <sos> (start of sequence) token, in position 0.

Next, the token and positional embeddings are elementwise summed together to get a vector which contains information about the token and also its position with in the sequence - which we simply call the embedding vector. This is followed by a linear layer which transforms the embedding vector into a vector with the required hidden dimension size.

The next step is to pass this hidden vector into $N$ convolutional blocks. This is where the "magic" happens in this model and we will detail the contents of the convolutional blocks shortly. After passing through the convolutional blocks, the vector is then fed through another linear layer to transform it back from the hidden dimension size into the embedding dimension size. This is our conved vector - and we have one of these per token in the input sequence.

Finally, the conved vector is elementwise summed with the embedding vector via a residual connection to get a combined vector for each token. Again, there is a combined vector for each token in the input sequence.


### Convoltional Blocks
    
So, how do these convolutional blocks work? The below image shows 2 convolutional blocks with a single filter (blue) that is sliding across the tokens within the sequence. In the actual implementation we will have 10 convolutional blocks with 1024 filters in each block.

<img src="images/convseq2seq2.png" >
    

First, the input sentence is padded. This is because the convolutional layers will reduce the length of the input sentence and we want the length of the sentence coming into the convolutional blocks to equal the length of it coming out of the convolutional blocks. Without padding, the length of the sequence coming out of a convolutional layer will be filter_size - 1 shorter than the sequence entering the convolutional layer. For example, if we had a filter size of 3, the sequence will be 2 elements shorter. Thus, we pad the sentence with one padding element on each side. We can calculate the amount of padding on each side by simply doing (filter_size - 1)/2 for odd sized filters - we will not cover even sized filters in this tutorial.

These filters are designed so the output hidden dimension of them is twice the input hidden dimension. In computer vision terminology these hidden dimensions are called channels - but we will stick to calling them hidden dimensions. Why do we double the size of the hidden dimension leaving the convolutional filter? This is because we are using a special activation function called gated linear units (GLU). GLUs have gating mechanisms (similar to LSTMs and GRUs) contained within the activation function and actually half the size of the hidden dimension - whereas usually activation functions keep the hidden dimensions the same size.

After passing through the GLU activation the hidden dimension size for each token is the same as it was when it entered the convolutional blocks. It is now elementwise summed with its own vector before it was passed through the convolutional layer.

This concludes a single convolutional block. Subsequent blocks take the output of the previous block and perform the same steps. Each block has their own parameters, they are not shared between blocks. The output of the last block goes back to the main encoder - where it is fed through a linear layer to get the conved output and then elementwise summed with the embedding of the token to get the combined output.

### Encoder Implementation

To keep the implementation simple, we only allow for odd sized kernels. This allows padding to be added equally to both sides of the source sequence.

The scale variable is used by the authors to "ensure that the variance throughout the network does not change dramatically". The performance of the model seems to vary wildly using different seeds if this is not used.

The positional embedding is initialized to have a "vocabulary" of 100. This means it can handle sequences up to 100 elements long, indexed from 0 to 99. This can be increased if used on a dataset with longer sequences.

In [15]:
class Encoder(nn.Module):
    def __init__(self, 
                input_dim,
                emb_dim,
                hid_dim, 
                n_layers, 
                kernel_size,
                dropout,
                device,
                max_length = 100):
        super(Encoder, self).__init__()
        
        assert kernel_size % 2 == 1, "Kernel size must be odd!"
        
        self.device = device 
        
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)
        
        self.tok_embedding = nn.Embedding(input_dim, emb_dim)
        self.pos_embedding = nn.Embedding(max_length, emb_dim)
        
        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        self.hid2emb = nn.Linear(hid_dim, emb_dim)
        
        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                             out_channels = 2 * hid_dim,
                                             kernel_size = kernel_size,
                                             padding = (kernel_size - 1) // 2)
                                   for _ in range(n_layers)]) 
        
        ## for _ in ?? conv1d ?? ModuleList ??
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        # src = [batch size, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        # positional tensor
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        # pos = [0, 1, 2, 3, .... , src_len - 1]
        # pos = [batch size, src len]
        
        #embed tokens and positions
        tok_embedded = self.tok_embedding(src)
        pos_embedded = self.pos_embedding(pos)
        
        # tok_embedded = pos_embedded = [batch size, src len, emb dim]
        
        # combine embeddings by elementwise summing
        embedded = self.dropout(tok_embedded + pos_embedded)
        
        # embedded = [batch size, src len, emb dim]
        
        # pass embedded thtough linear layer to convert from emb to hid dim
        conv_input = self.emb2hid(embedded)
        
        # conv_input = [batch_size, src len, hid dim]
        
        # permute for conv layer
        conv_input = conv_input.permute(0, 2, 1)
        
        # conv_input = [batch size, hid dim, src len]
        
        # begin convolutional blocks ...
        
        for i, conv in enumerate(self.convs):
            # pass throuh conv layer
            conved = conv(self.dropout(conv_input))
            
            # conved = [batch size, 2 * hid dim, src len]
            
            # passs through GLU activation
            conved = F.glu(conved, dim=1)
            
            # conved = [batch_size, hid dim, src len]
            
            # apply residual connection
            conved = (conved + conv_input) * self.scale
            
            # conved = [batch size, hid dim, src len]
            
            # set conv_input to conved for next loop iteration
            conved_input = conved
        
        # ... end convlutional blocks 
        
        # permute and convert back to emb dim
        conved = self.hid2emb(conved.permute(0,2,1))
        
        # conved = [batch size, src len, emb dim]
        
        # elementwise sum output (conved) and input(embedded) to be used for attention
        combined = (conved + embedded) * self.scale
        
        return conved, combined

**Decoder**

The decoder takes in the actual target sentence and tries to predict it. This model differs from the recurrent neural network models previously detailed in these tutorials as it predicts all tokens within the target sentence in parallel. There is no sequential processing, i.e. no decoding loop. This will be detailed further later on in the tutorials.

The decoder is similar to the encoder, with a few changes to both the main model and the convolutional blocks inside the model.

<img src="images/decoder.png" >

First, the embeddings do not have a residual connection that connects after the convolutional blocks and the transformation. Instead the embeddings are fed into the convolutional blocks to be used as residual connections there.

Second, to feed the decoder information from the encoder, the encoder conved and combined outputs are used - again, within the convolutional blocks.

Finally, the output of the decoder is a linear layer from embedding dimension to output dimension. This is used make a prediction about what the next word in the translation should be.

**Decoder Convolutional Blocks**

Again, these are similar to the convolutional blocks within the encoder, with a few changes.

<img src="images/decoder_blocks.png" >

First, the padding. Instead of padding equally on each side to ensure the length of the sentence stays the same throughout, we only pad at the beginning of the sentence. As we are processing all of the targets simultaneously in parallel, and not sequentially, we need a method of only allowing the filters translating token $i$ to only look at tokens before word $i$. If they were allowed to look at token $i+1$ (the token they should be outputting), the model will simply learn to output the next word in the sequence by directly copying it, without actually learning how to translate.

Let's see what happens if we incorrectly padded equally on each side, like we do in the encoder.

<img src="images/decoder_bad_pad.png" >

The filter at the first position, which is trying use the first word in the sequence, <sos> to predict the second word, two, can now directly see the word two. This is the same for every position, the word the model trying to predict is the second element covered by the filter. Thus, the filters can learn to simply copy the second word at each position allowing for perfect translation without actually learning how to translate.

Second, after the GLU activation and before the residual connection, the block calculates and applies attention - using the encoded representations and the embedding of the current word. Note: we only show the connections to the rightmost token, but they are actually connected to all tokens - this was done for clarity. Each token input uses their own, and only their own, embedding for their own attention calculation.

The attention is calculated by first using a linear layer to change the hidden dimension to the same size as the embedding dimension. Then the embedding summed via a residual connection. This combination then has the standard attention calculation applied by finding how much it "matches" with the encoded conved and then this is applied by getting a weighted sum over the encoded combined. This is then projected back up to the hidden dimenson size and a residual connection to the initial input to the attention layer is applied.

Why do they calculate attention first with the encoded conved and then use it to calculate the weighted sum over the encoded combined? The paper argues that the encoded conved is good for getting a larger context over the encoded sequence, whereas the encoded combined has more information about the specific token and is thus therefore more useful for makng a prediction.

**Decoder Implementation**

As we only pad on one side the decoder is allowed to use both odd and even sized padding. Again, the scale is used to reduce variance throughout the model and the position embedding is initialized to have a "vocabulary" of 100.

This model takes in the encoder representations in its forward method and both are passed to the calculate_attention method which calculates and applies attention. It also returns the actual attention values, but we are not currently using them.

In [18]:
class Decoder(nn.Module):
    def __init__(self,
                output_dim, 
                emd_dim,
                hid_dim,
                n_layers, 
                kernel_size,
                dropout,
                trg_pad_idx,
                device, 
                max_length = 100):
        super(Decoder, self).__init__()
        
        self.kernel_size = kernel_size
        self.trg_pad_id = trg_pad_idx
        self.device = device
        
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)
        
        self.tok_embedding = nn.Embedding(output_dim, emb_dim)
        self.pos_embedding = nn.Embedding(max_length, emb_dim)
        
        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        self.hid2emb = nn.Linear(emb_dim. hid_dim)
        
        self.attn_hid2emb = nn.Linear(hid_dim, emb_dim)
        self.attn_emb2hid = nn.Linear(emd_dim, hid_dim)
        
        self.fc_out = nn.Linear(emd_dim, output_dim)
        
        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                              out_channels = 2 * hid_dim,
                                              kernel_size = kernel_size)
                                    for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
        
        
    def calculate_attention(self, embedded, conved, encoder_conved, encoder_combined):
        
        #embedded = [batch size, trg len, emb dim]
        #conved = [batch size, hid dim, trg len]
        #encoder_conved = encoder_combined = [batch size, src len, emb dim]
        
        #permute and convert back to emb dim
        conved_emb = self.attn_hid2emb(conved.permute(0,2,1))
        
        #conved_emd = [batch_size, trg len, emb dim]
        
        combined = (conved_emb + embedded) * self.scale
        
        #combined = [batch size, trg len, emb dim]
        
        energy = torch.matmul(combined, encoder_conved.permute(0,2,1))
        
        #energy = [batch size, trg len, src len]
        
        attention = F.softmax(energy, dim=2)
        
        #attention = [batch size, trg len, src len]
        
        attention_encoding = torch.matmul(attention, encoder_combined)
        
        #attention_encoding = [batch size, trg len, emd dim]
        