<a href="https://colab.research.google.com/github/akashe/NLP/blob/main/Convolutional_Seq2Seq_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we discuss the steps in a Convolutional sequence and sequence to networks in detail. The explanations are mine but the code is not mine. Its copied from other places.

paper: https://arxiv.org/pdf/1612.08083.pdf


Architecture:
![IMg](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/9479fcb532214ad26fd4bda9fcf081a05e1aaf4e/assets/convseq2seq0.png)


In a convolution seq2seq model, the most important deviation from RNN based sequence to sequence model is that the input is sent in one go. There is no unrolling in sequence dimension.

Important concepts:

1. Positional Embeddings: Since there is no concept of unrolling, we provide sequence information in the form of postional embeddings. We add these positonal embeddings with the input dims.
2. Residual connections:
  *  help pass on the input information to each part of the architecture. So each layer has the option to look at previous layers ouputs/features in conjuction with the original input.
  * help in gradient flow. Each layer of the architecture gets a good enough magnitude of gradients. In networks with no residual connections, the gradients keep diminishing with each layer resulting in difficult training.
3. GLU activation: Conv Seq2Seq employ GLU activation. Given a vector $v$, it divides the vector into two parts, $(a,b)$ along a given dimension. It then performs: 
  * $a\otimes\sigma(b)$.
  * This can be interpreted as choosing features from $b$ using sigmoid gate(which outputs from 0 to 1) and amplifying the corresponding features in $a$. It is the same kind of operation we do in a LSTM in forget gate.
4. Kernels: **The name is convolution so it will have kernels!** The inputs in this model are of the form [batch_size,hid_dim,src_len]. Here, 'hid_dim' acts as input channels. And we perform convolution in 'src_len' dim. Since we are convolving along 1 direction, we use Conv1d.
If the sentence is "I am the king of the world". The next table shows how we convolve for kernel size 3:

|sentence|I|am|the|king|of|the|world|
| :------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Kernel position 1| i | am | the|
| Kernel position 2| | am |the| king|
| Kernel position 3| | | the | king| of|
| Kernel position 4| | | | king | of | the|
| Kernel position 5| | | | | of | the| world


5. Padding: In the above example, the source length of the sentence,"I am the king of the world" was 7. We had 5 kernel positions. So the output length will be 5. When we have multiple layers of convolution this could be problematic to write code. So, we use padding to keep the input and output source lenghts same. 
  * $\mathcal{Padding}=Kernel Size -1$
  * padding happens on both sides of the input. So our source sentence becomes: **\<pad\> I am the king of the world \<pad\>**
  * In decoders, as we will see, we add the padding in the beggining of the sentence. *We do so because we want to avoid decoder from looking into future information and keeping the output len same as source len*. For example for Kernel size 3 padding in decoder will look like this:

|sentence|\<pad\>|\<pad\>|I|am|the|king|of|the|world|
| :------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: |:---: |
| Kernel for pos 1 |\<pad\>|\<pad\>|I|
| Kernel for pos 2 | |\<pad\>|I|am|
| Kernel for pos 3 | | |I|am|the|
| Kernel for pos 4 | | | |am|the|king|
| Kernel for pos 5 | | | | |the|king|of|
| Kernel for pos 6 | | | | | | king|of|the|
| Kernel for pos 7 | | | | | | |of|the|world|


6. Attention: How do we use the information from encoder in decoder? In RNN, we used the last hidden vector as an input to decoder RNN. *Attention is RNN models works as an auxillary functionality* to extract relevant information from encoder hidden states. *Since Convolution Seq2Seq don't have rolling in sequence dimension, they use Attention as the main function to extract information from encoder.*
  * Encoder outputs two vectors:
    1. $\text{encoder_conved}$: Output of convolution operations. Note: encoder employs multiple convolution layers for maximum feature extraction. 'encoder_conved' is the output of the last layer.
    2. $\text{encoder_combined}$ As discussed in Residual connections, 'encoder_combined' is the sum of the 'encoder_conved' with 'input_representations'. This way encoder outputs retain some information of the inputs.
  * In decoder, after feature extraction with convolutions, we use recently extracted features to find what elements of $\text{encoder_conved}$ are most important/aligned to it. We use this importance as a ratio of how much information to take from $\text{encoder_combined}$. This mechanism of finding relavant information and selectively using the relavant information is called Attention.
  * Notice that unlike RNN models, here we calculate the attention at the same for the entire output sequence at once! and we can repeat finding attention for each decoder convolutional layer!!




Importing libs:

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np

import random
import math
import time

Setting Seed:

In [2]:
SEED = 7777

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Downloading spacy models:

In [3]:
!python -m spacy download en
!python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/de_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/de
You can now load the model via spacy.load('de')


Loading spacy models:

In [4]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

Defining tokenizer functions:

In [5]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]


Define Fields for processing:
Note: since we are working with convolutions, we set batch_first= True to get dims like [batch_size,src_len,input_dim]

In [6]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

Downloading Multi30k german to english translation dataset and setting train,valid and test datasets.

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), 
                                                    fields=(SRC, TRG))

Building vocabs

In [8]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

Setting device

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Using bucketIterator to get batches with similar or closer source lengths.

In [10]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     device = device)

Encoder:

In [11]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 emb_dim, 
                 hid_dim, 
                 n_layers, 
                 kernel_size, 
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()
        
        assert kernel_size % 2 == 1, "Kernel size must be odd!"
        
        self.device = device
        
        # scale factor, we use to scale down sum of 2 tensors
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)
        
        # Convert input from input_dims to emb dims
        self.tok_embedding = nn.Embedding(input_dim, emb_dim)
        # Convert positional information to emb_dim
        self.pos_embedding = nn.Embedding(max_length, emb_dim)
        
        # Convert tensors to hid_dim to feed them to Convolutional blocks
        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        # Convert tensors to convert them to emb_dim after convolutional blocks to be used by decoder
        self.hid2emb = nn.Linear(hid_dim, emb_dim)
        
        # layers of 1D convolutions which put padding in start and end of sentence
        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                              out_channels = 2 * hid_dim, 
                                              kernel_size = kernel_size, 
                                              padding = (kernel_size - 1) // 2)
                                    for _ in range(n_layers)])
        
        # Dropout to improve performance by regularization
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [batch size, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        #create position tensor
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [0, 1, 2, 3, ..., src len - 1]
        
        #pos = [batch size, src len]
        
        #embed tokens and positions
        tok_embedded = self.tok_embedding(src)
        pos_embedded = self.pos_embedding(pos)
        
        #tok_embedded = pos_embedded = [batch size, src len, emb dim]
        
        #combine embeddings by elementwise summing
        embedded = self.dropout(tok_embedded + pos_embedded)
        
        #embedded = [batch size, src len, emb dim]
        
        #pass embedded through linear layer to convert from emb dim to hid dim
        conv_input = self.emb2hid(embedded)
        
        #conv_input = [batch size, src len, hid dim]
        
        #permute for convolutional layer
        conv_input = conv_input.permute(0, 2, 1) 
        
        #conv_input = [batch size, hid dim, src len]
        
        #begin convolutional blocks...
        
        for i, conv in enumerate(self.convs):
        
            #pass through convolutional layer
            conved = conv(self.dropout(conv_input))

            #conved = [batch size, 2 * hid dim, src len]

            #pass through GLU activation function
            conved = F.glu(conved, dim = 1)

            #conved = [batch size, hid dim, src len]
            
            #apply residual connection
            conved = (conved + conv_input) * self.scale

            #conved = [batch size, hid dim, src len]
            
            #set conv_input to conved for next loop iteration
            conv_input = conved
        
        #...end convolutional blocks
        
        #permute and convert back to emb dim
        conved = self.hid2emb(conved.permute(0, 2, 1))
        
        #conved = [batch size, src len, emb dim]
        
        #elementwise sum output (conved) and input (embedded) to be used for attention
        combined = (conved + embedded) * self.scale
        
        #combined = [batch size, src len, emb dim]
        
        return conved, combined

Decoder:

In [12]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim, 
                 emb_dim, 
                 hid_dim, 
                 n_layers, 
                 kernel_size, 
                 dropout, 
                 trg_pad_idx, 
                 device,
                 max_length = 100):
        super().__init__()
        
        # setting kernel size
        self.kernel_size = kernel_size
        # setting target padding index
        self.trg_pad_idx = trg_pad_idx
        self.device = device # device info
        
        # scale tensor to scale down sum of two vectors
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)
        
        # Embedding to convert target sentence from output_dim to emb_dim
        self.tok_embedding = nn.Embedding(output_dim, emb_dim)
        # Embedding to convert position embeddings of the target sentence to emb_dim
        self.pos_embedding = nn.Embedding(max_length, emb_dim)
        
        # Converting tensor to hid_dim to be used by Convolution layers
        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        # Converting tensor to emb_dim to convert them target disctionaries
        self.hid2emb = nn.Linear(hid_dim, emb_dim)
        
        # convert convoution outputs to emb_dim to work with encoder outputs which are in emb_dim
        self.attn_hid2emb = nn.Linear(hid_dim, emb_dim)
        # convert the outputs of attention mechanism to hid_dim to be used by next convolutional layer
        self.attn_emb2hid = nn.Linear(emb_dim, hid_dim)
        
        # Convert to the target dictionary
        self.fc_out = nn.Linear(emb_dim, output_dim)
        
        # layers of 1D convolutions
        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                              out_channels = 2 * hid_dim, 
                                              kernel_size = kernel_size)
                                    for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
      
    def calculate_attention(self, embedded, conved, encoder_conved, encoder_combined):
        '''
        Calculate attention bw conved + embedded bw encoder_conved and use the alignment to extarct information from encoder_combined 
        which is encoder_conved+encoder_inputs
        '''
        
        #embedded = [batch size, trg len, emb dim]
        #conved = [batch size, hid dim, trg len]
        #encoder_conved = encoder_combined = [batch size, src len, emb dim]
        
        #permute and convert back to emb dim
        conved_emb = self.attn_hid2emb(conved.permute(0, 2, 1))
        
        #conved_emb = [batch size, trg len, emb dim]
        
        combined = (conved_emb + embedded) * self.scale
        
        #combined = [batch size, trg len, emb dim]
        # Find alpha values bw combined and encoder_conved        
        energy = torch.matmul(combined, encoder_conved.permute(0, 2, 1))
        
        #energy = [batch size, trg len, src len]
        
        # performa softmax so you can get scaled importance
        attention = F.softmax(energy, dim=2)
        
        #attention = [batch size, trg len, src len]

        # get the scaled values from encoder_combined    
        attended_encoding = torch.matmul(attention, encoder_combined)
        
        #attended_encoding = [batch size, trg len, emd dim]
        
        #convert from emb dim -> hid dim
        attended_encoding = self.attn_emb2hid(attended_encoding)
        
        #attended_encoding = [batch size, trg len, hid dim]
        
        #apply residual connection
        attended_combined = (conved + attended_encoding.permute(0, 2, 1)) * self.scale
        
        #attended_combined = [batch size, hid dim, trg len]
        
        return attention, attended_combined
        
    def forward(self, trg, encoder_conved, encoder_combined):
        
        #trg = [batch size, trg len]
        #encoder_conved = encoder_combined = [batch size, src len, emb dim]
                
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
            
        #create position tensor
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [batch size, trg len]
        
        #embed tokens and positions
        tok_embedded = self.tok_embedding(trg)
        pos_embedded = self.pos_embedding(pos)
        
        #tok_embedded = [batch size, trg len, emb dim]
        #pos_embedded = [batch size, trg len, emb dim]
        
        #combine embeddings by elementwise summing
        embedded = self.dropout(tok_embedded + pos_embedded)
        
        #embedded = [batch size, trg len, emb dim]
        
        #pass embedded through linear layer to go through emb dim -> hid dim
        conv_input = self.emb2hid(embedded)
        
        #conv_input = [batch size, trg len, hid dim]
        
        #permute for convolutional layer
        conv_input = conv_input.permute(0, 2, 1) 
        
        #conv_input = [batch size, hid dim, trg len]
        
        batch_size = conv_input.shape[0]
        hid_dim = conv_input.shape[1]
        
        for i, conv in enumerate(self.convs):
        
            #apply dropout
            conv_input = self.dropout(conv_input)
        
            #need to pad so decoder can't "cheat".. 
            padding = torch.zeros(batch_size, 
                                  hid_dim, 
                                  self.kernel_size - 1).fill_(self.trg_pad_idx).to(self.device)
            # adding paddings in the beginning of target sentences    
            padded_conv_input = torch.cat((padding, conv_input), dim = 2)
        
            #padded_conv_input = [batch size, hid dim, trg len + kernel size - 1]
        
            #pass through convolutional layer
            conved = conv(padded_conv_input)

            #conved = [batch size, 2 * hid dim, trg len]
            
            #pass through GLU activation function
            conved = F.glu(conved, dim = 1)

            #conved = [batch size, hid dim, trg len]
            
            #calculate attention
            attention, conved = self.calculate_attention(embedded, 
                                                         conved, 
                                                         encoder_conved, 
                                                         encoder_combined)
            
            #attention = [batch size, trg len, src len]
            
            #apply residual connection
            conved = (conved + conv_input) * self.scale
            
            #conved = [batch size, hid dim, trg len]
            
            #set conv_input to conved for next loop iteration
            conv_input = conved
            
        conved = self.hid2emb(conved.permute(0, 2, 1))
         
        #conved = [batch size, trg len, emb dim]
            
        output = self.fc_out(self.dropout(conved))
        
        #output = [batch size, trg len, output dim]
            
        return output, attention

Seq2Seq architecture:

In [13]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg):
        
        #src = [batch size, src len]
        #trg = [batch size, trg len - 1] (<eos> token sliced off the end)
           
        #calculate z^u (encoder_conved) and (z^u + e) (encoder_combined)
        #encoder_conved is output from final encoder conv. block
        #encoder_combined is encoder_conved plus (elementwise) src embedding plus 
        #  positional embeddings 
        encoder_conved, encoder_combined = self.encoder(src)
            
        #encoder_conved = [batch size, src len, emb dim]
        #encoder_combined = [batch size, src len, emb dim]
        
        #calculate predictions of next words
        #output is a batch of predictions for each word in the trg sentence
        #attention a batch of attention scores across the src sentence for 
        #  each word in the trg sentence
        output, attention = self.decoder(trg, encoder_conved, encoder_combined)
        
        #output = [batch size, trg len - 1, output dim]
        #attention = [batch size, trg len - 1, src len]
        
        return output, attention

Intializing everything:

In [14]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
EMB_DIM = 256
HID_DIM = 512 # each conv. layer has 2 * hid_dim filters
ENC_LAYERS = 10 # number of conv. blocks in encoder
DEC_LAYERS = 10 # number of conv. blocks in decoder
ENC_KERNEL_SIZE = 3 # must be odd!
DEC_KERNEL_SIZE = 3 # can be even or odd
ENC_DROPOUT = 0.25
DEC_DROPOUT = 0.25
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
    
enc = Encoder(INPUT_DIM, EMB_DIM, HID_DIM, ENC_LAYERS, ENC_KERNEL_SIZE, ENC_DROPOUT, device)
dec = Decoder(OUTPUT_DIM, EMB_DIM, HID_DIM, DEC_LAYERS, DEC_KERNEL_SIZE, DEC_DROPOUT, TRG_PAD_IDX, device)

model = Seq2Seq(enc, dec).to(device)

Counting number of parameters

In [15]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 37,351,685 trainable parameters


Setting optimizer and criterion

In [16]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Train loop

In [17]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])
        
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
        
        output_dim = output.shape[-1]
        
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
        
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Test loop:

In [18]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output, _ = model(src, trg[:,:-1])
        
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]

            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)

            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Function for epoch timings:

In [19]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Running the model:

In [20]:
N_EPOCHS = 10
CLIP = 0.1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut5-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 58s
	Train Loss: 4.228 | Train PPL:  68.581
	 Val. Loss: 2.982 |  Val. PPL:  19.720
Epoch: 02 | Time: 1m 2s
	Train Loss: 3.064 | Train PPL:  21.403
	 Val. Loss: 2.368 |  Val. PPL:  10.671
Epoch: 03 | Time: 1m 3s
	Train Loss: 2.638 | Train PPL:  13.987
	 Val. Loss: 2.151 |  Val. PPL:   8.597
Epoch: 04 | Time: 1m 2s
	Train Loss: 2.410 | Train PPL:  11.131
	 Val. Loss: 2.031 |  Val. PPL:   7.619
Epoch: 05 | Time: 1m 2s
	Train Loss: 2.263 | Train PPL:   9.615
	 Val. Loss: 1.943 |  Val. PPL:   6.981
Epoch: 06 | Time: 1m 2s
	Train Loss: 2.168 | Train PPL:   8.742
	 Val. Loss: 1.886 |  Val. PPL:   6.591
Epoch: 07 | Time: 1m 2s
	Train Loss: 2.090 | Train PPL:   8.086
	 Val. Loss: 1.843 |  Val. PPL:   6.316
Epoch: 08 | Time: 1m 2s
	Train Loss: 2.035 | Train PPL:   7.652
	 Val. Loss: 1.830 |  Val. PPL:   6.233
Epoch: 09 | Time: 1m 2s
	Train Loss: 1.994 | Train PPL:   7.345
	 Val. Loss: 1.807 |  Val. PPL:   6.092
Epoch: 10 | Time: 1m 2s
	Train Loss: 1.955 | Train PPL:   7.063