## QANet

### Introduction 

The papers that we've seen so far have been heavily based on recurrent neural nets and attention. However, RNNs are slow to train given their sequential nature and are also slow for inference. QANet was proposed in early 2018. This paper does away with recurrence and is only based on self-attention and convolutions. This paper derives its major ideas from the "Attention is all you need".  
Progression of ideas for any NLP task is usually derived from progress in other fields. For instance, most of the QA models employ methods that have been proven successful in Machine Translation (RNN with Attenion, self-attention), Language Modeling (BERT, ALBERT etc) etc. 

>  *We instead exclusively use convolutions and self-attentions as the building blocks of encoders that separately encodes the query and context. Then we learn the interactions between context and question by standard attentions. The resulting representation is encoded again with our recurrency-free encoder before ﬁnally decoding to the probability of each position being the start or end of the answer span. We call this architecture QANet*

> *The key motivation behind the design of our model is the following: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words. *

Let's get into the model.  
Note: An exhaustive list of resources/references that I followed during this has been given in the end.

In [16]:
from google.colab import drive
drive.mount('/content/drive')
proj_root = '/content/drive/MyDrive/library/' # proj_root = '' for non drive files

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
!cp /content/drive/MyDrive/library/preprocess.py .

In [18]:
BATCH_SIZE=16

In [19]:
import os
from torch import nn
import torch
import numpy as np
import pandas as pd
import pickle, time
import re, os, string, typing, gc, json
import torch.nn.functional as F
import pickle
import spacy
from sklearn.model_selection import train_test_split
from collections import Counter
nlp = spacy.blank('en')
from preprocess import *
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [20]:
optim_configuration=0
hidden_configuration=1

In [21]:
optimizer_config_list=[]
optimizer_config_list.append(((0.8,0.999),10e-7,3*10e-7, "Default_optimizer")) #0
optimizer_config_list.append(((0.9,0.999),10e-7,3*10e-7,"beta1_increased_optimizer")) #1
optimizer_config_list.append(((0.9,0.9999),10e-7,3*10e-7,"bothbeta_increased_optimizer")) #2
optimizer_config_list.append(((0.8,0.999),10e-4,3*10e-7,"epsilon_increased_optimizer")) #3 #if training is too slow decrease it closer to default
optimizer_config_list.append(((0.8,0.999),10e-8,3*10e-7,"epsilon_decreased_optimizer")) #4
optimizer_config_list.append(((0.8,0.999),10e-7,3*10e-6,"weight_decay_increased_optimizer")) #5
optimizer_config_list.append(((0.8,0.999),10e-7,3*10e-8,"weight_decay_decreased_optimizer")) #6

In [22]:
BETAS, EPSILON, WEIGHT_DECAY, MODEL_NAME = optimizer_config_list[optim_configuration]

In [23]:
hidden_config_list=[]
# configuration 0
hidden_config_list.append((128, "Default_hiddenSize"))
# configuration 1
hidden_config_list.append((256, "Increased_Params"))
# configuration 2
hidden_config_list.append((64, "Decreased_Params"))


In [24]:
MODEL_DIM, MODEL_NAME_HIDDEN = hidden_config_list[hidden_configuration]
MODEL_NAME += "_"+ MODEL_NAME_HIDDEN


## Data preprocessing

In [25]:
isPreProcessed = os.path.exists(f'{proj_root}qanettrain.pkl') and os.path.exists(f'{proj_root}qanetvalid.pkl') and os.path.exists(f'{proj_root}qanetw2id.pickle') and os.path.exists(f'{proj_root}qanetc2id.pickle') and os.path.exists(f'{proj_root}qanetwordvocab.pickle')
print(isPreProcessed)

True


In [26]:
# load dataset json files
if isPreProcessed==False:
  train_data = load_json(f'{proj_root}data/squad_train.json')
  valid_data = load_json(f'{proj_root}data/squad_dev.json')

In [27]:
# parse the json structure to return the data as a list of dictionaries
if isPreProcessed==False:
  train_list = parse_data(train_data)
  valid_list = parse_data(valid_data)

In [28]:
if isPreProcessed==False:
  print('Train list len: ',len(train_list))
  print('Valid list len: ',len(valid_list))

In [29]:
# converting the lists into dataframes
if isPreProcessed==False:
  train_df = pd.DataFrame(train_list)
  valid_df = pd.DataFrame(valid_list)

In [30]:
if isPreProcessed==False:
  train_df.head()

In [31]:
# get indices of outliers and drop them from the dataframe
if isPreProcessed==False:

  %time drop_ids_train = filter_large_examples(train_df)
  train_df.drop(list(drop_ids_train), inplace=True)

  %time drop_ids_valid = filter_large_examples(valid_df)
  valid_df.drop(list(drop_ids_valid), inplace=True)

In [32]:
# gather text to build vocabularies
if isPreProcessed==False:
  vocab_text = gather_text_for_vocab([train_df, valid_df])
  print("Number of sentences in the dataset: ", len(vocab_text))

In [33]:
# build word and character-level vocabularies
if isPreProcessed==False:
  %time word2idx, idx2word, word_vocab = build_word_vocab(vocab_text)
  print("----------------------------------")
  %time char2idx, char_vocab = build_char_vocab(vocab_text)

In [34]:
# numericalize context and questions for training and validation set
if isPreProcessed==False:
  %time train_df['context_ids'] = train_df.context.apply(context_to_ids, word2idx=word2idx)
  %time valid_df['context_ids'] = valid_df.context.apply(context_to_ids, word2idx=word2idx)
  %time train_df['question_ids'] = train_df.question.apply(question_to_ids, word2idx=word2idx)
  %time valid_df['question_ids'] = valid_df.question.apply(question_to_ids, word2idx=word2idx)

In [35]:
# get indices with tokenization errors and drop those indices 

if isPreProcessed==False:
  train_err = get_error_indices(train_df, idx2word)
  valid_err = get_error_indices(valid_df, idx2word)

  train_df.drop(train_err, inplace=True)
  valid_df.drop(valid_err, inplace=True)

In [36]:
if isPreProcessed==False:
  len(train_df), len(valid_df)

In [37]:
# get start and end positions of answers from the context
# this is basically the label for training QA models
if isPreProcessed==False:
  train_label_idx = train_df.apply(index_answer, axis=1, idx2word=idx2word)
  valid_label_idx = valid_df.apply(index_answer, axis=1, idx2word=idx2word)

  train_df['label_idx'] = train_label_idx
  valid_df['label_idx'] = valid_label_idx

### Dump data to pickle files 
This ensures that we can directly access the preprocessed dataframe next time.

In [38]:
if isPreProcessed==False:
  train_df.to_pickle(f'{proj_root}qanettrain.pkl')
  valid_df.to_pickle(f'{proj_root}qanetvalid.pkl')

In [39]:
if isPreProcessed==False:
  with open(f'{proj_root}qanetw2id.pickle','wb') as handle:
      pickle.dump(word2idx, handle)

  with open(f'{proj_root}qanetc2id.pickle','wb') as handle:
      pickle.dump(char2idx, handle)
      
  with open(f'{proj_root}qanetwordvocab.pickle', 'wb') as handle:
      pickle.dump(word_vocab, handle)

### Read data from pickle files

You only need to run the preprocessing once. Some preprocessing functions can take upto 3 mins. Therefore, pickling preprocessed data can save a lot of time.

In [40]:
import pickle

with open(f"{proj_root}qanetw2id.pickle",'rb') as handle:
    word2idx = pickle.load(handle)
with open(f'{proj_root}qanetc2id.pickle','rb') as handle:
    char2idx = pickle.load(handle)
with open(f'{proj_root}qanetwordvocab.pickle', 'rb') as handle:
    word_vocab = pickle.load(handle)



In [41]:
train_df = pd.read_pickle(f'{proj_root}qanettrain.pkl')
valid_df = pd.read_pickle(f'{proj_root}qanetvalid.pkl')

In [42]:
idx2word = {v:k for k,v in word2idx.items()}

## Creating the dataloader

This class takes care of batching, creating character vectors and returns all the things needed during training.

In [43]:
class SquadDataset:
    '''
    - Creates batches dynamically by padding to the length of largest example
      in a given batch.
    - Calulates character vectors for contexts and question.
    - Returns tensors for training.
    '''
    def __init__(self, data, batch_size):
        '''
        data: dataframe
        batch_size: int
        '''
        self.batch_size = batch_size
        data = [data[i:i+self.batch_size] for i in range(0, len(data), self.batch_size)]
        self.data = data
        
        
    def __len__(self):
        return len(self.data)
    
    def make_char_vector(self, max_sent_len, sentence, max_word_len=16):
        
        char_vec = torch.zeros(max_sent_len, max_word_len).type(torch.LongTensor)
        
        for i, word in enumerate(nlp(sentence, disable=['parser','tagger','ner'])):
            for j, ch in enumerate(word.text):
                if j == max_word_len:
                    break
                char_vec[i][j] = char2idx.get(ch, 0)
        
        return char_vec     
    
    def get_span(self, text):

        text = nlp(text, disable=['parser','tagger','ner'])
        span = [(w.idx, w.idx+len(w.text)) for w in text]

        return span

    
    def __iter__(self):
        '''
        Creates batches of data and yields them.
        
        Each yield comprises of:
        :padded_context: padded tensor of contexts for each batch 
        :padded_question: padded tensor of questions for each batch 
        :char_ctx & ques_ctx: character-level ids for context and question
        :label: start and end index wrt context_ids
        :context_text,answer_text: used while validation to calculate metrics
        :ids: question_ids for evaluation
        '''
        
        for batch in self.data:
            
            spans = []
            ctx_text = []
            answer_text = []
            
             
            for ctx in batch.context:
                ctx_text.append(ctx)
                spans.append(self.get_span(ctx))
            
            for ans in batch.answer:
                answer_text.append(ans)
                
            max_context_len = max([len(ctx) for ctx in batch.context_ids])
            padded_context = torch.LongTensor(len(batch), max_context_len).fill_(1)
            
            for i, ctx in enumerate(batch.context_ids):
                padded_context[i, :len(ctx)] = torch.LongTensor(ctx)
                
            max_word_ctx = 16
          
            char_ctx = torch.zeros(len(batch), max_context_len, max_word_ctx).type(torch.LongTensor)
            for i, context in enumerate(batch.context):
                char_ctx[i] = self.make_char_vector(max_context_len, context)
            
            max_question_len = max([len(ques) for ques in batch.question_ids])
            padded_question = torch.LongTensor(len(batch), max_question_len).fill_(1)
            
            for i, ques in enumerate(batch.question_ids):
                padded_question[i, :len(ques)] = torch.LongTensor(ques)
                
            max_word_ques = 16
            
            char_ques = torch.zeros(len(batch), max_question_len, max_word_ques).type(torch.LongTensor)
            for i, question in enumerate(batch.question):
                char_ques[i] = self.make_char_vector(max_question_len, question)
            
              
            label = torch.LongTensor(list(batch.label_idx))
            ids = list(batch.id)
            
            yield (padded_context, padded_question, char_ctx, char_ques, label, ctx_text, answer_text, ids)
            
            

In [44]:
# create dataloaders

train_dataset = SquadDataset(train_df,16)
valid_dataset = SquadDataset(valid_df,16)

In [45]:
# looking at the shapes of various tensors returned by the loader

a = next(iter(train_dataset))
for i in range(len(a)):
    try:
        print(a[i].shape)
    except AttributeError:
        print(len(a[i]))



torch.Size([16, 253])
torch.Size([16, 16])
torch.Size([16, 253, 16])
torch.Size([16, 16, 16])
torch.Size([16, 2])
16
16
16


In [46]:
def get_glove_dict():
    '''
    Parses the glove word vectors text file and returns a dictionary with the words as
    keys and their respective pretrained word vectors as values.

    '''
    glove_dict = {}
    with open(f"{proj_root}glove.840B.300d.txt", "r", encoding="utf-8") as f:
        for line in f:
            values = line.split(' ')
            word = values[0]
            vector = np.asarray(values[1:], dtype="float32")
            glove_dict[word] = vector

    f.close()
    
    return glove_dict


In [47]:
isMatrixLoaded = os.path.exists(f'{proj_root}qanetglove_vt.npy') 
if isMatrixLoaded == False:
  glove_dict = get_glove_dict()

In [48]:
def create_weights_matrix(glove_dict):
    '''
    Creates a weight matrix of the words that are common in the GloVe vocab and
    the dataset's vocab. Initializes OOV words with a zero vector.
    '''
    weights_matrix = np.zeros((len(word_vocab), 300))
    words_found = 0
    for i, word in enumerate(word_vocab):
        try:
            weights_matrix[i] = glove_dict[word]
            words_found += 1
        except:
            pass

    return weights_matrix, words_found


In [49]:
if isMatrixLoaded == False:
  weights_matrix, words_found = create_weights_matrix(glove_dict)
  print("Words found in the GloVe vocab: " ,words_found)

In [50]:
# save the weight matrix for future loading.
# This matrix is the nn.Embedding's weight matrix.
if isMatrixLoaded == False:
  np.save(f'{proj_root}qanetglove_vt.npy', weights_matrix)

In [51]:
if isMatrixLoaded == False:
  weights_matrix.shape

## Model

## Depthwise Separable Convolutions

Depthwise separable convolutions serve the same purpose as normal convolutions with the only difference being that they are faster because they reduce the number of multiplication operations. This is done by breaking the convolution operation into two parts: depthwise convolution and pointwise convolution.
> *We use depthwise separable convolutions rather than traditional ones, as we observe that it is memory efﬁcient and has better generalization. *

Let's understand why depthwise convolutions are faster than traditional convolution.
Traditional convolution can be visualized as,

<img src="images/conv2d.PNG" width="700" height="700"/>

Let's count the number of multiplications in a traditional convolution operation.  
The number of multiplications for a single convolution operation is the number of elements inside the kernel. This is $D_{K}$ X $D_{K}$ X $M$ = $D_{K}^{2}$ X $M$.
To get the output feature map, we slide or convolve this kernel over the input. Given the output dimensions, we perform $D_{O}$ covolutions along the width and the height of the input image. Therefore, the number of multiplications per kernel are $D_{O}^{2}$ X $D_{K}^{2}$ X $M$.   
These calculations are for a single kernel. In convolutional neural networks, we usually use multiple kernels. Each kernel is expected to extract a unique feature from the input. If we use $N$ such filters, then number of multiplications become 
$N$ X $D_{O}^{2}$ X $D_{K}^{2}$ X $M$.  

### Depthwise convolution

<img src="images/depthconv.PNG" width="800" height="900"/>

In depthwise convolution we perform convolution using kernels of dimension $D_{K}$ X $D_{K}$ X 1. Therefore the number of multiplications in a single convolution operation would be $D_{K}^{2}$ X $1$. If the output dimension is $D_{O}$, then the number of multiplications per kernel are $D_{K}^{2}$ X $D_{O}^{2}$. If there are $M$ input channels, we need to use $M$ such kernels, one kernel for each input channel to get the all the features. For $M$ kernels, we then get $D_{K}^{2}$ X $D_{O}^{2}$ X $M$ multiplications. 

### Pointwise convolution

<img src="images/pointconv.PNG" width="700" height="700"/>

This part takes the output from depthwise convolution and performs convolution operation with a kernel of size 1 X 1 X $N$, where $N$ is the desired number of output features/channels. Here similarly,   
Multiplications per 1 convolution operation = 1 X 1 X $M$  
Multiplications per kernel = $D_{O}^{2}$ X $M$  
For N output features = $N$ X $D_{O}^{2}$ X $M$
  
   
Adding up the number of multiplications from both the phases, we get, 

$$ =\  N\ .\ D_{O}^{2} \ .\ M \ +\ D_{K}^{2}\ .\ D_{O}^{2}\ .\ M $$
$$ =\  D_{O}^{2}\ .\ M (N + D_{K}^{2}) $$

Comparing this with traditional convolutions, 

$$ =\ \frac {D_{O}^{2}\ .\ M\ (N + D_{K}^{2})} {D_{O}^{2}\ .\  M\ .\ D_{K}^{2}\ .\ N}$$  

$$ =\  \frac{1}{D_{K}^{2}}\ +\ \frac{1}{N} $$

This clearly shows that the number of computations in depthwise separable convolutions are lesser than traditional ones.
In code, the depthwise phase of the convolution is done by assigning `groups` as `in_channels`. According to the documentation, 

> *At groups= `in_channels`, each `nput channel is convolved with its own set of filters, of size: $\left\lfloor\frac{out\_channels}{in\_channels}\right\rfloor$*

In [52]:
class DepthwiseSeparableConvolution(nn.Module):
    
    def __init__(self, in_channels, out_channels, kernel_size, dim=1):
        
        super().__init__()
        self.dim = dim
        if dim == 2:
            
            self.depthwise_conv = nn.Conv2d(in_channels=in_channels, out_channels=in_channels,
                                        kernel_size=kernel_size, groups=in_channels, padding=kernel_size//2)
        
            self.pointwise_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0)
        
    
        else:
        
            self.depthwise_conv = nn.Conv1d(in_channels=in_channels, out_channels=in_channels,
                                            kernel_size=kernel_size, groups=in_channels, padding=kernel_size//2,
                                            bias=False)

            self.pointwise_conv = nn.Conv1d(in_channels, out_channels, kernel_size=1, padding=0, bias=True)

    
    def forward(self, x):
        # x = [bs, seq_len, emb_dim]
        if self.dim == 1:
            x = x.transpose(1,2)
            x = self.pointwise_conv(self.depthwise_conv(x))
            x = x.transpose(1,2)
        else:
            x = self.pointwise_conv(self.depthwise_conv(x))
        #print("DepthWiseConv output: ", x.shape)
        return x

## Highway Networks

Highway networks were originally introduced to ease the training of deep neural networks. While researchers had cracked the code for optimizing shallow neural networks, training *deep* networks was still a challenging task owing to problems such as vanishing gradients etc. Quoting the paper,

>  *We present a novel architecture that enables the optimization of networks with virtually arbitrary depth. This is accomplished through the use of a learned gating mechanism for regulating information ﬂow which is inspired by Long Short Term Memory recurrent neural networks. Due to this gating mechanism, a neural network can have paths along which information can ﬂow across several layers without attenuation. We call such paths information highways, and such networks highway networks.* 

This paper takes the key idea of learned gating mechanism from LSTMs which process information internally through a sequence of learned gates. The purpose of this layer is to *learn* to pass relevant information from the input. A highway network is a series of feed-forward or linear layers with a gating mechanism. The gating is implemented by using a sigmoid function which decides what amount of information should be transformed and what should be passed as it is.   

A plain feed-forward layer is associated with a linear transform $H$ parameterized by ($W_{H}, b_{H}$), such that for input $x$, the output $y$ is  

$$ y = g(W_{H}.x + b_{H})$$
where $g$ is a non-linear activation.  
For highway networks, two additional linear transforms are defined viz. $T$ ($W_{T},b_{T}$) and $C$ ($W_{C}$,$b_{C}$).
Then,    
  
$$ y = T(x) . H(x) + x . C(x) $$ 
> *We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by
transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 − T. *

$$ y = T(x) . H(x) + x . (1 - T(x)) $$  
  
$$ y = T(x) . g(W_{H}.x + b_{H}) + x . (1 - T(x)) $$  
where $T(x)$ = $\sigma$ ($W_{T}$ . $x$ + $b_{T}$) and $g$ is relu activation.  

The input to this layer is the concatenation of word and character embeddings of each word. To implement this we use `nn.ModuleList` to add multiple linear layers. This is done for the gate layer as well as for a normal linear transform. In code the `flow_layer` is the same as linear transform $H$ discussed above and `gate_layer` is $T$. In the forward method we loop through each layer and compute the output according to the highway equation described above.   
  
The output of this layer for context is $X$ $\epsilon$ $R^{\ d \ X \ T}$ and for query is $Q$ $\epsilon$ $R^{\ d \ X \ J}$, where $d$ is hidden size of the LSTM, $T$ is the context length, $J$ is the query length.  

The structure discussed so far is a recurring pattern in many NLP systems. Although this might be out of favor now with the advent of transformers and large pretrained language models, you will find this pattern in many NLP systems before transformers came into being. The idea behind this is that adding highway layers enables the network to make more efficient use of character embeddings. If a particular word is not found in the pretrained word vector vocabulary (OOV word), it will most likely be initialized with a zero vector. It then makes much more sense to look at the character embedding of that word rather than the word embedding. The soft gating mechanism in highway layers helps the model to achieve this. 

In [53]:
class HighwayLayer(nn.Module):
    
    def __init__(self, layer_dim, num_layers=2):
    
        super().__init__()
        self.num_layers = num_layers
        
        self.flow_layers = nn.ModuleList([nn.Linear(layer_dim, layer_dim) for _ in range(num_layers)])
        self.gate_layers = nn.ModuleList([nn.Linear(layer_dim, layer_dim) for _ in range(num_layers)])
    
    def forward(self, x):
        #print("Highway input: ", x.shape)
        for i in range(self.num_layers):
            
            flow = self.flow_layers[i](x)
            gate = torch.sigmoid(self.gate_layers[i](x))
            
            x = gate * flow + (1 - gate) * x
            
        #print("Highway output: ", x.shape)
        return x

## Embedding Layer

This layer:
* converts word-level tokens into a 300-dim pre-trained glove embedding vector 
* creates trainable character embeddings using 2-D convolutions
* concatenates character and word embeddings and passes them through a highway network  

The details of calculating character embeddings has been discussed in detail in the previous notebook. The only difference here is that instead of max-pooling, `torch.max` is used to get a fixed-size representation of each word.

> *Each character is represented as a trainable vector of dimension p2 = 200,meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters. The length of each word is either truncated or padded to 16. We take maximum value of each row of this matrix to get a ﬁxed-size vector representation of each word.* 

In [54]:
class EmbeddingLayer(nn.Module):
    
    def __init__(self, char_vocab_dim, char_emb_dim, kernel_size, device):
        
        super().__init__()
        
        self.device = device
        
        self.char_embedding = nn.Embedding(char_vocab_dim, char_emb_dim)
        
        self.word_embedding = self.get_glove_word_embedding()
        
        self.conv2d = DepthwiseSeparableConvolution(char_emb_dim, char_emb_dim, kernel_size,dim=2)
        
        self.highway = HighwayLayer(self.word_emb_dim + char_emb_dim)
    
        
    def get_glove_word_embedding(self):
        
        weights_matrix = np.load(f'{proj_root}qanetglove_vt.npy')
        num_embeddings, embedding_dim = weights_matrix.shape
        self.word_emb_dim = embedding_dim
        embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights_matrix).to(self.device),freeze=True)

        return embedding
    
    def forward(self, x, x_char):
        # x = [bs, seq_len]
        # x_char = [bs, seq_len, word_len(=16)]
        
        word_emb = self.word_embedding(x)
        # word_emb = [bs, seq_len, word_emb_dim]
        
        word_emb = F.dropout(word_emb,p=0.1)
        
        char_emb = self.char_embedding(x_char)
        # char_embed = [bs, seq_len, word_len, char_emb_dim]
        
        char_emb = F.dropout(char_emb.permute(0,3,1,2), p=0.05)
        # [bs, char_emb_dim, seq_len, word_len] == [N, Cin, Hin, Win]
        
        conv_out = F.relu(self.conv2d(char_emb))
        # [bs, char_emb_dim, seq_len, word_len] 
        # the depthwise separable conv does not change the shape of the input
        
        char_emb, _ = torch.max(conv_out, dim=3)
        # [bs, char_emb_dim, seq_len]
        
        char_emb = char_emb.permute(0,2,1)
        # [bs, seq_len, char_emb_dim]
        concat_emb = torch.cat([char_emb, word_emb], dim=2)
        # [bs, seq_len, char_emb_dim + word_emb_dim]
        
        emb = self.highway(concat_emb)
        # [bs, seq_len, char_emb_dim + word_emb_dim]
        
        #print("Embedding output: ", emb.shape)
        return emb

## Multiheaded Self Attention

### Idea of Linear Projections

Consider a system of an online book store like kindle, which lets you rent, buy and read books on its platform. Such platforms usually have a recommendation system (recsys) in place that enables them to understand their users' taste and preferences over time. This helps them in making personalized recommendations to users and in turn improve their revenue.
For simplicity, let's assume that there are 10,000 books available on the platform and the system maintain a simple binary vector of size 10,000 for each user. If a user has read a particular book, the position in the vector corresponding to the book's id is 1 and 0 otherwise. A *books-read* vector for a user looks like,
$$ [1,0,0,1,1,0,0,0,0,1,1,...,1] $$ 

  
Now assume a projection matrix of dimension 10,000 X 100. When we multiply any user's books vector, we get a new low dimensional vector of size 100. This vector is totally different from the previous one and now represents the user's taste or preferences in books. It basically represents a user-profile for the recommendation system. Calculating this user-taster vector for different users enables the application to find users with similar taste and recommend books that they *might* like simply based on what the other "similar" user has read.  
The weights or values of this projection matrix can be thought as representing certain features or properties that a book might possess. It might capture various genres like science, philsophy, fantasy novels, etc.  
The question that still remains however, is how do we get such a projection matrix in the first place that can transform a represenation from one vector space to another that is somehow related to the original vector but has an entirely different interpretation.  
This is exactly what deep learning is about. Neural networks work as this *universal function approximators* that helps in learning such transformations. The weights of such projection matrices are learned via backpropagation. We also need a lot of training data to achieve this.

### Self Attention 

Much of what will follow is heavily derived from Jay Alammar's famous blog post: The Illustrated Transformer. The intuition and visualizations can be directly converted into code and that's my main motive here. To understand the details, we'll first look at self attention using vectors at a granular level. We'll then show how actually these computations are made using matrices which directly correspond to the code. For convenience, we'll explain how self attention works in the transformer model. The input to the self attention layer is an embedding vector.   
The central idea of attention is the same as discussed in the first notebook. Even here we'll calculate the measure of similarity between two representations, convert them into an attention distribution and take a weighted sum with the values. However, there are certain details involved that need to be addressed.  
Following steps involved in calculating self-attention.
1. The first step is to project the input into 3 different vector spaces: key space, query space and value space. These projections give us a key vector, a query vector and a value vector. The weights of these projection matrices are learnt via backpropagation during training. The projection matrices for key, query and value are $W^{K}$, $W^{Q}$, $W^{V}$ respectively. These projections are exactly what we discussed above. Their values depend a lot on the training procedure and the training data.  
<img src="images/selfattn1.PNG" width="700" height="700"/>

2. The next step is to calculate attention scores. This is basically the part where we determine how similar are two input vectors and hence how much attention/focus needs to be paid on one vector while summarizing the other. 

 >*The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.*     
 
 There are different ways to determine this. In this paper, a dot product between the query and the key is used. Consider the phrase "Thinking Machines". For the word "Thinking", we need to calculate a score with each word in the sentence including "Thinking" itself. Therefore, score for first position would be,
$$ q_{1}\ .\ k_{1} $$
The result of this product represents the amount of attention we need to pay to "Thinking" itself while encoding "Thinking".
The score for next position would be,
$$ q_{1}\ .\ k_{2} $$
which captures the importance of "Machines" while encoding "Thinking".  
3. We then divide the scores calculated in the previous step by $\sqrt d_{k}$, where $d_{k}$ is the dimension of key vectors. This scaling was done to ensure that the gradients are stable during training. Next, these scores are passed through a softmax function to get an attention distribution. This means that for a sentence of length $n$, if $\alpha_{t}$ represents the score at $t$-th position, then
$$ \sum_{t=1}^{n} \alpha_{t} = 1$$
4. The last step is to multiply the softmax output with the value vector at respective position and sum these products up. In effect this computes a weighted sum. For a sentence of length n,
$$ \sum_{t=1}^{n} \alpha_{t}\ v_{t}$$

All the steps explained above can be summarized as,   
<img src="images/selfattn2.PNG" width="600" height="500"/>

### Multiheaded attention and Implementation

The above steps are usually performed using matrices instead of vectors. This is also where we'll see how and why multihead attention is implemented.
1. The first step is to calculate the query, key and value matrices by projecting them using trainable weights. In code, these weights correspond to linear layers. $W^{Q}$ corresponds to `fc_q`, $W^{K}$ to `fc_k` and $W^{V}$ to `fc_v`. Projecting these gives us $Q$, $K$ and $V$ as seen in code too. 

<img src="images/selfattn3.PNG" width="600" height="500"/>
Similar representations for value and key are also calculated. The dimensions of the above matrices will be explained below.
2. Calculation of scores can be easily visualized as follows,
<img src="images/selfattn4.PNG" width="800" height="1000"/>  
In code this is achieved by calculating the `energy` of $K$ and $Q$ using `torch.matmul`.
3. The final step is to scale, take softmax of the scores and multiply the matrix by the value matrix.
`scale` is calculated by taking the square root of `head_dim`. After scaling the `energy` tensor or the scores at different positions, we apply softmax to this tensor and multiply it with $V$ using `torch.matmul` once again. 
<img src="images/selfattn5.PNG" width="600" height="500"/>


In the original transformer model, the input embedding size is 512. Before projecting these embeddings, we split them into 8 parts which brings us to multihead attention. This paper uses 8 attention heads.  
 Multiheaded attention expands the model's ability to focus on different positions.
 > *It gives the attention layer mutiple "representation subspaces."*
 
These subspaces are nothing but different projection matrices. Instead of having just one projection matrix $W^{Q}$ for query, we'll have 8 projection matrices for query, key and value. Weights for each of these "subspaces" are learnt via backpropagation during training. An analogy for this can be the use of multiple convolutional filters to learn unique features from the image.  
Therefore, now the dimension of key, query and value matrices would be 64 (512/8). In code, splitting weight matrices for multiple attention heads is done right after getting $K$, $Q$ and $V$. This is done by first calculating the `head_dimension` and then splitting the tensors using the `view` function.
 <img src="images/selfattn6.PNG" width="600" height="500"/>
The above image shows projection matrices for 2 attention heads. There are 8 such heads. This would give us 8 $Z$ matrices in the end. The output dimension of the self attention layer should be same as the input dimension. Hence, we need to recombine the results of all the attention heads before passing the output to the next layer. To combine them, in code, we simply use `view` to drop the head dimension and further make a projection using `fc_o` to ensure that the input dimension is same as the output dimension.  


In [55]:
class MultiheadAttentionLayer(nn.Module):
    
    def __init__(self, hid_dim, num_heads, device):
        
        super().__init__()
        self.num_heads = num_heads
        self.device = device
        self.hid_dim = hid_dim
        
        self.head_dim = self.hid_dim // self.num_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
        
    def forward(self, x, mask):
        # x = [bs, len_x, hid_dim]
        # mask = [bs, len_x]
        
        batch_size = x.shape[0]
        
        Q = self.fc_q(x)
        K = self.fc_k(x)
        V = self.fc_v(x)
        # Q = K = V = [bs, len_x, hid_dim]
        
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).permute(0,2,1,3)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).permute(0,2,1,3)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).permute(0,2,1,3)
        # [bs, len_x, num_heads, head_dim ]  => [bs, num_heads, len_x, head_dim]
        
        K = K.permute(0,1,3,2)
        # [bs, num_heads, head_dim, len_x]
        
        energy = torch.matmul(Q, K) / self.scale
        # (bs, num_heads){[len_x, head_dim] * [head_dim, len_x]} => [bs, num_heads, len_x, len_x]
        
        mask = mask.unsqueeze(1).unsqueeze(2)
        # [bs, 1, 1, len_x]
        
        #print("Mask: ", mask)
        #print("Energy: ", energy)
        
        energy = energy.masked_fill(mask == 1, -1e10)
        
        #print("energy after masking: ", energy)
        
        alpha = torch.softmax(energy, dim=-1)
        #  [bs, num_heads, len_x, len_x]
        
        #print("energy after smax: ", alpha)
        alpha = F.dropout(alpha, p=0.1)
        
        a = torch.matmul(alpha, V)
        # [bs, num_heads, len_x, head_dim]
        
        a = a.permute(0,2,1,3)
        # [bs, len_x, num_heads, hid_dim]
        
        a = a.contiguous().view(batch_size, -1, self.hid_dim)
        # [bs, len_x, hid_dim]
        
        a = self.fc_o(a)
        # [bs, len_x, hid_dim]
        
        #print("Multihead output: ", a.shape)
        return a

In [56]:
from torch.autograd import Variable
import math

## Positional Embedding

The model so far does not have any idea about the positioning of words in a sentence. In previous models, this was taken care of because we were using RNNs or LSTMs at some stage to encode this information. RNNs process input in a sequential order and maintain hidden states for each position in the input sequence. However, here we need to come up with a method to inject the positional information of tokens into the model.  
One simple method of doing this is to assign a single number to each token between $[0, 1]$, where first word starts with 0 and the last word corresponds to 1. This solution presents some problems. For different sentence lengths, we'll have different intervals over which tokens are distributed. We would not have a consistent meaning of a particular position across all inputs(of varying lengths).   
Another method is to use learned position embeddings. This is used in BERT, where, the positional embedding a lookup table of size $[512, 768]$ where 512 is the maximum sequence length that BERT can process. This lookup matrix is randomly intialized and trained along with the model.    
Here however, the authors have used another method of encoding position which is same as that proposed in the original transformers paper. The positional embedding can be defined as,
<img src="images/posemb.PNG" width="500" height="400"/>

where $pos$ is the position, $i$ is the dimension of embedding, and $d_{model}$ is the model dimension.  
These embeddings are simply added to the word embeddings of the tokens at their respective positions. 


In [57]:
class PositionEncoder(nn.Module):
    
    def __init__(self, model_dim, device, max_length=400):
        
        super().__init__()
        
        self.device = device
        
        self.model_dim = model_dim
        
        pos_encoding = torch.zeros(max_length, model_dim)
        
        for pos in range(max_length):
            
            for i in range(0, model_dim, 2):
                
                pos_encoding[pos, i] = math.sin(pos / (10000 ** ((2*i)/model_dim)))
                pos_encoding[pos, i+1] = math.cos(pos / (10000 ** ((2*(i+1))/model_dim)))
            
        
        pos_encoding = pos_encoding.unsqueeze(0).to(device)
        self.register_buffer('pos_encoding', pos_encoding)
        
    
    def forward(self, x):
        #print("PE shape: ", self.pos_encoding.shape)
        #print("PE input: ", x.shape)
        x = x + Variable(self.pos_encoding[:, :x.shape[1]], requires_grad=False)
        #print("PE output: ", x.shape)
        return x

## Encoder Block

This layer brings together all the components discussed so far. 
<img src="images/encoderblock.PNG" width="250" height="50"/>

> *We use the same Encoder Block throughout the model, only varying the number of convolutional layers for each block. We use layer norm and residual connection between every layer in the Encoder Block.*

The following steps are performed by this layer:  

* A positional embedding is injected into the input.
* This is then passed through a series of convolutional layers. The number of these layers depend upon the layer of which these encoder blocks are a part of. For embedding encoder layer, this number is 4 and for model encoder layer it is 2. The layers of convolution are defined using `nn.Modulelist`. 
* The output of this is then passed to a multiheaded self attention layer and finally to a feedforward network which is simply a linear layer.
* As can be seen in the figure above, the model involves residual connections, layer normalizations and dropouts too. These too are implemented appropriately. An easy way to understand the residual connections in code would be draw 2-3 iterations of the lower block (that involves convolution) and ensure that everything matches.



In [58]:
class EncoderBlock(nn.Module):
    
    def __init__(self, model_dim, num_heads, num_conv_layers, kernel_size, device):
        
        super().__init__()
        
        self.num_conv_layers = num_conv_layers
        
        self.conv_layers = nn.ModuleList([DepthwiseSeparableConvolution(model_dim, model_dim, kernel_size)
                                          for _ in range(num_conv_layers)])
        
        self.multihead_self_attn = MultiheadAttentionLayer(model_dim, num_heads, device)
        
        self.position_encoder = PositionEncoder(model_dim, device)
        
        self.pos_norm = nn.LayerNorm(model_dim)
        
        self.conv_norm = nn.ModuleList([nn.LayerNorm(model_dim) for _ in range(self.num_conv_layers)])
        
        self.feedfwd_norm = nn.LayerNorm(model_dim)
        
        self.feed_fwd = nn.Linear(model_dim, model_dim)
        
    def forward(self, x, mask):
        # x = [bs, len_x, model_dim]
        # mask = [bs, len_x]
        
        out = self.position_encoder(x)
        # [bs, len_x, model_dim]
        
        res = out
        
        out = self.pos_norm(out)
        # [bs, len_x, model_dim]
        
        for i, conv_layer in enumerate(self.conv_layers):
            
            out = F.relu(conv_layer(out))
            out = out + res
            if (i+1) % 2 == 0:
                out = F.dropout(out, p=0.1)
            res = out
            out = self.conv_norm[i](out)
        
        
        out = self.multihead_self_attn(out, mask)
        # [bs, len_x, model_dim]
        
        out = F.dropout(out + res, p=0.1)
        
        res = out
        
        out = self.feedfwd_norm(out)
        
        out = F.relu(self.feed_fwd(out))
        # [bs, len_x, model_dim]
            
        out = F.dropout(out + res, p=0.1)
        # [bs, len_x, model_dim]
        #print("Encoder block output: ", out.shape)
        return out

## Context-Query Attention Layer

This layer is very similar to the attention flow layer in BIDAF. It calculates attention in two directions. Context-query attention tells us what query words are the most relevant to each context word.   
Let $C$ and $Q$ represent the encoded context and query respectively. Given that the context length is $n$ and query length is $m$, a similarity matrix is calculated first. The similarity matrix captures the similarity between each pair of context and query words. It is denoted by $S$ and is a $n$-by-$m$ matrix. The similarity matrix is calculated as,
$$ S = f\ (Q,\ C)$$
where $f$ is a trilinear similarity function defined as,
$$ f(q,c) = W_{0}\ [q\ ;\ c\ ;\ q \odot c] $$,
where $W_{0}$ is trainable variable, $;$ denotes concatenation and $\odot$ denotes element wise multiplication.  
Context-to-Query attention can then be calculated as,
$$ A = \overline S\ .\ Q^{T} $$,
where $\overline S$ is obtained by normalizing each row of $S$ using softmax. The computations so far are exactly similar to those in BIDAF. You can refer to the previous notebook for a more detailed explanation.  

> *Most high performing models additionally use some form of query-to-context attention, such as BiDaF and DCN. Empirically, we ﬁnd that, the DCN attention can provide a little beneﬁt over simply applying context-to-query attention, so we adopt this strategy.*

Query-to-Context attention is calculated as,
$$B = \overline S\ .\ \overline{\overline S}^{T}\ .\ C^{T}$$,
where $\overline{\overline S}^{T}$ is the column-normalized matrix of $S$ by softmax function.  

The implementation is fairly straightforward and is just about multiplying the said tensors.

In [59]:
class ContextQueryAttentionLayer(nn.Module):
    
    def __init__(self, model_dim):
        
        super().__init__() 
        
        self.W0 = nn.Linear(3*model_dim, 1, bias=False)
        
    def forward(self, C, Q, c_mask, q_mask):
        # C = [bs, ctx_len, model_dim]
        # Q = [bs, qtn_len, model_dim]
        # c_mask = [bs, ctx_len]
        # q_mask = [bs, qtn_len]
        
        c_mask = c_mask.unsqueeze(2)
        # [bs, ctx_len, 1]
        
        q_mask = q_mask.unsqueeze(1)
        # [bs, 1, qtn_len]
        
        ctx_len = C.shape[1]
        qtn_len = Q.shape[1]
        
        C_ = C.unsqueeze(2).repeat(1,1,qtn_len,1)
        # [bs, ctx_len, qtn_len, model_dim] 
        
        Q_ = Q.unsqueeze(1).repeat(1,ctx_len,1,1)
        # [bs, ctx_len, qtn_len, model_dim]
        
        C_elemwise_Q = torch.mul(C_, Q_)
        # [bs, ctx_len, qtn_len, model_dim]
        
        S = torch.cat([C_, Q_, C_elemwise_Q], dim=3)
        # [bs, ctx_len, qtn_len, model_dim*3]
        
        S = self.W0(S).squeeze()
        #print("Simi matrix: ", S.shape)
        # [bs, ctx_len, qtn_len, 1] => # [bs, ctx_len, qtn_len]
        
        S_row = S.masked_fill(q_mask==1, -1e10)
        S_row = F.softmax(S_row, dim=2)
        
        S_col = S.masked_fill(c_mask==1, -1e10)
        S_col = F.softmax(S_col, dim=1)
        
        A = torch.bmm(S_row, Q)
        # (bs)[ctx_len, qtn_len] X [qtn_len, model_dim] => [bs, ctx_len, model_dim]
        
        B = torch.bmm(torch.bmm(S_row,S_col.transpose(1,2)), C)
        # [ctx_len, qtn_len] X [qtn_len, ctx_len] => [bs, ctx_len, ctx_len]
        # [ctx_len, ctx_len] X [ctx_len, model_dim ] => [bs, ctx_len, model_dim]
        
        model_out = torch.cat([C, A, torch.mul(C,A), torch.mul(C,B)], dim=2)
        # [bs, ctx_len, model_dim*4]
        
        #print("C2Q output: ", model_out.shape)
        return F.dropout(model_out, p=0.1)
        
        

## Output Layer

The output layer is tasked with predicting the start and end indices of the answer from the context. The input to this layer
$M_{1}$, $M_{2}$ and $M_{3}$ are the outputs of 3 model encoders(explained below), from bottom to top. The start index $p_{1}$ is then calculated as,  

$$ p_{1} = softmax\ (\ W_{1}\ [M_{1}\ ;\ M_{2}])$$
and end as,
$$ p_{2} = softmax\ (\ W_{2}\ [M_{1}\ ;\ M_{3}])$$

where $W_{1}$ and $W_{2}$ are trainable variables.

In [60]:
class OutputLayer(nn.Module):
    
    def __init__(self, model_dim):
        
        super().__init__()
        
        self.W1 = nn.Linear(2*model_dim, 1, bias=False)
        
        self.W2 = nn.Linear(2*model_dim, 1, bias=False)
        
        
    def forward(self, M1, M2, M3, c_mask):
        
        start = torch.cat([M1,M2], dim=2)
        
        start = self.W1(start).squeeze()
        
        p1 = start.masked_fill(c_mask==1, -1e10)
        
        #p1 = F.log_softmax(start.masked_fill(c_mask==1, -1e10), dim=1)
        
        end = torch.cat([M1, M3], dim=2)
        
        end = self.W2(end).squeeze()
        
        p2 = end.masked_fill(c_mask==1, -1e10)
        
        #p2 = F.log_softmax(end.masked_fill(c_mask==1, -1e10), dim=1)
        
        #print("preds: ", [p1.shape,p2.shape])
        return p1, p2
        

## QANet

This module wraps up everything. It brings together all the components together that we've seen so far. 
<img src="images/qanet.PNG" width="500" height="600"/>

Going up the flowchart above, the following module does the following end-to-end:
* The inputs to the `forward` method are word-level and character-level tokens for both the context and the query. These tokens are passed to the embedding layer.  

 > *The word embedding is ﬁxed during training and initialized from the p1 = 300 dimensional pre-trained GloVe word vectors, which are ﬁxed during training.*  

 > *The character embedding is obtained as follows: Each character is represented as a trainable vector of dimension p2 = 200, meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters.*   

 For each word the concatenation of these two embeddings is passed on to a 2-layer highway network. Highway network does not affect the shape of the input. Hence the output shape from the `EmbeddingLayer` defined above would be `[bs, ctx_len, word_emb_dim + char_emb_dim]` = `[batch_size, ctx_len, 500]`. This is then supposed to be passed to `embedding_encoder` or the Embedding Encoder Layer. This layer however requires the input dimension to be 128 which is the `model_dim`
 and not 500. As clearly mentioned in the paper,    
 > *Note that the input of this layer is a vector of dimension p1 + p2 = 500 for each individual word, which is immediately mapped to d = 128 by a one-dimensional convolution. The output of this layer is a also of dimension d = 128.*   
 
 We therefore map the output of embedding to 128 in code using `ctx_resizer` and `qtn_resizer`.  

* The resized tensors are then passed on to the *Embedding Encoding Layer* which is a single encoder block with 4 conv layers. 8 attention heads are used in the self-attention module which is the same for all the encoder blocks in the model.
  
* The output of previous layer is then passed on to the *Contex-Query Attention Layer*.  The output dimension of this layer is `4 * model_dim`. This is again resized using `c2q_resizer` to have a dimension of `model_dim`. 
* Next the encoded representation so far is passed on to the *Model Encoder Layer*. This layer comprises of 7 blocks of encoder, with each block having 2 convolutional layers. 
 > *We share weights between each of the 3 repetitions of the model encoder.*
 This can be seen in code while calculating $M_{1}$, $M_{2}$ and $M_{3}$.
* Finally the shared-weight matrices are passed to the output layer which predicts the start and end index of the answer.


In [61]:
class QANet(nn.Module):
    
    def __init__(self, char_vocab_dim, char_emb_dim, word_emb_dim, kernel_size, model_dim, num_heads, device):
        
        super().__init__()
        
        self.embedding = EmbeddingLayer(char_vocab_dim, char_emb_dim, kernel_size, device)
        
        self.ctx_resizer = DepthwiseSeparableConvolution(char_emb_dim+word_emb_dim, model_dim, 5)
        
        self.qtn_resizer = DepthwiseSeparableConvolution(char_emb_dim+word_emb_dim, model_dim, 5)
        
        self.embedding_encoder = EncoderBlock(model_dim, num_heads, 4, 5, device)
        
        self.c2q_attention = ContextQueryAttentionLayer(model_dim)
        
        self.c2q_resizer = DepthwiseSeparableConvolution(model_dim*4, model_dim, 5)
        
        self.model_encoder_layers = nn.ModuleList([EncoderBlock(model_dim, num_heads, 2, 5, device)
                                                   for _ in range(7)])
        
        self.output = OutputLayer(model_dim)
        
        self.device=device
    
    def forward(self, ctx, qtn, ctx_char, qtn_char):
        
        c_mask = torch.eq(ctx, 1).float().to(self.device)
        q_mask = torch.eq(qtn, 1).float().to(self.device)
        
        ctx_emb = self.embedding(ctx, ctx_char)
        # [bs, ctx_len, ch_emb_dim + word_emb_dim]
        
        ctx_emb = self.ctx_resizer(ctx_emb)
        #  [bs, ctx_len, model_dim]
        
        qtn_emb = self.embedding(qtn, qtn_char)
        # [bs, ctx_len, ch_emb_dim + word_emb_dim]
        
        qtn_emb = self.qtn_resizer(qtn_emb)
        # [bs, qtn_len, model_dim]
        
        C = self.embedding_encoder(ctx_emb, c_mask)
        # [bs, ctx_len, model_dim]
        
        Q = self.embedding_encoder(qtn_emb, q_mask)
        # [bs, qtn_len, model_dim]
            
        C2Q = self.c2q_attention(C, Q, c_mask, q_mask)
        # [bs, ctx_len, model_dim*4]
        
        M1 = self.c2q_resizer(C2Q)
        # [bs, ctx_len, model_dim]
    
        for layer in self.model_encoder_layers:
            M1 = layer(M1, c_mask)
        
        M2 = M1
        # [bs, ctx_len, model_dim]  
        
        for layer in self.model_encoder_layers:
            M2 = layer(M2, c_mask)
        
        M3 = M2
        # [bs, ctx_len, model_dim]
        
        for layer in self.model_encoder_layers:
            M3 = layer(M3, c_mask)
            
        p1, p2 = self.output(M1, M2, M3, c_mask)
        
        return p1, p2

In [62]:
#args.hidden_size * 2 == (args.char_channel_size + args.word_dim)

CHAR_VOCAB_DIM = len(char2idx)
CHAR_EMB_DIM = 200
WORD_EMB_DIM = 300
device = torch.device('cuda')
KERNEL_SIZE = 5
NUM_ATTENTION_HEADS = 8


model = QANet(CHAR_VOCAB_DIM,
              CHAR_EMB_DIM, 
              WORD_EMB_DIM,
              KERNEL_SIZE,
              MODEL_DIM,
              NUM_ATTENTION_HEADS,
              device).to(device)

In [63]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 5,481,408 trainable parameters


## Training

> *We use the ADAM optimizer (Kingma & Ba, 2014) with β1 = 0.8,β2 = 0.999, $\epsilon$ = 10−7. We use a learning rate warm-up scheme with an inverse exponential increase from 0.0 to 0.001 in the ﬁrst 1000 steps, and then maintain a constant learning rate for the remainder of training.*

Note: I have not used learning-rate warm up scheme to keep things simple for initial training. 

In [64]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), betas=BETAS, eps=EPSILON, weight_decay=WEIGHT_DECAY)





In [65]:
def train(model, train_dataset):
    print("Starting training ........")
   

    train_loss = 0.
    batch_count = 0

    for batch in train_dataset:

        if batch_count % 500 == 0:
            print(f"Starting batch: {batch_count}")
        batch_count += 1
        
        context, question, char_ctx, char_ques, label, ctx_text, ans, ids = batch
        
        # place data on GPU
        context, question, char_ctx, char_ques, label = context.to(device), question.to(device),\
                                    char_ctx.to(device), char_ques.to(device), label.to(device)
        
        # forward pass, get predictions
        preds = model(context, question, char_ctx, char_ques)

        start_pred, end_pred = preds
        
        # separate labels for start and end position
        start_label, end_label = label[:,0], label[:,1]
        
        # calculate loss
        loss = F.cross_entropy(start_pred, start_label) + F.cross_entropy(end_pred, end_label)
        
        # backward pass
        loss.backward()
        
        # update the gradients
        optimizer.step()

        # zero the gradients so that they do not accumulate
        optimizer.zero_grad()

        train_loss += loss.item()

    return train_loss/len(train_dataset)

In [66]:
def valid(model, valid_dataset):
    
    print("Starting validation .........")
   
    valid_loss = 0.

    batch_count = 0
    
    f1, em = 0., 0.
    
    predictions = {}
    
    for batch in valid_dataset:

        if batch_count % 500 == 0:
            print(f"Starting batch {batch_count}")
        batch_count += 1

        context, question, char_ctx, char_ques, label, ctx_text, ans, ids = batch

        context, question, char_ctx, char_ques, label = context.to(device), question.to(device),\
                                    char_ctx.to(device), char_ques.to(device), label.to(device)

        with torch.no_grad():

            preds = model(context, question, char_ctx, char_ques)

            p1, p2 = preds

            y1, y2 = label[:,0], label[:,1]

            loss = F.nll_loss(p1, y1) + F.nll_loss(p2, y2)

            valid_loss += loss.item()

            batch_size, c_len = p1.size()
            ls = nn.LogSoftmax(dim=1)
            mask = (torch.ones(c_len, c_len) * float('-inf')).to(device).tril(-1).unsqueeze(0).expand(batch_size, -1, -1)
            score = (ls(p1).unsqueeze(2) + ls(p2).unsqueeze(1)) + mask
            score, s_idx = score.max(dim=1)
            score, e_idx = score.max(dim=1)
            s_idx = torch.gather(s_idx, 1, e_idx.view(-1, 1)).squeeze()
            
           
            for i in range(batch_size):
                id = ids[i]
                pred = context[i][s_idx[i]:e_idx[i]+1]
                pred = ' '.join([idx2word[idx.item()] for idx in pred])
                predictions[id] = pred
            
    em, f1 = evaluate(predictions)
    return valid_loss/len(valid_dataset), em, f1           
  

In [67]:
def evaluate(predictions):
    '''
    Gets a dictionary of predictions with question_id as key
    and prediction as value. The validation dataset has multiple 
    answers for a single question. Hence we compare our prediction
    with all the answers and choose the one that gives us
    the maximum metric (em or f1). 
    This method first parses the JSON file, gets all the answers
    for a given id and then passes the list of answers and the 
    predictions to calculate em, f1.
    
    
    :param dict predictions
    Returns
    : exact_match: 1 if the prediction and ground truth 
      match exactly, 0 otherwise.
    : f1_score: 
    '''
    with open(f'{proj_root}/data/squad_dev.json','r',encoding='utf-8') as f:
        dataset = json.load(f)
        
    dataset = dataset['data']
    f1 = exact_match = total = 0
    for article in dataset:
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                total += 1
                if qa['id'] not in predictions:
                    continue
                
                ground_truths = list(map(lambda x: x['text'], qa['answers']))
                
                prediction = predictions[qa['id']]
                
                exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
                
                f1 += metric_max_over_ground_truths(
                    f1_score, prediction, ground_truths)
                
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total
    
    return exact_match, f1



In [68]:
def normalize_answer(s):
    '''
    Performs a series of cleaning steps on the ground truth and 
    predicted answer.
    '''
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    '''
    Returns maximum value of metrics for predicition by model against
    multiple ground truths.
    
    :param func metric_fn: can be 'exact_match_score' or 'f1_score'
    :param str prediction: predicted answer span by the model
    :param list ground_truths: list of ground truths against which
                               metrics are calculated. Maximum values of 
                               metrics are chosen.
                            
    
    '''
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
        
    return max(scores_for_ground_truths)


def f1_score(prediction, ground_truth):
    '''
    Returns f1 score of two strings.
    '''
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    '''
    Returns exact_match_score of two strings.
    '''
    return (normalize_answer(prediction) == normalize_answer(ground_truth))

def epoch_time(start_time, end_time):
    '''
    Helper function to record epoch time.
    '''
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [69]:

train_losses = []
valid_losses = []
ems = []
f1s = []
epochs = 3
best_valid_loss=99999

path=f'{proj_root}{MODEL_NAME}.pth'


for epoch in range(epochs):
    print(f"Epoch {epoch+1}")
    start_time = time.time()
    
    train_loss = train(model, train_dataset)
    valid_loss, em, f1 = valid(model, valid_dataset)

    if best_valid_loss>valid_loss: # save the best model
        torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': valid_loss,
                'em':em,
                'f1':f1,
                }, path)
        best_valid_loss=valid_loss
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    ems.append(em)
    f1s.append(f1)
    
    print(f"Epoch train loss : {train_loss}| Time: {epoch_mins}m {epoch_secs}s")
    print(f"Epoch valid loss: {valid_loss}")
    print(f"Epoch EM: {em}")
    print(f"Epoch F1: {f1}")
    print("====================================================================================")
    

Epoch 1
Starting training ........
Starting batch: 0


KeyboardInterrupt: ignored

## References

* Papers read/ referenced:
    1. The QANet paper: https://arxiv.org/abs/1804.09541
    2. Attention is All You Need https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
    3. Convolutional Neural Networks for Sentence Classification: https://arxiv.org/abs/1408.5882
    4. Highway Networks: https://arxiv.org/abs/1505.00387
* Other helpful links:
    1. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
    2. The Illustrated Transformer:http://jalammar.github.io/illustrated-transformer/. This is an excellent piece of writing with amazing easy-to-understand visualizations. Must read.
    3. https://mccormickml.com/2019/11/11/bert-research-ep-1-key-concepts-and-sources/. Chris McCormick's BERT research series is another great resource to learn about self attention and various other details about BERT. He has a blog as well as youtube video series on the same.
    4. https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
    5. https://nlp.seas.harvard.edu/2018/04/03/attention.html. The annotated Transformer.
    6. https://nlp.seas.harvard.edu/slides/aaai16.pdf. A great resource for character embeddings.
    7. https://www.youtube.com/watch?v=T7o3xvJLuHk. Easy explanation of depthwise separable convolutions.
    8. https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728. Another amazing blog for depthwise separable convolutions.
    9. https://github.com/bentrevett/pytorch-seq2seq. A great series of notebooks on Machine Translation using PyTorch.  
Some of the repositories below might be out of date. 
    10. https://github.com/BangLiu/QANet-PyTorch
    11. https://github.com/NLPLearn/QANet
    12. https://github.com/setoidz/QANet-pytorch
    13. https://github.com/hackiey/QAnet-pytorch/tree/master/qanet