# Natural Language Processing

## DrQA 

This notebook implements model proposed in the paper: [Reading Wikipedia to Answer Open-Domain Questions](https://arxiv.org/abs/1704.00051) which is called DrQA by the authors. Specifically, DrQA is an end-to-end system for open domain question answering which involves an information retrieval system as well. This notebook however only explains the deep learning model proposed by them. This model is very similar to the one explained in [this](https://arxiv.org/abs/1606.02858) paper. The first authors in both the papers are also the same. The latter model is also known as "Stanford Attentive Reader" and is one of the models that is explained in Chris Manning's lecture on QA

In [1]:
import numpy as np
import time
import torch
from torch import nn
import torch.nn.functional as F

## 1. Load SQuAD

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

In [2]:
#comment this if you are not using AIT proxy...
import os
os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

In [3]:
import json

def load_json(path):
    '''
    Loads the JSON file of the Squad dataset.
    Returns the json object of the dataset.
    '''
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
        
    print("Length of data: ", len(data['data']))
    print("Data Keys: ", data['data'][0].keys())
    
    return data

In [4]:
# load dataset json files
train_data = load_json('data/squad_train.json')
valid_data = load_json('data/squad_dev.json')

Length of data:  442
Data Keys:  dict_keys(['title', 'paragraphs'])
Length of data:  48
Data Keys:  dict_keys(['title', 'paragraphs'])


In [5]:
print("Example 0 Title: ", train_data['data'][0]['title'])

Example 0 Title:  University_of_Notre_Dame


In [6]:
#Example 0 Paragraph:
train_data['data'][0]['paragraphs'][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

## 2. Data Preprocessing

### 2.1 Parse to dict

Since SQuAD has an unique structure where there are many questions for one context, we have to construct a nice datasets, reflect each of these questions as one sample, but repeating the same context.

In [7]:
def parse_data(data:dict)->list:
    
    data = data['data']
    qa_list = []

    for paragraphs in data:

        for para in paragraphs['paragraphs']:
            context = para['context']

            for qa in para['qas']:
                
                id = qa['id']
                question = qa['question']
                
                for ans in qa['answers']:
                    answer = ans['text']
                    ans_start = ans['answer_start']
                    ans_end = ans_start + len(answer)
                    
                    #one row of data
                    qa_dict = {}
                    qa_dict['id'] = id
                    qa_dict['context'] = context
                    qa_dict['question'] = question
                    qa_dict['label'] = [ans_start, ans_end]
                    qa_dict['answer'] = answer
                    
                    #append to a list of rows/dicts
                    qa_list.append(qa_dict)    

    return qa_list

In [8]:
# parse the json structure to return the data as a list of dictionaries
train_list = parse_data(train_data)
valid_list = parse_data(valid_data)

In [9]:
print('Train list len: ',len(train_list))
print('Valid list len: ',len(valid_list))

Train list len:  87599
Valid list len:  34726


In [10]:
#minimize the train_list and valid_list for easy debugging
#uncomment this for toy training
# train_list = train_list[:32]  #just enough for one batch
# valid_list = valid_list[:32]
train_list = train_list[:10000]
valid_list = valid_list[:5000]

In [11]:
import pandas as pd

# converting the lists into dataframes for easy access
train_df = pd.DataFrame(train_list)
valid_df = pd.DataFrame(valid_list)
train_df.head()

Unnamed: 0,id,context,question,label,answer
0,5733be284776f41900661182,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"[515, 541]",Saint Bernadette Soubirous
1,5733be284776f4190066117f,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"[188, 213]",a copper statue of Christ
2,5733be284776f41900661180,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"[279, 296]",the Main Building
3,5733be284776f41900661181,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,"[381, 420]",a Marian place of prayer and reflection
4,5733be284776f4190066117e,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,"[92, 126]",a golden statue of the Virgin Mary


### 2.2 Numericalization (building vocabs)

Next, we need to numericalize our dataset.

Since our context and questions repeats in the df, let's first take only the unique context and question.

In [12]:
def gather_text_for_vocab(dfs:list):    
    text = []
    total = 0
    for df in dfs:
        unique_contexts = list(df.context.unique())
        unique_questions = list(df.question.unique())
        total += df.context.nunique() + df.question.nunique()
        text.extend(unique_contexts + unique_questions)
    
    assert len(text) == total
    
    return text

In [13]:
# gather text to build vocabularies
%time vocab_text = gather_text_for_vocab([train_df, valid_df])
print("Number of unique sentences in dataset: ", len(vocab_text))

CPU times: user 27.8 ms, sys: 0 ns, total: 27.8 ms
Wall time: 27.3 ms
Number of unique sentences in dataset:  13712


In [14]:
#example
vocab_text[0]

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

Next, we gonna assign ids to every unique tokens.  Thus, before we assign, we use our favorite spaCy to help tokenize, and then we just assign the `ids`.

In [15]:
#use spacy to help tokenize
import spacy
from collections import Counter

nlp = spacy.load('en_core_web_sm')

def build_word_vocab(vocab_text):

    words = []
    for sent in vocab_text:
        for word in nlp(sent, disable=['parser', 'ner']):  #disable so it's fast....
            words.append(word.text)

    word_counter = Counter(words)
    word_vocab = sorted(word_counter, key=word_counter.get, reverse=True) #list of words sorted by frequency
    print(f"raw-vocab: {len(word_vocab)}")
    word_vocab.insert(0, '<unk>')
    word_vocab.insert(1, '<pad>')
    print(f"vocab-length: {len(word_vocab)}")
    word2idx = {word:idx for idx, word in enumerate(word_vocab)}
    print(f"word2idx-length: {len(word2idx)}")
    idx2word = {v:k for k,v in word2idx.items()}
    
    return word2idx, idx2word, word_vocab

In [16]:
# build word vocabulary
%time word2idx, idx2word, word_vocab = build_word_vocab(vocab_text)

raw-vocab: 26420
vocab-length: 26422
word2idx-length: 26422
CPU times: user 36.2 s, sys: 46.6 ms, total: 36.2 s
Wall time: 36.2 s


Now we are ready to convert our context and question into ids.

In [17]:
def convert_to_ids(text, word2idx):
    
    tokens = [w.text for w in nlp(text, disable=['parser','ner'])]
    ids = [word2idx[word] for word in tokens]
    
    assert len(ids) == len(tokens)
    return ids

In [18]:
# numericalize context and questions for training and validation set
%time train_df['context_ids'] = train_df.context.apply(convert_to_ids, word2idx=word2idx)
%time valid_df['context_ids'] = valid_df.context.apply(convert_to_ids, word2idx=word2idx)

%time train_df['question_ids'] = train_df.question.apply(convert_to_ids,  word2idx=word2idx)
%time valid_df['question_ids'] = valid_df.question.apply(convert_to_ids,  word2idx=word2idx)

CPU times: user 58.7 s, sys: 7.77 ms, total: 58.7 s
Wall time: 58.7 s
CPU times: user 27.8 s, sys: 7.9 ms, total: 27.8 s
Wall time: 27.8 s
CPU times: user 18 s, sys: 0 ns, total: 18 s
Wall time: 18 s
CPU times: user 8.87 s, sys: 3.98 ms, total: 8.88 s
Wall time: 8.87 s


Next we remove any indices that have error.  Error here means the actual answer does not appear in the context, probably due to data entry errors.

In [19]:
def get_error_indices(df, idx2word):
    
    start_value_error, end_value_error, assert_error = test_indices(df, idx2word)
    err_idx = start_value_error + end_value_error + assert_error
    err_idx = set(err_idx)
    print(f"Number of error indices: {len(err_idx)}")
    
    return err_idx

In [20]:
def test_indices(df, idx2word):
   
    start_value_error = []
    end_value_error = []
    assert_error = []
    for index, row in df.iterrows():
        
        #get all answer tokens
        answer_tokens = [w.text for w in nlp(row['answer'], disable=['parser','ner'])]

        start_token = answer_tokens[0]
        end_token = answer_tokens[-1]
        
        #get context tokens, and their start and end position
        context_span  = [(word.idx, word.idx + len(word.text)) 
                         for word in nlp(row['context'], disable=['parser','ner'])]

        #get starts from the first pair of the tuple, and ends from the second pair
        starts, ends = zip(*context_span)

        #ground truth indices
        answer_start, answer_end = row['label']

        try:
            #try to find answer_start from starts
            start_idx = starts.index(answer_start)
        except:
            start_value_error.append(index)
        try:
            #try to find answer_start from starts
            end_idx  = ends.index(answer_end)
        except:
            end_value_error.append(index)

        try:
            #just to make sure that the idx2word convert back to the answer_token...
            #otherwise, the ground truth cannot work.....
            assert idx2word[row['context_ids'][start_idx]] == answer_tokens[0]
            assert idx2word[row['context_ids'][end_idx]]   == answer_tokens[-1]
        except:
            assert_error.append(index)


    return start_value_error, end_value_error, assert_error

In [21]:
# get indices with tokenization errors and drop those indices 
train_err = get_error_indices(train_df, idx2word)
valid_err = get_error_indices(valid_df, idx2word)

train_df.drop(train_err, inplace=True)
valid_df.drop(valid_err, inplace=True)

Number of error indices: 170
Number of error indices: 99


Last, we need to get the label answer based on our tokenized dataset.  That is, we should calculate the spans and then return a tuple of start and end positions.

In [22]:
def index_answer(row, idx2word):
    
    context_span = [(word.idx, word.idx + len(word.text)) for word in nlp(row.context, disable=['parser','ner'])]
    starts, ends = zip(*context_span)
    
    #finding the spans
    answer_start, answer_end = row.label
    start_idx = starts.index(answer_start)
    end_idx   = ends.index(answer_end)
    
    #double check
    ans_toks = [w.text for w in nlp(row.answer,disable=['parser','ner'])]
    ans_start = ans_toks[0]
    ans_end  = ans_toks[-1]
    assert idx2word[row.context_ids[start_idx]] == ans_start
    assert idx2word[row.context_ids[end_idx]]   == ans_end
    
    return [start_idx, end_idx]

In [23]:
# get start and end positions of answers from the context
# this is basically the label for training QA models
train_label_idx = train_df.apply(index_answer, axis=1, idx2word=idx2word)
valid_label_idx = valid_df.apply(index_answer, axis=1, idx2word=idx2word)

train_df['label_idx'] = train_label_idx
valid_df['label_idx'] = valid_label_idx

### 2.3 Dump data to pickle files 
This ensures that we can directly access the preprocessed dataframe next time.

In [24]:
import pickle
with open('drqastoi.pickle','wb') as handle:
    pickle.dump(word2idx, handle)
    
train_df.to_pickle('drqatrain.pkl')
valid_df.to_pickle('drqavalid.pkl')

### 2.4 Read data from pickle files

You only need to run the preprocessing once. Some preprocessing functions can take upto 3 mins. Therefore, pickling preprocessed data can save a lot of time.
Once the preprocessed files are saved, you can directly start from here.

In [25]:
train_df = pd.read_pickle('drqatrain.pkl')
valid_df = pd.read_pickle('drqavalid.pkl')

## 3. Preparing the Dataset/ Dataloader

In [26]:
class SquadDataset:
    '''
    -Divides the dataframe in batches.
    -Pads the contexts and questions dynamically for each batch by padding 
     the examples to the maximum-length sequence in that batch.
    -Calculates masks for context and question.
    -Calculates spans for contexts.
    '''
    
    def __init__(self, data, batch_size):
        
        self.batch_size = batch_size
        data = [data[i:i+self.batch_size] for i in range(0, len(data), self.batch_size)]
        self.data = data
    
    def get_span(self, text):
        
        text = nlp(text, disable=['parser','ner'])
        span = [(w.idx, w.idx+len(w.text)) for w in text]

        return span

    def __len__(self):
        return len(self.data)
    
    def __iter__(self):
        '''
        Creates batches of data and yields them.
        
        Each yield comprises of:
        :padded_context: padded tensor of contexts for each batch 
        :padded_question: padded tensor of questions for each batch 
        :context_mask & question_mask: zero-mask for question and context
        :label: start and end index wrt context_ids
        :context_text,answer_text: used while validation to calculate metrics
        :context_spans: spans of context text
        :ids: question_ids used in evaluation
        '''
        
        for batch in self.data:
                            
            spans = []
            context_text = []
            answer_text = []
            
            max_context_len = max([len(ctx) for ctx in batch.context_ids])
            padded_context = torch.LongTensor(len(batch), max_context_len).fill_(1)
            
            for ctx in batch.context:
                context_text.append(ctx)
                spans.append(self.get_span(ctx))
            
            for ans in batch.answer:
                answer_text.append(ans)
                
            for i, ctx in enumerate(batch.context_ids):
                padded_context[i, :len(ctx)] = torch.LongTensor(ctx)
            
            max_question_len = max([len(ques) for ques in batch.question_ids])
            padded_question = torch.LongTensor(len(batch), max_question_len).fill_(1)
            
            for i, ques in enumerate(batch.question_ids):
                padded_question[i,: len(ques)] = torch.LongTensor(ques)
            
            label = torch.LongTensor(list(batch.label_idx))
            context_mask = torch.eq(padded_context, 1)
            question_mask = torch.eq(padded_question, 1)
            
            ids = list(batch.id)  
            
            yield (padded_context, padded_question, context_mask, 
                   question_mask, label, context_text, answer_text, ids)
            
            

In [27]:
train_dataset = SquadDataset(train_df, 32)

In [28]:
valid_dataset = SquadDataset(valid_df, 32)

In [29]:
a = next(iter(train_dataset))

In [30]:
a[0].shape, a[1].shape, a[2].shape, a[3].shape, a[4].shape

(torch.Size([32, 253]),
 torch.Size([32, 19]),
 torch.Size([32, 253]),
 torch.Size([32, 19]),
 torch.Size([32, 2]))

In [31]:
a[5][0]  #first sample of the batch (context_text)

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [32]:
a[6][0]  #answer

'Saint Bernadette Soubirous'

In [33]:
a[4][0] #label of start and end

tensor([102, 104])

## 4. Model

Before we dive deep into the intricacies of the model, let's set up the notations. An input example during training is comprised of 
* a paragraph / context $p$ consisting of $l$ tokens { $p_{1}$, $p_{2}$,..., $p_{l}$ }
* a question $q$ consisting of $m$ tokens { $q_{1}$, $q_{2}$,..., $q_{m}$ }
* a start and and end position that comes from the context itself. More specifically, the start and end indices of the answer from the context  

The following flowchart shows the flow of the model. It might not make sense now, but as we progress down the chart and build all the components, things will become clearer.


<img src="images/drqaflow.PNG" width="700" height="800"/>


### Word Embedding

The first transformation for both the question and the context tokens is that they are passed through an embedding layer initialized with pre-trained GloVe word vectors. 300-dimensional vectors version are used here.

In [34]:
def create_glove_matrix():
    glove_dict = {}
    with open("glove.6B.300d.txt", "r", encoding="utf-8") as f:
        for line in f:
            values = line.split(' ')
            word = values[0]
            vector = np.asarray(values[1:], dtype="float32")
            glove_dict[word] = vector
    f.close()
    
    return glove_dict

In [35]:
glove_dict = create_glove_matrix()

In [36]:
def create_word_embedding(glove_dict):
    '''
    Creates a weight matrix of the words that are common in the GloVe vocab and
    the dataset's vocab. Initializes OOV words with a zero vector.
    '''
    weights_matrix = np.zeros((len(word_vocab), 300))
    words_found = 0
    for i, word in enumerate(word_vocab):
        try:
            weights_matrix[i] = glove_dict[word]
            words_found += 1
        except:
            pass
    return weights_matrix, words_found

In [37]:
weights_matrix, words_found = create_word_embedding(glove_dict)

In [38]:
print("Total words found in glove vocab: ", words_found)

Total words found in glove vocab:  15768


In [39]:
np.save('drqaglove_vt.npy',weights_matrix)

### Align Question Embedding

The paper has different encoding procedures for the context and the question. The context/paragraph encoding is more exhaustive and comprises of following additional features:

* exact match : encodes a binary feature if $p$ can be exactly matched to one word in question in its original, lemma or lowercase form
* token features : Includes POS, NER and TF of context tokens and
* aligned question embedding ($f_{align}$) .  

In this re-implementation we only implemented the aligned question embedding. The other features can be added easily but they do not affect the metrics by a large margin(~2).  
$f_{align}$ has been formulated as shown below:

$$ f_{align} = \sum_{j}a_{i,j}E(q_{j}) $$ 

where $E()$ represents the glove embeddings and

<img src="images/drqa1.PNG" width="400" height="200"/>

where $\alpha()$ is a single dense layer with relu non-linearity. This transformation can be thought of as a projection to a new vector sub-space. The weights of the projection matrix will be learnt via backpropogation.
These equations can be converted into code quite easily. Lets break this down into smaller chunks and understand what's going on actually. 
<img src="images/drqa2.PNG" width="200" height="150"/>
This is simply the product of projections of glove embeddings of the context and the question. Careful inspection of the equation for $a_{i,j}$ reveals that it is actually a softmax of the above product. The equations above depict everything at token level where $i$ represents a context token and $j$ represents a question token. Practically we usually vectorize our computations and deal with tensors directly.
$f_{align}$ is a weighted representation of the question embeddings. $a_{i,j}$ represents the weights and hence a softmax function is necessary.  
#### Intuition
This feature enables the model to understand what portion of the context is more important or relevant with respect to the question. The products of projections taken at token level ensure a higher value when similar words from the question and context are multiplied. Quoting the paper,
> *these features add soft alignments between similar but non-identical words (e.g., car and vehicle).* 

This is achieved via backpropation and training the weights of the dense layer. While this might seem a bit weird initially, we have to trust the process of backpropogation.   

While implementing, we first calculate the projections of context and question vectors. We then use `torch.bmm` to calculate the product in the numerator of $a_{i,j}$, mask the product and then pass it through the softmax function to get $a_{i,j}$. Finally, we multiply this with the question embeddings. The output of this layer is an additional context embedding which is then concatenated with the glove embeddings.

In [40]:
class AlignQuestionEmbedding(nn.Module):
    
    def __init__(self, input_dim):        
        
        super().__init__()
        
        self.linear = nn.Linear(input_dim, input_dim)
        
        self.relu = nn.ReLU()
        
    def forward(self, context, question, question_mask):
        
        # context = [bs, ctx_len, emb_dim]
        # question = [bs, qtn_len, emb_dim]
        # question_mask = [bs, qtn_len]
    
        ctx_ = self.linear(context)
        ctx_ = self.relu(ctx_)
        # ctx_ = [bs, ctx_len, emb_dim]
        
        qtn_ = self.linear(question)
        qtn_ = self.relu(qtn_)
        # qtn_ = [bs, qtn_len, emb_dim]
        
        qtn_transpose = qtn_.permute(0,2,1)
        # qtn_transpose = [bs, emb_dim, qtn_len]
        
        align_scores = torch.bmm(ctx_, qtn_transpose)
        # align_scores = [bs, ctx_len, qtn_len]
        
        qtn_mask = question_mask.unsqueeze(1).expand(align_scores.size())
        # qtn_mask = [bs, 1, qtn_len] => [bs, ctx_len, qtn_len]
        
        # Fills elements of self tensor(align_scores) with value(-float(inf)) where mask is True. 
        # The shape of mask must be broadcastable with the shape of the underlying tensor.
        align_scores = align_scores.masked_fill(qtn_mask == 1, -float('inf'))
        # align_scores = [bs, ctx_len, qtn_len]
        
        align_scores_flat = align_scores.view(-1, question.size(1))
        # align_scores = [bs*ctx_len, qtn_len]
        
        alpha = F.softmax(align_scores_flat, dim=1)
        alpha = alpha.view(-1, context.shape[1], question.shape[1])
        # alpha = [bs, ctx_len, qtn_len]
        
        align_embedding = torch.bmm(alpha, question)
        # align = [bs, ctx_len, emb_dim]
        
        return align_embedding

## Stacked BiLSTM

The paragraph/context encoding which now has two features (glove and $f_{align}$) is then passed to a multilayer (3 layers) bidirectional LSTM. According to the paper,

> *Speciﬁcally, we choose to use a multi-layer bidirectional long short-term memory network (LSTM), and take the concatenation of each layer’s hidden units in the end. *

To achieve this functionality we cannot directly use the pytorch recurrent layers. Every recurrent layer in pytorch returns a tuple `[output, hidden]` where `output` holds the hidden states of all the timesteps from the __last layer only__. We need to access the hidden states of intermediate layers and then concatenate them at the end.
The following figure illustrates this point in more detail.

<img src="images/bilstm.png" width="700" height="600"/>

This figure shows a 3-layer bidirectional LSTM with an input sequence of size $n$. The green blocks denote the forward LSTMs and the blue blocks backward. Each block is labelled with the value that it calculates. The subscript denotes the time-step and the superscript denotes the depth or the layer-number.

As highlighted in the diagram, we need the intermediate hidden states passed between the layers along with the final output. To create this in code, we create a `nn.ModuleList` and add 3 LSTM layers to it. The input size of the first layer remains the same but for subsequent LSTMs the input size must be twice the hidden size. This is because the `output` of the first LSTM will have the dimension of `[batch_size, seq_len, hidden_size*num_directions]` and `num_directions` is 2 in our case. In the forward method, we loop through the LSTMs, store the hidden states of each layer and finally return the concatenated output. 



In [41]:
class StackedBiLSTM(nn.Module):
    
    def __init__(self, input_dim, hidden_dim, num_layers, dropout):
        
        super().__init__()
        
        self.dropout = dropout
        
        self.num_layers = num_layers
        
        self.lstms = nn.ModuleList()
        
        for i in range(self.num_layers):
            
            input_dim = input_dim if i == 0 else hidden_dim * 2
            
            self.lstms.append(nn.LSTM(input_dim, hidden_dim,
                                      batch_first=True, bidirectional=True))
           
    
    def forward(self, x):
        # x = [bs, seq_len, feature_dim]

        outputs = [x]
        for i in range(self.num_layers):

            lstm_input = outputs[-1]
            lstm_out = F.dropout(lstm_input, p=self.dropout)
            lstm_out, (hidden, cell) = self.lstms[i](lstm_input)
           
            outputs.append(lstm_out)

    
        output = torch.cat(outputs[1:], dim=2)
        # [bs, seq_len, num_layers*num_dir*hidden_dim]
        
        output = F.dropout(output, p=self.dropout)
      
        return output

## Linear Attention Layer

The previous layers were majorly about encoding and representing the context. This layer is used to encode the question and is much simpler than the previous layers. The question tokens are first passed through the glove embedding layer, then passed through the bilstm layer and finally reach this layer. 
This layer is used to calculate the importance of each word in the question. This can be achieved by simply taking a softmax over the input. However to add more learning capacity to the model, the inputs are multiplied by a trainable weight vector $w$ and then passed through a softmax function.  
This layer calculates the weights as 
<img src="images/drqab.PNG" width="300" height="300"/>

Essentially the layer is performing "attention" on inputs. The $w$ in code is characterized by a linear layer.

In [42]:
class LinearAttentionLayer(nn.Module):
    
    def __init__(self, input_dim):
        
        super().__init__()
        
        self.linear = nn.Linear(input_dim, 1)
        
    def forward(self, question, question_mask):
        
        # question = [bs, qtn_len, input_dim] = [bs, qtn_len, bi_lstm_hid_dim]
        # question_mask = [bs,  qtn_len]
        
        qtn = question.view(-1, question.shape[-1])
        # qtn = [bs*qtn_len, hid_dim]
        
        attn_scores = self.linear(qtn)
        # attn_scores = [bs*qtn_len, 1]
        
        attn_scores = attn_scores.view(question.shape[0], question.shape[1])
        # attn_scores = [bs, qtn_len]
        
        attn_scores = attn_scores.masked_fill(question_mask == 1, -float('inf'))
        
        alpha = F.softmax(attn_scores, dim=1)
        # alpha = [bs, qtn_len]
        
        return alpha
        

The following function just multiplies the weights calculated in the previous layer by the outputs of the question bilstm layer. This allows the model to assign higher values to important words in each question.

$$ q = \sum_{j} b_{j} q_{j} $$

In [43]:
def weighted_average(x, weights):
    # x = [bs, len, dim]
    # weights = [bs, len]
    
    weights = weights.unsqueeze(1)
    # weights = [bs, 1, len]
    
    w = weights.bmm(x).squeeze(1)
    # w = [bs, 1, dim] => [bs, dim]
    
    return w

## Attention

Recall that the attention mechanism was designed to do this: while decoding at any particular time step, encoder hidden states from all the time-steps are made available to the decoder. The decoder then can look back at the encoder hidden states or the source language and make a more informed prediction at a particular time-step. This alieviates the problem of all the information from source language being crammed into a single vector.  

To illustrate this with equations, consider that the hidden states of the encoder RNN are represented by $H$ = {$h_{1}, h_{2}, h_{3},...,h_{t}$}. While decoding the token at position $t$, the input to the decoder unit is hidden state from previous unit $s_{t-1}$ and an attention vector which is a selective summary of the encoder hidden states and helps the decoder to pay more attention to a particular encoder state. 
The similarity between the encoder hidden states $H$ and the decoder hidden state so far $s_{t-1}$ is computed by,  
$$ \alpha = tanh (W [H ; s_{t-1}]) $$   

$\alpha$ is then passed through a softmax layer to obtain attention distribution such that $\sum_{t} \alpha_{t}$ = 1.
The final step is calculating the attention vector by taking a weighted sum of the encoder hidden states,
$$ \sum_{t} \alpha_{t} h_{t} $$

The following diagram illustrates this process.  
 
<img src="images/attnkj.PNG" width="600" height="100"/>

Here the encoder hidden states {$h_{1}, h_{2}, h_{3},...,h_{t}$} are commonly called the __*values*__ and the decoder hidden state $s_{t-1}$ is the __*query*__.  

### A More General Take On Attention

In general there are 3 steps when calculating the attention. Consider that values are represented by {$h_{1}, h_{2}, h_{3},..h_{n}$} and query is $s$. Then attention always involves,

1. Calculating the energy $e$ or attention scores between these 2 vectors,
$e$   $ \epsilon$  $ R^{N} $
2. Taking softmax to get an attention distribution $\alpha$, $\alpha$ $\epsilon$ $R^{N}$

$$ \alpha = softmax(e)$$ 
$$ \sum_{t}^{N} \alpha_{t} = 1 $$

3. Taking the weighted sum of the `values` by using $\alpha$
$$ a = \sum_{t}^{N}\alpha_{t}h_{t} $$


Now there are different ways to calculate the energy between `query` and `values`. 
* **Basic Dot Product Attention**    
$$ e_{t} = s^{T}h_{t}$$      
* **Additive Attention**
$$ e_{t} = v^{T} tanh (W [h_{t};s])$$  
This is nothing but the Bahdanau attention.
* **Scaled Dot Product Attention**
$$ e_{t} = s^{T}h_{t}/\sqrt n$$
where $n$ is the model size. A modified version of this proposed in the Transformers paper by Vaswani et al. is now employed in almost every NLP system.

* **Bilinear Attention**
$$ e_{t} = s^{T} W h_{t}$$
where $W$ is a trainable weight vector.
This is the method used in this paper to predict the start and end position of the answer from the context.    


To implement this layer, we characterise $W$ by a linear layer.
First the linear layer is applied to the question, which is equivalent to the product $W.q$. This product is then multiplied by the context using `torch.bmm`.   
Note that softmax is not taken over here to get the weights. This is taken care of when we calculate the loss using cross entropy. The following layer does not actually calculate the attention as a weighted sum. It just uses the bilinear term's representation to predict the span. However the intuition behind the bilinear term still remains the same.

In [44]:
class BilinearAttentionLayer(nn.Module):
    
    def __init__(self, context_dim, question_dim):
        
        super().__init__()
        
        self.linear = nn.Linear(question_dim, context_dim)
        
    def forward(self, context, question, context_mask):
        
        # context = [bs, ctx_len, ctx_hid_dim] = [bs, ctx_len, hid_dim*6] = [bs, ctx_len, 768]
        # question = [bs, qtn_hid_dim] = [bs, qtn_len, 768]
        # context_mask = [bs, ctx_len]
        
        qtn_proj = self.linear(question)
        # qtn_proj = [bs, ctx_hid_dim]
        
        qtn_proj = qtn_proj.unsqueeze(2)
        # qtn_proj = [bs, ctx_hid_dim, 1]
        
        scores = context.bmm(qtn_proj)
        # scores = [bs, ctx_len, 1]
        
        scores = scores.squeeze(2)
        # scores = [bs, ctx_len]
        
        scores = scores.masked_fill(context_mask == 1, -float('inf'))
        
        #alpha = F.log_softmax(scores, dim=1)
        # alpha = [bs, ctx_len]
        
        return scores

## Putting it together

The following module brings all the components discussed so far together. It takes in the context and question tokens as inputs and returns the start and end positions of the answer from the context.  

<img src="images/drqaflow.PNG" width="600" height="600"/>

  
Going down the flowchart, following steps are performed in sequence:  
* The context and question tokens are passed through the Glove embedding layer. The glove embeddings are partially finetuned during training. According to the paper,  
> *We keep most of the pre-trained word embeddings ﬁxed and only ﬁne-tune the 1000 most frequent question words because the representations of some key words such as what, how, which, many could be crucial for QA systems.*   

In code, this is done by using hooks in pytorch. Hooks work as a callback functions and are executed after `forward` or `backward` function is called for a particular tensor. You should read more about this in their documentation.

* Aligned question embedding is calculated for the context vector and concatenated (using `torch.cat`) to the context representation. If $d$ is the embedding dimension then context $\epsilon$ $R^{2d}$ and question $\epsilon$ $R^{d}$.
* Context and question representations are then passed to bilstm layers to obtain tensors of dimension `[batch_size, seq_len, hidden_dim*6]` since the LSTM is bidirectional and there are 3 layers of it.
* The embedded question is also passed through the linear attention layer and a weighted sum of its output is taken with the biLSTM output.
* Both these representations are finally passed through the bilinear attention layer to predict the start and end position of the answer.   

An intriguing point here is that the same set of weights are passed to the bilinear attention layers. Yet how do they predict different things. This is left over to the neural network to learn. Our loss function ensures that our objective is to predict different positions from the context. It is now the neural net's responsibility to learn different weights for each layer. It is sort of a "black-box" and we have to trust the process of backpropogation.

In [45]:
class DocumentReader(nn.Module):
    
    def __init__(self, hidden_dim, embedding_dim, num_layers, num_directions, dropout, device):
        
        super().__init__()
        
        self.device = device
        
        #self.embedding = self.get_glove_embedding()
        
        self.context_bilstm = StackedBiLSTM(embedding_dim * 2, hidden_dim, num_layers, dropout)
        
        self.question_bilstm = StackedBiLSTM(embedding_dim, hidden_dim, num_layers, dropout)
        
        self.glove_embedding = self.get_glove_embedding()
        
        def tune_embedding(grad, words=1000):
            grad[words:] = 0
            return grad
        
        self.glove_embedding.weight.register_hook(tune_embedding)
        
        self.align_embedding = AlignQuestionEmbedding(embedding_dim)
        
        self.linear_attn_question = LinearAttentionLayer(hidden_dim*num_layers*num_directions) 
        
        self.bilinear_attn_start = BilinearAttentionLayer(hidden_dim*num_layers*num_directions, 
                                                          hidden_dim*num_layers*num_directions)
        
        self.bilinear_attn_end = BilinearAttentionLayer(hidden_dim*num_layers*num_directions,
                                                        hidden_dim*num_layers*num_directions)
        
        self.dropout = nn.Dropout(dropout)
   
        
    def get_glove_embedding(self):
        
        weights_matrix = np.load('drqaglove_vt.npy')
        num_embeddings, embedding_dim = weights_matrix.shape
        embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights_matrix).to(self.device),freeze=False)

        return embedding
    
    
    def forward(self, context, question, context_mask, question_mask):
       
        # context = [bs, len_c]
        # question = [bs, len_q]
        # context_mask = [bs, len_c]
        # question_mask = [bs, len_q]
        
        
        ctx_embed = self.glove_embedding(context)
        # ctx_embed = [bs, len_c, emb_dim]
        
        ques_embed = self.glove_embedding(question)
        # ques_embed = [bs, len_q, emb_dim]
        
        ctx_embed = self.dropout(ctx_embed)
     
        ques_embed = self.dropout(ques_embed)
             
        align_embed = self.align_embedding(ctx_embed, ques_embed, question_mask)
        # align_embed = [bs, len_c, emb_dim]  
        
        ctx_bilstm_input = torch.cat([ctx_embed, align_embed], dim=2)
        # ctx_bilstm_input = [bs, len_c, emb_dim*2]
                
        ctx_outputs = self.context_bilstm(ctx_bilstm_input)
        # ctx_outputs = [bs, len_c, hid_dim*layers*dir] = [bs, len_c, hid_dim*6]
       
        qtn_outputs = self.question_bilstm(ques_embed)
        # qtn_outputs = [bs, len_q, hid_dim*6]
    
        qtn_weights = self.linear_attn_question(qtn_outputs, question_mask)
        # qtn_weights = [bs, len_q]
            
        qtn_weighted = weighted_average(qtn_outputs, qtn_weights)
        # qtn_weighted = [bs, hid_dim*6]
        
        start_scores = self.bilinear_attn_start(ctx_outputs, qtn_weighted, context_mask)
        # start_scores = [bs, len_c]
         
        end_scores = self.bilinear_attn_end(ctx_outputs, qtn_weighted, context_mask)
        # end_scores = [bs, len_c]
        
      
        return start_scores, end_scores

## 5. Training

###  Hyperparameters

> *We use 3-layer bidirectional LSTMs with h = 128 hidden units for both paragraph and question encoding. Dropout with p = 0.3 is applied to word embeddings and all the hidden units of LSTMs. *

In [46]:
device = torch.device('cuda')
HIDDEN_DIM = 128
EMB_DIM = 300
NUM_LAYERS = 3
NUM_DIRECTIONS = 2
DROPOUT = 0.3
device = torch.device('cuda')

model = DocumentReader(HIDDEN_DIM,
                       EMB_DIM, 
                       NUM_LAYERS, 
                       NUM_DIRECTIONS, 
                       DROPOUT, 
                       device).to(device)

In [47]:
optimizer = torch.optim.Adamax(model.parameters())

In [48]:
def count_parameters(model):
    '''Returns the number of trainable parameters in the model.'''
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 11,967,749 trainable parameters


In [49]:
def train(model, train_dataset):
    '''
    Trains the model.
    '''
    
    print("Starting training ........")
    
    train_loss = 0.
    batch_count = 0
    
    # put the model in training mode
    model.train()
    
    # iterate through training data
    for batch in train_dataset:

        if batch_count % 500 == 0:
            print(f"Starting batch: {batch_count}")
        batch_count += 1

        context, question, context_mask, question_mask, label, ctx, ans, ids = batch
        
        # place the tensors on GPU
        context, context_mask, question, question_mask, label = context.to(device), context_mask.to(device),\
                                    question.to(device), question_mask.to(device), label.to(device)
        
        # forward pass, get the predictions
        preds = model(context, question, context_mask, question_mask)

        start_pred, end_pred = preds
        
        # separate labels for start and end position
        start_label, end_label = label[:,0], label[:,1]
        
        # calculate loss
        loss = F.cross_entropy(start_pred, start_label) + F.cross_entropy(end_pred, end_label)
        
        # backward pass, calculates the gradients
        loss.backward()
        
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 10)
        
        # update the gradients
        optimizer.step()
        
        # zero the gradients to prevent them from accumulating
        optimizer.zero_grad()

        train_loss += loss.item()

    return train_loss/len(train_dataset)

In [50]:
def valid(model, valid_dataset):
    '''
    Performs validation.
    '''
    
    print("Starting validation .........")
   
    valid_loss = 0.

    batch_count = 0
    
    f1, em = 0., 0.
    
    # puts the model in eval mode. Turns off dropout
    model.eval()
    
    predictions = {}
    
    for batch in valid_dataset:

        if batch_count % 500 == 0:
            print(f"Starting batch {batch_count}")
        batch_count += 1

        context, question, context_mask, question_mask, label, context_text, answers, ids = batch

        context, context_mask, question, question_mask, label = context.to(device), context_mask.to(device),\
                                    question.to(device), question_mask.to(device), label.to(device)

        with torch.no_grad():

            preds = model(context, question, context_mask, question_mask)

            p1, p2 = preds
            #p1, p2 = [bs, len_context]

            y1, y2 = label[:,0], label[:,1]

            loss = F.cross_entropy(p1, y1) + F.cross_entropy(p2, y2)

            valid_loss += loss.item()
            
            # get the start and end index positions from the model preds
            
            batch_size, c_len = p1.size()
            ls = nn.LogSoftmax(dim=1)
            mask = (torch.ones(c_len, c_len) * float('-inf')).to(device).tril(-1).unsqueeze(0).expand(batch_size, -1, -1)
            #mask = [bs, c_len, c_len]            
            
            score = (ls(p1).unsqueeze(2) + ls(p2).unsqueeze(1)) + mask
            #score = [bs, len_context, 1] + [bs, 1, len_context] + mask
            
            #first, max along the p1 axis
            score, s_idx = score.max(dim=1)
            #s_idx = [bs, c_len]
            #score = [bs, c_len]
            
            #then since the score now is left with [bs, len_context], we simply max again to get end_idx
            score, e_idx = score.max(dim=1)
            #s_idx = [bs]
            #score = [bs]
            
            s_idx = torch.gather(s_idx, 1, e_idx.view(-1, 1)).squeeze()
            
            # stack predictions
            for i in range(batch_size):
                id = ids[i]
                pred = context[i][s_idx[i]:e_idx[i]+1]
                pred = ' '.join([idx2word[idx.item()] for idx in pred])
                predictions[id] = pred
            
    em, f1 = evaluate(predictions)            
    return valid_loss/len(valid_dataset), em, f1
                

### Explaning how we calculate the score, s_idx, and e_idx

In [51]:
#in case you don't understand what this line does
#it basically block all impossible scores where end_idx > start_idx
mask = (torch.ones(5, 5) * float('-inf')).tril(-1).unsqueeze(0).expand(1, -1, -1)
print(mask)
print(mask.shape)

tensor([[[0., 0., 0., 0., 0.],
         [-inf, 0., 0., 0., 0.],
         [-inf, -inf, 0., 0., 0.],
         [-inf, -inf, -inf, 0., 0.],
         [-inf, -inf, -inf, -inf, 0.]]])
torch.Size([1, 5, 5])


In [52]:
#let's say the context here is
# AIT is at Pathum Thani
#let's say the answer is Pathum Thani

print("*" * 5)
p1 = torch.tensor([[2, 3, 2, 10, 5]])
print("p1 shape: ", p1.shape)
p1_unsqueezed = p1.unsqueeze(2)
print(f"{p1_unsqueezed=}")
print(f"{p1_unsqueezed.shape=}")

print("*" * 5)
p2 = torch.tensor([[1, 3, 2, 5, 10]])
print("p2 shape: ", p2.shape)
p2_unsqueezed = p2.unsqueeze(1)
print(f"{p2_unsqueezed=}")
print(f"{p2_unsqueezed.shape}")

print("*" * 5)
score = p1_unsqueezed + p2_unsqueezed + mask
'''
[6 3 2 10 6     [6 3 2 6 10    [0 0 0 0 0
 6 3 2 10 6      6 3 2 6 10     - 0 0 0 0
 6 3 2 10 6   +  6 3 2 6 10  +  - - 0 0 0
 6 3 2 10 6      6 3 2 6 10     - - - 0 0
 6 3 2 10 6]     6 3 2 6 10]    - - - - 0]
'''
print(f"{score=}")

print("*" * 5)
#first, max along the p1 axis
score, s_idx = score.max(dim=1)
print(f"{score=}")
print(f"{s_idx=}")

print("*" * 5)
#first, max along the p2 axis
score, e_idx = score.max(dim=1)
print(f"{score=}")
print(f"{e_idx=}")

print("*" * 5)
#torch.gather(input, dim, index)
s_idx = torch.gather(s_idx, 1, e_idx.view(-1, 1)).squeeze()
print(f"{s_idx=}")


*****
p1 shape:  torch.Size([1, 5])
p1_unsqueezed=tensor([[[ 2],
         [ 3],
         [ 2],
         [10],
         [ 5]]])
p1_unsqueezed.shape=torch.Size([1, 5, 1])
*****
p2 shape:  torch.Size([1, 5])
p2_unsqueezed=tensor([[[ 1,  3,  2,  5, 10]]])
torch.Size([1, 1, 5])
*****
score=tensor([[[ 3.,  5.,  4.,  7., 12.],
         [-inf,  6.,  5.,  8., 13.],
         [-inf, -inf,  4.,  7., 12.],
         [-inf, -inf, -inf, 15., 20.],
         [-inf, -inf, -inf, -inf, 15.]]])
*****
score=tensor([[ 3.,  6.,  5., 15., 20.]])
s_idx=tensor([[0, 1, 1, 3, 3]])
*****
score=tensor([20.])
e_idx=tensor([4])
*****
s_idx=tensor(3)


In [53]:
def evaluate(predictions):
    '''
    Gets a dictionary of predictions with question_id as key
    and prediction as value. The validation dataset has multiple 
    answers for a single question. Hence we compare our prediction
    with all the answers and choose the one that gives us
    the maximum metric (em or f1). 
    This method first parses the JSON file, gets all the answers
    for a given id and then passes the list of answers and the 
    predictions to calculate em, f1.
    
    
    :param dict predictions
    Returns
    : exact_match: 1 if the prediction and ground truth 
      match exactly, 0 otherwise.
    : f1_score: 
    '''
    with open('./data/squad_dev.json','r',encoding='utf-8') as f:
        dataset = json.load(f)
        
    dataset = dataset['data']
    f1 = exact_match = total = 0
    for article in dataset:
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                total += 1
                if qa['id'] not in predictions:
                    continue
                
                ground_truths = list(map(lambda x: x['text'], qa['answers']))
                
                prediction = predictions[qa['id']]
                
                exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
                
                f1 += metric_max_over_ground_truths(
                    f1_score, prediction, ground_truths)
                
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total
    
    return exact_match, f1



In [54]:
import string, re

def normalize_answer(s):
    '''
    Performs a series of cleaning steps on the ground truth and 
    predicted answer.
    '''
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    '''
    Returns maximum value of metrics for predicition by model against
    multiple ground truths.
    
    :param func metric_fn: can be 'exact_match_score' or 'f1_score'
    :param str prediction: predicted answer span by the model
    :param list ground_truths: list of ground truths against which
                               metrics are calculated. Maximum values of 
                               metrics are chosen.
                            
    
    '''
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
        
    return max(scores_for_ground_truths)


def f1_score(prediction, ground_truth):
    '''
    Returns f1 score of two strings.
    '''
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    '''
    Returns exact_match_score of two strings.
    '''
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def epoch_time(start_time, end_time):
    '''
    Helper function to record epoch time.
    '''
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [55]:

train_losses = []
valid_losses = []
ems = []
f1s = []
epochs = 5

for epoch in range(epochs):
    print(f"Epoch {epoch+1}")
    
    start_time = time.time()
    
    train_loss = train(model, train_dataset)
    valid_loss, em, f1 = valid(model, valid_dataset)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    ems.append(em)
    f1s.append(f1)
    
    print(f"Epoch train loss : {train_loss}| Time: {epoch_mins}m {epoch_secs}s")
    print(f"Epoch valid loss: {valid_loss}")
    print(f"Epoch EM: {em}")
    print(f"Epoch F1: {f1}")
    print("====================================================================================")
    

Epoch 1
Starting training ........
Starting batch: 0
Starting validation .........
Starting batch 0
Epoch train loss : 7.143351714332383| Time: 1m 34s
Epoch valid loss: 6.428433150440068
Epoch EM: 2.866603595080416
Epoch F1: 4.1844099085305135
Epoch 2
Starting training ........
Starting batch: 0
Starting validation .........
Starting batch 0
Epoch train loss : 5.625602439626471| Time: 1m 33s
Epoch valid loss: 5.678070783615112
Epoch EM: 4.2951750236518444
Epoch F1: 5.7572854482205775
Epoch 3
Starting training ........
Starting batch: 0
Starting validation .........
Starting batch 0
Epoch train loss : 4.895061938019542| Time: 1m 33s
Epoch valid loss: 5.235319673240959
Epoch EM: 5.10879848628193
Epoch F1: 6.621075770924116
Epoch 4
Starting training ........
Starting batch: 0
Starting validation .........
Starting batch 0
Epoch train loss : 4.362497874668667| Time: 1m 34s
Epoch valid loss: 5.074764686745483
Epoch EM: 5.373699148533586
Epoch F1: 6.813977891061937
Epoch 5
Starting training 

## References

* Papers read/referenced
    1. https://arxiv.org/abs/1704.00051
    2. https://arxiv.org/abs/1606.02858
    3. https://arxiv.org/abs/1409.0473
* Other helpful links
    1. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
    2. https://github.com/facebookresearch/DrQA
    3. https://github.com/hitvoice/DrQA. Special thanks to [Runqi Yang](https://github.com/hitvoice) who helped me clarify some doubts with respect to preprocessing the SQUAD dataset.
    4. https://towardsdatascience.com/the-definitive-guide-to-bidaf-part-3-attention-92352bbdcb07. Good introduction to attention.
    5. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/lectures/lecture10.pdf. The attention section of this notebook is largely inspired and derived from these slides.
* Following links are related to debugging neural nets. Something on which I was stuck for quite some time during training these models.
    1. https://datascience.stackexchange.com/questions/410/choosing-a-learning-rate
    2. https://www.jeremyjordan.me/nn-learning-rate/
    3. https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0
    4. https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1
    5. https://towardsdatascience.com/checklist-for-debugging-neural-networks-d8b2a9434f21
    6. https://arxiv.org/abs/1708.07120
    7. https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html