Drop dataset if a Question Answering datset focusing on reading comprehension of models. Each datapoint has:

1.   passage: which we refer as context in this notebook. A passage is derived from wikipedia and its corresponding wikipedia url is mentioned in the key 'wiki_url.
2.   Qa_pairs: For a single passage, the dataset has mutiple questions. Each question has the answer of the following type:

  *   Number: When the answer is a specific number
  *   Date: When the answer is a date
  *   Span: When the answer is a text like names or places

What makes this datset interesting is the presence of multiple questions for the same context. So lets say a batch of 32 contexts might have total 900 questions on them total or 560. This varying number of number of questions for each passage brings up design issues.

A goto choice could be that for each question you create an example like
<context,question,answer>

But this would mean passing the same context again and again through the encoder for questions belonging to the same context which would take time for longer contexts since we are modelling the problem as a seq2seq problem. To avoid this, we create an example like the dataset itself.
<context,(question1..question_n),(answer1..answer_n)>

This required making customField and a customExample class. To have a batch of contexts and yet allow different number of questions in each batch.


The architecture contains three RNN cells. The procedure looks like this:
1. We pass context to an encoder.
2. We then pass the question through a different encoder cell which takes care of different number of questions for each batch of contexts. We use the last hidden vector of a context passed through context encoder as an input to cell for its question.
3. We then use the hidden vectors of the question encoder as an input to the 'answer decoder'.

We are initially only focussing on loss and perplexity and not on exact matches of the answer.

In this notebook, we add attention components to the 'answer decoder'. We do not use attention from the output of the question encoder RNN because questions in the dataset don't have much information. We use attention on the hidden states of 'context encoder' because the answer is part or derived from the context. 












In [1]:
import torch
import json
from torchtext import data
from itertools import chain
import torch.nn as nn
import torch.optim as optim
import time
from torch.nn import Embedding

In [2]:
%%bash
directory=/content/drop_dataset/
if [ ! -d "$directory" ]; then
  # Download drop dataset
  wget -c "https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip"
  # unzip the file
  unzip drop_dataset.zip
fi

Archive:  drop_dataset.zip
  inflating: drop_dataset/drop_dataset_dev.json  
  inflating: drop_dataset/drop_dataset_train.json  
  inflating: drop_dataset/license.txt  


--2020-12-29 12:14:28--  https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.180.216
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.180.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8308692 (7.9M) [application/zip]
Saving to: ‘drop_dataset.zip’

     0K .......... .......... .......... .......... ..........  0% 3.18M 2s
    50K .......... .......... .......... .......... ..........  1% 3.25M 2s
   100K .......... .......... .......... .......... ..........  1% 3.28M 2s
   150K .......... .......... .......... .......... ..........  2%  224M 2s
   200K .......... .......... .......... .......... ..........  3% 3.35M 2s
   250K .......... .......... .......... .......... ..........  3%  167M 2s
   300K .......... .......... .......... .......... ..........  4%  190M 1s
   350K .......... .......... .......... .......... .........

In [3]:
train_file = "/content/drop_dataset/drop_dataset_train.json"
test_file = "/content/drop_dataset/drop_dataset_dev.json"

train = json.load(open(train_file))
test = json.load(open(test_file))

train_data = []
for i, j in enumerate(train):
    ques_ = []
    ans_ = []
    for l in train[j]['qa_pairs']:
        ques_.append(l['question'])
        m = l['answer']
        if len(m['spans']) > 0:
            ans_.append(" ".join(m['spans']))
        elif len(m['number']) > 0:
            ans_.append(m['number'])
        else:
            ans_.append(" ".join([m['date'][o] for o in m['date']]))
    train_data.append([train[j]['passage'], ques_, ans_])

test_data = []
for i, j in enumerate(test):
    ques_ = []
    ans_ = []
    for l in test[j]['qa_pairs']:
        ques_.append(l['question'])
        m = l['answer']
        if len(m['spans']) > 0:
            ans_.append(" ".join(m['spans']))
        elif len(m['number']) > 0:
            ans_.append(m['number'])
        else:
            ans_.append(" ".join([m['date'][o] for o in m['date']]))
    test_data.append([test[j]['passage'], ques_, ans_])

class CustomExample(data.Example):
    # a problem might happen while trying to convert to tensors or while making vocab
    @classmethod
    def fromCustomList(cls, data, fields):
        ex = cls()
        for index, ((name, field), val) in enumerate(zip(fields, data)):
            if index == 0:
                if isinstance(val, str):
                    val = val.rstrip('\n')
                setattr(ex, name, field.preprocess(val))
            else:
                field_processed_val = []
                if not isinstance(val, list):
                    val = [val]
                for i in val:
                    if isinstance(i, str):
                        i = i.rstrip('\n')
                    field_processed_val.append(field.preprocess(i))
                setattr(ex, name, field_processed_val)
        return ex


class CustomField(data.Field):
    # Creating this class for custom padding and numercalizing of batches of questions
    # problem was batch_size number of passages will have way more number of questions on them
    # so append all question to make 1 batch for the second RNN cell.

    def pad(self, minibatch):
        # If I get OOM, restrict number of questions per passage here
        self.lengths = [len(i) for i in minibatch]
        m = chain.from_iterable(minibatch)
        return super(CustomField, self).pad(m)


context = data.Field(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')
question = CustomField(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')
answer = CustomField(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')

fields = [('context', context), ('question', question), ('answer', answer)]
train_Examples = [
    CustomExample.fromCustomList([i[0], i[1], i[2]], fields) for i
    in train_data]
test_Examples = [
    CustomExample.fromCustomList([i[0], i[1], i[2]], fields) for i
    in test_data]

train_dataset = data.Dataset(train_Examples, fields)
test_dataset = data.Dataset(test_Examples,fields)
context.build_vocab(train_dataset,min_freq=2,max_size = 30000)
question.build_vocab(train_dataset,min_freq=2,max_size = 10000)
answer.build_vocab(train_dataset,min_freq=2,max_size = 5000)

BATCH_SIZE = 32
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, test_iterator = data.BucketIterator.splits((train_dataset, test_dataset), batch_size=BATCH_SIZE,
                                                            sort_key=lambda x: len(x.context), sort_within_batch=True,device=device)


In [4]:
print("Context:")
print(train_Examples[0].context)
print("Questions :")
print(train_Examples[0].question)
print("Answers:")
print(train_Examples[0].answer)

Context:
['To', 'start', 'the', 'season', ',', 'the', 'Lions', 'traveled', 'south', 'to', 'Tampa', ',', 'Florida', 'to', 'take', 'on', 'the', 'Tampa', 'Bay', 'Buccaneers', '.', 'The', 'Lions', 'scored', 'first', 'in', 'the', 'first', 'quarter', 'with', 'a', '23-yard', 'field', 'goal', 'by', 'Jason', 'Hanson', '.', 'The', 'Buccaneers', 'tied', 'it', 'up', 'with', 'a', '38-yard', 'field', 'goal', 'by', 'Connor', 'Barth', ',', 'then', 'took', 'the', 'lead', 'when', 'Aqib', 'Talib', 'intercepted', 'a', 'pass', 'from', 'Matthew', 'Stafford', 'and', 'ran', 'it', 'in', '28', 'yards', '.', 'The', 'Lions', 'responded', 'with', 'a', '28-yard', 'field', 'goal', '.', 'In', 'the', 'second', 'quarter', ',', 'Detroit', 'took', 'the', 'lead', 'with', 'a', '36-yard', 'touchdown', 'catch', 'by', 'Calvin', 'Johnson', ',', 'and', 'later', 'added', 'more', 'points', 'when', 'Tony', 'Scheffler', 'caught', 'an', '11-yard', 'TD', 'pass', '.', 'Tampa', 'Bay', 'responded', 'with', 'a', '31-yard', 'field', 'goal

In [5]:
print(len(train_Examples))

5565


In [19]:
class Encoder(nn.Module):
  def __init__(self,context_dim,emb_dim,hid_dim,n_layers,dropout,bidirectional):
    super().__init__()
    self.hid_dim = hid_dim
    
    self.embedding = nn.Embedding(context_dim,emb_dim)

    self.rnn = nn.LSTM(input_size=emb_dim,hidden_size = hid_dim,num_layers= n_layers,dropout= dropout,bidirectional = bidirectional)

    self.dropout = nn.Dropout(dropout)
  
  def forward(self,context):
    
    embedded = self.dropout(self.embedding(context))

    outputs, (hidden,cell_state) = self.rnn(embedded)

    return outputs,hidden,cell_state

In [20]:
class Encoder1(nn.Module):
  def __init__(self,question_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout):
    super().__init__()
    self.hid_dim = hid_dim

    self.embedded = nn.Embedding(question_dim,emb_dim)

    self.rnn = nn.LSTM(input_size=emb_dim,hidden_size = hid_dim,num_layers= n_layers,dropout= dropout,bidirectional = bidirectional)

    self.dropout = nn.Dropout(dropout)    
    

  def fill_tensor(self,hidden, cell_state, question):
    # The idea is to make a new tensor that fills the hidden and cell state according to the lengths of the question
    
    hidden_shape = hidden.shape
    question_shape = question.shape
    t_hidden = torch.zeros(hidden_shape[0],question_shape[-1],hidden_shape[-1]).to(device)
    t_cell_state = torch.zeros(hidden_shape[0],question_shape[-1],hidden_shape[-1]).to(device)
    pos = 0
    for i in range(hidden_shape[1]):
      t_hidden[:,pos:pos+self.lengths[i],:] = hidden[:,i,:].unsqueeze(1)
      t_cell_state[:,pos:self.lengths[i],:] = cell_state[:,i,:].unsqueeze(1)
      pos += self.lengths[i]
    return t_hidden,t_cell_state

  def forward(self, question, hidden, cell_state,lengths):
    self.lengths = lengths

    assert sum(lengths) == question.shape[-1]
    input = self.dropout(self.embedded(question))

    # filled tensor
    filled_hidden, filled_cell_state = self.fill_tensor(hidden,cell_state,question)

    outputs , (hidden, cell_state) = self.rnn(input,(filled_hidden,filled_cell_state))

    return outputs ,hidden, cell_state



In [21]:
import torch.nn.functional as F
class Attention(nn.Module):
  def __init__(self, attn_vector_size,hid_dim,bidirectional,n_layers, type_='dot'):
    super().__init__()
    self.attn_vector_size = attn_vector_size
    self.bidirectional = bidirectional
    self.n_layers = n_layers
    self.type_ = type_
    self.hid_dim = hid_dim
    self.directions = 2 if bidirectional else 1
    # The final vector is concatenation of (n_layers*directions) different vectors and the final size is 'attn_vector_size'
    # so it should be divisible with (n_layers*directions) 
    assert attn_vector_size%(n_layers*self.directions) == 0 

    self.transform_ = nn.Linear(self.directions*self.hid_dim,int(attn_vector_size/(n_layers*self.directions)))
  
  def forward(self,encoder_outputs,decoder_hidden_state):
    # Interesting: what happens when there are multiple layers and bidirectionality.. Do I find attention vector for each layer and append it to 
    # input of each layer.. but we give LSTMs input only once..so I think I need a attention vector including information from all layers
    # given as a single input once.

    # Since I am already working with very long sequences..to avoid too many params, I am using dot product as a score function
    if not self.type_ == 'dot': raise NotImplementedError
    
    # The idea is to find attention for each layer and direction of decoder hidden state and later combine all these and give a final vector
    # so the attention input has attention information from all decoder hidden states

    # encoder_outputs : [src_len,batch_size,num_of_directions*hid_dim]
    # decoder_hidden_state : [n_layers*no_of_directions,batch_size,hid_dim]

    decoder_hidden_state_len = len(decoder_hidden_state)
    src_len = encoder_outputs.shape[0]

    encoder_view = encoder_outputs.view(-1,self.directions,self.hid_dim) # encoder_view : [src_len*batch_size,num_of_directions,hid_dim]

    encoder_outputs = encoder_outputs.permute(1,0,2) # encoder_outputs : [batch_size,src_len,num_of_directions*hid_dim]
    all_attention_vectors = []

    for i in range(decoder_hidden_state_len):
      
      # Calculating alpha
      hidden_state = decoder_hidden_state[i] # hidden state : [batch_size,hid_dim]

      hidden_state = hidden_state.unsqueeze(0) # hidden_state : [1,batch_size,hid_dim]
      hidden_state = hidden_state.permute(1,2,0) # hidden_state :[batch_size,hid_dim, 1]
      hidden_state = hidden_state.repeat(src_len,1,1) # hidden_state :[src_len*batch_size,hid_dim, 1]

      temp = torch.bmm(encoder_view,hidden_state) # temp : [src_len*batch_size,num_of_directions,1]
      temp = temp.sum(1) # temp : [src_len*batch_size,1]
      temp = temp.reshape(-1,1,src_len) # temp :[batch_size,1,src_len]

      alpha = F.softmax(temp,dim=-1)  # alpha :[batch_size,1,src_len]

      # Calculating attention vector
      c = torch.bmm(alpha,encoder_outputs) # s: [batch_size,1,num_of_directions*hid_dim]

      transform_c = self.transform_(c) # transform_s : [batch_size,1,attn_vector_size/(n_layers*self.directions)]
      transform_c = transform_c.permute(1,0,2) # transform_s : [1,batch_size,attn_vector_size/(n_layers*self.directions)]
      all_attention_vectors.append(transform_c)


    # return vector of shape [1,batch_size,self.attn_vector_size(emb_dim)]
    return torch.cat(all_attention_vectors,dim = -1)

In [22]:
class Decoder(nn.Module):
  def __init__(self,answer_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout):
    super().__init__()

    ## I wont be passing the hidden state from last LSTM to each input and fc_out of eaxh time step and emb dimension to fc_out
    self.hid_dim = hid_dim

    self.embedded = nn.Embedding(answer_dim,emb_dim)

    self.rnn = nn.LSTM(input_size=2*emb_dim,hidden_size = hid_dim,num_layers= n_layers,dropout= dropout,bidirectional = bidirectional)

    self.attention = Attention(emb_dim,hid_dim,bidirectional,n_layers)

    self.dropout = nn.Dropout(dropout)
    
    no_of_directions = 2 if bidirectional else 1

    self.fc_out = nn.Linear(no_of_directions*hid_dim,answer_dim)

  def forward(self, answer, encoder_outputs, hidden, cell_state):
    # this forward will be 1 step at a time

    # emdeb answer with dropout
    answer = answer.unsqueeze(0)
    embedded = self.dropout(self.embedded(answer))

    attention_vector = self.attention(encoder_outputs,hidden)

    # concatenate embedded with attention vector
    embedded = torch.cat([embedded,attention_vector],dim=-1)

    # pass it through cell
    output , (hidden,cell_state) = self.rnn(embedded,(hidden,cell_state))

    # change dimension of output for fc_out
    # output should be [1,batch_size,no_of_directions*hid_dim]
    output = output.squeeze(0)
    
    # return hidden and cell state
    prediction = self.fc_out(output)

    return prediction, hidden, cell_state

In [23]:

import random
class Seq2Seq(nn.Module):
  def __init__(self,context_dim,question_dim,answer_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout):
    super().__init__()
    self.context_encoder = Encoder(context_dim,emb_dim,hid_dim,n_layers,dropout,bidirectional)
    self.question_encoder = Encoder1(question_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout)
    self.answer_decoder = Decoder(answer_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout)

    self.answer_dim = answer_dim
  def custom_strech(self,context,lengths):
    total_len = sum(lengths)

    #checking batch size in  context is same as lenght of the list 'lenghts'
    assert context.shape[1] == len(lengths)
    t_context = torch.zeros(context.shape[0],total_len,context.shape[-1]).to(device)
    pos = 0
    for j,i in enumerate(lengths):
      t_context[:,pos:pos+i,:] = context[:,j,:].unsqueeze(1)
      pos = pos+i

    return t_context
  
  def forward(self,context,question,answer,lengths,teacher_forcing_ratio = 0.5):

    # encode context
    context_encoder_outputs,hidden,cell_state = self.context_encoder(context)

    # strech encoder context output so that the batch size is same while calculating attention
    context_encoder_outputs = self.custom_strech(context_encoder_outputs,lengths)

    # encode question
    _,hidden,cell_state = self.question_encoder(question,hidden,cell_state,lengths)

    # pass the answer one by one
    answer_len = len(answer)
    batch_size = answer.shape[1]

    outputs = torch.zeros(answer_len,batch_size,self.answer_dim).to(device)

    for i,j in enumerate(range(answer_len)):
      k = answer[j]
      if not i == 0:
        k = prediction.argmax(1) if random.random() < teacher_forcing_ratio else k
      # using attention over context vectors rather than questions
      prediction,hidden,cell_state = self.answer_decoder(k,context_encoder_outputs,hidden,cell_state)
      outputs[j] = prediction

    return outputs


In [24]:
context_dim = len(context.vocab)
question_dim = len(question.vocab)
answer_dim = len(answer.vocab)
emb_dim = 50
hid_dim = 50
n_layers = 1
bidirectional = True
dropout = 0.5

model = Seq2Seq(context_dim,question_dim,answer_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout).to(device)

def init_weights(m):
    for name, param in m.named_parameters():
      if not isinstance(m, Embedding):
        nn.init.normal_(param.data, mean=0, std=0.01)
        
model.apply(init_weights)

optimizer = optim.Adam(model.parameters())


TRG_PAD_IDX = answer.vocab.stoi[answer.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

  "num_layers={}".format(dropout, num_layers))


In [25]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        context_ = batch.context
        question_ = batch.question
        answer_ = batch.answer
        lengths = question.lengths # I know this is not the right way to do..maybe create a new filed for lenghts but they depend of the chosen batch??
        
        optimizer.zero_grad()
        
        output = model(context_, question_,answer_,lengths)
        
        trg = answer_
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [26]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            context_ = batch.context
            question_ = batch.question
            answer_ = batch.answer
            lengths = question.lengths
        

            output = model(context_, question_,answer_,lengths,0) #turn off teacher forcing

            trg = answer_
            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [27]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [28]:
import math
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, test_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 2m 4s
	Train Loss: 5.240 | Train PPL: 188.745
	 Val. Loss: 4.057 |  Val. PPL:  57.780
Epoch: 02 | Time: 2m 3s
	Train Loss: 4.369 | Train PPL:  78.936
	 Val. Loss: 3.732 |  Val. PPL:  41.743
Epoch: 03 | Time: 2m 3s
	Train Loss: 4.083 | Train PPL:  59.320
	 Val. Loss: 3.389 |  Val. PPL:  29.641
Epoch: 04 | Time: 2m 4s
	Train Loss: 3.982 | Train PPL:  53.623
	 Val. Loss: 3.336 |  Val. PPL:  28.112
Epoch: 05 | Time: 2m 1s
	Train Loss: 3.876 | Train PPL:  48.231
	 Val. Loss: 3.198 |  Val. PPL:  24.472
Epoch: 06 | Time: 2m 1s
	Train Loss: 3.675 | Train PPL:  39.447
	 Val. Loss: 3.025 |  Val. PPL:  20.597
Epoch: 07 | Time: 2m 4s
	Train Loss: 3.487 | Train PPL:  32.685
	 Val. Loss: 2.899 |  Val. PPL:  18.153
Epoch: 08 | Time: 2m 4s
	Train Loss: 3.364 | Train PPL:  28.902
	 Val. Loss: 2.745 |  Val. PPL:  15.560
Epoch: 09 | Time: 2m 3s
	Train Loss: 3.272 | Train PPL:  26.355
	 Val. Loss: 2.657 |  Val. PPL:  14.248
Epoch: 10 | Time: 2m 0s
	Train Loss: 3.174 | Train PPL:  23.898
