Natural Question Dataset:
Official Link:
This dataset contains really long context/passages and a single question based on the passage. The answer to the questions are subparts of the passage given in the form of <start_token> and <end_token>.
What makes this dataset really challenging is the passage and answer length.

I chose this dataset because it will be interesting to work with transformers with such long sequnces. When I tried this dataset with a seq2seq I kept getting OOM. The reason being exterme lenghts of contexts and their answers. The context are entire wikipedia pages in HTML format so there length is typically over 2000.

I started experimenting with original dataset. I kept watering down the task. First reducing the number of examples, then answer length, batch_size but I kept getting OOM. Finally I restricted context lengths to 1000.

The problem was in the sequence length of context. If I restrict them to 1000, the model has good enough memory to work with. No of samples has not much effect. And batch size does. I will remove this 1000 length restriction with transformers.

Note: this notebook is just about experimenting with this dataset. Its not about getting SOTA or admirable results. For example, in this notebook we only consider contexts with lenghts less than 1000.

Also, in this notebook we add attention components for answer_decoder. Attention uses only the context encoder hidden states and not question encoder hidden states.

In [1]:
import torch
import json
from torchtext import data
from itertools import chain
import torch.nn as nn
import torch.optim as optim
import time
from torch.nn import Embedding

In [2]:
%%bash
file=/content/train.jsonl
if [ ! -f "$file" ]; then
  # Since the dataset itself is huge, we will make train and test set from the original train file itself
  # A simple wget wont be able to get the natural questions dataset present at https://ai.google.com/research/NaturalQuestions/download
  # Use the advice given in https://www.kaggle.com/c/deepfake-detection-challenge/discussion/121194 
  # the download link is https://storage.cloud.google.com/natural_questions/v1.0-simplified/simplified-nq-train.jsonl.gz 
  
  # put curl command here
  # command -o /content/train.jsonl.gz
  
  zcat /content/train.jsonl.gz > /content/train.jsonl
  
fi

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0 4497M    0 32.0M    0     0  34.8M      0  0:02:08 --:--:--  0:02:08 34.8M  2 4497M    2  114M    0     0  60.9M      0  0:01:13  0:00:01  0:01:12 60.9M  4 4497M    4  200M    0     0  63.3M      0  0:01:11  0:00:03  0:01:08 63.3M  6 4497M    6  280M    0     0  71.3M      0  0:01:03  0:00:03  0:01:00 71.2M  8 4497M    8  364M    0     0  74.6M      0  0:01:00  0:00:04  0:00:56 74.6M 10 4497M   10  464M    0     0  77.5M      0  0:00:57  0:00:05  0:00:52 85.2M 11 4497M   11  512M    0     0  73.8M      0  0:01:00  0:00:06  0:00:54 78.6M 14 4497M   14  631M    0     0  80.0M      0  0:00:56  0:00:07  0:00:49 91.2M 15 4497M   15  719M    0     0  80.9M      0  0:00:55  0:00:08  0:00:47 88.5M 17 4497M   17  800M    0     0  78.1M      0  0:00

In [7]:
# In the beginning restrciting total number of examples to 10k. 
total_limit = 1000
context_length = 1000
span_length = 200

file = "/content/train.jsonl"
f = open(file)

total = 0
examples = []
for i,line in enumerate(f):
    if total >total_limit:
        break
    ak = json.loads(line)
    context = ak['document_text']
    question = ak['question_text']
    start = ak['annotations'][0]['long_answer']['start_token']
    end = ak['annotations'][0]['long_answer']['end_token']
    try:
        assert start < end and end-start > span_length and len(context.split(" ")) <context_length
    except AssertionError:
        continue
    answer = " ".join(context.split(" ")[start:end])
    examples.append([context,question,answer])
    total += 1
f.close()


In [8]:
f = open("/content/pruned_examples","w")
for i in examples:
  f.write(str(i))
  f.write("\n")

f.close()

In [9]:
# Will do the architecture with seperate encoders for context and question and a decoder for answer
context = data.Field(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')
question = data.Field(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')
answer = data.Field(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')

fields = [('context', context), ('question', question), ('answer', answer)]

Examples = [data.Example.fromlist([i[0], i[1], i[2]], fields) for i in examples]
Dataset = data.Dataset(Examples, fields)

train_dataset,valid_dataset = Dataset.split(split_ratio=[0.85,0.15])

context.build_vocab(train_dataset,min_freq=2,max_size = 20000)
question.build_vocab(train_dataset,min_freq=2,max_size = 5000)
answer.build_vocab(train_dataset,min_freq=2,max_size = 10000)


BATCH_SIZE = 32
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator = data.BucketIterator.splits((train_dataset, valid_dataset), batch_size=32,
                                                            sort_key=lambda x: len(x.context),
                                                            sort_within_batch=True,device=device)

In [10]:
print(Examples[0])
print(len(Examples))

<torchtext.data.example.Example object at 0x7fd174d055f8>
247


Only 247 examples with context length less than 1000!!

In [11]:
class Encoder(nn.Module):
  def __init__(self,input_dim,emb_dim,hid_dim,n_layers,dropout,bidirectional):
    super().__init__()
    self.hid_dim = hid_dim
    
    self.embedding = nn.Embedding(input_dim,emb_dim)

    self.rnn = nn.LSTM(input_size=emb_dim,hidden_size = hid_dim,num_layers= n_layers,dropout= dropout,bidirectional = bidirectional)

    self.dropout = nn.Dropout(dropout)
  
  def forward(self,input,hidden=None,cell_state=None):
    
    embedded = self.dropout(self.embedding(input))

    if not hidden == None:
      outputs, (hidden,cell_state) = self.rnn(embedded,(hidden,cell_state))
    else:
      outputs, (hidden,cell_state) = self.rnn(embedded)

    return outputs ,hidden,cell_state

In [12]:
class Attention(nn.Module):
  def __init__(self, attn_vector_size,hid_dim,bidirectional,n_layers, type_='dot'):
    super().__init__()
    self.attn_vector_size = attn_vector_size
    self.bidirectional = bidirectional
    self.n_layers = n_layers
    self.type_ = type_
    self.hid_dim = hid_dim
    self.directions = 2 if bidirectional else 1
    # The final vector is concatenation of (n_layers*directions) different vectors and the final size is 'attn_vector_size'
    # so it should be divisible with (n_layers*directions) 
    assert attn_vector_size%(n_layers*self.directions) == 0 

    self.transform_ = nn.Linear(self.directions*self.hid_dim,int(attn_vector_size/(n_layers*self.directions)))
  
  def forward(self,encoder_outputs,decoder_hidden_state):
    # Interesting: what happens when there are multiple layers and bidirectionality.. Do I find attention vector for each layer and append it to 
    # input of each layer.. but we give LSTMs input only once..so I think I need a attention vector including information from all layers
    # given as a single input once.

    # Since I am already working with very long sequences..to avoid too many params, I am using dot product as a score function
    if not self.type_ == 'dot': raise NotImplementedError
    
    # The idea is to find attention for each layer and direction of decoder hidden state and later combine all these and give a final vector
    # so the attention input has attention information from all decoder hidden states

    # encoder_outputs : [src_len,batch_size,num_of_directions*hid_dim]
    # decoder_hidden_state : [n_layers*no_of_directions,batch_size,hid_dim]

    with torch.no_grad():
      decoder_hidden_state_len = len(decoder_hidden_state)
      src_len = encoder_outputs.shape[0]

      encoder_view = encoder_outputs.view(-1,self.directions,self.hid_dim) # encoder_view : [src_len*batch_size,num_of_directions,hid_dim]

      encoder_outputs = encoder_outputs.permute(1,0,2) # encoder_outputs : [batch_size,src_len,num_of_directions*hid_dim]
      all_attention_vectors = []

    for i in range(decoder_hidden_state_len):
      
      with torch.no_grad():
        # Calculating alpha
        hidden_state = decoder_hidden_state[i] # hidden state : [batch_size,hid_dim]

        hidden_state = hidden_state.unsqueeze(0) # hidden_state : [1,batch_size,hid_dim]
        hidden_state = hidden_state.permute(1,2,0) # hidden_state :[batch_size,hid_dim, 1]
        hidden_state = hidden_state.repeat(src_len,1,1) # hidden_state :[src_len*batch_size,hid_dim, 1]

        temp = torch.bmm(encoder_view,hidden_state) # temp : [src_len*batch_size,num_of_directions,1]
        temp = temp.sum(1) # temp : [src_len*batch_size,1]
        temp = temp.reshape(-1,1,src_len) # temp :[batch_size,1,src_len]

        alpha = F.softmax(temp,dim=-1)  # alpha :[batch_size,1,src_len]

        # Calculating attention vector
        c = torch.bmm(alpha,encoder_outputs) # s: [batch_size,1,num_of_directions*hid_dim]

      transform_c = self.transform_(c) # transform_s : [batch_size,1,attn_vector_size/(n_layers*self.directions)]
      transform_c = transform_c.permute(1,0,2) # transform_s : [1,batch_size,attn_vector_size/(n_layers*self.directions)]
      all_attention_vectors.append(transform_c)


    # return vector of shape [1,batch_size,self.attn_vector_size(emb_dim)]
    return torch.cat(all_attention_vectors,dim = -1)

In [13]:
class Decoder(nn.Module):
  def __init__(self,output_dim,emd_dim,hid_dim,n_layers,bidirectional,dropout):
    super().__init__()

    self.embedded = nn.Embedding(output_dim,emb_dim)

    # Input dims will be 2*emb_size because it will also recieve attention information
    self.rnn = nn.LSTM(input_size=2*emb_dim,hidden_size=hid_dim,num_layers=n_layers,bidirectional=bidirectional,dropout=dropout)

    self.attention = Attention(emd_dim,hid_dim,bidirectional,n_layers)

    self.dropout = nn.Dropout(dropout)

    no_of_directions = 2 if bidirectional else 1

    # I am not passing attention vector to help in output prediction
    self.fc_out = nn.Linear(no_of_directions*hid_dim,output_dim)

  def forward(self,input,encoder_outputs,hidden,cell_state):
    input = input.unsqueeze(0)

    input = self.dropout(self.embedded(input))

    # Attention
    attention_vector = self.attention(encoder_outputs,hidden)

    # concat input and attention vector
    input = torch.cat([input,attention_vector],dim=-1)

    output, (hidden,cell_state) = self.rnn(input,(hidden,cell_state))

    output = output.squeeze(0)
    
    prediction = self.fc_out(output)

    return prediction, hidden , cell_state


In [14]:
import random
class Seq2seq(nn.Module):
  def __init__(self,context_dim,question_dim,answer_dim,emd_dim,hid_dim,n_layers,bidirectional,dropout):
    super().__init__()

    self.context_encoder = Encoder(context_dim,emd_dim,hid_dim,n_layers,dropout,bidirectional)
    self.question_encoder = Encoder(question_dim,emd_dim,hid_dim,n_layers,dropout,bidirectional)
    self.answer_decoder = Decoder(answer_dim,emd_dim,hid_dim,n_layers,bidirectional,dropout)

    self.answer_dim = answer_dim

  def forward(self,context,question,answer,teacher_forcing =0.5):

    encoder_outputs, hidden,cell_state = self.context_encoder(context)

    _,hidden,cell_state = self.question_encoder(question,hidden,cell_state)

    answer_len = len(answer)
    batch_size = answer.shape[1]

    outputs = torch.zeros(answer_len,batch_size,self.answer_dim).to(device)

    for i,j in enumerate(range(answer_len)):
      k = answer[j]
      if i != 0:
        k = prediction.argmax(1) if random.random() < teacher_forcing else k
      prediction, hidden, cell_state = self.answer_decoder(k,encoder_outputs,hidden,cell_state)
      outputs[j] = prediction

    return outputs

In [21]:
context_dim = len(context.vocab)
question_dim = len(question.vocab)
answer_dim = len(answer.vocab)
emb_dim = 100
hid_dim = 100
n_layers = 1
bidirectional = False
dropout = 0.5

model = Seq2seq(context_dim,question_dim,answer_dim,emb_dim,hid_dim,n_layers,bidirectional,dropout).to(device)

def init_weights(m):
    for name, param in m.named_parameters():
      # if not isinstance(m, Embedding):
      nn.init.normal_(param.data, mean=0, std=0.01)
        
model.apply(init_weights)

optimizer = optim.Adam(model.parameters())


TRG_PAD_IDX = answer.vocab.stoi[answer.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

  "num_layers={}".format(dropout, num_layers))


In [22]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,870,156 trainable parameters


In [23]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        context_ = batch.context
        question_ = batch.question
        answer_ = batch.answer
        
        optimizer.zero_grad()
        
        output = model(context_, question_,answer_)
        
        trg = answer_
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [24]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            context_ = batch.context
            question_ = batch.question
            answer_ = batch.answer
        
            output = model(context_, question_,answer_,0) #turn off teacher forcing

            trg = answer_
            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [25]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [26]:
import math
import torch.nn.functional as F

N_EPOCHS = 20
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 26s
	Train Loss: 8.348 | Train PPL: 4222.266
	 Val. Loss: 8.325 |  Val. PPL: 4124.963
Epoch: 02 | Time: 0m 26s
	Train Loss: 8.281 | Train PPL: 3949.416
	 Val. Loss: 8.129 |  Val. PPL: 3391.024
Epoch: 03 | Time: 0m 26s
	Train Loss: 7.323 | Train PPL: 1514.181
	 Val. Loss: 6.159 |  Val. PPL: 473.176
Epoch: 04 | Time: 0m 26s
	Train Loss: 5.937 | Train PPL: 378.985
	 Val. Loss: 4.935 |  Val. PPL: 139.111
Epoch: 05 | Time: 0m 26s
	Train Loss: 5.099 | Train PPL: 163.874
	 Val. Loss: 4.047 |  Val. PPL:  57.237
Epoch: 06 | Time: 0m 26s
	Train Loss: 4.550 | Train PPL:  94.597
	 Val. Loss: 3.593 |  Val. PPL:  36.334
Epoch: 07 | Time: 0m 26s
	Train Loss: 4.407 | Train PPL:  82.062
	 Val. Loss: 3.447 |  Val. PPL:  31.419
Epoch: 08 | Time: 0m 26s
	Train Loss: 4.422 | Train PPL:  83.293
	 Val. Loss: 3.418 |  Val. PPL:  30.512
Epoch: 09 | Time: 0m 27s
	Train Loss: 4.410 | Train PPL:  82.303
	 Val. Loss: 3.406 |  Val. PPL:  30.131
Epoch: 10 | Time: 0m 26s
	Train Loss: 4.404 | Trai