Hotpot is a Question answering dataset. A datapoint looks like this:

1.   Context: A list of different topics. Each topic has multiple sentences explaining each concept.
2.   Question: A question that asks a question from one particular topic from the context. So the job of the model is to find the answer from all the different topics.
3.   supporting facts: this mentions which tell exactly which topic the answer is derived from. We ignore this information in the model below.
4. Answer: The answer to the question which is usually a phrase, name, place or thing.

We model this dataset as a seq2seq problem. We combine context and question to form one single context we call context_with_question which we feed to the encoder. To demarcate the between context and question we use ' @q_pad '. The decoder's job is to come up with the answer to this question.

The reason to choose this dataset is the length of contexts. In a way the context can be thought of as a document containing multiple topics and the model's job is to find the answer of the question based on the document.

Initally we are only focussing only loss and perplexity values and not exact matches with answer while ignoring the 'pad' token in the decoder.

In this notebook we add attention mechanism over context+question hidden states. Attention mechanism is later used by answer decoder. 

The model is simply learning the train data as train loss is decreasing but validation loss isnt.


In [None]:
import torch
import json
from torchtext import data
from itertools import chain
import torch.nn as nn
import torch.optim as optim
import time
from torch.nn import Embedding

In [None]:
%%bash
FILE=/content/hotpot_train_v1.1.json
if [ ! -f "$FILE" ]; then
  #Download dataset
  wget -c "http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json"
  wget -c "http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_fullwiki_v1.json"
fi

In [None]:
train_file = "/content/hotpot_train_v1.1.json"
test_file = "/content/hotpot_dev_fullwiki_v1.json"


def get_examples(file):
    ak = json.load(open(file))
    examples = []
    for j, i in enumerate(ak):
        context = "".join([k for j in i['context'] for k in j[1]])
        question = i['question']
        answer = i['answer']
        examples.append([context, question, answer])

    return examples


train_examples = get_examples(train_file)
test_examples = get_examples(test_file)

question_pad = ' @qpad '

def get_examples(file):
    ak = json.load(open(file))
    examples = []
    for j, i in enumerate(ak):
        # Limiting examples coz ram not sufficient. find another way..some sort of yield
        if len(examples)>50000:
          break
        context = "".join([k for j in i['context'] for k in j[1]])
        question = i['question']
        answer = i['answer']
        examples.append([context + question_pad + question, answer])

    return examples


train_examples = get_examples(train_file)
test_examples = get_examples(test_file)

context_with_question = data.Field(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')
answer = data.Field(sequential=True, tokenize='spacy', init_token='<sos>', eos_token='<eos>')

fields = [('context', context_with_question), ('answer', answer)]

train_Examples = [data.Example.fromlist([i[0], i[1]], fields) for i in train_examples]
train_dataset = data.Dataset(train_Examples, fields)

test_Examples = [data.Example.fromlist([i[0], i[1]], fields) for i in test_examples]
test_dataset = data.Dataset(test_Examples, fields)

context_with_question.build_vocab(train_dataset,min_freq=2,max_size = 30000,vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)
# observation: since I build answer vocab only for the train dataset. the model starts overfitting for train
# solutions: 
# 1) build vocab for both train and test data..feels like cheating
# 2) use context vocab everywhere..increases model size
# Going with option 2
# answer.build_vocab(train_dataset,vectors = "glove.6B.100d", 
#                  unk_init = torch.Tensor.normal_)
answer.vocab = context_with_question.vocab

BATCH_SIZE = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, test_iterator = data.BucketIterator.splits((train_dataset, test_dataset),
                                                           batch_size=BATCH_SIZE,
                                                           sort_key=lambda x: len(x.context),
                                                           sort_within_batch=True,device=device)

.vector_cache/glove.6B.zip: 862MB [06:29, 2.22MB/s]                           
100%|█████████▉| 398962/400000 [00:16<00:00, 26018.19it/s]

In [None]:
print(train_examples[0])

['Radio City is India\'s first private FM radio station and was started on 3 July 2001. It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003). It plays Hindi, English and regional songs. It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features. The Radio station currently plays a mix of Hindi and Regional music. Abraham Thomas is the CEO of the company.Football in Albania existed before the Albanian Football Federation (FSHF) was created. This was evidenced by the team\'s registration at the Balkan Cup tournament during 1929-1931, which started in 1929 (although Albania eventually had pressure from the teams because of competition,

In [None]:
print(len(train_examples))

50001


In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        
        self.embedding = nn.Embedding(input_dim, emb_dim) #no dropout as only one layer!
        
        self.rnn = nn.GRU(emb_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded) #no cell state!
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.output_dim = output_dim
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim + hid_dim, hid_dim)
        
        self.fc_out = nn.Linear(emb_dim + hid_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, context):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #context = [n layers * n directions, batch size, hid dim]
        
        #n layers and n directions in the decoder will both always be 1, therefore:
        #hidden = [1, batch size, hid dim]
        #context = [1, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        emb_con = torch.cat((embedded, context), dim = 2)
            
        #emb_con = [1, batch size, emb dim + hid dim]
            
        output, hidden = self.rnn(emb_con, hidden)
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        
        #seq len, n layers and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        output = torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)), 
                           dim = 1)
        
        #output = [batch size, emb dim + hid dim * 2]
        
        prediction = self.fc_out(output)
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is the context
        context = self.encoder(src)
        
        #context also used as the initial hidden state of the decoder
        hidden = context
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and the context state
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, context)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

In [None]:
INPUT_DIM = len(context_with_question.vocab)
# OUTPUT_DIM = len(answer.vocab)
OUTPUT_DIM = len(context_with_question.vocab)
ENC_EMB_DIM = 100
DEC_EMB_DIM = 100
HID_DIM = 200
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(enc, dec, device).to(device)

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
      if not isinstance(m, Embedding):
        nn.init.normal_(param.data, mean=0, std=0.01)
        
model.apply(init_weights)

optimizer = optim.Adam(model.parameters())

TRG_PAD_IDX = answer.vocab.stoi[answer.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.context
        trg = batch.answer
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.context
            trg = batch.answer

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
import random
import math
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, test_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

100%|█████████▉| 398962/400000 [00:29<00:00, 26018.19it/s]

Epoch: 01 | Time: 2m 51s
	Train Loss: 5.834 | Train PPL: 341.765
	 Val. Loss: 5.733 |  Val. PPL: 308.765
Epoch: 02 | Time: 2m 53s
	Train Loss: 5.130 | Train PPL: 168.942
	 Val. Loss: 5.653 |  Val. PPL: 285.170
Epoch: 03 | Time: 2m 52s
	Train Loss: 4.836 | Train PPL: 125.912
	 Val. Loss: 5.541 |  Val. PPL: 254.845
Epoch: 04 | Time: 2m 55s
	Train Loss: 4.560 | Train PPL:  95.623
	 Val. Loss: 5.435 |  Val. PPL: 229.224
Epoch: 05 | Time: 2m 54s
	Train Loss: 4.338 | Train PPL:  76.550
	 Val. Loss: 5.418 |  Val. PPL: 225.499
Epoch: 06 | Time: 2m 54s
	Train Loss: 4.145 | Train PPL:  63.122
	 Val. Loss: 5.432 |  Val. PPL: 228.493
Epoch: 07 | Time: 2m 54s
	Train Loss: 3.963 | Train PPL:  52.631
	 Val. Loss: 5.453 |  Val. PPL: 233.348
Epoch: 08 | Time: 2m 53s
	Train Loss: 3.795 | Train PPL:  44.495
	 Val. Loss: 5.493 |  Val. PPL: 242.908
Epoch: 09 | Time: 2m 53s
	Train Loss: 3.595 | Train PPL:  36.429
	 Val. Loss: 5.582 |  Val. PPL: 265.648
Epoch: 10 | Time: 2m 54s
	Train Loss: 3.401 | Train PPL