# Generating Questions with the same/Similar Meaning and Context 

## 1. Data Preparation 

### Download the data 

In [None]:
!wget -q http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

In [None]:
import pandas as pd 

df = pd.read_csv('/content/quora_duplicate_questions.tsv',delimiter='\t')
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
len(df)

404290

Print some of the data out so you can have a look at it 

In [None]:
for i in range(10):
    i = df.iloc[i]
    print(("\033[1m"+"Duplicate"+"\033[0m").center(100) if i.is_duplicate==1 \
          else ("\033[1m"+"Different"+"\033[0m").center(100))
    print("\033[1m"+"Q1"+"\033[0m"+":", i.question1)
    print("\033[1m"+"Q2"+"\033[0m"+":", i.question2+"\n")
    

                                         [1mDifferent[0m                                          
[1mQ1[0m: What is the step by step guide to invest in share market in india?
[1mQ2[0m: What is the step by step guide to invest in share market?

                                         [1mDifferent[0m                                          
[1mQ1[0m: What is the story of Kohinoor (Koh-i-Noor) Diamond?
[1mQ2[0m: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?

                                         [1mDifferent[0m                                          
[1mQ1[0m: How can I increase the speed of my internet connection while using a VPN?
[1mQ2[0m: How can Internet speed be increased by hacking through DNS?

                                         [1mDifferent[0m                                          
[1mQ1[0m: Why am I mentally very lonely? How can I solve it?
[1mQ2[0m: Find the remainder when [math]23^{24}[/math] i

In [None]:
len(df[df.is_duplicate==1])

149263

You don't need columns other than question1 and question2, so get rid of the rest

In [None]:
df_duplicate = df[df.is_duplicate==1].reset_index() 
                
df_duplicate = df_duplicate.drop(["index","id","qid1","qid2","is_duplicate"],axis=1)
df_duplicate.head(10)

Unnamed: 0,question1,question2
0,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
1,How can I be a good geologist?,What should I do to be a great geologist?
2,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
3,What can make Physics easy to learn?,How can you make physics easy to learn?
4,What was your first sexual experience like?,What was your first sexual experience?
5,What would a Trump presidency mean for current...,How will a Trump presidency affect the student...
6,What does manipulation mean?,What does manipulation means?
7,Why are so many Quora users posting questions ...,Why do people ask Quora questions which can be...
8,Why do rockets look white?,Why are rockets and boosters painted white?
9,How should I prepare for CA final law?,How one should know that he/she completely pre...


## 2. Data Processing 

In [None]:
!pip install spacy --upgrade --quiet
!python -m spacy download en --quiet

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

2021-06-24 09:42:58.832396: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


#### Split the data into training and validation

In [None]:
from sklearn.model_selection import train_test_split

train_df,valid_df = train_test_split(df_duplicate,test_size=0.2)
len(train_df),len(valid_df)

(119410, 29853)

In [None]:
train_df.head()

Unnamed: 0,question1,question2
90533,Which programming language is the best to lear...,What language should I learn first?
103293,Why do you(or not) support Israel?,Do you support the existence of Israel? Why?
62720,What are Common Preparation Mistakes of IIT JE...,What are common mistakes students do in IIT JE...
28357,Why are humans born?,Why were humans born?
27123,How is Donald Trump winning?,Why did American people elect Donald Trump as ...


In [None]:
valid_df.head()

Unnamed: 0,question1,question2
34152,Is joining coaching center necessary to clear ...,Is coaching necessary to crack JEE Advanced?
30878,What are common symptoms of bipolar disorder?,What are the symptoms of bipolar disorder?
67545,What are some songs that everyone must listen to?,What are some good songs to listen to?
90110,Who would win in a war between Russia and the US?,If America went to war with Russia who would w...
101646,What are two ways the U.S. Constitution can be...,How can an amendment to the U.S. Constitution ...


In [None]:
train_df.to_csv('train.csv',index=False),valid_df.to_csv('valid.csv',index=False)

(None, None)

#### Load the tokenizer 

In [None]:
spacy = spacy.load('en_core_web_sm')

In [None]:
def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy.tokenizer(text)]

In [None]:
Q1 = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

Q2 = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

In [None]:
fields = [('q1', Q1),('q2', Q2)]

#### Create the Datasets needed for creating the DataLoaders 

In [None]:
from torchtext.legacy import data

train_data, valid_data = data.TabularDataset.splits(
                                        path = '/content',
                                        train = 'train.csv',
                                        validation = 'valid.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

In [None]:
len(train_data),len(valid_data)

(119410, 29853)

In [None]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")

Number of training examples: 119410
Number of validation examples: 29853


#### Look into Dataset structure 

In [None]:
for i in range(5):
  print(vars(train_data.examples[i]),"\n")

{'q1': ['which', 'programming', 'language', 'is', 'the', 'best', 'to', 'learn', 'first', '?'], 'q2': ['what', 'language', 'should', 'i', 'learn', 'first', '?']} 

{'q1': ['why', 'do', 'you(or', 'not', ')', 'support', 'israel', '?'], 'q2': ['do', 'you', 'support', 'the', 'existence', 'of', 'israel', '?', 'why', '?']} 

{'q1': ['what', 'are', 'common', 'preparation', 'mistakes', 'of', 'iit', 'jee', 'advanced', '?'], 'q2': ['what', 'are', 'common', 'mistakes', 'students', 'do', 'in', 'iit', 'jee', 'preparation', '?']} 

{'q1': ['why', 'are', 'humans', 'born', '?'], 'q2': ['why', 'were', 'humans', 'born', '?']} 

{'q1': ['how', 'is', 'donald', 'trump', 'winning', '?'], 'q2': ['why', 'did', 'american', 'people', 'elect', 'donald', 'trump', 'as', 'their', 'president', '?']} 



#### Build Vocabulary 

In [None]:
Q1.build_vocab(train_data, min_freq = 2)
Q2.build_vocab(train_data, min_freq = 2)

In [None]:
print(f"Unique tokens in source (de) vocabulary: {len(Q1.vocab)}")

Unique tokens in source (de) vocabulary: 15367


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#### Create the DataIterators needed for training and validation

In [None]:
BATCH_SIZE = 128

train_iterator, valid_iterator = BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort=False,
    device = device)

#### Size and Sanity Check 

In [None]:
example = next(iter(train_iterator))

In [None]:
example.q1.shape,example.q2.shape

(torch.Size([32, 128]), torch.Size([26, 128]))

## 3. Model Building

#### Encoder Architecture

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [None]:
INPUT_DIM = len(Q1.vocab)
EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
DROPOUT = 0.5
enc_test = Encoder(INPUT_DIM,EMB_DIM,HID_DIM,N_LAYERS,DROPOUT).cuda()

#### Sanity and Shape check 

In [None]:
example.q1.shape

torch.Size([31, 128])

In [None]:
hidden,cell = enc_test(example.q1)
hidden.shape,cell.shape

(torch.Size([2, 128, 512]), torch.Size([2, 128, 512]))

#### Decoder Architecture

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim,output_dim)
        
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

In [None]:
OUTPUT_DIM = len(Q2.vocab)
EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
DROPOUT = 0.5
dec_test = Decoder(INPUT_DIM,EMB_DIM,HID_DIM,N_LAYERS,DROPOUT).cuda()

### Seq2Seq module integrating Encoder and Decoder 

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]

        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

#### Instantiate the model and pop it onto the GPU 

In [None]:
INPUT_DIM = len(Q1.vocab)
OUTPUT_DIM = len(Q2.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 23,107,591 trainable parameters


#### Choose Optimizer and Loss fuction Criteria  

In [None]:
optimizer = optim.AdamW(model.parameters())

In [None]:
Q2_PAD_IDX =Q2.vocab.stoi[Q2.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = Q2_PAD_IDX)

#### Create Training Loop

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):

        Q1 = batch.q1
        Q2= batch.q2
        
        optimizer.zero_grad()
        
        output = model(Q1, Q2)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        Q2 = Q2[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output,Q2)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

#### Create Validation Loop

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            Q1 = batch.q1
            Q2= batch.q2

            output = model(Q1,Q2, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            Q2 = Q2[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, Q2)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

#### Run the training loop

In [None]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 10m 15s
	Train Loss: 4.908 | Train PPL: 135.314
	 Val. Loss: 4.743 |  Val. PPL: 114.722
Epoch: 02 | Time: 10m 15s
	Train Loss: 4.027 | Train PPL:  56.077
	 Val. Loss: 4.325 |  Val. PPL:  75.596
Epoch: 03 | Time: 10m 16s
	Train Loss: 3.610 | Train PPL:  36.955
	 Val. Loss: 4.137 |  Val. PPL:  62.593
Epoch: 04 | Time: 10m 19s
	Train Loss: 3.350 | Train PPL:  28.500
	 Val. Loss: 4.024 |  Val. PPL:  55.925
Epoch: 05 | Time: 10m 19s
	Train Loss: 3.167 | Train PPL:  23.730
	 Val. Loss: 3.947 |  Val. PPL:  51.801


#### Run for 5 more Epochs

In [None]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 10m 18s
	Train Loss: 2.995 | Train PPL:  19.988
	 Val. Loss: 3.893 |  Val. PPL:  49.051
Epoch: 02 | Time: 10m 20s
	Train Loss: 2.873 | Train PPL:  17.687
	 Val. Loss: 3.815 |  Val. PPL:  45.358
Epoch: 03 | Time: 10m 23s
	Train Loss: 2.756 | Train PPL:  15.741
	 Val. Loss: 3.835 |  Val. PPL:  46.297
Epoch: 04 | Time: 10m 24s
	Train Loss: 2.677 | Train PPL:  14.548
	 Val. Loss: 3.768 |  Val. PPL:  43.304
Epoch: 05 | Time: 10m 20s
	Train Loss: 2.577 | Train PPL:  13.156
	 Val. Loss: 3.757 |  Val. PPL:  42.829
