# Sequence to Sequence Prediction on Quora Duplicate Data
---

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


## Data Exploration
---

In [1]:
import os

data_path = 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'

In [2]:
import pandas as pd

qa_data = pd.read_csv(data_path, sep='\t')

We need to consider question1 and question2 columns for our purpose.

In [3]:
qa_data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


There are null values in the concerned columns.

In [4]:
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


Since we need to consider only duplicate columns, we will filter data using is_duplicate=1.

In [5]:
qa_data.is_duplicate.value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

This is the final data after filtering.

In [6]:
qa_data = qa_data[qa_data['is_duplicate']==1][['question1','question2']]
qa_data.reset_index(drop=True,inplace=True)
qa_data

Unnamed: 0,question1,question2
0,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
1,How can I be a good geologist?,What should I do to be a great geologist?
2,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
3,What can make Physics easy to learn?,How can you make physics easy to learn?
4,What was your first sexual experience like?,What was your first sexual experience?
...,...,...
149258,What are some outfit ideas to wear to a frat p...,What are some outfit ideas wear to a frat them...
149259,Why is Manaphy childish in Pokémon Ranger and ...,Why is Manaphy annoying in Pokemon ranger and ...
149260,How does a long distance relationship work?,How are long distance relationships maintained?
149261,What does Jainism say about homosexuality?,What does Jainism say about Gays and Homosexua...


There are no null values.

In [7]:
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149263 entries, 0 to 149262
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   question1  149263 non-null  object
 1   question2  149263 non-null  object
dtypes: object(2)
memory usage: 2.3+ MB


## Create NLP Pipeline
---


### Import libraries and set seed
---

In [2]:
import os

base_path = 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'
data_path = os.path.join(base_path, 'data')
qa_data_base_path = os.path.join(data_path, 'QuoraData')
qa_data_path = os.path.join(qa_data_base_path, 'quora_duplicate_questions.tsv')

In [2]:
%cd 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'

/content/gdrive/MyDrive/TSAI_END2/Session7/Assignment2


In [4]:
from nlp_api import *

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy import data
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

In [6]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# !pip install spacy --upgrade

### Data Loading, Model Building and Training
---

In [7]:
model_params = {'enc_emb_dim': 256, 'dec_emb_dim': 256, 'hid_dim': 512, 'n_layers': 2, 'enc_dropout': 0.5, 'dec_dropout': 0.5}
params = {'data_path': './data/QuoraData/quora_duplicate_questions.tsv', 'data_name': 'quora', 'model_name': 'lstm encoder-decoder sequence model', 'model_params': model_params, 'seed': SEED, 'batch_size': 128, 'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')}

nlp_pipeline = NLP_Pipeline(**params)
nlp_pipeline.train_model(10, './saved_models/QuoraData')
nlp_pipeline.evaluate_model()

Loading data...
Number of training examples: 104484
Number of testing examples: 44779
Unique tokens in source vocabulary: 14522
Unique tokens in target vocabulary: 14469
Data is loaded
Loading model....
The model has 22200709 trainable parameters
Epoch: 01 | Time: 4m 23s
	Train Loss: 4.825 | Train PPL: 124.636
	 Val. Loss: 4.667 |  Val. PPL: 106.393
Epoch: 02 | Time: 4m 23s
	Train Loss: 3.831 | Train PPL:  46.130
	 Val. Loss: 4.163 |  Val. PPL:  64.238
Epoch: 03 | Time: 4m 23s
	Train Loss: 3.316 | Train PPL:  27.563
	 Val. Loss: 3.858 |  Val. PPL:  47.368
Epoch: 04 | Time: 4m 24s
	Train Loss: 2.983 | Train PPL:  19.745
	 Val. Loss: 3.730 |  Val. PPL:  41.700
Epoch: 05 | Time: 4m 24s
	Train Loss: 2.745 | Train PPL:  15.559
	 Val. Loss: 3.646 |  Val. PPL:  38.319
Epoch: 06 | Time: 4m 24s
	Train Loss: 2.563 | Train PPL:  12.970
	 Val. Loss: 3.604 |  Val. PPL:  36.744
Epoch: 07 | Time: 4m 23s
	Train Loss: 2.418 | Train PPL:  11.221
	 Val. Loss: 3.614 |  Val. PPL:  37.122
Epoch: 08 | Time: 

### Model Evaluation
---

In [None]:
nlp_pipeline.evaluate_model()

| Test Loss: 3.552 | Test PPL:  34.876 |
