## Download and unzip of the data
To download the data it is important to install kaggle and use it for the download, since the version downloaded doesn't support the command competitions download we also force it to update using --no-deps.

In [27]:
! pip install -q kaggle

from google.colab import files
_ = files.upload()

! mkdir -p ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

! pip install --upgrade --force-reinstall --no-deps kaggle

! kaggle competitions download quora-question-pairs

! unzip quora-question-pairs.zip
! unzip train.csv.zip
! unzip test.csv.z
! unzip sample_submission.csv.zip

Saving kaggle.json to kaggle (1).json
Collecting kaggle
  Using cached kaggle-1.5.12-py3-none-any.whl
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    Uninstalling kaggle-1.5.12:
      Successfully uninstalled kaggle-1.5.12
Successfully installed kaggle-1.5.12
quora-question-pairs.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  quora-question-pairs.zip
replace sample_submission.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: sample_submission.csv.zip  
  inflating: test.csv                A
A

  inflating: test.csv.zip            
  inflating: train.csv.zip           
Archive:  train.csv.zip
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: train.csv               
unzip:  cannot find or open test.csv.z, test.csv.z.zip or test.csv.z.ZIP.
Archive:  sample_submission.csv.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [

## Data Exploration


In [28]:
import pandas as pd
import numpy as np
import os

import scipy
import string
import csv

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing
import spacy
import matplotlib.pyplot as plt
import plotly.graph_objects as go 

import re

import seaborn as sns

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Splitting training data
We split the training data in order to create a new training set and a test set. Also, from the new training set we use a part for the validation set. The split is 80/10/10 of the original data.

In [29]:
train_df = pd.read_csv('train.csv')

Since the data contains some NaN, as seen in the "Data Analysis" notebook, we remove the rows containing a NaN in at least one of their fields

In [30]:
train_df = train_df[train_df['question1'].notna()]
train_df = train_df[train_df['question2'].notna()]

Check if all NaN rows are been removed:

In [31]:
nan_rows = train_df[train_df.isnull().any(1)]
nan_rows

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate


Count number of rows:

In [32]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 404287 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404287 non-null  int64 
 1   qid1          404287 non-null  int64 
 2   qid2          404287 non-null  int64 
 3   question1     404287 non-null  object
 4   question2     404287 non-null  object
 5   is_duplicate  404287 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 21.6+ MB


The info function told us that there are 404287 entries, thus the split will be approximately:


*   TRAINING SET: 404287*80/100
*   VALIDATION SET: 404287*10/100
*  VALIDATION SET: 404287*10/100



In [33]:
training_set_size = int(len(train_df)*0.8) + 2
validation_set_size = int(len(train_df)*0.1)
test_set_size = int(len(train_df)*0.1)
print(training_set_size, validation_set_size, test_set_size)
total = training_set_size + validation_set_size + test_set_size
total

323431 40428 40428


404287

To split correctly the next step is to shuffle our rows, and reset the index in place

In [34]:
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,216036,322068,322069,How can I sell at Snapdeal? What are the terms...,What are the payment terms for online marketpl...,0
1,241554,353581,353582,Why are most prosecutors in American courts no...,Why does a country like USA where law enforcem...,0
2,208738,312815,312816,What are some good government jobs without a c...,Are there any good companies that hire smart p...,0
3,102714,104282,77231,What would happen if humans no longer needed t...,What would the world be like if humans didn't ...,1
4,397304,178955,34172,How do I shave my bikini line?,What is the best way to shave the bikini area?,1
...,...,...,...,...,...,...
404282,210958,24331,315662,What are the landmark judgements of the Suprem...,What are some of the best judgements passed by...,1
404283,347284,450778,475744,How did Snapchat initially fund itself?,How did Shoply gain its initial traction?,0
404284,341397,469216,469217,Can I give my dog Benadryl to help him calm down?,Can I give Benadryl to help my baby sleep?,0
404285,355618,484863,484864,What are the differences between constant acce...,What is the difference between a body travelin...,0


Next step is to use the indexes computed before for splitting the DataFrame into three parts

In [35]:
# Boundary
val_test_b = training_set_size + validation_set_size

In [36]:
new_train_df = train_df[:training_set_size]
val_df = train_df[training_set_size : val_test_b]
test_df = train_df[val_test_b:]
print(len(new_train_df), len(val_df), len(test_df))

323431 40428 40428


In [37]:
new_train_df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,216036,322068,322069,How can I sell at Snapdeal? What are the terms...,What are the payment terms for online marketpl...,0
1,241554,353581,353582,Why are most prosecutors in American courts no...,Why does a country like USA where law enforcem...,0
2,208738,312815,312816,What are some good government jobs without a c...,Are there any good companies that hire smart p...,0
3,102714,104282,77231,What would happen if humans no longer needed t...,What would the world be like if humans didn't ...,1
4,397304,178955,34172,How do I shave my bikini line?,What is the best way to shave the bikini area?,1
...,...,...,...,...,...,...
323426,189944,8395,34814,Is it true that the US is funding ISIS?,Is it really true that US is backing ISIS?,1
323427,330730,110579,457526,How do I play in share market in India?,How can I study and invest in the Indian share...,0
323428,143114,226824,175127,What is the difference between laundry deterge...,How safe is it to use non-HE detergent in a HE...,0
323429,10918,21121,21122,Would you consider teaching as a full time job...,Would you consider teaching as a full time job?,1


In [38]:
val_df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
323431,145953,131178,6156,What does it mean when a phone rings once and ...,How do you change a phone number through Strai...,0
323432,82428,139767,139768,How can I love myself unconditionally?,How can we love ourselves unconditionally?,1
323433,209608,106721,19434,How did you catch your spouse cheating?,How do I catch cheating husbands?,1
323434,371747,70807,502401,What are determinant of demand and supply?,What determines demand?,0
323435,15464,29541,29542,What is your favorite Indian sweet dish?,Which is your favorite Indian vegetarian dish?,0
...,...,...,...,...,...,...
363854,251785,18327,38339,What are some of the best horror movies?,Can somebody help me with the list of best top...,1
363855,57914,101707,101708,What movie should I watch in 2017?,What movies will you recommend for us to watch...,1
363856,242879,355193,209645,What are good side dishes to accompany blacken...,What are side dishes for salmon patties?,0
363857,5626,11057,11058,What do you really know about Algeria?,What do you know about Algeria?,1


In [39]:
test_df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
363859,231251,158987,65612,How will scraping currency notes of INR 500 an...,How is scrapping of Rs 500 and Rs 1000 currenc...,1
363860,226528,335164,335165,Why do the coated lens look purple by reflecte...,Why do coated lenses look purple when reflecti...,1
363861,36038,65760,65761,How safe is the Oracle Arena?,"In Oracle Arena: Would Seat ""1"" in Section 128...",0
363862,234195,34326,344637,How can I improve me problem solving skills?,What is the best way to practice problem solvi...,0
363863,121299,196582,196583,Could I have more than one account on Pinteres...,I have two account and forgot the email for th...,0
...,...,...,...,...,...,...
404282,210958,24331,315662,What are the landmark judgements of the Suprem...,What are some of the best judgements passed by...,1
404283,347284,450778,475744,How did Snapchat initially fund itself?,How did Shoply gain its initial traction?,0
404284,341397,469216,469217,Can I give my dog Benadryl to help him calm down?,Can I give Benadryl to help my baby sleep?,0
404285,355618,484863,484864,What are the differences between constant acce...,What is the difference between a body travelin...,0


Save CSV on Google drive

In [40]:
DATASETS_PATH = '/content/drive/MyDrive/Quora/Dataset/'

new_train_df.to_csv(DATASETS_PATH + 'training.csv', index=False)
val_df.to_csv(DATASETS_PATH + 'validation.csv', index=False)
test_df.to_csv(DATASETS_PATH + 'test.csv', index=False)