# Introduction

In early 2017, Quora released a really interesting [dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) on question pairs. Basically, one of the most important jobs for Quora is identifying when two questions are asking the same thing. For example "" and "" have basically the same meaning. This is important for Quora to recognize because they don't want 3 of the same questions, each with different answers. 

In this notebook, we'll look at the dataset that Quora released, as well as creating a machine learning model that determines whether two questions can be considered pairs or not. 

# Data Loading

We'll first start by loading in the dataset into a pandas dataframe

In [2]:
import pandas as pd
df = pd.read_csv('Data/quora_duplicate_questions.tsv', sep='\t')

In [3]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
df.shape

(404290, 6)

# Word Vectors

As with lots of deep learning approaches to NLP tasks, our first job is to create word vectors. This can either be in the model itself, or you can use pretrained word vectors, which is the approach we'll be taking here. The vectors were downloaded from the following [website](https://nlp.stanford.edu/projects/glove/), and were trained using the GloVe model. 

In [20]:
import numpy as np

wordsList = np.load('Data/wordsList.npy') .tolist()
wordVectors = np.load('Data/wordVectors.npy')

In [23]:
len(wordsList) # Contains all of the words that we have vectors for

400000

In [37]:
numDimensions = wordVectors.shape[1]
wordVectors.shape # Contains all of the respective vectors

(400000, 50)

Now, let's go through each of the 404,290 question pairs and turn each of the questions into a N x 50 dimensional matrix where N is the number of words in the sentence. Each question pair will have two associated matrices (one for each question), and then we will concatenate them. The resulting matrix will the input into our RNN. 

Let's see how that looks like just for the first question pair

In [74]:
firstQuestion = df.loc[0,'question1'] # Getting the first sentence in the first question pair
secondQuestion = df.loc[0,'question2'] # Getting the second sentence in the first question pair

The next function is one that cleans the sentences. It's a form of data preprocessing which is extremely extremely important in the field of machine learning and deep learning

In [75]:
def cleanSentences(string):
    string = string.lower()
    return string

In [76]:
firstQuestion = cleanSentences(firstQuestion)
secondQuestion = cleanSentences(secondQuestion)
firstQuestionSplit = firstQuestion.split()
secondQuestionSplit = secondQuestion.split()
lenBothSentence = len(firstQuestionSplit) + len(secondQuestionSplit)

In [77]:
firstXInput = np.zeros((lenBothSentence, numDimensions), dtype='float32')
indexCounter = 0
for word in firstQuestionSplit:
    print word
    try:
        firstXInput[indexCounter] = wordVectors[wordsList.index(word)]
    except ValueError:
        firstXInput[indexCounter] = wordVectors[399999] #Vector for unkown words
    indexCounter = indexCounter + 1
for word in secondQuestionSplit:
    try:
        firstXInput[indexCounter] = wordVectors[wordsList.index(word)]
    except ValueError:
        firstXInput[indexCounter] = wordVectors[399999] #Vector for unkown words
    indexCounter = indexCounter + 1
firstXInput.shape

what
is
the
step
by
step
guide
to
invest
in
share
market
in
india


(26, 50)

In [80]:
import re
re.split(' ?', firstQuestion)

['What',
 'is',
 'the',
 'step',
 'by',
 'step',
 'guide',
 'to',
 'invest',
 'in',
 'share',
 'market',
 'in',
 'india?']