# Introduction

In early 2017, Quora released a really interesting [dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) on question pairs. Basically, one of the most important jobs for Quora is identifying when two questions are asking the same thing. For example "" and "" have basically the same meaning. This is important for Quora to recognize because they don't want 3 of the same questions, each with different answers. 

In this notebook, we'll look at the dataset that Quora released, as well as creating a machine learning model that determines whether two questions can be considered pairs or not. 

# Data Loading

We'll first start by loading in the dataset into a pandas dataframe

In [1]:
import pandas as pd
df = pd.read_csv('Data/quora_duplicate_questions.tsv', sep='\t')

In [2]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [3]:
df.shape

(404290, 6)

# Word Vectors

As with lots of deep learning approaches to NLP tasks, our first job is to create word vectors. This can either be in the model itself, or you can use pretrained word vectors, which is the approach we'll be taking here. The vectors were downloaded from the following [website](https://nlp.stanford.edu/projects/glove/), and were trained using the GloVe model. 

In [4]:
import numpy as np

wordsList = np.load('Data/wordsList.npy') .tolist()
wordVectors = np.load('Data/wordVectors.npy')

In [5]:
len(wordsList) # Contains all of the words that we have vectors for

400000

In [6]:
numDimensions = wordVectors.shape[1]
wordVectors.shape # Contains all of the respective vectors

(400000, 50)

Now, let's go through each of the 404,290 question pairs and turn each of the questions into a N x 50 dimensional matrix where N is the number of words in the sentence. Each question pair will have two associated matrices (one for each question), and then we will concatenate them. The resulting matrix will the input into our RNN. 

Let's see how that looks like just for the first question pair

In [59]:
firstQuestion = df.loc[0,'question1'] # Getting the first sentence in the first question pair
secondQuestion = df.loc[0,'question2'] # Getting the second sentence in the first question pair

The next function is one that cleans the sentences. It's a form of data preprocessing which is extremely extremely important in the field of machine learning and deep learning

In [60]:
import re
def cleanSentences(string):
    if (isinstance(string, basestring) == False):
        print "The passed in value is not a string" 
        return " " # TODO: Find some better way to deal with these values  
    string = string.lower()
    string = re.sub('([.,!?()])', r' \1 ', string) # Separates punctuation from the word
    return string

In [61]:
firstQuestion = cleanSentences(firstQuestion)
secondQuestion = cleanSentences(secondQuestion)
firstQuestionSplit = firstQuestion.split()
secondQuestionSplit = secondQuestion.split()
lenBothSentence = len(firstQuestionSplit) + len(secondQuestionSplit)

In [62]:
firstXInput = np.zeros((lenBothSentence, numDimensions), dtype='float32')
indexCounter = 0
for word in firstQuestionSplit:
    try:
        firstXInput[indexCounter] = wordVectors[wordsList.index(word)]
    except ValueError:
        firstXInput[indexCounter] = wordVectors[399999] #Vector for unkown words
    indexCounter = indexCounter + 1
for word in secondQuestionSplit:
    try:
        firstXInput[indexCounter] = wordVectors[wordsList.index(word)]
    except ValueError:
        firstXInput[indexCounter] = wordVectors[399999] #Vector for unkown words
    indexCounter = indexCounter + 1
firstXInput.shape

(28, 50)

Okay, so now let's do the same for every vector and create one huge X matrix. This matrix will be 3-D and will have the dimensionality S x N x D where S is the number of training examples that we have (about 350,000 pairs for training and 50,000 for testing), N is the sequence length, and D is the number of dimensions of the word vectors. 

Let's examine N, the sequence length, a bit more carefully. In every sentence pair, there could be a different sequence length right? In that first question pair we saw, there was a total of 28 words (+ punctuation). In the second question pair, there could be 20,30,40, etc words. We don't know. So, what we'll do is we'll set the sequence length to the maximum length of all question pairs in the test. 

In [68]:
maxSeqLength = 0
for index, row in df.iterrows():
    firstQuestion = cleanSentences(row['question1'])
    secondQuestion = cleanSentences(row['question2'])
    firstQuestionSplit = firstQuestion.split()
    secondQuestionSplit = secondQuestion.split()
    lenBothSentence = len(firstQuestionSplit) + len(secondQuestionSplit)
    if (lenBothSentence > maxSeqLength):
        maxSeqLength = lenBothSentence
maxSeqLength

The passed in value is not a string
The passed in value is not a string


296

We'll use the first 350,000 examples for training and the rest for testing

In [69]:
numTrainExamples = 350000
numTestExamples = df.shape[0] - numTrainExamples

For the pairs that have a length < seqLength, we will pad the rest with zeros. 

In [92]:
numClasses = 2
xTrain = np.zeros((numTrainExamples, maxSeqLength, numDimensions), dtype='float32')
yTrain = np.zeros((numTrainExamples, numClasses), dtype='int32')

Now, let's fill in this xTrain matrix

In [94]:
exampleCounter = 0
for index, row in df.iterrows():
    if (index == numTrainExamples):
        break
    firstQuestion = cleanSentences(row['question1'])
    secondQuestion = cleanSentences(row['question2'])
    firstQuestionSplit = firstQuestion.split()
    secondQuestionSplit = secondQuestion.split()
    indexCounter = 0
    for word in firstQuestionSplit:
        try:
            xTrain[exampleCounter][indexCounter] = wordVectors[wordsList.index(word)]
        except ValueError:
            xTrain[exampleCounter][indexCounter] = wordVectors[399999] #Vector for unkown words
        indexCounter = indexCounter + 1
    for word in secondQuestionSplit:
        try:
            xTrain[exampleCounter][indexCounter] = wordVectors[wordsList.index(word)]
        except ValueError:
            xTrain[exampleCounter][indexCounter] = wordVectors[399999] #Vector for unkown words
        indexCounter = indexCounter + 1
    if (row['is_duplicate'] == 1):
        yTrain[exampleCounter] = [1,0]
    else:
        yTrain[exampleCounter] = [0,1]
    exampleCounter = exampleCounter + 1

array([[[ 0.45322999,  0.059811  , -0.10577   , ...,  0.53240001,
         -0.25103   ,  0.62546003],
        [ 0.61849999,  0.64253998, -0.46551999, ..., -0.27557001,
          0.30899   ,  0.48497   ],
        [ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
         -0.11514   , -0.78580999],
        ..., 
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]],

       [[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ..., 
        [ 0.        ,  0.        ,  0.        , ...,

# Helper Functions

In [None]:
from random import randint

def getBatch(isTest = False):
    labels = np.zeros([batchSize, numClasses])
    arr = np.zeros([batchSize, maxSeqLength, numDimensions])
    if (isTest):
        num = randint(numTrainExamples,df.shape[0] - batchSize)
        arr = xTrain[num:num+batchSize]
        labels = yTrain[num:num+batchSize]
    else:
        num = randint(0,numTrainExamples - batchSize)
        arr = xTrain[num:num+batchSize]
        labels = yTrain[num:num+batchSize]  
    return arr, labels

# RNN Model

In [83]:
batchSize = 24
lstmUnits = 64
iterations = 100000

In [86]:
import tensorflow as tf
tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.float32, [batchSize, maxSeqLength, numDimensions])

In [87]:
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.70)
value, _ = tf.nn.dynamic_rnn(lstmCell, input_data, dtype=tf.float32)

weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))
bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
prediction = (tf.matmul(last, weight) + bias)

In [89]:
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)

In [90]:
import datetime

tf.summary.scalar('Loss', loss)
tf.summary.scalar('Accuracy', accuracy)
merged = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, sess.graph)

# Training

In [None]:
sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())

for i in range(iterations):
    #Next Batch of reviews
    nextBatch, nextBatchLabels = getBatch();
    sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})
   
    #Write summary to Tensorboard
    if (i % 50 == 0):
        summary = sess.run(merged, {input_data: nextBatch, labels: nextBatchLabels})
        writer.add_summary(summary, i)

    #Save the network every 10,000 training iterations
    if (i % 10000 == 0 and i != 0):
        save_path = saver.save(sess, "models/pretrained_lstm.ckpt", global_step=i)
        print("saved to %s" % save_path)
writer.close()

# Testing

In [None]:
iterations = 10
for i in range(iterations):
    nextBatch, nextBatchLabels = getBatch(True);
    print("Accuracy for this batch:", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)