# Quora Questions Kaggle Challenge

## Importing the csv

Import CSV and store selected rows in a list.

In [20]:
import csv
questions = []

with open('q_quora_100.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        questions.append(row[0:6])

Print the first row of the csv, the column keys. For double checking that the columns are the ones we want to easily refer back to see which columns are which.

In [21]:
keys = questions[0]
print 'column names'
print keys

column names
['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate']


Identify any columns that have the wrong number of columns. This was found because commas were used in some questions which messed up the csv parsing.

In [29]:
for row in questions:
    if len(row) != 6:
        print 'WARNING: A COLUMN NEEDS FIXING'

Convert python list to numpy array and delete the column names.

In [30]:
import numpy as np

questions = np.array(questions)
questions = np.delete(questions, 0, 0)
print questions[0]

['1' '3' '4' 'What is the story of Kohinoor (Koh-i-Noor) Diamond?'
 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?'
 '0']


## Data Exploration

First take a random entry in the dataset to get a feel for the questions.

In [31]:
import random

size = len(questions)
random_index = random.randrange(size)
print questions[random_index]

['78' '157' '158' 'How can I make money through the Internet?'
 'What are some different ways to make money online, excluding selling things?'
 '0']


## Munging

Splitting the sentence into tokens.

In [35]:
processing = []

for data in questions:
    pairId = data[0]
    sentence1 = data[3]
    sentence2 = data[4]
    tokens1 = sentence1.split(' ')
    tokens2 = sentence2.split(' ')
    processing.append([
        pairId,
        tokens1,
        tokens2
    ])
    
print(processing[0:4])

[['0', ['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], ['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market?']], ['1', ['What', 'is', 'the', 'story', 'of', 'Kohinoor', '(Koh-i-Noor)', 'Diamond?'], ['What', 'would', 'happen', 'if', 'the', 'Indian', 'government', 'stole', 'the', 'Kohinoor', '(Koh-i-Noor)', 'diamond', 'back?']], ['2', ['How', 'can', 'I', 'increase', 'the', 'speed', 'of', 'my', 'internet', 'connection', 'while', 'using', 'a', 'VPN?'], ['How', 'can', 'Internet', 'speed', 'be', 'increased', 'by', 'hacking', 'through', 'DNS?']], ['3', ['Why', 'am', 'I', 'mentally', 'very', 'lonely?', 'How', 'can', 'I', 'solve', 'it?'], ['Find', 'the', 'remainder', 'when', '[math]23^{24}[/math]', 'is', 'divided', 'by', '24,23?']]]


We remove all stopwords. We loop over every row in the data set.

In [45]:
import nltk
from nltk.corpus import stopwords
no_stop = []

def remove_stop(sentence, words=stopwords.words('english')):
    remove_these = []
    for i in range(0, len(sentence) - 1):
        word = sentence[i]
        if word in words:
            remove_these.append(i)
    output = []
    for i in range(0, len(sentence)):
        if i not in remove_these:
            output.append(sentence[i])
    return output

for data in processing:
    pairId = data[0]
    sentence1 = data[1]
    sentence2 = data[2]
    out1 = remove_stop(sentence1)
    out2 = remove_stop(sentence2)
        
    no_stop.append([
        pairId,
        out1,
        out2
    ])
    
print no_stop[0:5]

[['0', ['What', 'step', 'invest', 'market', 'What', 'step', 'invest', 'market', 'in'], ['What', 'step', 'step', 'guide', 'invest', 'share', 'market?', 'What', 'step', 'step', 'guide', 'invest', 'share', 'market?']], ['1', ['What', 'story', 'Kohinoor', '(Koh-i-Noor)', 'Diamond?', 'What', 'story', 'Kohinoor', '(Koh-i-Noor)', 'Diamond?'], ['What', 'would', 'happen', 'Indian', 'government', 'stole', 'Kohinoor', '(Koh-i-Noor)', 'diamond', 'back?', 'What', 'would', 'happen', 'Indian', 'government', 'stole', 'Kohinoor', '(Koh-i-Noor)', 'diamond', 'back?']], ['2', ['How', 'I', 'increase', 'speed', 'internet', 'connection', 'using', 'VPN?', 'How', 'I', 'increase', 'speed', 'internet', 'connection', 'using', 'VPN?'], ['How', 'Internet', 'speed', 'increased', 'hacking', 'DNS?', 'How', 'Internet', 'speed', 'increased', 'hacking', 'DNS?']], ['3', ['Why', 'I', 'mentally', 'lonely?', 'How', 'I', 'solve', 'it?', 'Why', 'I', 'mentally', 'lonely?', 'How', 'I', 'solve', 'it?'], ['Find', 'remainder', '[ma

