# Quora Questions Kaggle Challenge

## Importing the csv

Import CSV and store selected rows in a list.

In [71]:
import csv
questions = []

with open('q_quora_100.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        questions.append(row[0:6])

Print the first row of the csv, the column keys. For double checking that the columns are the ones we want to easily refer back to see which columns are which.

In [72]:
keys = questions[0]
print 'column names'
print keys

column names
['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate']


Identify any columns that have the wrong number of columns. This was found because commas were used in some questions which messed up the csv parsing.

In [73]:
for row in questions:
    if len(row) != 6:
        print 'WARNING: A COLUMN NEEDS FIXING' + str(row[0])

Convert python list to numpy array and delete the column names.

In [74]:
import numpy as np

questions = np.array(questions)
questions = np.delete(questions, 0, 0)
print questions[0]

['0' '1' '2'
 'What is the step by step guide to invest in share market in india?'
 'What is the step by step guide to invest in share market?' '0']


## Data Exploration

First take a random entry in the dataset to get a feel for the questions.

In [75]:
import random
from IPython.display import HTML, display

size = len(questions)
random_index = random.randrange(size)
random_question = questions[random_index]


display(HTML(
    '<table><tr>{}</tr></table>'.format(
        '</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in random_question)) for row in random_question)
        )
    )
)

0,1,2,3,4,5
46,93,94,How did Darth Vader fought Darth Maul in Star Wars Legends?,Does Quora have a character limit for profile descriptions?,0
46,93,94,How did Darth Vader fought Darth Maul in Star Wars Legends?,Does Quora have a character limit for profile descriptions?,0
46,93,94,How did Darth Vader fought Darth Maul in Star Wars Legends?,Does Quora have a character limit for profile descriptions?,0
46,93,94,How did Darth Vader fought Darth Maul in Star Wars Legends?,Does Quora have a character limit for profile descriptions?,0
46,93,94,How did Darth Vader fought Darth Maul in Star Wars Legends?,Does Quora have a character limit for profile descriptions?,0
46,93,94,How did Darth Vader fought Darth Maul in Star Wars Legends?,Does Quora have a character limit for profile descriptions?,0


Get a count of how many duplicates and what percentage.

In [76]:
duplicates = 0
not_duplicates = 0
total = len(questions)

for row in questions:
    if row[5] == '0':
        not_duplicates += 1
    else:
        duplicates += 1

print 'duplicates: ' + str(duplicates) + ' percentage: ' + str((float(duplicates) / float(total))) + '%'
print 'not duplicates: ' + str(not_duplicates) + ' percentage: ' + str(float(not_duplicates) / float(total)) + '%'

duplicates: 36 percentage: 0.349514563107%
not duplicates: 67 percentage: 0.650485436893%


## Bag of Words

### Munging

Downcase and then split the sentence into tokens.

In [77]:
import re
processing = []


for data in questions:
    pairId = data[0]
    sentence1 = data[3].lower()
    sentence2 = data[4].lower()
    matches = reg.match(data[3])
    tokens1 = sentence1.split(' ')
    tokens2 = sentence2.split(' ')
    processing.append([
        pairId,
        tokens1,
        tokens2
    ])
    
print(processing[0:4])

[['0', ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market?']], ['1', ['what', 'is', 'the', 'story', 'of', 'kohinoor', '(koh-i-noor)', 'diamond?'], ['what', 'would', 'happen', 'if', 'the', 'indian', 'government', 'stole', 'the', 'kohinoor', '(koh-i-noor)', 'diamond', 'back?']], ['2', ['how', 'can', 'i', 'increase', 'the', 'speed', 'of', 'my', 'internet', 'connection', 'while', 'using', 'a', 'vpn?'], ['how', 'can', 'internet', 'speed', 'be', 'increased', 'by', 'hacking', 'through', 'dns?']], ['3', ['why', 'am', 'i', 'mentally', 'very', 'lonely?', 'how', 'can', 'i', 'solve', 'it?'], ['find', 'the', 'remainder', 'when', '[math]23^{24}[/math]', 'is', 'divided', 'by', '24,23?']]]


We remove all stopwords and punctuation.

In [85]:
import nltk
from nltk.corpus import stopwords
munged = []

def remove_stop(sentence, words=stopwords.words('english')):
    remove_these = []
    for i in range(0, len(sentence) - 1):
        word = sentence[i]
        if word in words:
            remove_these.append(i)
    output = []
    for i in range(0, len(sentence)):
        if i not in remove_these:
            punctuationless = remove_punctuation(sentence[i])
            if len(punctuationless) > 0:
                output.append(punctuationless)
    return output

def remove_punctuation(word):
    return re.sub('[^a-zA-Z]', '', word)

for data in processing:
    pairId = data[0]
    sentence1 = data[1]
    sentence2 = data[2]
    out1 = remove_stop(sentence1)
    out2 = remove_stop(sentence2)
        
    munged.append([
        pairId,
        out1,
        out2
    ])
    
print munged[0:5]

[['0', ['step', 'step', 'guide', 'invest', 'share', 'market', 'india'], ['step', 'step', 'guide', 'invest', 'share', 'market']], ['1', ['story', 'kohinoor', 'kohinoor', 'diamond'], ['would', 'happen', 'indian', 'government', 'stole', 'kohinoor', 'kohinoor', 'diamond', 'back']], ['2', ['increase', 'speed', 'internet', 'connection', 'using', 'vpn'], ['internet', 'speed', 'increased', 'hacking', 'dns']], ['3', ['mentally', 'lonely', 'solve', 'it'], ['find', 'remainder', 'mathmath', 'divided']], ['4', ['one', 'dissolve', 'water', 'quikly', 'sugar', 'salt', 'methane', 'carbon', 'di', 'oxide'], ['fish', 'would', 'survive', 'salt', 'water']]]




### Order words by popularity

We'll create a list of all words used in the corpus. Then we'll count the occurences of each word and order it by frequency.

In [93]:
all_tokens = []

for row in munged:
    for word in row[1]:
        all_tokens.append(word)
    for word in row[2]:
        all_tokens.append(word)

print all_tokens[0:20]



['step', 'step', 'guide', 'invest', 'share', 'market', 'india', 'step', 'step', 'guide', 'invest', 'share', 'market', 'story', 'kohinoor', 'kohinoor', 'diamond', 'would', 'happen', 'indian']
