In [40]:
import quora
import os
import pandas as pd
from quora.aux_functions import get_cols_with_nans

# 1. The data

## 1.1 Basics

In [41]:
# load data (pre-loaded into pkl for faster loading)
df = pd.read_pickle(os.path.join(quora.root, 'data', 'train.pkl'))
df_counts = pd.read_pickle(os.path.join(quora.root, 'data', 'train_counts.pkl'))

In [42]:
# how does it look like?
df.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [43]:
# get some basic stats
df_counts.describe()

Unnamed: 0,qid1,qid2,is_duplicate,q1_n_words,q1_n_chars,q2_n_words,q2_n_chars
count,404290.0,404290.0,404290.0,404290.0,404290.0,404288.0,404288.0
mean,217243.942418,220955.655337,0.369198,10.944592,59.536716,11.18517,60.108663
std,157751.700002,159903.182629,0.482588,5.431949,29.940641,6.311051,33.86369
min,1.0,2.0,0.0,1.0,1.0,1.0,1.0
25%,74437.5,74727.0,0.0,7.0,39.0,7.0,39.0
50%,192182.0,197052.0,0.0,10.0,52.0,10.0,51.0
75%,346573.5,354692.5,1.0,13.0,72.0,13.0,72.0
max,537932.0,537933.0,1.0,125.0,623.0,237.0,1169.0


* There are 404290 pairs of questions in the dataset, from which ~37% represent duplicated questions.

The minimum number of words (and characters) is one... Let's remove all rows with questions with less than 3 words.

In [55]:
df_counts = df_counts[(df_counts.q1_n_words > 2) & (df_counts.q2_n_words > 2)]

In [57]:
# check the stats again
df_counts.describe()

Unnamed: 0,qid1,qid2,is_duplicate,q1_n_words,q1_n_chars,q2_n_words,q2_n_chars
count,404083.0,404083.0,404083.0,404083.0,404083.0,404083.0,404083.0
mean,217232.048104,220949.505456,0.36931,10.947986,59.554861,11.186974,60.117481
std,157755.623822,159910.159857,0.482619,5.429643,29.928932,6.310064,33.859176
min,1.0,2.0,0.0,3.0,9.0,3.0,10.0
25%,74415.5,74705.5,0.0,7.0,39.0,7.0,39.0
50%,192176.0,197040.0,0.0,10.0,52.0,10.0,51.0
75%,346564.0,354693.0,1.0,13.0,72.0,13.0,72.0
max,537932.0,537933.0,1.0,125.0,623.0,237.0,1169.0


In [70]:
# looks better! (both 9 and 10 chars minimum length seem achievable)
# lets consider this the new df
df = df_counts


In [71]:
# Are there missing values?
get_cols_with_nans(df)

qid1 int64 no missing values
qid2 int64 no missing values
question1 object no missing values
question2 object no missing values
is_duplicate int64 no missing values
q1_n_words int64 no missing values
q1_n_chars int64 no missing values
q2_n_words float64 no missing values
q2_n_chars float64 no missing values


No missing values! Ready to go.

In [72]:
# read through some pairs to get a sense of the data...
for i, row in df.iterrows():
    r = row.values
    print('Q1: {}'.format(r[2]))
    print('Q2: {}'.format(r[3]))
    c = 'Duplicated' if r[4] == 1 else 'No duplicated'
    print('Result: {}'.format(c))
    
    # comment break to to go through some examples
    break
    
    a = input()
    if a == 'stop':
        break

Q1: What is the step by step guide to invest in share market in india?
Q2: What is the step by step guide to invest in share market?
Result: No duplicated


## some interesting examples...

### mostly overlapping questions, that are yet different questions
* Q1: What is the step by step guide to invest in share market in india?
* Q2: What is the step by step guide to invest in share market?
* Result: No duplicated
* ---
* Q1: What is the best travel website in spain?
* Q2: What is the best travel website?
* Result: No duplicated
* ---
* Q1: What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?
* Q2: What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?
* Result: No duplicated
* ---
* Q1: Which is the best digital marketing institution in banglore?
* Q2: Which is the best digital marketing institute in Pune?
* Result: No duplicated
* ---   
* Q1: What are some tips on making it through the job interview process at Medicines?
* Q2: What are some tips on making it through the job interview process at Foundation Medicine?
* Result: No duplicated


### the case of the negation...
* Q1: What are the questions should not ask on Quora?
* Q2: Which question should I ask on Quora?
* Result: No duplicated

    
### inverse scenario: questions are slightly different, but essentialy the same in content
* Q1: Why do rockets look white?
* Q2: Why are rockets and boosters painted white?
* Result: Duplicated
  

### special characters and punctiation that remove the essential meaning from the question
* Q1: When do you use シ instead of し?
* Q2: When do you use "&" instead of "and"?
* Result: No duplicated
    
    
### very different phrasing, same meaning
* Q1: What would a Trump presidency mean for current international master’s students on an F1 visa?
* Q2: How will a Trump presidency affect the students presently in US or planning to study in US?
* Result: Duplicated
    
    
### problems with abreviations...
* Q1: How much is 30 kV in HP?
* Q2: Where can I find a conversion chart for CC to horsepower?
* Result: No duplicated
* ---
* Q1: How do we prepare for UPSC?  # (union public service commission)
* Q2: How do I prepare for civil service?
* Result: Duplicated
    
    
### problems with labeling..?
* Q1: How should I prepare for CA final law?
* Q2: How one should know that he/she completely prepare for CA final exam?
* Result: Duplicated
* ---
* Q1: What is the quickest way to increase Instagram followers?
* Q2: How can we increase our number of Instagram followers?
* Result: No duplicated
* ---   
* Q1: How is the new Harry Potter book 'Harry Potter and the Cursed Child'?
* Q2: How bad is the new book by J.K Rowling?
* Result: Duplicated
* ---
* Q1: What is web application?
* Q2: What is the web application framework?
* Result: No duplicated
* ---
* Q1: What are some special cares for someone with a nose that gets stuffy during the night?
* Q2: How can I keep my nose from getting stuffy at night?
* Result: Duplicated
* ---
* Q1: When can I expect my Cognizant confirmation mail?
* Q2: When can I expect Cognizant confirmation mail?
* Result: No duplicated


### A minimal difference, that makes it a completely different question...
* Q1: Can I make 50,000 a month by day trading?
* Q2: Can I make 30,000 a month by day trading?
* Result: No duplicated

### a) Is this dataset balanced?

In [73]:
# get distribution of target variable
df['is_duplicate'].value_counts()

0    254851
1    149232
Name: is_duplicate, dtype: int64

This is a binary classification problem:
* 1 - represents a duplicated pair of questions
* 0 - represents no duplicated pair of questions

36.9% of the question pairs are duplicated. Although there is a slight imbalance in the dataset, this difference probably accurately reflects the reality, that duplicated questios are less frequent than original ones. For now this looks good enough and I will keep the data as is.

What is the null error rate? If only 36.9% of questions are duplicated, this means that we could obtain 63% accuracy by always predicting "no duplicate".

WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles.