# Quora pairs
### Issue
- Many people ask similarly worded questions
- Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question
- Make writers feel they need to answer multiple versions of the same question. 

Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

### Current technique
Quora uses a Random Forest model to identify duplicate questions

### Goal
Classify whether question pairs are duplicates or not. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import time

import sys
stdout = sys.stdout
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = stdout

%matplotlib inline

## 1. Data exploration
According to the description: 
- __id__: the id of a training set question pair
- __qid1__, __qid2__: unique ids of each question (only available in train.csv)
- __question1__, __question2__: the full text of each question
- __is_duplicate__: the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

## Training set

In [2]:
df_train = pd.read_csv('../data/train.csv').fillna("")
df_train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
id              404290 non-null int64
qid1            404290 non-null int64
qid2            404290 non-null int64
question1       404290 non-null object
question2       404290 non-null object
is_duplicate    404290 non-null int64
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


### Questions
- Are all the questions unique?
- What is the ratio duplicate / non-duplicate pair questions?

In [4]:
unique_questions = df_train['qid1'].tolist() + df_train['qid2'].tolist() 
total = len(unique_questions)
print "Unique id questions: %.2f%%" % (float(len(set(unique_questions)))/total*100) 
print "Duplicate pair questions: %.2f%%, non-duplicate questions %.2f%%" % (df_train['is_duplicate'].mean()*100, 
                                                                        100 - df_train['is_duplicate'].mean()*100)

Unique id questions: 66.53%
Duplicate pair questions: 36.92%, non-duplicate questions 63.08%


### A peek at the Test set

In [5]:
df_test = pd.read_csv('../data/test.csv')
df_test.head()

### Cleaning of the text

In [6]:
import os.path
import re
import string
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
    
# How to clean?
def clean_question(sentence, stop_words):
    # Remove punctuation
    sentence = ''.join([c for c in sentence if c not in string.punctuation])
    # Lower and tokenization
    tokenized_sentence = sentence.lower().split()
    tokenized_sentence = [c for c in tokenized_sentence if c not in stop_words]
    return ' '.join(tokenized_sentence)    
    
if not os.path.isfile('../data/cleaned_train.csv'): 
    df_train['clean_q1'] = df_train.apply(lambda x: clean_question(x['question1'], stop_words), axis=1)
    df_train['clean_q2'] = df_train.apply(lambda x: clean_question(x['question2'], stop_words), axis=1)
    df_train.to_csv('../data/cleaned_train.csv', index=False)
else:
    df_train = pd.read_csv('../data/cleaned_train.csv').fillna("")

## 2. Features 

For features, I chose to work with the following features:
- __Distance Measure__:
    - __Jaccard Similarity__: ratio of number of shared terms against total number of terms.
- __Shared__: 
    - __words (unigrams)__
    - __bigrams__
    - __trigrams__
- __NLP-specific__: 
    - __Word movers distance__: semantic similarity between the questions
    - __TD-IDF__: similarity between 2 strings from a corpus (google news here)
- __Other__:
    - __Same final words__: Last words might carry weights
    - __Length of questions__: similar questions would have similar length
    - __Ratio of length between questions__: normalized measure for the length
    - __Question frequency__: a question asked often will be likely to be a duplicate

Everything will be processed without stop words.

In [7]:
def ngram(l, n, joint=''):
    return [joint.join(l[i:i + n]) for i in xrange(0, len(l), n)]
       
def unigram(sentence):
    sentence = sentence.split()
    return ngram(sentence,1)

def bigram(sentence, joint="_"):
    sentence = sentence.split()
    return ngram(sentence,2,joint)
     
def trigram(sentence, joint="_"):
    sentence = sentence.split()
    return ngram(sentence,3,joint)

q1 = "My name is jean luc?"
q2 = "My name is jean paul."

print unigram(q1), bigram(q1), trigram(q2)

['My', 'name', 'is', 'jean', 'luc?'] ['My_name', 'is_jean', 'luc?'] ['My_name_is', 'jean_paul.']


### A. Distance Measure
#### Jaccard similarity

In [8]:
from __future__ import division

def jaccard(q1, q2, debug=False):
    q1_s = q1.split()
    q2_s = q2.split()
    shared_words = [word for word in set(q1_s).intersection(q2_s)]
    shared_total = [word for word in set(q1_s).union(q2_s)]
    if debug:
        print shared_words, shared_total
    if len(shared_total) == 0:
        shared_total = [1]
    return (len(shared_words) / len(shared_total))


# jaccard(q1,q2, True)
df_train['jaccard_sim'] = df_train.apply(lambda x: jaccard(x['clean_q1'], x['clean_q2']), axis=1)

In [9]:
df_train['jaccard_sim'].head()

0    0.833333
1    0.222222
2    0.222222
3    0.000000
4    0.153846
Name: jaccard_sim, dtype: float64

### B. Shared words

In [10]:
def shared_unigrams(q1,q2):
    return len(set(unigram(q1+q2)))

def shared_bigrams(q1,q2):
    return len(set(bigram(q1)).intersection(bigram(q2)))

def shared_trigrams(q1,q2):
    return len(set(trigram(q1)).intersection(trigram(q2)))

# print shared_unigrams(q1,q2), shared_bigrams(q1,q2), shared_trigrams(q1,q2)

df_train['shared_uni'] = df_train.apply(lambda x: shared_unigrams(x['clean_q1'], x['clean_q2']), axis=1)
df_train['shared_bi'] = df_train.apply(lambda x: shared_bigrams(x['clean_q1'], x['clean_q2']), axis=1)
df_train['shared_tri'] = df_train.apply(lambda x: shared_trigrams(x['clean_q1'], x['clean_q2']), axis=1)

In [11]:
print df_train['shared_uni'].head(), df_train['shared_bi'].head(), df_train['shared_tri'].head()

0     6
1     9
2     9
3     7
4    12
Name: shared_uni, dtype: int64 0    3
1    1
2    0
3    0
4    0
Name: shared_bi, dtype: int64 0    2
1    0
2    0
3    0
4    0
Name: shared_tri, dtype: int64


### C. NLP-specific

In [12]:
start = time()

try:
    from pyemd import emd
    PYEMD_EXT = True
except ImportError:
    PYEMD_EXT = False
    
import gensim
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(
    '../data/GoogleNews-vectors-negative300.bin.gz', 
    binary=True)

# Word movers distance
def wmd(q1,q2,model):
    return model.wmdistance(q1, q2)

print 'Cell took %.2f seconds to run.' %(time() - start)

Cell took 172.59 seconds to run.


In [13]:
start = time()

word2vec_model.init_sims(replace=True) 
# wmd(q1,q2,word2vec_model)
df_train['norm_wmd'] = df_train.apply(lambda x: wmd(x['clean_q1'], x['clean_q2'],word2vec_model), axis=1)

print 'Cell took %.2f seconds to run.' %(time() - start)
df_train['norm_wmd'].head()
df_train.to_csv('../data/cleaned_train_with_ABC_features.csv', index=False)

Cell took 1175.54 seconds to run.


In [14]:
# TD-IDF
from collections import Counter

# If a word appears only once, we ignore it completely (likely a typo)
# Epsilon defines a smoothing constant, which makes the effect of extremely rare words smaller
def get_weight(count, eps=10000, min_count=2):
    if count < min_count:
        return 0
    else:
        return 1 / (count + eps)

eps = 5000
questions_serie = pd.Series(df_train['question1'].tolist() + df_train['question2'].tolist()).astype(str)

words = (" ".join(questions_serie)).lower().split()
counts = Counter(words)
weights = {word: get_weight(count) for word, count in counts.items()}

def tfidf(q1,q2,stop_words):
    q1words = {}
    q2words = {}
    for word in str(q1).lower().split():
        if word not in stop_words:
            q1words[word] = 1
    for word in str(q2).lower().split():
        if word not in stop_words:
            q2words[word] = 1
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    
    shared_weights = [weights.get(w, 0) for w in q1words.keys() if w in q2words] + [weights.get(w, 0) for w in q2words.keys() if w in q1words]
    total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words]
    
    return np.sum(shared_weights, dtype=np.float64) / np.sum(total_weights, dtype=np.float64)

In [15]:
df_train['tdidf'] = df_train.apply(lambda x: tfidf(x['question1'],x['question2'],stop_words), axis=1)

### D. Other

In [16]:
# Same final words
def final_words(q1,q2):
    q1_s = q1.split()
    q2_s = q2.split()
    if q1_s and q2_s:
        return q1_s[-1] == q2_s[-1]
    else:
        return False

df_train['final_words'] = df_train.apply(lambda x: final_words(x['clean_q1'], x['clean_q2']), axis=1)

In [17]:
# Length of questions
def len_question(sentence):
    return len(unigram(sentence))

df_train['len_q1'] = df_train.apply(lambda x: len_question(x['clean_q1']), axis=1)
df_train['len_q2'] = df_train.apply(lambda x: len_question(x['clean_q2']), axis=1)


In [18]:
# Ratio of length between questions
def len_ratio(q1,q2):
    len_q1 = len_question(q1)
    len_q2 = len_question(q2)
    if len_q1 > len_q2:
        return len_q2/len_q1
    else:
        if len_q2:
            return len_q1/len_q2
        else:
            return len_q1

len_ratio(q1,q2)
df_train['len_ratio'] = df_train.apply(lambda x: len_ratio(x['clean_q1'], x['clean_q2']), axis=1)


In [19]:
# Question frequency

all_qid = df_train['qid1'].tolist() + df_train['qid2'].tolist()
df = pd.DataFrame({'freq':all_qid})
freq_serie = df['freq'].value_counts()
total = len(all_qid)

def freq_question(qid, total, serie):
    nb_occ = serie.get(qid)
    return nb_occ / total
    
df_train['freq_qid1'] = df_train.apply(lambda x: freq_question(x['qid1'],total,freq_serie), axis=1)
df_train['freq_qid2'] = df_train.apply(lambda x: freq_question(x['qid2'],total,freq_serie), axis=1)

In [20]:
df_train.to_csv('../data/cleaned_train_with_ABCD_features.csv', index=False)

## Model

In [21]:
# split data into X and y
x_train = pd.DataFrame()
x_train['jaccard_sim'] = df_train['jaccard_sim']
x_train['shared_uni'] = df_train['shared_uni']
x_train['shared_bi'] = df_train['shared_bi']
x_train['shared_tri'] = df_train['shared_tri']
x_train['norm_wmd'] = df_train['norm_wmd']
x_train['final_words'] = df_train['final_words']
x_train['len_q1'] = df_train['len_q1']
x_train['len_q2'] = df_train['len_q2']
x_train['len_ratio'] = df_train['len_ratio']
x_train['freq_qid1'] = df_train['freq_qid1']
x_train['freq_qid2'] = df_train['freq_qid2']
x_train['tdidf'] = df_train['tdidf']

y_train = df_train['is_duplicate'].values

## Gradient boosted decision trees

XGBoost is used for supervised learning problems, where we use the training data (with multiple features) __X__ (here _x_train_ with features as columns) to predict a target variable __Y__ (here _y_train_ with 'is_duplicate').

In [22]:
from sklearn.model_selection import train_test_split

# Split data into train and test sets
seed = 7
test_size = 0.33

x_train_set, x_test_set, y_train_set, y_test_set = train_test_split(
    x_train, y_train, test_size=test_size, random_state=seed)

# Model parameter
params = {}
params['objective'] = 'binary:logistic' # binary output
params['eval_metric'] = 'logloss' # validation metrics
params['eta'] = 0.04 # learning rate, default 0.03
params['max_depth'] = 3 # Less over-fitting, default = 6

In [23]:
import xgboost as xgb

d_train = xgb.DMatrix(x_train_set, label=y_train_set)
d_test = xgb.DMatrix(x_test_set, label=y_test_set)

watchlist  = [(d_test,'eval'), (d_train,'train')]
num_round = 2001
bst = xgb.train(params, d_train, num_round, watchlist, verbose_eval=50)

bst.save_model('../data/xgb_model.mdl')

[0]	eval-logloss:0.675273	train-logloss:0.675298
[50]	eval-logloss:0.397554	train-logloss:0.397916
[100]	eval-logloss:0.359662	train-logloss:0.359589
[150]	eval-logloss:0.346636	train-logloss:0.346477
[200]	eval-logloss:0.339821	train-logloss:0.339515
[250]	eval-logloss:0.335834	train-logloss:0.335317
[300]	eval-logloss:0.332596	train-logloss:0.331884
[350]	eval-logloss:0.329178	train-logloss:0.328264
[400]	eval-logloss:0.327192	train-logloss:0.326092
[450]	eval-logloss:0.325749	train-logloss:0.324522
[500]	eval-logloss:0.324194	train-logloss:0.322834
[550]	eval-logloss:0.323086	train-logloss:0.321564
[600]	eval-logloss:0.322239	train-logloss:0.320513
[650]	eval-logloss:0.321216	train-logloss:0.319368
[700]	eval-logloss:0.320488	train-logloss:0.318508
[750]	eval-logloss:0.319927	train-logloss:0.317791
[800]	eval-logloss:0.319121	train-logloss:0.316908
[850]	eval-logloss:0.318641	train-logloss:0.316297
[900]	eval-logloss:0.318176	train-logloss:0.315706
[950]	eval-logloss:0.317727	train-

### Prediction on the test data

In [None]:
df_test = pd.read_csv('../data/test.csv').fillna("")

# pre-process
df_test['clean_q1'] = df_test.apply(lambda x: clean_question(x['question1'], stop_words), axis=1)
df_test['clean_q2'] = df_test.apply(lambda x: clean_question(x['question2'], stop_words), axis=1)
df_test.to_csv('../data/cleaned_test.csv', index=False)

In [None]:
# Features
x_test = pd.DataFrame()
x_test['jaccard_sim'] = df_test.apply(lambda x: jaccard(x['clean_q1'], x['clean_q2']), axis=1)
x_test['shared_uni'] = df_test.apply(lambda x: shared_unigrams(x['clean_q1'], x['clean_q2']), axis=1)
x_test['shared_bi'] = df_test.apply(lambda x: shared_bigrams(x['clean_q1'], x['clean_q2']), axis=1)
x_test['shared_tri'] = df_test.apply(lambda x: shared_trigrams(x['clean_q1'], x['clean_q2']), axis=1)

In [None]:
x_test['norm_wmd'] = df_test.apply(lambda x: wmd(x['clean_q1'], x['clean_q2'],word2vec_model), axis=1)
x_test['final_words'] = df_test.apply(lambda x: final_words(x['clean_q1'], x['clean_q2']), axis=1)

In [None]:
x_test['tdidf'] = df_test.apply(lambda x: tfidf(x['question1'],x['question2'],stop_words), axis=1)

In [None]:
x_test['len_q1'] = df_test.apply(lambda x: len_question(x['clean_q1']), axis=1)
x_test['len_q2'] = df_test.apply(lambda x: len_question(x['clean_q2']), axis=1)
x_test['len_ratio'] = df_test.apply(lambda x: len_ratio(x['clean_q1'], x['clean_q2']), axis=1)

In [None]:
from collections import defaultdict

# Question frequency
df_freq = pd.concat([df_train[['question1', 'question2']], \
        df_test[['question1', 'question2']]], axis=0).reset_index(drop='index')
q_dict = defaultdict(set)
for i in range(df_test.shape[0]):
    q_dict[df_freq.question1[i]].add(df_freq.question2[i])
    q_dict[df_freq.question2[i]].add(df_freq.question1[i])

In [None]:
def question_dic_freq(question):
    return(len(q_dict[question]))
        
x_test['freq_qid1'] = df_test.apply(lambda x: question_dic_freq(x['question1']), axis=1)
x_test['freq_qid2'] = df_test.apply(lambda x: question_dic_freq(x['question2']), axis=1)

In [None]:
x_test.to_csv('../data/cleaned_test_with_ABCD_features.csv', index=False)

## Prediction

In [None]:
df_test2 = pd.read_csv('../data/cleaned_test_with_ABCD_features.csv').fillna("")

In [None]:
x_test = pd.DataFrame()
x_test['jaccard_sim'] = df_test2['jaccard_sim']
x_test['shared_uni'] = df_test2['shared_uni']
x_test['shared_bi'] = df_test2['shared_bi']
x_test['shared_tri'] = df_test2['shared_tri']
x_test['norm_wmd'] = df_test2['norm_wmd']
x_test['final_words'] = df_test2['final_words']
x_test['len_q1'] = df_test2['len_q1']
x_test['len_q2'] = df_test2['len_q2']
x_test['len_ratio'] = df_test2['len_ratio']
x_test['freq_qid1'] = df_test2['freq_qid1']
x_test['freq_qid2'] = df_test2['freq_qid2']
x_test['tdidf'] = df_test2['tdidf'].fillna(0)

In [None]:
bst = xgb.Booster(params)
bst.load_model('../data/xgb_model.mdl')

d_test = xgb.DMatrix(x_test)
p_test = bst.predict(d_test)

In [None]:
final = pd.DataFrame()
final['test_id'] = df_test['test_id']
final['is_duplicate'] = p_test
final.to_csv('../data/output_xgb.csv', index=False)

In [None]:
final.head()