## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [113]:
import pandas as pd
import string
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import gensim
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample
import warnings
warnings.filterwarnings('ignore')

In [169]:
#loading the data
message = pd.read_csv("./train.csv")
message

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
...,...,...,...,...,...,...
404285,404285,433578,379845,How many keywords are there in the Racket prog...,How many keywords are there in PERL Programmin...,0
404286,404286,18840,155606,Do you believe there is life after death?,Is it true that there is life after death?,1
404287,404287,537928,537929,What is one coin?,What's this coin?,0
404288,404288,537930,537931,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0


#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Exploration

In [42]:
#checking for null values
message.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [4]:
#checking for type and shape of data
message.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


In [5]:
#exploring the entries having null values
message[message.question1.isnull() | message.question2.isnull()]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
105780,105780,174363,174364,How can I develop android app?,,0
201841,201841,303951,174364,How can I create an Android app?,,0
363362,363362,493340,493341,,My Chinese name is Haichao Yu. What English na...,0


In [7]:
#checking for balance of data
message['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [170]:
#dropping the null values and useless columns
message = message.drop(['id','qid1', 'qid2'], axis=1)
message.columns = ['question1', 'question2', 'label']
message = message.dropna()

message.head()

Unnamed: 0,question1,question2,label
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [171]:
#double checking
message.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 404287 entries, 0 to 404289
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   question1  404287 non-null  object
 1   question2  404287 non-null  object
 2   label      404287 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 12.3+ MB


In [172]:
#removing punctuation
punctuation = string.punctuation
message['question1'] = message['question1'].apply(lambda x: "".join([i for i in x if i not in punctuation]))
message['question2'] = message['question2'].apply(lambda x: "".join([i for i in x if i not in punctuation]))

message.head(5)

Unnamed: 0,question1,question2,label
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor KohiNoor Diamond,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely How can I solve it,Find the remainder when math2324math is divide...,0
4,Which one dissolve in water quikly sugar salt ...,Which fish would survive in salt water,0


In [173]:
#tokenizing
message['question1'] = message['question1'].apply(lambda x: x.lower().split())
message['question2'] = message['question2'].apply(lambda x: x.lower().split())
message.head(5)

Unnamed: 0,question1,question2,label
0,"[what, is, the, step, by, step, guide, to, inv...","[what, is, the, step, by, step, guide, to, inv...",0
1,"[what, is, the, story, of, kohinoor, kohinoor,...","[what, would, happen, if, the, indian, governm...",0
2,"[how, can, i, increase, the, speed, of, my, in...","[how, can, internet, speed, be, increased, by,...",0
3,"[why, am, i, mentally, very, lonely, how, can,...","[find, the, remainder, when, math2324math, is,...",0
4,"[which, one, dissolve, in, water, quikly, suga...","[which, fish, would, survive, in, salt, water]",0


In [174]:
#removing stopwords
common_words = stopwords.words('english')
message['question1'] = message['question1'].apply(lambda x: [i for i in x if i not in common_words])
message['question2'] = message['question2'].apply(lambda x: [i for i in x if i not in common_words])
message.head(5)

Unnamed: 0,question1,question2,label
0,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]",0
1,"[story, kohinoor, kohinoor, diamond]","[would, happen, indian, government, stole, koh...",0
2,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]",0
3,"[mentally, lonely, solve]","[find, remainder, math2324math, divided, 2423]",0
4,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]",0


In [176]:
#lemmatizing on adjectives
lemmatizer = WordNetLemmatizer()
message['question1'] = message['question1'].apply(lambda x: [lemmatizer.lemmatize(i, pos ="a") for i in x])
message['question2'] = message['question2'].apply(lambda x: [lemmatizer.lemmatize(i, pos ="a") for i in x])
message.head(5)

Unnamed: 0,question1,question2,label
0,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]",0
1,"[story, kohinoor, kohinoor, diamond]","[would, happen, indian, government, stole, koh...",0
2,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]",0
3,"[mentally, lonely, solve]","[find, remainder, math2324math, divided, 2423]",0
4,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]",0


In [177]:
#joining the tokens
message['question1'] = message['question1'].apply(lambda x: ' '.join(x))
message['question2'] = message['question2'].apply(lambda x: ' '.join(x))
message.head(5)

Unnamed: 0,question1,question2,label
0,step step guide invest share market india,step step guide invest share market,0
1,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...,0
2,increase speed internet connection using vpn,internet speed increased hacking dns,0
3,mentally lonely solve,find remainder math2324math divided 2423,0
4,one dissolve water quikly sugar salt methane c...,fish would survive salt water,0


In [180]:
#saving the cleaned data
message.to_csv('./cleaned_data.csv', index=False)

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [140]:
#loading the cleaned data
message = pd.read_csv("./cleaned_data.csv") 
message = message.dropna().reset_index(drop=True)
message

Unnamed: 0,question1,question2,label
0,step step guide invest share market india,step step guide invest share market,0
1,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...,0
2,increase speed internet connection using vpn,internet speed increased hacking dns,0
3,mentally lonely solve,find remainder math2324math divided 2423,0
4,one dissolve water quikly sugar salt methane c...,fish would survive salt water,0
...,...,...,...
404160,many keywords racket programming language late...,many keywords perl programming language late v...,0
404161,believe life death,true life death,1
404162,one coin,whats coin,0
404163,approx annual cost living studying uic chicago...,little hairfall problem want use hair styling ...,0


In [198]:
#double checking
message.isnull().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404165 entries, 0 to 404164
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   question1  404165 non-null  bool 
 1   question2  404165 non-null  bool 
 2   label      404165 non-null  bool 
dtypes: bool(3)
memory usage: 1.2 MB


In [141]:
#tokenizing
message['question1'] = message['question1'].apply(lambda x: x.split())
message['question2'] = message['question2'].apply(lambda x: x.split())
message.head(5)


Unnamed: 0,question1,question2,label
0,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]",0
1,"[story, kohinoor, kohinoor, diamond]","[would, happen, indian, government, stole, koh...",0
2,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]",0
3,"[mentally, lonely, solve]","[find, remainder, math2324math, divided, 2423]",0
4,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]",0


In [142]:
#preparing the data for word2vec training
training_data = pd.concat([message['question1'], message['question2']], ignore_index=True).reset_index(drop=True)
training_data

0         [step, step, guide, invest, share, market, india]
1                      [story, kohinoor, kohinoor, diamond]
2         [increase, speed, internet, connection, using,...
3                                 [mentally, lonely, solve]
4         [one, dissolve, water, quikly, sugar, salt, me...
                                ...                        
808325    [many, keywords, perl, programming, language, ...
808326                                  [true, life, death]
808327                                        [whats, coin]
808328    [little, hairfall, problem, want, use, hair, s...
808329                                  [like, sex, cousin]
Length: 808330, dtype: object

In [117]:
#trainig word2vec model
model_word2vec = gensim.models.Word2Vec(training_data, vector_size = 50, window = 4, min_count = 0)


In [143]:
#vectorizing each document
for i in range(len(message)):
    message['question1'][i] = np.mean([model_word2vec.wv[j] for j in message['question1'][i]], axis=0)
    message['question2'][i] = np.mean([model_word2vec.wv[j] for j in message['question2'][i]], axis=0)

message

Unnamed: 0,question1,question2,label
0,"[-0.07167065, -1.0874205, 0.017766748, 1.22209...","[-0.25407422, -0.8449242, -0.20848584, 1.23597...",0
1,"[-0.34327218, 0.17414057, 0.049539745, 0.31465...","[0.42332283, -0.4673859, -0.34021503, 0.087255...",0
2,"[-0.2512975, 1.04085, -1.6191721, 0.7223783, -...","[0.3294565, 0.03207121, -0.72052205, 0.6779605...",0
3,"[-0.37822175, 0.038107943, -0.74548155, -0.798...","[-0.5221234, -0.20535085, -1.5316513, 0.116613...",0
4,"[-0.05219735, 0.41353494, -0.28826874, -0.3058...","[-0.0014979772, 1.190263, 0.18299475, -0.77524...",0
...,...,...,...
404160,"[0.25131688, 0.7876653, -1.4903367, -0.3660437...","[0.2911484, 0.8170861, -1.4928764, -0.38677862...",0
404161,"[0.68210214, 0.1496731, 1.2797834, 0.21794479,...","[0.87879443, 0.07379445, 0.8525842, 0.44833604...",1
404162,"[0.0471607, -0.88970315, -1.3850651, -0.289441...","[0.34339184, 1.0306684, -0.17767984, -0.193153...",0
404163,"[0.7725865, -0.28841126, -0.041376486, 0.59982...","[0.38098103, -1.0133544, -0.8146707, 0.1309495...",0


In [144]:
# Separate majority and minority classes
max_class = message[message.label == 0]
min_class = message[message.label == 1]
 
# Downsample majority class
tab_max = resample(max_class, 
                   replace = False,   
                   n_samples = 149259,
                   random_state=123)
 
# Combine minority class with downsampled majority class
message = pd.concat([tab_max, min_class])
message = message.reset_index(drop=True)

#checking output
message['label'].value_counts()

0    149259
1    149259
Name: label, dtype: int64

In [121]:
#making the data ready for training model
message = pd.concat([pd.DataFrame(message['question1'].to_list()), 
                     pd.DataFrame(message['question2'].to_list()), 
                     message['label']], axis=1)
message['label'] = message['label'] == 1
message

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,label
0,-0.183687,-1.171803,-0.989652,0.403407,0.505179,-0.921104,0.886456,0.897650,-0.244627,-0.963114,...,1.086491,-0.278801,0.037397,1.273736,0.221238,-0.621448,-1.222416,-0.738569,0.155263,False
1,0.218390,-1.145710,-0.672877,-0.630909,-0.226211,0.364081,0.806007,2.189648,0.649629,-0.473313,...,0.381085,0.769267,-0.049070,0.543064,0.772594,-0.649824,-0.651627,-0.938636,-0.698390,False
2,0.386397,0.592529,-0.125509,0.108326,0.733482,0.144116,0.090382,-0.372318,-0.679812,0.481637,...,0.075570,-0.060495,0.022953,1.016139,-0.349210,-0.083458,0.145254,0.396989,-0.271952,False
3,-0.280604,0.641521,-1.002964,0.560172,0.511830,0.052247,1.510296,2.051119,1.018711,-0.304311,...,0.677201,-0.115658,-1.279813,1.183333,-0.251360,-1.040749,0.022727,-0.816676,1.826516,False
4,0.725860,-1.129289,0.528622,0.270596,0.159662,-0.762757,0.837320,0.201949,0.515901,0.159321,...,0.183189,-0.473401,-0.115975,0.654645,0.921295,-0.252697,-0.853532,-0.245854,0.576143,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298513,-0.006284,-1.405400,-0.308053,0.632359,0.334629,0.293212,-0.190902,1.129629,-1.688322,-0.367945,...,-0.131451,0.353527,-0.150649,0.430945,0.144024,-0.720410,0.406001,0.396105,-0.060282,True
298514,-0.987322,0.075753,0.879683,-1.227116,0.307681,1.147305,-0.279997,2.179170,-0.408423,-0.158073,...,-1.423774,-0.596251,1.317945,2.354396,-0.435799,-0.390249,-1.035784,2.162613,0.189906,True
298515,2.654303,-0.724160,0.341088,-1.064408,-1.263167,-0.195906,1.204050,2.017030,-0.296436,-1.054444,...,0.266421,0.678746,0.712721,0.826187,-0.777713,0.278712,-1.995118,-0.844087,-0.271900,True
298516,0.345587,0.309711,-0.412950,-0.328948,0.427059,0.140866,-0.692953,0.496977,-0.279821,-0.249760,...,0.210247,-0.533880,0.461220,0.762006,-0.705113,-0.474573,-0.049309,-0.317150,0.176534,True


In [145]:
#compute cosine similarity as features
tab = []
for i in range(len(message)):
    tab.append(cosine_similarity(message['question1'][i].reshape(1,-1), message['question2'][i].reshape(1,-1)))

tab1 = []
for i in range(len(tab)):
    tab1.append(tab[i][0][0])
tab1 = pd.DataFrame(list(tab1))

message = pd.concat([message['label'], tab1], axis=1)
message['label'] = message['label'] == 1
message.head(5)

Unnamed: 0,label,0
0,False,0.319354
1,False,0.751609
2,False,1.0
3,False,0.981135
4,False,0.729559


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

I tried wheter the cosine similarity or the numeric representation of the vectors as features. I have a better efficiency with the numeric representation of the vectors

In [122]:
#splitting the data into train and test
X_train, y_train = message.drop(['label'], axis = 1), message['label']

train_ratio = 0.7
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, shuffle=True, 
                                                    train_size=train_ratio,
                                                    random_state=42)

print(f'{len(X_train)} training samples and {len(X_test)} test samples')

208962 training samples and 89556 test samples


In [147]:
#training logistic regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train,y_train)

print('Model score on training data:', model.score(X_train, y_train))
print('Model score on testing data:', model.score(X_test, y_test))


Model score on training data: 0.6826695762865975
Model score on testing data: 0.6785140024118987


In [136]:
#train random forest
model = RandomForestClassifier(random_state=42, max_depth=10, n_estimators=150)
model = model.fit(X_train, y_train)

print('Model score on training data:', model.score(X_train, y_train))
print('Model score on testing data:', model.score(X_test, y_test))


Model score on training data: 0.7469779194303271
Model score on testing data: 0.717930680244763


In [137]:
#evaluating test data
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

       False       0.69      0.79      0.74     44676
        True       0.76      0.64      0.70     44880

    accuracy                           0.72     89556
   macro avg       0.72      0.72      0.72     89556
weighted avg       0.72      0.72      0.72     89556



In [138]:
#evaluating train data
y_train_pred = model.predict(X_train)
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

       False       0.71      0.82      0.77    104583
        True       0.79      0.67      0.73    104379

    accuracy                           0.75    208962
   macro avg       0.75      0.75      0.75    208962
weighted avg       0.75      0.75      0.75    208962



In [139]:
#saving model
import pickle
pickle.dump(model, open('model_NV.sav', 'wb'))

In [None]:
# setting random parameters for random forest
#computing ressources issues
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomForestClassifier()

grid_search = GridSearchCV(estimator = rf, param_grid = random_grid, 
                          cv = 3, n_jobs = -1, verbose = 2, scoring='accuracy')
grid_search = grid_search.fit(X_train, y_train)

# assess the score
grid_search.best_score_, grid_search.best_params_