## Objectives

This notebook implements logistic regression and gradient boosting methods on both basics and fuzzy-wuzzy features of the quora question pairs dataset.

### Installs and imports

In [1]:
!python -V

Python 3.7.1


In [2]:
!pip3 install fuzzywuzzy
!pip3 install python-Levenshtein



You should consider upgrading via the 'c:\users\clara\anaconda3\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\clara\anaconda3\python.exe -m pip install --upgrade pip' command.


In [43]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.0.2-py3-none-win_amd64.whl (24.6 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.0.2


You should consider upgrading via the 'c:\users\clara\anaconda3\python.exe -m pip install --upgrade pip' command.


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import fuzzywuzzy
from fuzzywuzzy import fuzz
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

### Download data

In [5]:
data_path = "C:/Users/clara/Documents/Cours/IASD/S2/NLP/NLP_project/data/" #change datapath to fit yours

df_train = pd.read_csv(data_path+"train.csv")
#df_test = pd.read_csv(data_path+"test.csv")

df_train = df_train.dropna()
df_train.shape[0]

404287

In [6]:
df_train = df_train.drop(['id', 'qid1', 'qid2'], axis=1)
df_train[0:10]

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,How can I be a good geologist?,What should I do to be a great geologist?,1
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [7]:
df_train["question1"] = df_train["question1"].astype("str")
df_train["question2"] = df_train["question2"].astype('str')

In [8]:
#q1 = df_train['question1'].tolist()
#q2 = df_train['question2'].tolist()

## Features 1 : basics

We first compute basic features from dataset's sentences (lengths, #common words,...)

In [9]:
# length based features
df_train['len_q1'] = df_train.question1.apply(lambda x: len(str(x)))
df_train['len_q2'] = df_train.question2.apply(lambda x: len(str(x)))
# difference in lengths of two questions
df_train['diff_len'] = df_train.len_q1 - df_train.len_q2

# character length based features
df_train['len_char_q1'] = df_train.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
df_train['len_char_q2'] = df_train.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))

# word length based features
df_train['len_word_q1'] = df_train.question1.apply(lambda x: len(str(x).split()))
df_train['len_word_q2'] = df_train.question2.apply(lambda x: len(str(x).split()))

# common words in the two questions
df_train['common_words'] = df_train.apply(lambda x: len(set(str(x['question1'])
    .lower().split())
    .intersection(set(str(x['question2'])
    .lower().split()))), axis=1)

basics = ['len_q1', 'len_q2', 'diff_len', 'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2', 'common_words']

## Features 2 : fuzzy

In [11]:
df_train['fuzz_qratio'] = df_train.apply(lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_WRatio'] = df_train.apply(lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_partial_ratio'] = df_train.apply(lambda x: fuzz.partial_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_partial_token_set_ratio'] = df_train.apply(lambda x: fuzz.partial_token_set_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_partial_token_sort_ratio'] = df_train.apply(lambda x: fuzz.partial_token_sort_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_token_set_ratio'] = df_train.apply(lambda x: fuzz.token_set_ratio(
str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_token_sort_ratio'] = df_train.apply(lambda x: fuzz.token_sort_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

fuzzys = ['fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio', 'fuzz_partial_token_set_ratio',
          'fuzz_partial_token_sort_ratio','fuzz_token_set_ratio', 'fuzz_token_sort_ratio']

In [17]:
#df_train.to_pickle('df_train_wfs1fs2.plk')

# Preparing datasets for learning methods

We now train and evaluate logistic regression and gradient boosting on those features.

In [27]:
#df_train = pd.read_pickle('df_train_wfs1fs2.plk')

In [55]:
df_train[0:5]

Unnamed: 0,question1,question2,is_duplicate,len_q1,len_q2,diff_len,len_char_q1,len_char_q2,len_word_q1,len_word_q2,common_words,fuzz_qratio,fuzz_WRatio,fuzz_partial_ratio,fuzz_partial_token_set_ratio,fuzz_partial_token_sort_ratio,fuzz_token_set_ratio,fuzz_token_sort_ratio
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,66,57,9,20,20,14,12,10,93,95,98,100,89,100,93
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,51,88,-37,21,29,8,13,4,66,86,73,100,75,86,63
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,73,59,14,25,24,14,10,4,54,63,53,100,71,66,66
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,50,65,-15,19,26,11,9,0,35,35,30,37,38,36,36
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,76,39,37,25,18,13,7,2,46,86,54,100,63,67,47


Since logistic regression is sensitive to the scale of the features we will start by standardizing the data.

In [25]:
scaler = StandardScaler()

In [26]:
y = df_train.is_duplicate.values
y = y.astype('float32').reshape(-1, 1)
X = df_train[basics+fuzzys]
X = X.replace([np.inf, -np.inf], np.nan).fillna(0).values
X = scaler.fit_transform(X)



 Then we separate data in 80(training)-20(testing) sets for validation purposes.

In [39]:
np.random.seed(42)

total = y.shape[0]
index = np.random.shuffle(np.arange(total)) #shuffling data to avoid bias
split = 2*total // 10 #we split the dataset into 20% val - 80% train

index_val = index[:split]
index_train = index[split:]

x_train = X[index_train]
y_train = np.ravel(y[index_train])
x_val = X[index_val]
y_val = np.ravel(y[index_val])

In [66]:
print("x_train shape is: {}".format(x_train.shape) + " and x_val shape is: {}".format(x_val.shape))
print("y_train shape is: {}".format(y_train.shape) + " and y_val shape is: {}".format(y_val.shape))

x_train shape is: (323430, 15) and x_val shape is: (80857, 15)
y_train shape is: (323430,) and y_val shape is: (80857,)


# Logistic Regression

In [67]:
lreg = linear_model.LogisticRegression(C=0.1, solver='sag', max_iter=1000)

lreg.fit(x_train, y_train)
lreg_preds = lreg.predict(x_val)
lreg_accuracy = np.sum(lreg_preds == y_val) / len(y_val)

print("Logistic regression accuracy on validation set: %0.3f" % lreg_accuracy)

Logistic regression accuracy on validation set: 0.664


# Gradient Boosting

In [68]:
params = dict()
params['objective'] = 'binary:logistic'
params['loss'] = ['logloss', 'error']
params['learning_rate'] = 0.02
params['max_depth'] = 4

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_val, label=y_val)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]

boosting = xgb.train(params, d_train, 5000, watchlist, early_stopping_rounds=50, verbose_eval=False)
xgboost_preds = (boosting.predict(d_valid) >= 0.5).astype(int)
xgboost_accuracy = np.sum(xgboost_preds == y_val) / len(y_val)

print("Gradient boosting accuracy on validation set: %0.3f" % xgboost_accuracy)

Gradient boosting accuracy on validation set: 0.731
