## Objectives

This notebook implements logistic regression and gradient boosting methods on both basics and fuzzy-wuzzy features of the quora question pairs dataset.

### Installs and imports

In [1]:
!python -V

Python 3.6.9


In [2]:
!pip3 install fuzzywuzzy
!pip3 install python-Levenshtein
!pip install sentence-transformers

Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting python-Levenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)
[K     |████████████████████████████████| 51kB 3.9MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.0-cp36-cp36m-linux_x86_64.whl size=144797 sha256=5dcdc4a7c67e286d81ce4ae150e8fd9a8a3e6314c8ec974aa21cc308e604793c
  Stored in directory: /root/.cache/pip/wheels/de/c2/93/660fd5f7559049268ad2dc6d81c4e39e9e36518766eaf7e342
Successfully built python-Levenshtein
Installing collect

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import fuzzywuzzy
from fuzzywuzzy import fuzz
from sklearn import linear_model
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [4]:
from google.colab import drive
drive.mount('/content/drive/')
%cd '/content/drive/My Drive/NLP/'
!ls

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/
/content/drive/My Drive/NLP
data		   FuzzyWuzzy.ipynb			   Untitled0.ipynb
Fuzzy_Quora.ipynb  Sentence_Transformers_embbedings.ipynb


### Download data

In [5]:
data_path = "data/" #change datapath to fit yours

df_train = pd.read_csv(data_path+"train.csv")

df_train = df_train.dropna()
df_train.shape[0]

404287

In [0]:
df_train = df_train.drop(['id', 'qid1', 'qid2'], axis=1)
df_train[0:10]

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,How can I be a good geologist?,What should I do to be a great geologist?,1
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [0]:
df_train["question1"] = df_train["question1"].astype("str")
df_train["question2"] = df_train["question2"].astype('str')

## Features 1 : basics

We first compute basic features from dataset's sentences (lengths, #common words,...)

In [0]:
# Length of questions
df_train['len_q1'] = df_train.question1.apply(lambda x: len(str(x)))
df_train['len_q2'] = df_train.question2.apply(lambda x: len(str(x)))
# Difference between length of the two questions
df_train['diff_len'] = df_train.len_q1 - df_train.len_q2
# Character length without spaces
df_train['len_char_q1'] = df_train.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
df_train['len_char_q2'] = df_train.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
# Number of words
df_train['len_word_q1'] = df_train.question1.apply(lambda x: len(str(x).split()))
df_train['len_word_q2'] = df_train.question2.apply(lambda x: len(str(x).split()))
# Number of common words in the two questions
df_train['common_words'] = df_train.apply(lambda x: len(set(str(x['question1'])
    .lower().split())
    .intersection(set(str(x['question2'])
    .lower().split()))), axis=1)

basics = ['len_q1', 'len_q2', 'diff_len', 'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2', 'common_words']

## Features 2 : fuzzy

In [0]:
df_train['fuzz_qratio'] = df_train.apply(lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_WRatio'] = df_train.apply(lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_partial_ratio'] = df_train.apply(lambda x: fuzz.partial_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_partial_token_set_ratio'] = df_train.apply(lambda x: fuzz.partial_token_set_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_partial_token_sort_ratio'] = df_train.apply(lambda x: fuzz.partial_token_sort_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_token_set_ratio'] = df_train.apply(lambda x: fuzz.token_set_ratio(
str(x['question1']), str(x['question2'])), axis=1)

df_train['fuzz_token_sort_ratio'] = df_train.apply(lambda x: fuzz.token_sort_ratio(
    str(x['question1']), str(x['question2'])), axis=1)

fuzzys = ['fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio', 'fuzz_partial_token_set_ratio',
          'fuzz_partial_token_sort_ratio','fuzz_token_set_ratio', 'fuzz_token_sort_ratio']

In [0]:
#df_train.to_pickle('data/df_train_wfs1fs2.plk')

In [0]:
df_train = pd.read_pickle('data/df_train_wfs1fs2.plk')
fuzzys = ['fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio', 'fuzz_partial_token_set_ratio',
          'fuzz_partial_token_sort_ratio','fuzz_token_set_ratio', 'fuzz_token_sort_ratio']
basics = ['len_q1', 'len_q2', 'diff_len', 'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2', 'common_words']

# Preparing datasets for learning methods

We now train and evaluate **logistic regression** and **gradient boosting** on those features.

In [8]:
df_train[0:5]

Unnamed: 0,question1,question2,is_duplicate,len_q1,len_q2,diff_len,len_char_q1,len_char_q2,len_word_q1,len_word_q2,common_words,fuzz_qratio,fuzz_WRatio,fuzz_partial_ratio,fuzz_partial_token_set_ratio,fuzz_partial_token_sort_ratio,fuzz_token_set_ratio,fuzz_token_sort_ratio
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,66,57,9,20,20,14,12,10,93,95,98,100,89,100,93
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,51,88,-37,21,29,8,13,4,66,86,73,100,75,86,63
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,73,59,14,25,24,14,10,4,54,63,53,100,71,66,66
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,50,65,-15,19,26,11,9,0,35,35,30,37,38,36,36
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,76,39,37,25,18,13,7,2,46,86,54,100,63,67,47


Since logistic regression is sensitive to the scale of the features we will start by standardizing the data.

In [0]:
scaler = StandardScaler()

In [0]:
y = df_train.is_duplicate.values
y = y.astype('float32').reshape(-1, 1)
X = df_train[basics+fuzzys]
X = X.replace([np.inf, -np.inf], np.nan).fillna(0).values
X = scaler.fit_transform(X)

 Then we separate data in 80(training)-20(validation) sets for validation purposes.

In [0]:
np.random.seed(42)

#shuffling data to avoid bias
total = y.shape[0]
index = np.arange(total) 
np.random.shuffle(index)

#we split the dataset into 20% val - 80% train
split = 2*total // 10 
index_val = index[:split]
index_train = index[split:]

x_train = X[index_train]
y_train = np.ravel(y[index_train])
x_val = X[index_val]
y_val = np.ravel(y[index_val])

In [12]:
print("x_train shape is: {}".format(x_train.shape) + " and x_val shape is: {}".format(x_val.shape))
print("y_train shape is: {}".format(y_train.shape) + " and y_val shape is: {}".format(y_val.shape))

x_train shape is: (323430, 15) and x_val shape is: (80857, 15)
y_train shape is: (323430,) and y_val shape is: (80857,)


# Logistic Regression

In [20]:
#we run a logistic regression with SGD classifier on training data
lreg = linear_model.SGDClassifier(loss='log',verbose=1)
lreg.fit(x_train, y_train)

#then compute predictions on validation set
lreg_preds_prob = lreg.predict_proba(x_val)
lreg_preds = lreg.predict(x_val)
#and on training set
lreg_preds_prob_t = lreg.predict_proba(x_train)
lreg_preds_t = lreg.predict(x_train)

-- Epoch 1
Norm: 1.94, NNZs: 15, Bias: -0.923796, T: 323430, Avg. loss: 0.810572
Total training time: 0.08 seconds.
-- Epoch 2
Norm: 1.72, NNZs: 15, Bias: -1.078893, T: 646860, Avg. loss: 0.567442
Total training time: 0.16 seconds.
-- Epoch 3
Norm: 1.78, NNZs: 15, Bias: -0.819547, T: 970290, Avg. loss: 0.562302
Total training time: 0.24 seconds.
-- Epoch 4
Norm: 1.74, NNZs: 15, Bias: -0.891265, T: 1293720, Avg. loss: 0.560250
Total training time: 0.31 seconds.
-- Epoch 5
Norm: 1.69, NNZs: 15, Bias: -0.893622, T: 1617150, Avg. loss: 0.559237
Total training time: 0.39 seconds.
-- Epoch 6
Norm: 1.72, NNZs: 15, Bias: -0.988568, T: 1940580, Avg. loss: 0.558502
Total training time: 0.46 seconds.
-- Epoch 7
Norm: 1.68, NNZs: 15, Bias: -0.878439, T: 2264010, Avg. loss: 0.557944
Total training time: 0.53 seconds.
-- Epoch 8
Norm: 1.73, NNZs: 15, Bias: -0.951027, T: 2587440, Avg. loss: 0.557534
Total training time: 0.60 seconds.
-- Epoch 9
Norm: 1.74, NNZs: 15, Bias: -0.952403, T: 2910870, Avg. 

In [21]:
###### ###### ######      RESULTS on training set   ###### ###### ###### 

lreg_accuracy = metrics.accuracy_score(y_train,lreg_preds_t)
lreg_f1 = metrics.f1_score(y_train,lreg_preds_t)
lreg_loss = metrics.log_loss(y_train,lreg_preds_prob_t)

print("Logistic regression accuracy on training set: %0.3f" % lreg_accuracy)
print("Logistic regression F1score on training set: %0.3f" % lreg_f1)
print("Logistic regression loss on training set: %0.3f" % lreg_loss)

Logistic regression accuracy on training set: 0.662
Logistic regression F1score on training set: 0.520
Logistic regression loss on training set: 0.556


In [0]:
###### ###### ######      RESULTS on validation set    ###### ###### ###### 

lreg_accuracy = metrics.accuracy_score(y_val,lreg_preds)
lreg_f1 = metrics.f1_score(y_val,lreg_preds)
lreg_loss = metrics.log_loss(y_val,lreg_preds_prob)

print("Logistic regression accuracy on validation set: %0.3f" % lreg_accuracy)
print("Logistic regression F1score on validation set: %0.3f" % lreg_f1)
print("Logistic regression loss on validation set: %0.3f" % lreg_loss)

Logistic regression accuracy on validation set: 0.663
Logistic regression F1score on validation set: 0.501
Logistic regression loss on validation set: 0.555


# Gradient Boosting

In [22]:
#we run a gradient boosting binary classification on training data
params = dict()
params['objective'] = 'binary:logistic'
params['loss'] = ['logloss']
params['learning_rate'] = 0.02
params['max_depth'] = 4
params['eval_metric'] = ['logloss', 'error']

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_val, label=y_val)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]

boosting = xgb.train(params, d_train, 5000, watchlist, early_stopping_rounds=20, verbose_eval=100)
xgboost_preds = (boosting.predict(d_valid) >= 0.5).astype(int)
xgboost_preds_proba = (boosting.predict(d_valid)).astype(float)
xgboost_preds_t = (boosting.predict(d_train) >= 0.5).astype(int)
xgboost_preds_proba_t = (boosting.predict(d_train)).astype(float)

[0]	train-logloss:0.687866	train-error:0.305504	valid-logloss:0.687854	valid-error:0.304612
Multiple eval metrics have been passed: 'valid-error' will be used for early stopping.

Will train until valid-error hasn't improved in 20 rounds.
[100]	train-logloss:0.526235	train-error:0.296398	valid-logloss:0.526048	valid-error:0.293902
[200]	train-logloss:0.507651	train-error:0.291896	valid-logloss:0.507794	valid-error:0.289746
[300]	train-logloss:0.501285	train-error:0.288072	valid-logloss:0.501693	valid-error:0.286827
[400]	train-logloss:0.496418	train-error:0.284952	valid-logloss:0.49717	valid-error:0.283884
[500]	train-logloss:0.492666	train-error:0.281424	valid-logloss:0.493887	valid-error:0.280928
[600]	train-logloss:0.489615	train-error:0.279421	valid-logloss:0.491166	valid-error:0.278578
Stopping. Best iteration:
[600]	train-logloss:0.489615	train-error:0.279421	valid-logloss:0.491166	valid-error:0.278578



In [23]:
###### ###### ######      RESULTS     ###### ###### ###### 

xgboost_accuracy = metrics.accuracy_score(y_train,xgboost_preds_t)
xgboost_f1 = metrics.f1_score(y_train,xgboost_preds_t)
xgboost_loss = metrics.log_loss(y_train,xgboost_preds_proba_t)
print("Gradient boosting accuracy on training set: %0.3f" % xgboost_accuracy)
print("Gradient boosting F1score on training set: %0.3f" % xgboost_f1)
print("Gradient boosting loss on training set: %0.3f" % xgboost_loss)

Gradient boosting accuracy on training set: 0.721
Gradient boosting F1score on training set: 0.658
Gradient boosting loss on training set: 0.489


In [0]:
###### ###### ######      RESULTS     ###### ###### ###### 

xgboost_accuracy = metrics.accuracy_score(y_val,xgboost_preds)
xgboost_f1 = metrics.f1_score(y_val,xgboost_preds)
xgboost_loss = metrics.log_loss(y_val,xgboost_preds_proba)
print("Gradient boosting accuracy on validation set: %0.3f" % xgboost_accuracy)
print("Gradient boosting F1score on validation set: %0.3f" % xgboost_f1)
print("Gradient boosting loss on validation set: %0.3f" % xgboost_loss)

Gradient boosting accuracy on validation set: 0.721
Gradient boosting F1score on validation set: 0.657
Gradient boosting loss on validation set: 0.491
