# Quora Question pairs

The objective of this project is to build a solution for:

https://www.kaggle.com/c/quora-question-pairs/overview

where one has to detect if a pair of questions are duplicated.

In [1]:
import pandas as pd
import sklearn
from sklearn import *
import numpy as np
import pickle
from utils import *
import joblib

# use this to train and VALIDATE your solution
df = pd.read_csv("./quora_train_data.csv")
train_df, valid_df = sklearn.model_selection.train_test_split(df, test_size=0.05, random_state=123)
train_df = train_df.loc[~(pd.isnull(train_df.question1) | pd.isnull(train_df.question2)), :]
corpus = list(train_df["question1"].unique()) + list(train_df["question2"].unique())

## Vectorize the sentences

First we are going to build word embeddings to represent the content of the sentences. We are going to try some embeddings engineered by us (co-occurrance SVD model) and also the pre-trained word2vec embeddings.

## Building a co-occurance SVD embedding model

In [2]:
cooccurrance_embeddings = cooccurrance_embeddings(corpus, preprocess)
cooccurrance_embeddings = cooccurrance_embeddings.fit(n_components=300, store=True)

computing coocurrance matrix


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 468461/468461 [03:40<00:00, 2128.85it/s]


SVM decomposition


## Building a word2vec model

In [3]:
model = api.load('word2vec-google-news-300')
model.save('embeddings/word2vec-google-news-300.kv')

## Training the models

Now, for each of the two previous word-embeddings, we compute the sentence embeddings as the TF-IDF scaled sum of the words it contains. Then we compute the distance between the cosine, manhatan, euclidian and word movers distance between two sentence embeddings, the jaccard distance of the questions and pass them all, together with the embeddings, to a logistic classifier and an XGBoost classifier. We also compute BoW embeddings and pass them to a logistic classifier to be used as baseline.

In [4]:
transformer = QuoraBaselineTransformer().fit(train_df)
X_train_bow = transformer.transform(train_df)
X_valid_bow = transformer.transform(valid_df)
joblib.dump(transformer, 'transformers/baseline.joblib')
print('Done creating BoW dataset')

transformer = QuoraTransformer(embeddings_type=EmbeddingType.WORD2VEC).fit(train_df)
X_train_word2vec = transformer.transform(train_df)
X_valid_word2vec = transformer.transform(valid_df)
joblib.dump(transformer, 'transformers/word2vec.joblib')
print('Done creating word2vec dataset')

transformer = QuoraTransformer(embeddings_type=EmbeddingType.COOCCURRANCE_SVD).fit(train_df)
X_train_cooccurrence = transformer.transform(train_df)
X_valid_cooccurrence = transformer.transform(valid_df)
joblib.dump(transformer, 'transformers/cooccurrance_svd.joblib')
print('Done creating cooccurrence dataset')

Done creating BoW dataset
Done creating word2vec dataset
Done creating cooccurrence dataset


In [5]:
import xgboost as xgb

y_train = train_df['is_duplicate']
y_valid = valid_df['is_duplicate']

classifiers = {
    'xgboost': xgb.XGBClassifier(random_state=123),
    'logistic': linear_model.LogisticRegression(random_state=123)
}

datasets = {
    'BoW': (X_train_bow, X_valid_bow),
    'word2vec': (X_train_word2vec, X_valid_word2vec),
    'cooccurrence': (X_train_cooccurrence, X_valid_cooccurrence)
}

for clf_name, clf in classifiers.items():
    for emb_name, data in datasets.items():
        print(f'Training {clf_name} on {emb_name} with distances...')
        clf.fit(data[0], y_train)
        y_hat = clf.predict(data[1])
        print(metrics.classification_report(y_valid, y_hat) + '\n')
    
        joblib.dump(clf, f'models/{emb_name}_{clf_name}.joblib') 

Training xgboost on BoW with distances...
              precision    recall  f1-score   support

           0       0.75      0.92      0.82     10165
           1       0.77      0.47      0.59      6007

    accuracy                           0.75     16172
   macro avg       0.76      0.69      0.70     16172
weighted avg       0.76      0.75      0.74     16172


Training xgboost on word2vec with distances...
              precision    recall  f1-score   support

           0       0.83      0.84      0.83     10165
           1       0.72      0.71      0.72      6007

    accuracy                           0.79     16172
   macro avg       0.78      0.77      0.77     16172
weighted avg       0.79      0.79      0.79     16172


Training xgboost on cooccurrence with distances...
              precision    recall  f1-score   support

           0       0.79      0.83      0.81     10165
           1       0.68      0.64      0.66      6007

    accuracy                           0

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       0.78      0.85      0.81     10165
           1       0.70      0.60      0.65      6007

    accuracy                           0.76     16172
   macro avg       0.74      0.72      0.73     16172
weighted avg       0.75      0.76      0.75     16172


Training logistic on word2vec with distances...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       0.76      0.82      0.79     10165
           1       0.65      0.57      0.61      6007

    accuracy                           0.73     16172
   macro avg       0.71      0.69      0.70     16172
weighted avg       0.72      0.73      0.72     16172


Training logistic on cooccurrence with distances...
              precision    recall  f1-score   support

           0       0.68      0.82      0.74     10165
           1       0.53      0.34      0.41      6007

    accuracy                           0.64     16172
   macro avg       0.60      0.58      0.58     16172
weighted avg       0.62      0.64      0.62     16172




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
