# Reproduce results
### Adrián Fernández Cid

This notebook loads the input matrices and trained models to reproduce the validation and test results shown in train_models.ipynb. In addition, I show and comment on a sample of the test mistakes produced by the two best models: the neural network with a tf-idf vectorisation and the cosine similarity, and the logistic regression with the same preprocessing. The whole notebook executes in <1min.

Imports:

In [1]:
# custom stuff
from utils import *
# basic stuff
import pandas as pd
import scipy as sp
import numpy as np
import os
import pickle  #for saving matrices, models, etc.
#sklearn stuff
import sklearn
from sklearn import *
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import roc_auc_score
#tensoflow stuff
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from keras.models import model_from_json

Loading data:

In [2]:
path_data =  os.path.expanduser('~') 

train_df = pd.read_csv(os.path.join(path_data,
                                    os.path.join("Datasets", "kaggle_datasets", "quora", "quora_train_data.csv")))

train_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [3]:
A_df, te_df = sklearn.model_selection.train_test_split(train_df, test_size=0.05,random_state=123)

tr_df, va_df = sklearn.model_selection.train_test_split(A_df, test_size=0.05,random_state=123)
print('tr_df.shape=',tr_df.shape)
print('va_df.shape=',va_df.shape)
print('te_df.shape=',te_df.shape)

tr_df.shape= (364871, 6)
va_df.shape= (19204, 6)
te_df.shape= (20215, 6)


In [4]:
y_tr = tr_df["is_duplicate"].values
y_va = va_df["is_duplicate"].values
y_te = te_df["is_duplicate"].values

## Validation

Load input matrices and trained models, and print the validation results:

In [5]:
%%time

mod_fnames = ["logistic", "xgb"]
mod_print = ["LOGISTIC REGRESSION", "XGBOOST"]
vec_fnames = ["count", "skl_tfidf"]
vec_print = ["Count", "Sklearn's tf-idf"]
feat_fnames = ["raw", "euclid", "cos", "edit"]
feat_print = ["no", "Euclidean distance", "cosine similarity", "edit distance"]
for i, mod in enumerate(mod_fnames):
    print("==================================================")
    print(f"{mod_print[i]}")
    print("==================================================")
    for j, vec in enumerate(vec_fnames):
        print("--------------------------------------")
        print(f"{vec_print[j]} vectoriser")
        print("--------------------------------------")
        for k, feat in enumerate(feat_fnames):
            print(f"****With {feat_print[k]} new feature****")
            print(f"-Validation")
            # load input matrix
            Xva_fname = "results/Xva_"+vec+"_"+feat+".npz"
            Xva = sp.sparse.load_npz(Xva_fname)
            # load model
            filename = "models/"+mod+"_"+vec+"_"+feat+".sav"
            clf = pickle.load(open(filename, 'rb'))
            try:
                print_auc_logloss(clf, Xva, y_va)
            except:
                print("ERROR. Could not get metrics: there's an issue with input matrix.")

LOGISTIC REGRESSION
--------------------------------------
Count vectoriser
--------------------------------------
****With no new feature****
-Validation
AUC: 0.8058412271384804
log loss: 0.5150855223485794
****With Euclidean distance new feature****
-Validation
AUC: 0.8058355318707418
log loss: 0.5150976791190962
****With cosine similarity new feature****
-Validation
AUC: 0.8567268858027568
log loss: 0.4532025503252208
****With edit distance new feature****
-Validation
ERROR. Could not get metrics: there's an issue with input matrix.
--------------------------------------
Sklearn's tf-idf vectoriser
--------------------------------------
****With no new feature****
-Validation
AUC: 0.8033913596741575
log loss: 0.5075859218079861
****With Euclidean distance new feature****
-Validation
AUC: 0.8033859222168925
log loss: 0.5076338630398345
****With cosine similarity new feature****
-Validation
AUC: 0.8700305156195411
log loss: 0.4330545686924139
****With edit distance new feature****
-Va

As shown in train_models.ipynb, the winner is the logistic regression with a tf-idf vectorisation and the cosine similarity.

## Test

Let's check the test results for the previous winner. As we've established in train_models.ipynb, my tf-idf's and sklearn's results are the same up to 4 decimal digits, so I use my own vectoriser: 

In [6]:
# load feature matrices
Xval_fname = "results/Xva_mytfidf_cos.npz"
Xval = sp.sparse.load_npz(Xval_fname)
Xte_fname = "results/Xte_mytfidf_cos.npz"
Xte = sp.sparse.load_npz(Xte_fname)
# load model
m_fname = "models/logistic_mytfidf_cos.sav"
clf = pickle.load(open(m_fname, 'rb'))
# get performance metrics
print("-Validation")
print_auc_logloss(clf, Xval, y_va) 
print("-Test")
print_auc_logloss(clf, Xte, y_te)

-Validation
AUC: 0.8700340429355689
log loss: 0.43305295164377394
-Test
AUC: 0.8757583837689608
log loss: 0.42572321614875147


And the absolute best was the neural network:

In [7]:
vec = "mytfidf"
feat = "cos"
# validation
Xval_fname = "results/Xva_"+vec+"_"+feat+".npz"
Xval = sp.sparse.load_npz(Xval_fname)
Xval = convert_sparse_matrix_to_ordered_sparse_tensor(Xval)
# test
Xte_fname = "results/Xte_"+vec+"_"+feat+".npz"
Xte = sp.sparse.load_npz(Xte_fname)
Xte= convert_sparse_matrix_to_ordered_sparse_tensor(Xte)

In [8]:
name = "neural_mytfidf_cos"
# load json and create model
json_file = open("models/"+name+".json", "r")
loaded_model_json = json_file.read()
json_file.close()
new_model = model_from_json(loaded_model_json)

# load weights into new model
new_model.load_weights("models/"+name+".h5")
print(f"Loaded model {name}")

# check its auc
in_list=[Xval, Xte] 
y_list = [y_va, y_te]
steps = ["Validation", "Test"]
for i, x in enumerate(in_list):
    probs = new_model.predict(x)[:,1]
    auc = roc_auc_score(y_list[i], probs)
    print(f"{steps[i]} AUC: {auc}")

Loaded model neural_mytfidf_cos
Validation AUC: 0.8765279958034079
Test AUC: 0.8805521028374024


## Erroneus predictions

### Logistic regression

In [9]:
# test
Xte_fname = "results/Xte_"+vec+"_"+feat+".npz"
Xte = sp.sparse.load_npz(Xte_fname)

mistake_indices, predictions = get_mistakes(clf,Xte, y_te)
print("==================================================================")
for i in range(10):
    print(i+1)
    print_mistake_k(te_df, 100+i, mistake_indices, predictions)

error rate: 0.205392035617116 
 total mistakes: 4152
1
How do I prepare for Gate Exam by myself?
What are the best ways to prepare gate exam?
true class: 0
prediction: 1
2
Does the black hole hold a gateway to another universe?
Would a black hole be the exit of this universe?
true class: 1
prediction: 0
3
How do I get rid of acne on my face? I workout daily and wash my face twice a day.
What products should I use to get rid of acne quickly?
true class: 1
prediction: 0
4
Why do we do rainwater harvesting?
Why is rainwater harvesting illegal?
true class: 0
prediction: 1
5
How can I get the girl I like?
How do I get over a girl I like?
true class: 0
prediction: 1
6
How can I stop my depression?
What can I do to stop being depressed?
true class: 1
prediction: 0
7
Which is the best site to book hotel online?
What is the best hotel booking service?
true class: 1
prediction: 0
8
Can H4 visa holders invest in stock markets?
Can H4 visa holders invest in US stockmarkets?
true class: 1
predictio

Some samples seem trickier (like the 5th), but others look quite straightforward (like the 7th). On the other hand, one could argue that predictions 1 and 8 are actually right, or at least reasonable. 

One thing that we verify is that sometimes the only thing telling two questions apart is a single word (e.g. sample 5), which highlights the limitations of a static vocabulary. One could therefore try using a dynamic or more complete vocabulary for vectorisation.

It also stands out that the addition of information unrelated to the character of the question seems to offset the classifier in the 3rd case.

### Neural network

In [10]:
Xte_tf= convert_sparse_matrix_to_ordered_sparse_tensor(Xte)

mistake_indices, predictions = get_mistakes(new_model,Xte_tf, y_te, neural_net=True)
print("==================================================================")
for i in range(10):
    print(i+1)
    print_mistake_k(te_df, 100+i, mistake_indices, predictions)

error rate: 0.2017313875834776 
 total mistakes: 4078
1
Why was Tim Kaine chosen as the running mate for Hillary Clinton?
How did Hillary Clinton decide to choose Tim Kaine as her running mate?
true class: 1
prediction: 0
2
Why am I not able to post answers or comments on answers on Quora?
Why am I not able to edit my answer on Quora?
true class: 0
prediction: 1
3
What is the name of this T.V series ?
What's the name of this TV series?
true class: 0
prediction: 1
4
Why does my urine smell like garlic?
Why does my urine smell like onions?
true class: 0
prediction: 1
5
Which is the best laptop for mechanical?
Which is the best laptop for engineers?
true class: 0
prediction: 1
6
If you had the chance to meet with Sunil Gavaskar, what would you tell him?
Which was Sunil Gavaskar's greatest innnings?
true class: 0
prediction: 1
7
How are salt bridges used in galvanic cells?
Why salt bridge are used in galvanic cell?
true class: 1
prediction: 0
8
How does Zuckerberg earn the money, when face

Some errors look reasonable (like the 4th, the 7th or the 8th) and others understandable (like the 5th or the 2nd). Aside from the already mentioned fact that single words sometimes make the difference, like in sample 5 (which suggests using a dynamic or more complete vocabulary), there does not seem to be a clear error pattern to be addressed at preprocessing.

Nevertheless, the ground truth for samples 3 and 10 seems utterly wrong: either labels are poorly given or there is a problem with the reordering performed in the feature matrix for it to be compatible with tensorflow.