Baseline Question Comparison Model

and

Scoring of Output

In [36]:
%matplotlib inline


import nltk            # natural language tool kit
import numpy as np     # support for large data structures
import pandas as pd    # data structure support
import string          # various string functions
import difflib         # classes and functions for comparing sequences
import utils           # word processing functions and distance functions

from sklearn.metrics import log_loss    # used in measurement / scoring

In [2]:
# import train and test data

# Generate training data dataframe
train = pd.read_csv('Data/train.csv') #index_col=0

train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [19]:
# testing with 1 id
# later can put all into a loop
id = 2
q1 = train['question1'][id]
q2 = train['question2'][id]

# print(q1)
# print(q2)

# remove punctuation  (as our baseline model will focus on the words themselves)
# one addition could be to change everything to lowercase, too (likely will test impact in results)

# q1 = q1.translate(None, string.punctuation)
q1 = q1.translate(None, string.punctuation).lower()
# q2 = q2.translate(None, string.punctuation)
q2 = q2.translate(None, string.punctuation).lower()


# split sentences (multiple sentences) into list of words
q1words = q1.split()
q2words = q2.split()


print(q1words)
print(q2words)

['how', 'can', 'i', 'increase', 'the', 'speed', 'of', 'my', 'internet', 'connection', 'while', 'using', 'a', 'vpn']
['how', 'can', 'internet', 'speed', 'be', 'increased', 'by', 'hacking', 'through', 'dns']


In [21]:
# simple baseline model...compare words in
def find_similarity(wl1, wl2):
    # send 2 word lists to find matching sequence
    sm = difflib.SequenceMatcher(None, wl1,wl2)
    sm = sm.ratio()
    return sm

find_similarity(q1words, q2words)

0.25

In [22]:
# alternate baseline approach- levenstein distance b/t the 2 questions

In [23]:
def lv_similarity(str1, str2):
    # send 2 strings (questions) to find levenshtien distance, compare to length, and ind similarity
    s1len =len(str1)
    s2len = len(str2)
    distance = utils.levenshtein_explicit(str1, str2)

    # print(distance)
    # print(s1len)
    # print(s2len)

    # crude similarity metric using distance plus length of q1 and q2
    return (distance / (q1len*1.0 + q2len*1.0))

lv_similarity(q1, q2)

0.3883495145631068

Using what we have learned above, will apply to all records.

In [35]:
# code below runs through all IDs, comparing Q1 to Q2, storing similarity measurement
# runs through loop once, storing both similarity measurements

sm_results = []
lv_results = []

for id in range(0, len(train)):
# test with a smaller loop first
# for id in range(0, 20):
    
    # look at questions associated with ID
    # also need to convert to a string (in event not all questions are strings)
    q1 = str(train['question1'][id])
    q2 = str(train['question2'][id])

    # print(q1)
    # print(q2)

    # remove punctuation  (as our baseline model will focus on the words themselves)
    # one addition could be to change everything to lowercase, too (likely will test impact in results)

    # q1 = q1.translate(None, string.punctuation)
    q1 = q1.translate(None, string.punctuation).lower()
    # q2 = q2.translate(None, string.punctuation)
    q2 = q2.translate(None, string.punctuation).lower()


    # split sentences (multiple sentences) into list of words
    q1words = q1.split()
    q2words = q2.split()
    
    # calculate similarity (2 ways)
    sm_results.append([id, find_similarity(q1words, q2words)])
    lv_results.append([id, lv_similarity(q1, q2)])
    
    

Model is quite slow, do to loop through 400,000 records
But, it is the baseline, so that is ok.


Measuring Results:
Kaggle uses Log Loss; in this case we have 1 class (duplicate) and we are looking at the probability we are in that class.


In [48]:
actuals = np.array(train['is_duplicate'])
n_sm_results = np.array(sm_results)
n_lv_results = np.array(lv_results)

predictions_sm = n_sm_results[:,1]
predictions_lv = n_lv_results[:,1]

# using sk_learn log_loss function
score_sm = log_loss(actuals, predictions_sm)
score_lv = log_loss(actuals, predictions_lv)

print(score_sm)
print(score_lv)

0.637853951947
2.46958295062


In [None]:
# exapnd measurements to include accuracy, precision, recall, and f1

The sequence comparison works much better than levenshtein distance. Among many things, Levenshtein has a lot of false positives.

A clear issue with baseline approach is that additional phrases (prepositional, etc) that are attached to the end of a sentence, causing them to be different, are not being counted enough to make the sentences be "different".

"How do I make money?" vs "How do I make money in the market?" The last prepositional phrase causes these to be different, but as for the sequence, they are very close.