# Set Up

## Introduction

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

To do:
- Follow [this workflow](https://github.com/jyu-theartofml/kaggle_quora/blob/master/02_LSTM_2Dense_layers.ipynb) to try and make an NN
- Pickle gridsearch and non gridsearch models on PC
- Github together
- Presentation outline

## Packages

In [1]:
#trifecta
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#gen process
from copy import deepcopy
from sklearn.model_selection import train_test_split

#NLP process
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

#Models
from sklearn import naive_bayes
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#export model
import pickle

#Evaluation
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV

In [2]:
#pull custom NLP pre-processing functions
%run -i ~/Coding/custom_functions/lighthouse_labs/NLP_functions.py

## Data

In [22]:
#download data
data_raw = pd.read_csv('train.csv')
data_raw.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [23]:
#drop id col (dup of index)
data_raw = data_raw.drop(['id', 'qid1', 'qid2'], axis = 1) #[don't think I need qid but I'd like to check that]

# Exploration

In [21]:
#get frequency for class (is/isnt duplicate)
class_freq = data_raw.groupby('is_duplicate').count().question1

class_freq/ data_raw.shape[0] #roughly 60:40 ratio, in interest of time we'll say this is fine

is_duplicate
0    0.630799
1    0.369201
Name: question1, dtype: float64

In [6]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   question1     404289 non-null  object
 1   question2     404288 non-null  object
 2   is_duplicate  404290 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 9.3+ MB


In [7]:
#missing values
data_raw.isnull().sum()

question1       1
question2       2
is_duplicate    0
dtype: int64

In [8]:
#explore the nan
data_raw[data_raw.isnull().any(axis=1)]

Unnamed: 0,question1,question2,is_duplicate
105780,How can I develop android app?,,0
201841,How can I create an Android app?,,0
363362,,My Chinese name is Haichao Yu. What English na...,0


In [9]:
data_raw = data_raw.drop(data_raw[data_raw.isnull().any(axis=1)].index)

In [10]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 404287 entries, 0 to 404289
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   question1     404287 non-null  object
 1   question2     404287 non-null  object
 2   is_duplicate  404287 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 12.3+ MB


# Cleaning and Feature Extraction

## NLP Pre-Processing

In [11]:
#create deep copy of data to change
X = deepcopy(data_raw)

In [12]:
#clean, stem and tokenize each column of strings
X['question1'] = process_features(X['question1']) #with such a big frame this takes a while
X['question2'] = process_features(X['question2'])

In [13]:
#seperate sentences from target
y = X.is_duplicate
X = X.drop('is_duplicate', axis=1)

X.head() #check

Unnamed: 0,question1,question2
0,"[step, step, guid, invest, share, market, india]","[step, step, guid, invest, share, market]"
1,"[stori, kohinoor, kohinoor, diamond]","[would, happen, indian, govern, stole, kohinoo..."
2,"[increas, speed, internet, connect, use, vpn]","[internet, speed, increas, hack, dn]"
3,"[mental, lone, solv]","[find, remaind, math, math, divid]"
4,"[one, dissolv, water, quikli, sugar, salt, met...","[fish, would, surviv, salt, water]"


## Similarity

In [14]:
#use custome function to create similarity score
X['similarity'] = similarity_score(X) #apply to X

In [15]:
#split before we are doing fit /transform operation
#test train split, also split out labels (y)
X_train, X_test, y_train, y_test = train_test_split(X[['question1', 'question2', 'similarity']], y, stratify = y, random_state = 42) #default of 20%

## Vectorize

In [16]:
#Vectorize
#so it can handle token data we add this dummy function
def dummy_fun(doc):
    return doc

#instansiate vectorizer
vectorizer = TfidfVectorizer(max_features = 2500, min_df = 7, tokenizer=dummy_fun,
                            preprocessor=dummy_fun, max_df = 0.8, stop_words = stopwords.words('english'))
    
#fit to the entire corpus (question 1 and 2)
vectorizer.fit(X_train[['question1', 'question2']].values.flatten())



TfidfVectorizer(max_df=0.8, max_features=2500, min_df=7,
                preprocessor=<function dummy_fun at 0x14179b3a0>,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function dummy_fun at 0x14179b3a0>)

In [17]:
#use custom functions to get vectors and merge with similarity for single feature input
X_train_features = vectors_to_features(X_train)
X_test_features = vectors_to_features(X_test)

#check the input shape of the data
print(X_train_features.shape)
print(y_train.shape)

(303215, 5001)
(303215,)


# Modelling

## Initial Test with RFC

In [18]:
#create dataframe to compare scores
comparison_train = pd.DataFrame(index = ['RandomForest', 'LogReg', 'NaiveBayes'], columns = ['recall', 'precision', 'accuracy'])
comparison_test = pd.DataFrame(index = ['RandomForest', 'LogReg', 'NaiveBayes'], columns = ['recall', 'precision',  'accuracy'])

In [19]:
#Train random forest
rfc = RandomForestClassifier(random_state = 1)
rfc.fit(X_train_features, y_train)

KeyboardInterrupt: 

In [None]:
#save the trained model
#pickle.dump(rfc, open('rfc_default.sav', 'wb'))

#load trained model
#test it works
rfc_default = pickle.load(open('rfc_default.sav', 'rb'))

In [None]:
y_rfc_train = rfc_default.predict(X_train_features)
y_rfc_test = rfc_default.predict(X_test_features)

In [None]:
comparison_train.loc['RandomForest'] = evaluation(y_rfc_train, y_train)
comparison_test.loc['RandomForest'] = evaluation(y_rfc_test, y_test)

In [None]:
rfc_default.get_params()

## Other Shallow Models

In [None]:
#train naive bayes
naive = naive_bayes.MultinomialNB(random_state = 1)
naive.fit(X_train_features, y_train)

#train logreg
lr = LogisticRegression(random_state = 1)
lr.fit(X_train_features, y_train)

In [None]:
y_naive_train = naive.predict(X_train_features)
y_lr_train = lr.predict(X_train_features)

y_naive = naive.predict(X_test_features)
y_lr = lr.predict(X_test_features)

In [None]:
comparison_train.loc['NaiveBayes'] = evaluation(y_naive_train, y_train)
comparison_train.loc['LogReg'] = evaluation(y_lr_train, y_train)

In [None]:
comparison_test.loc['NaiveBayes'] = evaluation(y_naive, y_test)
comparison_test.loc['LogReg'] = evaluation(y_lr, y_test)

In [None]:
comparison_train

In [None]:
comparison_test

## Gridsearch

In [None]:
#define hyperparameters to try
param_grid = {'max_features': ['auto', 'log2', 'sqrt'],
              'min_samples_leaf': [ 5, 10, 20]}

#Instantiate random forest with specific parameters
rfc_grid = RandomForestClassifier(random_state = 1, n_jobs = -1, n_estimators = 1000, oob_score = True)
#instantiant gridsearch with random forest and param grid
gridsearch = GridSearchCV(rfc_grid, param_grid, n_jobs = -1, verbose = 1)

In [None]:
#gridsearch.fit(X_train_features, y_train)

In [None]:
#save the trained model
#pickle.dump(gridsearch, open('rfc_grid.sav', 'wb'))

#load trained model
pickle.load(open('rfc_grid.sav', 'rb'))

In [None]:
#gridsearch.best_params_

In [None]:
#grid_train = gridsearch.predict(X_train_features)
#grid_test = gridsearch.predict(X_test_features)

#comparison_train.loc['RandomForestGrid'] = evaluation(grid_train, y_train)
#comparison_test.loc['RandomForestGrid'] = evaluation(grid_test, y_test)

## Neural Net

#### Set Up

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.data_utils import get_file

In [27]:
#create deep copy of data to change
data = deepcopy(data_raw)

In [28]:
target = data['is_duplicate']

question1 = list(data['question1'])
question2 = list(data['question2'])

print(len(question1))
print(len(question2))

404290
404290


In [29]:
question1[:5]

['What is the step by step guide to invest in share market in india?',
 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
 'How can I increase the speed of my internet connection while using a VPN?',
 'Why am I mentally very lonely? How can I solve it?',
 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?']

In [None]:
#fit tokenizer
tokenizer = Tokenizer(num_words=200000)
tokenizer.fit_on_texts(question1+question2)

In [None]:
#transform
question1_word_sequences = tokenizer.texts_to_sequences(question1)
question2_word_sequences = tokenizer.texts_to_sequences(question2)
word_index = tokenizer.word_index #unique words in corpus (training and test sets)

print("Words in index: %d" % len(word_index))

In [None]:
#pad out sentences
q1_data = pad_sequences(question1_word_sequences, maxlen=25)
q2_data = pad_sequences(question2_word_sequences, maxlen=25)

#ensure target is int
labels = np.array(target, dtype=int)
#check shapes
print('Shape of question1 data tensor:', q1_data.shape)
print('Shape of question2 data tensor:', q2_data.shape)
print('Shape of label tensor:', labels.shape)

#### Embedding

[Download glove] 

In [None]:
embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

#### Model

In [None]:
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,GlobalAveragePooling1D,Lambda,Bidirectional
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam, RMSprop
from keras import backend as K

import keras
keras.__version__

In [None]:
from sklearn.cross_validation import train_test_split

X = np.stack((q1_data, q2_data), axis=1)
target = labels

X_train, X_val, y_train, y_val = train_test_split(X, target, test_size=0.25, random_state=126, stratify=target)
Q1_train = X_train[:,0]
Q2_train = X_train[:,1]
Q1_val = X_val[:,0]
Q2_val = X_val[:,1]

In [None]:
def vec_distance(vects):
    x, y = vects
    return K.sum(K.square(x - y), axis=1, keepdims=True)
#don't use squar root of the sum, it doens't give a good range to feed to the dense layer.

def vec_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)


In [None]:
from keras.layers.embeddings import Embedding

nb_words=137077+1
max_sentence_len=25
embedding_layer = Embedding(nb_words,300,
        weights=[embedding_matrix],
        input_length=max_sentence_len,trainable=False)
#dont train this layer!