## Quroa Kaggle Challenge

First Attempt at the Quora kaggle challenge.  Use just one word embedding and a simple CNN model,

For each qid in the test set, you must predict whether the corresponding question_text is insincere (1) or not (0). Predictions should only be the integers 0 or 1.



- https://www.kaggle.com/c/quora-insincere-questions-classification

In [1]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
#read in the data
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")
print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)

Train shape :  (1306122, 3)
Test shape :  (56370, 2)


In [3]:
#lets look at the data
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


Next steps are as follows:
 * Split the training dataset into train and val sample - cross val too expensive
 * Fill up the missing values in the text column with '_na_'
 * Tokenize the text column and convert them to vector sequences
 * Pad the sequence as needed - if the number of words in the text is greater than 'max_len' trunacate them to 'max_len' or if the number of words in the text is lesser than 'max_len' add zeros for remaining values.

In [4]:
## split to train and val
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=2018)

## some config values 
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

## fill up the missing values
train_X = train_df["question_text"].fillna("_na_").values
val_X = val_df["question_text"].fillna("_na_").values
test_X = test_df["question_text"].fillna("_na_").values

In [5]:
train_X[0]

'What have been the best exhibits at the Museo del Prado in Madrid?'

In [6]:
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))

In [7]:
print(dir(tokenizer))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'char_level', 'document_count', 'filters', 'fit_on_sequences', 'fit_on_texts', 'index_docs', 'lower', 'num_words', 'oov_token', 'sequences_to_matrix', 'split', 'texts_to_matrix', 'texts_to_sequences', 'texts_to_sequences_generator', 'word_counts', 'word_docs', 'word_index']


In [8]:
#produce a list of lists- each list is a integer representation of each word in the sentence
train_X = tokenizer.texts_to_sequences(train_X)
print((train_X[:10]))

#the integer values represent the tokenizer index of each word in each sentence 

[[2, 24, 113, 1, 34, 21469, 43, 1, 49489, 9003, 6, 5043], [9, 15, 8, 6912, 2938, 1211, 1694], [30, 3, 1, 34, 3180, 5062, 6, 9336, 848, 591], [9, 11, 8, 4220, 2110, 10, 9146, 6, 4249], [11, 14, 2192, 37, 488], [9, 11, 8, 33, 1394, 6742, 6, 2420, 1540, 6, 66, 1272, 6, 11482], [9, 11, 8, 76, 1, 5666, 985], [15, 14, 3573, 4, 7470, 31612, 20, 14, 24, 4, 53, 377, 13, 17], [11, 68, 16033, 536, 8556, 324, 7, 7856], [48, 14, 24, 50, 28901, 140, 57, 22, 28, 487]]


In [13]:
#repeat for the val and test
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

## Pad the sentences to the same length
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values

In [14]:
print(train_X.shape)
print(val_X.shape)
print(test_X.shape)

(1175509, 100)
(130613, 100)
(56370, 100)


### Glove Embeddings
- glove.840B.300d - https://nlp.stanford.edu/projects/glove/

In [9]:
#load the entire GloVe word embedding file into memory as a dictionary of
#word to embedding array
path = 'D:\\ml_code\\embeddings\\glove.840B.300d\\'
EMBEDDING_FILE = f'{path}glove.840B.300d.txt'

def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE,errors='ignore'))

In [10]:
#create a matrix of one embedding for each word in the training
#dataset. We can do that by enumerating all unique words in the Tokenizer.word index and
#locating the embedding weight vector from the loaded GloVe embedding. The result is a matrix
#of weights only for words we will see during training.

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

#initialise the embedding matrix
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

#for each word in vocab get weights
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

### Declare a simple model

In [11]:
#prepare a first simple model
# define model
model = Sequential()
input_length = Input(shape=(maxlen,))
e = Embedding(max_features, embed_size, weights=[embedding_matrix], input_length=maxlen)
model.add(e)
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
#compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 300)          15000000  
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 300)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 30001     
Total params: 15,030,001
Trainable params: 15,030,001
Non-trainable params: 0
_________________________________________________________________
None


In [15]:
#run for just one epoch
model.fit(train_X, train_y, batch_size=512, epochs=1, validation_data=(val_X, val_y),verbose=2)

Train on 1175509 samples, validate on 130613 samples
Epoch 1/1
 - 88s - loss: 0.1401 - acc: 0.9494 - val_loss: 0.1250 - val_acc: 0.9528


<keras.callbacks.History at 0x1cbc2558828>

In [16]:
#apply to validation set
pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1)



In [17]:
#look for a better threshold
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5002051983584132
F1 score at threshold 0.11 is 0.5140422510207705
F1 score at threshold 0.12 is 0.5271317829457365
F1 score at threshold 0.13 is 0.5376727935591676
F1 score at threshold 0.14 is 0.5472737932382254
F1 score at threshold 0.15 is 0.5561982807102113
F1 score at threshold 0.16 is 0.5636296265799333
F1 score at threshold 0.17 is 0.5711400075748011
F1 score at threshold 0.18 is 0.5775065537840044
F1 score at threshold 0.19 is 0.5840607044168603
F1 score at threshold 0.2 is 0.5896977284922196
F1 score at threshold 0.21 is 0.5945084467920404
F1 score at threshold 0.22 is 0.5987579942534063
F1 score at threshold 0.23 is 0.6014584803575629
F1 score at threshold 0.24 is 0.6054694970357621
F1 score at threshold 0.25 is 0.6085395439107231
F1 score at threshold 0.26 is 0.611706837186424
F1 score at threshold 0.27 is 0.6144950652975775
F1 score at threshold 0.28 is 0.6176782736139889
F1 score at threshold 0.29 is 0.6179510797257189
F1 score at threshold 0

In [18]:
pred_glove_test_y = model.predict([test_X], batch_size=1024, verbose=1)



In [19]:
pred_glove_test_y

array([[0.00277655],
       [0.00458942],
       [0.0101913 ],
       ...,
       [0.02646757],
       [0.10272979],
       [0.8694278 ]], dtype=float32)

In [20]:
pred_test_y = (pred_glove_test_y>0.35).astype(int)

In [21]:
#write to submit file
#out_df = pd.DataFrame({"qid":test_df["qid"].values})
#out_df['prediction'] = pred_test_y
#out_df.to_csv("submission.csv", index=False)

In [22]:
#do a clean up
del word_index, embeddings_index, all_embs, embedding_matrix, model
import gc; gc.collect()
time.sleep(10)