# LSTM Training
This notebook can be used to train an LSTM for text classification and generate predictions for the kaggle competition found [here](https://www.kaggle.com/c/quora-insincere-questions-classification). 

The notebook utilizes Keras and GloVe for preprocessing using word embeddings. Then, Keras with Tensorflow backend is used for training a deep LSTM. Feel free to fork!

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Embedding
from keras.layers import Conv1D, MaxPooling1D, Flatten
from keras.models import Model

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# Load in training and testing data
train_df = pd.read_csv('./input/train.csv')
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [5]:
# Extract the training data and corresponding labels
text = train_df['question_text'].fillna('unk').values
labels = train_df['target'].values

# Split into training and validation sets by making use of the scikit-learn
# function train_test_split
X_train, X_val, y_train, y_val = train_test_split(text, labels,\
                                                  test_size=0.2)

In [6]:
embed_size = 300 # Size of each word vector
max_words = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

In [7]:
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(list(X_train))

# The tokenizer will assign an integer value to each word in the dictionary
# and then convert each string of words into a list of integer values
X_train = tokenizer.texts_to_sequences(X_train)
X_val = tokenizer.texts_to_sequences(X_val)

word_index = tokenizer.word_index
print('The word index consists of {} unique tokens.'.format(len(word_index)))

## Pad the sentences 
X_train = pad_sequences(X_train, maxlen=maxlen)
X_val = pad_sequences(X_val, maxlen=maxlen)

The word index consists of 196262 unique tokens.


In [9]:
# Create the embedding dictionary from the word embedding file
embedding_dict = {}
filename = os.path.join('./input/embeddings/', 'glove.840B.300d/glove.840B.300d.txt')
with open(filename) as f:
    for line in f:
        line = line.split()
        token = line[0]
        try:
            coefs = np.asarray(line[1:], dtype='float32')
            embedding_dict[token] = coefs
        except:
            pass
print('The embedding dictionary has {} items'.format(len(embedding_dict)))

The embedding dictionary has 2195884 items


In [13]:
# Create the embedding layer weight matrix
embed_mat = np.zeros(shape=[max_words, embed_size])
for word, idx in word_index.items():
    # Word index is ordered from most frequent to least frequent
    # Ignore words that occur less frequently
    if idx >= max_words: continue
    vector = embedding_dict.get(word)
    if vector is not None:
        embed_mat[idx] = vector

In [27]:
def create_cnn():
    sequence_input = Input(shape=(maxlen,), dtype='int32')
    embed_seq = Embedding(max_words, embed_size, weights=[embed_mat], trainable=False)(sequence_input)
    x = Conv1D(128, 3, activation='relu')(embed_seq)
    x = MaxPooling1D(3)(x)
    x = Conv1D(128, 3, activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = Conv1D(128, 3, activation='relu')(x)
    x = MaxPooling1D(3)(x)  # global max pooling
    x = Flatten()(x)
    x = Dense(64, activation='relu')(x)
    preds = Dense(1, activation='sigmoid')(x)

    model = Model(sequence_input, preds)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])
    model.summary()
    
    return model

In [26]:
# Create and train network
cnn = create_cnn()
cnn.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=3, batch_size=512)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_7 (Embedding)      (None, 100, 300)          15000000  
_________________________________________________________________
conv1d_16 (Conv1D)           (None, 98, 128)           115328    
_________________________________________________________________
max_pooling1d_15 (MaxPooling (None, 32, 128)           0         
_________________________________________________________________
conv1d_17 (Conv1D)           (None, 30, 128)           49280     
_________________________________________________________________
max_pooling1d_16 (MaxPooling (None, 10, 128)           0         
_________________________________________________________________
conv1d_18 (Conv1D)           (None, 8, 128)            49280     
__________

Train on 1044897 samples, validate on 261225 samples
Epoch 1/3


    512/1044897 [..............................] - ETA: 2:56:03 - loss: 0.6964 - acc: 0.8711

   1024/1044897 [..............................] - ETA: 1:52:30 - loss: 0.6900 - acc: 0.8984

   1536/1044897 [..............................] - ETA: 1:31:02 - loss: 0.6833 - acc: 0.9160

   2048/1044897 [..............................] - ETA: 1:18:50 - loss: 0.6769 - acc: 0.9214

   2560/1044897 [..............................] - ETA: 1:11:21 - loss: 0.6730 - acc: 0.9262

   3072/1044897 [..............................] - ETA: 1:06:19 - loss: 0.6675 - acc: 0.9290

   3584/1044897 [..............................] - ETA: 1:02:28 - loss: 0.6639 - acc: 0.9311

   4096/1044897 [..............................] - ETA: 59:41 - loss: 0.6594 - acc: 0.9299  

   4608/1044897 [..............................] - ETA: 57:29 - loss: 0.6567 - acc: 0.9293

   5120/1044897 [..............................] - ETA: 55:45 - loss: 0.6481 - acc: 0.9307

   5632/1044897 [..............................] - ETA: 54:19 - loss: 0.6418 - acc: 0.9306

   6144/1044897 [..............................] - ETA: 53:47 - loss: 0.6317 - acc: 0.9315

   6656/1044897 [..............................] - ETA: 52:54 - loss: 0.6251 - acc: 0.9321

   7168/1044897 [..............................] - ETA: 52:11 - loss: 0.6159 - acc: 0.9325

   7680/1044897 [..............................] - ETA: 51:47 - loss: 0.6048 - acc: 0.9322

   8192/1044897 [..............................] - ETA: 51:29 - loss: 0.5912 - acc: 0.9330

   8704/1044897 [..............................] - ETA: 51:17 - loss: 0.5789 - acc: 0.9332

KeyboardInterrupt: 

# 4. Predictions
The remainder of this notebok will generate predictions from the test set and write them to a submission csv file for the kaggle competition.

In [36]:
test_df = pd.read_csv('../input/test.csv')
X_test = test_df['question_text'].values

# Perform the same preprocessing as was done on the training set
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=maxlen)

# Make predictions, ensure that predictions are in integer form
preds = np.rint(lstm.predict([X_test], batch_size=1024, verbose=1)).astype('int')
test_df['prediction'] = preds



Let's examine a few examples of sincere predictions and insincere predictions. It appears that our network is making meaningful predictions.

In [37]:
n=5
sin_sample = test_df.loc[test_df['prediction'] == 0]['question_text'].head(n)
print('Sincere Samples:')
for idx, row in enumerate(sin_sample):
    print('{}'.format(idx+1), row)

print('\n')
print('Insincere Samples:')
insin_sample = test_df.loc[test_df['prediction'] == 1]['question_text'].head(n)
for idx, row in enumerate(insin_sample):
    print('{}'.format(idx+1), row)

Sincere Samples:
1 My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type?
2 How much does a tutor earn in Bangalore?
3 What are the best made pocket knives under $200-300?
4 Why would they add a hypothetical scenario that’s impossible to happen in the link below? It shows what 800 meters rise in sea level would look like.
5 What is the dresscode for Techmahindra freshers?


Insincere Samples:
1 Why don't India start a War with Pakistan ? They Kill our Soldiers.
2 Why do people think white privilege is real when it's blatantly not?
3 Why does Quora send me a notice because I told a guy from England that he wasn’t American so he shouldn’t worry about our gun laws?
4 Can a bleeding heart liberal be happily married to a militant Republican, when they fundamentally disagree on everything? I'm an optimistic feminist who believes in hope, and he's a die hard gun enthusiast who borders on misogyny and racism.
5 Why do these Sikhs

In [None]:
test_df = test_df.drop('question_text', axis=1)
test_df.to_csv('submission.csv', index=False)