# LSTM Training
This notebook can be used to train an LSTM for text classification and generate predictions for the kaggle competition found [here](https://www.kaggle.com/c/quora-insincere-questions-classification). 

The notebook utilizes Keras and GloVe for preprocessing using word embeddings. Then, Keras with Tensorflow backend is used for training a deep LSTM. 

Ensure that the train.csv and test.csv are in the data/ directory of this project. 

In [5]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm_notebook as tqdm


from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Embedding, LSTM, Bidirectional, SpatialDropout1D, GlobalMaxPooling1D, Dropout
from keras.models import Model

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
# Load in training and testing data
train_df = pd.read_csv('data/train.csv')
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [7]:
# Eliminate any potential null values
train_df[train_df.isnull().any(axis=1)].shape

(0, 3)

In [8]:
# Extract the training data and corresponding labels
text = train_df['question_text'].values
labels = train_df['target'].values

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(text, labels,\
                                                  test_size=0.2)

In [9]:
embed_size = 300 # Size of each word vector
max_words = 30000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

In [10]:
## Tokenize the sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(X_train))
tokenizer.num_words = max_words

X_train = tokenizer.texts_to_sequences(X_train)
X_val = tokenizer.texts_to_sequences(X_val)

word_index = tokenizer.word_index
print('The word index consists of {} unique tokens.'.format(len(word_index)))

## Pad the sentences 
X_train = pad_sequences(X_train, maxlen=maxlen)
X_val = pad_sequences(X_val, maxlen=maxlen)

The word index consists of 196165 unique tokens.


In [11]:
# Create the embedding dictionary from the word embedding file
embedding_dict = {}
filename = os.path.join('./embeddings/', 'glove.840B.300d/glove.840B.300d.txt')
#pbar = tqdm(total=os.path.getsize(filename))
with open(filename) as f:
    for line in tqdm(f):
        line = line.split()
        token = line[0]
        try:
            coefs = np.asarray(line[1:], dtype='float32')
            embedding_dict[token] = coefs
        except:
            pass
print('The embedding dictionary has {} items'.format(len(embedding_dict)))

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


The embedding dictionary has 2195884 items


In [22]:
embed_mat = np.zeros(shape=[max_words, embed_size])
for word, idx in word_index.items():
    if idx >= max_words: continue
    vector = embedding_dict.get(word)
    if vector is not None:
        embed_mat[idx] = vector

In [23]:
def create_lstm():
    input = Input(shape=(maxlen,))
    x = Embedding(max_words, embed_size, weights=[embed_mat], trainable=False)(input)
    x = SpatialDropout1D(0.1)(x)
    x = Bidirectional(LSTM(100, return_sequences=True))(x)
    x = GlobalMaxPooling1D()(x)
    x = Dense(32, activation="relu")(x)
    x = Dropout(0.1)(x)
    output = Dense(1, activation="sigmoid")(x)    

    model = Model(inputs=input, outputs=output)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    print(model.summary())
    
    return model

In [26]:
lstm = create_lstm()
lstm.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=1, batch_size=512)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 100, 300)          9000000   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 100, 300)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 200)          320800    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                6432      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
__________

Train on 1044897 samples, validate on 261225 samples
Epoch 1/1


    128/1044897 [..............................] - ETA: 20:34:45 - loss: 0.7482 - acc: 0.1719

    256/1044897 [..............................] - ETA: 11:59:09 - loss: 0.6978 - acc: 0.4883

    384/1044897 [..............................] - ETA: 9:04:22 - loss: 0.6503 - acc: 0.6250 

    512/1044897 [..............................] - ETA: 7:36:30 - loss: 0.6011 - acc: 0.7012

    640/1044897 [..............................] - ETA: 6:43:26 - loss: 0.5603 - acc: 0.7500

    768/1044897 [..............................] - ETA: 6:08:13 - loss: 0.5268 - acc: 0.7799

    896/1044897 [..............................] - ETA: 5:44:10 - loss: 0.4971 - acc: 0.8013

   1024/1044897 [..............................] - ETA: 5:25:10 - loss: 0.4711 - acc: 0.8184

   1152/1044897 [..............................] - ETA: 5:10:29 - loss: 0.4547 - acc: 0.8281

   1280/1044897 [..............................] - ETA: 4:58:37 - loss: 0.4336 - acc: 0.8398

   1408/1044897 [..............................] - ETA: 4:49:06 - loss: 0.4127 - acc: 0.8501

   1536/1044897 [..............................] - ETA: 4:41:03 - loss: 0.4034 - acc: 0.8568

   1664/1044897 [..............................] - ETA: 4:39:54 - loss: 0.3918 - acc: 0.8636

   1792/1044897 [..............................] - ETA: 4:33:34 - loss: 0.3928 - acc: 0.8655

   1920/1044897 [..............................] - ETA: 4:28:11 - loss: 0.3816 - acc: 0.8708

   2048/1044897 [..............................] - ETA: 4:23:30 - loss: 0.3712 - acc: 0.8760

   2176/1044897 [..............................] - ETA: 4:19:16 - loss: 0.3674 - acc: 0.8787

   2304/1044897 [..............................] - ETA: 4:16:35 - loss: 0.3579 - acc: 0.8828

KeyboardInterrupt: 

# Predictions
The remainder of this notebok will generate predictions from the test set and write them to a submission csv file. 

In [None]:
test_df = pd.read_csv('data/test.csv')
X_test = train_df['question_text'].values

X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=maxlen)

test.to_csv('data/bmmidei_NB_Submission_1', index=False