# Episode 5: Building castles in the sky, or a memory palace Part 1 
Eu Jin Lok

17 March 2018

# LSTM  
In this notebook we will go into the details of how to build a document classifier using LSTM, a deep learning architecture that is able to remember long-term dependencies. For the full background on this topic, please checkout my blog post in this link: 

https://mungingdata.wordpress.com/2018/03/21/episode-5-building-castles-in-the-sky-or-a-memory-palace-part-1/

This dataset is based on Episode 4 and builds upon the previous CNN architecture. So without further ado, lets begin....

In [1]:
#import the key libraries 
import pandas as pd 
from pandas import crosstab
import numpy as np
import os 
import pickle
from keras.layers.recurrent import LSTM
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Flatten, MaxPooling1D, SpatialDropout1D, Dropout,Convolution1D
from keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.python.client import device_lib
os.chdir("C:\\Users\\User\\Dropbox\\Pet Project\\Blog\\DONE CNN\\")

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


So first step after loadings the necessary packages, we'll go grab our training dataset, the same one from the previous Episode 4, and I just copy the code here... 

In [2]:
# import data 
train = pd.read_csv("happydb\\cleaned_hm.csv")  

# Lets one-hot encode the labels  
labels=train.predicted_category.unique()
dic={}
for i,labels in enumerate(labels):
    dic[labels]=i
labels=train.predicted_category.apply(lambda x:dic[x])

val=train.sample(frac=0.2,random_state=200)
train=train.drop(val.index)

NUM_WORDS=20000 # if set, tokenization will be restricted to the top num_words most common words in the dataset).
tokenizer = Tokenizer(num_words=NUM_WORDS,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n\'',
                      lower=True)

# we need to fit the tokenizer on our text data in order to get the tokens
texts=train.cleaned_hm
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print('Found %s unique tokens or words.' % len(word_index)) 

Found 23313 unique tokens or words.


Now we need to convert the words the sentences in our documents to the index values

In [3]:
sequences_train = tokenizer.texts_to_sequences(texts) # converts the text to numbers essentially
sequences_valid=tokenizer.texts_to_sequences(val.cleaned_hm)
word_index = tokenizer.word_index
#Although word_index contains all words tokenizer.texts_to_sequences takes num_words into account.

# Check the index is working correctly 
print(texts[0])
print(sequences_train[0])
print(word_index['date'],"= index for the word 'Date' ") 
print('Date is an index number of 315. And it appears in the right position (5th) in the sentence')

I went on a successful date with someone I felt sympathy and connection with.
[1, 23, 16, 3, 758, 315, 13, 284, 1, 94, 9298, 5, 2393, 13]
315 = index for the word 'Date' 
Date is an index number of 315. And it appears in the right position (5th) in the sentence


We will be fitting the data into an LSTM architecture, and we need to ensure the shape of the dataset is the same across all text. But because each text varies in lenght, we'll cap it at a fixed length, and just pad it with zeros to fill in the gaps 

In [4]:
# set the sequence length of the text to speed up training and prevent overfitting. 
seq_len = 500
X_train = pad_sequences(sequences_train,maxlen=seq_len, value=0)
X_val = pad_sequences(sequences_valid,maxlen=seq_len, value=0)

# Lets check a single record to see how it looks
print(X_train[0]) # By default we pad the left side. In order words, all the text is right side aligned

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

So the last part of the processing is to one-hot encode / binarise the target. That's the format that works well with Keras

In [5]:
y_train =train.predicted_category.apply(lambda x:dic[x])
y_train = to_categorical(np.asarray(labels[train.index]))
y_val =val.predicted_category.apply(lambda x:dic[x])
y_val = to_categorical(np.asarray(labels[y_val.index]))
print(y_train.shape)
print(y_val.shape)

(80428, 7)
(20107, 7)


# LSTM layer only = 97% accuracy
And here we are. Since this notebook is going to be short, I thought I'll showcase a few variations of LSTM. Lets start with a simple LSTM without using a pretrained embedding. 

WARNING: I'm using my desktop computer which is calibrated for CUDA processing. The timing printed below, will vary depending on your hardware specification. I've printed my GPU specs below. A CPU will take 10 times longers generally... and LSTM takes a long long time. 

In [12]:
device = list(device_lib.list_local_devices())
print(device[1])

name: "/gpu:0"
device_type: "GPU"
memory_limit: 104815001
locality {
  bus_id: 1
}
incarnation: 16597597132222788778
physical_device_desc: "device: 0, name: GeForce GTX 980, pci bus id: 0000:01:00.0"



In [15]:
# Use a sequential setup 
model = Sequential()
e = Embedding(vocab_size, 100, input_length=seq_len)

# Use a simple LSTM structure
model.add(e)
model.add(SpatialDropout1D(0.3))
model.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(7, activation='sigmoid'))  # 7 targets, each done as a logistic  

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary()) # summarize the model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 100)          2331400   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 500, 100)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 300)               481200    
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              308224    
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1024)              0         
__________

Lets setup a checkpoint to ensure we save the best solution, and an early stopping procedure. And run the model for just 5 epochs

In [17]:
# setup checkpoint 
file_path="C:\\Users\\User\\Downloads\\dump\\weights_base.LSTM.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)
callbacks_list = [checkpoint, early] #early

# fit the model
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_split=0.2, callbacks=callbacks_list, verbose=1) 

Train on 64342 samples, validate on 16086 samples
Epoch 1/5

Epoch 00001: val_acc improved from -inf to 0.96011, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM.hdf5
Epoch 2/5

Epoch 00002: val_acc improved from 0.96011 to 0.96440, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM.hdf5
Epoch 3/5

Epoch 00003: val_acc improved from 0.96440 to 0.97025, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM.hdf5
Epoch 4/5

Epoch 00004: val_acc improved from 0.97025 to 0.97075, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM.hdf5
Epoch 5/5

Epoch 00005: val_acc improved from 0.97075 to 0.97311, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM.hdf5


<keras.callbacks.History at 0x19efd09a9e8>

97%, and seems like it could continue to improve further... but lets stop at 5 epochs and you can try it yourself. Note that LSTM takes incrementally longer when compared to CNN. Now lets confirm the accuracy by applying it to the actual validation dataset...

In [18]:
# Load the model from epoch 1, which is the best. If we use the latest model from Epoch 5, accuracy is terrible. Guess why?
model.load_weights(file_path) 
loss, accuracy = model.evaluate(X_val, y_val, verbose=1)
print(accuracy)

0.9743870398629136


97.4% accuracy! This is our best model so far! Now lets see how we what happens when we use a hybrid model of CNN and LSTM...

#  CNN and LSTM layers = 97%
Start with the standard embedding layer first, then followed by a Convolution layer followed by an LSTM. A hybrid model...

In [20]:
# Use a sequential setup 
model = Sequential()
e = Embedding(vocab_size, 100, input_length=seq_len)

# Use a Convolution Kernal first then LSTM 
model.add(e)
model.add(Dropout(0.2))
model.add(Convolution1D(64, 5, padding='same', activation='relu'))
model.add(Dropout(0.2))
model.add(MaxPooling1D())
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.7))
model.add(Dense(7, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary()) # summarize the model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 500, 100)          2331400   
_________________________________________________________________
dropout_5 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 500, 64)           32064     
_________________________________________________________________
dropout_6 (Dropout)          (None, 500, 64)           0         
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 250, 64)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               66000     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
__________

Setup checkpoint...

In [21]:
# setup checkpoint 
file_path="C:\\Users\\User\\Downloads\\dump\\weights_base.CNN_LSTM.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)
callbacks_list = [checkpoint, early] #early

# fit the model
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_split=0.2, callbacks=callbacks_list, verbose=1)

Train on 64342 samples, validate on 16086 samples
Epoch 1/5

Epoch 00001: val_acc improved from -inf to 0.94803, saving model to C:\Users\User\Downloads\dump\weights_base.CNN_LSTM.hdf5
Epoch 2/5

Epoch 00002: val_acc improved from 0.94803 to 0.96330, saving model to C:\Users\User\Downloads\dump\weights_base.CNN_LSTM.hdf5
Epoch 3/5

Epoch 00003: val_acc improved from 0.96330 to 0.96621, saving model to C:\Users\User\Downloads\dump\weights_base.CNN_LSTM.hdf5
Epoch 4/5

Epoch 00004: val_acc improved from 0.96621 to 0.96835, saving model to C:\Users\User\Downloads\dump\weights_base.CNN_LSTM.hdf5
Epoch 5/5

Epoch 00005: val_acc improved from 0.96835 to 0.97027, saving model to C:\Users\User\Downloads\dump\weights_base.CNN_LSTM.hdf5


<keras.callbacks.History at 0x19efe143128>

In [22]:
# Load the model from epoch 1, which is the best. 
model.load_weights(file_path) 
loss, accuracy = model.evaluate(X_val, y_val, verbose=1) 
print(accuracy)

0.9728026601079742


97.2% accuracy! So slightly lower. But generally about the same as just using one LSTM layer... 

# Double LSTM layers = 97%
Now lets try a double LSTM layer .... this might take awhile to finish...

In [10]:
# Use a sequential setup 
model = Sequential()
e = Embedding(vocab_size, 100, input_length=seq_len)

# After the embedding layer, use an LSTM and then another LSTM. First LSTM returns the sequence length as outputs 
model.add(e)
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3,return_sequences = True))
model.add(Dropout(0.7))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
model.add(Dropout(0.7))
model.add(Dense(7, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary()) # summarize the model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 500, 100)          2331400   
_________________________________________________________________
spatial_dropout1d_4 (Spatial (None, 500, 100)          0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 500, 100)          80400     
_________________________________________________________________
dropout_8 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dropout_9 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 7)                 707       
Total para

In [7]:
# setup checkpoint 
file_path="C:\\Users\\User\\Downloads\\dump\\weights_base.LSTM_LSTM.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)
callbacks_list = [checkpoint, early] #early

# fit the model
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_split=0.2, callbacks=callbacks_list, verbose=1)

Train on 64342 samples, validate on 16086 samples
Epoch 1/5

Epoch 00001: val_acc improved from -inf to 0.94505, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM_LSTM.hdf5
Epoch 2/5

Epoch 00002: val_acc improved from 0.94505 to 0.95187, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM_LSTM.hdf5
Epoch 3/5

Epoch 00003: val_acc improved from 0.95187 to 0.96346, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM_LSTM.hdf5
Epoch 4/5

Epoch 00004: val_acc improved from 0.96346 to 0.96561, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM_LSTM.hdf5
Epoch 5/5

Epoch 00005: val_acc improved from 0.96561 to 0.96843, saving model to C:\Users\User\Downloads\dump\weights_base.LSTM_LSTM.hdf5


<keras.callbacks.History at 0x1d6054043c8>

In [8]:
# Load the model from epoch 1, which is the best. 
model.load_weights(file_path) 
loss, accuracy = model.evaluate(X_val, y_val, verbose=1) 
print(accuracy)

0.970770674851732


97% accuracy! Not as good as the previous model, but only marginally different. End of the day... all equally good