# A picture spoken with 1000 words, reported by CNN - Part 3 
Eu Jin Lok

9 February 2018

# CNN training 
In this notebook we will go into the details of how to build a document classifier using CNN, a deep learning architecture well known for images classification. For the full background on this topic, please checkout my blog post in this link: 

xxxxxxxxxxx

This is part 3 of the code which looks at building the CNN model, with the embedding layer using our pretrained GloVe vectors from part 2 of the code. More information of how CNN can be applied to text data: 

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
 
So without further ado, lets begin....

In [1]:
#import the key libraries 
import pandas as pd 
from pandas import crosstab
import numpy as np
import os 
import pickle
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Flatten, MaxPooling1D, Convolution1D, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.python.client import device_lib
os.chdir("C:\\Users\\User\\Dropbox\\Pet Project\\Blog\\CNN\\")

Using TensorFlow backend.


So first step after loadings the necessary packages, we'll go grab our training dataset again and run through the same data processing steps again...

In [2]:
# import data 
train = pd.read_csv("happydb\\cleaned_hm.csv")  

# Lets one-hot encode the labels  
labels=train.predicted_category.unique()
dic={}
for i,labels in enumerate(labels):
    dic[labels]=i
labels=train.predicted_category.apply(lambda x:dic[x])

val=train.sample(frac=0.2,random_state=200)
train=train.drop(val.index)

NUM_WORDS=20000 # if set, tokenization will be restricted to the top num_words most common words in the dataset).
tokenizer = Tokenizer(num_words=NUM_WORDS,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n\'',
                      lower=True)

# we need to fit the tokenizer on our text data in order to get the tokens
texts=train.cleaned_hm
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print('Found %s unique tokens or words.' % len(word_index)) 

Found 23313 unique tokens or words.


Now we need to convert the words the sentences in our documents to the index values

In [3]:
sequences_train = tokenizer.texts_to_sequences(texts) # converts the text to numbers essentially
sequences_valid=tokenizer.texts_to_sequences(val.cleaned_hm)
word_index = tokenizer.word_index
#Although word_index contains all words tokenizer.texts_to_sequences takes num_words into account.

# Check the index is working correctly 
print(texts[0])
print(sequences_train[0])
print(word_index['date'],"= index for the word 'Date' ") 
print('Date is an index number of 315. And it appears in the right position (5th) in the sentence')

I went on a successful date with someone I felt sympathy and connection with.
[1, 23, 16, 3, 758, 315, 13, 284, 1, 94, 9298, 5, 2393, 13]
315 = index for the word 'Date' 
Date is an index number of 315. And it appears in the right position (5th) in the sentence


We will be fitting the data into CNN, and we need to ensure the shape of the dataset is the same across all text. But because each text varies in lenght, we'll cap it at a fixed lenght, and just pad it with zeros to fill in the gaps 

In [4]:
# set the sequence length of the text to speed up training and prevent overfitting. 
seq_len = 500
X_train = pad_sequences(sequences_train,maxlen=seq_len, value=0)
X_val = pad_sequences(sequences_valid,maxlen=seq_len, value=0)

# Lets check a single record to see how it looks
print(X_train[0]) # By default we pad the left side. In order words, all the text is right side aligned

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0  

So the last part of the processing is to one-hot encode / binarise the target. That's the format that works well with Keras

In [5]:
y_train =train.predicted_category.apply(lambda x:dic[x])
y_train = to_categorical(np.asarray(labels[train.index]))
y_val =val.predicted_category.apply(lambda x:dic[x])
y_val = to_categorical(np.asarray(labels[y_val.index]))
print(y_train.shape)
print(y_val.shape)

(80428, 7)
(20107, 7)


# CNN - No pretrained word embedding = 95% accuracy
And here we are. Lets start with a CNN without using a pretrained embedding. 

WARNING: I'm using my desktop computer which is calibrated for CUDA processing. The timing printed below, from the Keras CNN 
processing, will vary depending on your hardware specification. I've printed my GPU specs below. A CPU will take 10 times longers generally..

In [6]:
device = list(device_lib.list_local_devices())
print(device[1])

name: "/gpu:0"
device_type: "GPU"
memory_limit: 3228522905
locality {
  bus_id: 1
}
incarnation: 1158385534894545445
physical_device_desc: "device: 0, name: GeForce GTX 980, pci bus id: 0000:01:00.0"



In [35]:
# Without pretrained embedding, we just initalize the matrixs as:
EMBEDDING_DIM=300
vocabulary_size=min(len(word_index)+1,NUM_WORDS)

embedding_layer = Embedding(vocabulary_size,
                            EMBEDDING_DIM)

# Use a sequential setup 
model = Sequential()
e = Embedding(vocab_size, 100, input_length=seq_len)

# Use 1 Convolution Kernal 
model.add(e)
model.add(Dropout(0.2))
model.add(Convolution1D(64, 5, padding='same', activation='relu'))
model.add(Dropout(0.2))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.7))
model.add(Dense(7, activation='sigmoid'))  # 7 targets, each done as a logistic  

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary()) # summarize the model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 500, 100)          2331400   
_________________________________________________________________
dropout_13 (Dropout)         (None, 500, 100)          0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 500, 64)           32064     
_________________________________________________________________
dropout_14 (Dropout)         (None, 500, 64)           0         
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 250, 64)           0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 100)               1600100   
__________

Lets setup a checkpoint to ensure we save the best solution, and an early stopping procedure. And run the model for just 5 epochs

In [24]:
# setup checkpoint 
file_path="C:\\Users\\User\\Downloads\\dump\\weights_base.CovNet.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='min')
early = EarlyStopping(monitor="val_acc", mode="min", patience=20)
callbacks_list = [checkpoint, early] #early

# fit the model
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_split=0.2, callbacks=callbacks_list, verbose=1) 

Train on 64342 samples, validate on 16086 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2572378f198>

97%? Is that real? Thats a massive jump in accuracy over our best baseline which is at 86%. Well, note that the accuracy here is based on an Keras internal validation split. We'll need to test it on our own validation set that we split off earlier on. Each epoch took 2 mins to train. And seems like running it for 1 epoch is enough. Now lets see if the model overfitted...

In [39]:
# Load the model from epoch 1, which is the best. If we use the latest model from Epoch 5, accuracy is terrible. Guess why?
model.load_weights(file_path) 
loss, accuracy = model.evaluate(X_val, y_val, verbose=1)
print(accuracy)

0.949498771726


95% accuracy! Trully impressive!

#  CNN - With pretrained word embedding - 96% accuracy
Now lets see how we what happens when we initialse using our pretrain word embeddings

In [6]:
# Get the embedding matrix we built from part 2. 
embedding_matrix = pickle.load(open("C:\\Users\\User\\Downloads\\dump\\embedding matrix.pickle","rb"))

# CNN with initialised with the embedding matrix weights 
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=seq_len, trainable=True) # trainable=True

# Use 1 Convolution Kernal 
model.add(e)
model.add(Dropout(0.2))
model.add(Convolution1D(64, 5, padding='same', activation='relu'))
model.add(Dropout(0.2))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.7))
model.add(Dense(7, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary()) # summarize the model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 100)          2331400   
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 64)           32064     
_________________________________________________________________
dropout_2 (Dropout)          (None, 500, 64)           0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 64)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               1600100   
__________

Save deal as before...

In [31]:
# setup checkpoint 
file_path="C:\\Users\\User\\Downloads\\dump\\weights_base.CovNet_GloVe.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='min')
early = EarlyStopping(monitor="val_acc", mode="min", patience=20)
callbacks_list = [checkpoint, early] #early

# fit the model
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_split=0.2, callbacks=callbacks_list, verbose=1)

Train on 64342 samples, validate on 16086 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x25723f50fd0>

In [18]:
# Load the model from epoch 1, which is the best. 
model.load_weights(file_path) 
loss, accuracy = model.evaluate(X_val, y_val, verbose=1) 
print(accuracy)

0.958152471437


96% accuracy! Pretty much the same as our CNN model without pretrain word embeddings. In all honesty, this turned out to be abit of a surpise for me. I expected CNN with pretrained word embedding to outperform a CNN without one, but I suppose due to its high accuracy already, the gains become really hard. 

At the end of the day, the winner here is CNN. Now you know why Deep Learning is the talk of the town