## Problem
We want to build a text classification model using CNN, RNN, and LSTM.

## Step 2-1 Understanding/defining business problem
Email classification (spam or ham). We need to classify spam or ham email
based on email content.

## Step 2-2 Identifying potential data sources, collection,and understanding
Using the same data used in Recipe 4-6 from Chapter 4:

Please download data from the below link and save it in your working

directory:

https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

In [1]:
import pandas as pd

#read file
file_content = pd.read_csv('spam.csv', encoding = "ISO-8859-1")
#check sample content in the email
file_content['v2'][1]

'Ok lar... Joking wif u oni...'

## Step 2-3 Text preprocessing
Let’s preprocess the data:


In [34]:
#Import library
from nltk.corpus import stopwords
from nltk import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Remove stop words
stop = stopwords.words('english')
file_content['v2'] = file_content['v2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

# Delete unwanted columns
Email_Data = file_content[['v1', 'v2']]

# Rename column names
Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})
Email_Data.head()

Unnamed: 0,Target,Email
0,ham,"Go jurong point, crazy.. Available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry 2 wkly comp win FA Cup final tkts 2...
3,ham,U dun say early hor... U c already say...
4,ham,"Nah I think goes usf, lives around though"


In [3]:
#Delete punctuations, convert text in lower case and delete the double space
Email_Data['Email'] = Email_Data['Email'].apply(lambda x:re.sub('[!@#$:).;,?&]', ' ', x.lower()))
Email_Data['Email'] = Email_Data['Email'].apply(lambda x:re.sub(' ', ' ', x))
Email_Data['Email'].head(5)

0    go jurong point  crazy   available bugis n gre...
1                        ok lar    joking wif u oni   
2    free entry 2 wkly comp win fa cup final tkts 2...
3            u dun say early hor    u c already say   
4            nah i think goes usf  lives around though
Name: Email, dtype: object

In [4]:
#Separating text(input) and target classes
list_sentences_rawdata = Email_Data["Email"].fillna("_na_").values
list_classes = ["Target"]
target = Email_Data[list_classes].values
To_Process=Email_Data[['Email', 'Target']]

Step 2-4 Data preparation for model building
Now we prepare the data:


In [12]:
# Import Libraries
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding,Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D,Conv1D, SimpleRNN
from keras.models import Model
from keras.models import Sequential
from keras import initializers, regularizers, constraints,optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout,BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential

#Train and test split with 80:20 ratio
train, test = train_test_split(To_Process, test_size=0.2)

# Define the sequence lengths, max number of words and embedding dimensions
# Sequence length of each sentence. If more, truncate. If less,pad with zeros
MAX_SEQUENCE_LENGTH = 300

# Top 20000 frequently occurring words
MAX_NB_WORDS = 20000

# Get the frequently occurring words
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train.Email)
train_sequences = tokenizer.texts_to_sequences(train.Email)
test_sequences = tokenizer.texts_to_sequences(test.Email)

# dictionary containing words and their index
word_index = tokenizer.word_index

# print(tokenizer.word_index)
# total words in the corpus
print('Found %s unique tokens.' % len(word_index))
# get only the top frequent words on train

train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
# get only the top frequent words on test
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(train_data.shape)
print(test_data.shape)

Found 7940 unique tokens.
(4457, 300)
(1115, 300)


In [51]:
# train.Email

In [52]:
# tokenizer.word_index

In [13]:
train_labels = train['Target']
test_labels = test['Target']

In [14]:
from sklearn.preprocessing import LabelEncoder

# converts the character array to numeric array. Assigns levels to unique labels.
le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)
print(le.classes_)
print(np.unique(train_labels, return_counts=True))
print(np.unique(test_labels, return_counts=True))

['ham' 'spam']
(array([0, 1], dtype=int64), array([3876,  581], dtype=int64))
(array([0, 1], dtype=int64), array([949, 166], dtype=int64))


In [54]:
# train_labels

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [15]:
# changing data types
labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))
print('Shape of data tensor:', train_data.shape)
print('Shape of label tensor:', labels_train.shape)
print('Shape of label tensor:', labels_test.shape)

Shape of data tensor: (4457, 300)
Shape of label tensor: (4457, 2)
Shape of label tensor: (1115, 2)


In [55]:
labels_train

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [1., 0.]], dtype=float32)

In [16]:
EMBEDDING_DIM = 100
print(MAX_SEQUENCE_LENGTH)

300


## Step 2-5 Model building and predicting
We are building the models using different deep learning approaches
like CNN, RNN, LSTM, and Bidirectional LSTM and comparing the
performance of each model using different accuracy metrics.

We can now define our CNN model.

Here we define a single hidden layer with 128 memory units. The
network uses a dropout with a probability of 0.5. The output layer is a
dense layer using the softmax activation function to output a probability
prediction.


In [18]:
# Import Libraries
# import sys, os, re, csv, codecs, numpy as np, pandas as pd
# from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
# from keras.utils import to_categorical
# from keras.layers import Dense, Input, LSTM, Embedding,
# Dropout, Activation
# from keras.layers import Bidirectional, GlobalMaxPool1D,
# Conv1D, SimpleRNN
# from keras.models import Model
# from keras.models import Sequential
# from keras import initializers, regularizers, constraints,
# optimizers, layers
# from keras.layers import Dense, Input, Flatten, Dropout,
# BatchNormalization
# from keras.layers import Conv1D, MaxPooling1D, Embedding
# from keras.models import Sequential


print('Training CNN 1D model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
                    EMBEDDING_DIM,
                    input_length=MAX_SEQUENCE_LENGTH))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])


Training CNN 1D model.


In [58]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 300, 100)          2000000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 300, 32)           14976     
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 298, 16)           1552      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 16)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 50)                850       
_________________________________________________________________
dropout_5 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 102       
Total para

We are now fitting our model to the data. Here we have 5 epochs and a
batch size of 64 patterns.

In [60]:
# model.fit(train_data, labels_train,
#             batch_size=64,
#             epochs=5,
#             validation_data=(test_data, labels_test))

model.fit(train_data, labels_train,
            batch_size=64,
            epochs=5,
            validation_split=0.2)

Train on 3565 samples, validate on 892 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1b4d09f1048>

In [20]:
predicted=model.predict(test_data)
predicted

array([[5.3949499e-01, 4.6050492e-01],
       [5.3976852e-01, 4.6023145e-01],
       [5.4747850e-01, 4.5252153e-01],
       ...,
       [5.4092926e-01, 4.5907077e-01],
       [3.6320067e-04, 9.9963677e-01],
       [5.3984576e-01, 4.6015424e-01]], dtype=float32)

In [61]:
# [0.07058866604882806, 0.9874439467229116]
model.evaluate(test_data, labels_test)



[0.08230450725331151, 0.9874439467229116]

In [22]:
#model evaluation
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,predicted.round()))

precision: [0.98950682 0.96296296]
recall: [0.99367756 0.93975904]
fscore: [0.9915878  0.95121951]
support: [949 166]
############################
             precision    recall  f1-score   support

          0       0.99      0.99      0.99       949
          1       0.96      0.94      0.95       166

avg / total       0.99      0.99      0.99      1115



## We can now define our RNN model.

In [23]:
#import library
from keras.layers.recurrent import SimpleRNN

#model training
print('Training SIMPLERNN model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
          EMBEDDING_DIM,
          input_length=MAX_SEQUENCE_LENGTH))
model.add(SimpleRNN(2, input_shape=(None,1)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
              optimizer='adam',
              metrics = ['accuracy'])

model.fit(train_data, labels_train,
          batch_size=16,
          epochs=5,
          validation_data=(test_data, labels_test))

Training SIMPLERNN model.
Train on 4457 samples, validate on 1115 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1b4da152a58>

In [24]:
# prediction on test data
predicted_Srnn=model.predict(test_data)
predicted_Srnn

array([[0.97940284, 0.02059711],
       [0.9901631 , 0.00983696],
       [0.86733055, 0.13266945],
       ...,
       [0.9935961 , 0.00640392],
       [0.01116436, 0.98883563],
       [0.9942239 , 0.00577611]], dtype=float32)

In [25]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,predicted_Srnn.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,predicted_Srnn.round()))

precision: [0.96817248 0.95744681]
recall: [0.99367756 0.81325301]
fscore: [0.98075923 0.87947883]
support: [949 166]
############################
             precision    recall  f1-score   support

          0       0.97      0.99      0.98       949
          1       0.96      0.81      0.88       166

avg / total       0.97      0.97      0.97      1115



## And here is our Long Short-Term Memory (LSTM):

In [26]:
#model training
print('Training LSTM model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
          EMBEDDING_DIM,
          input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(output_dim=16,
               activation='relu', 
               inner_activation='hard_sigmoid',
               return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
              optimizer='adam',
              metrics = ['accuracy'])

model.fit(train_data, labels_train,
          batch_size=16,
          epochs=5,
          validation_data=(test_data, labels_test))


Training LSTM model.


  # Remove the CWD from sys.path while we load stuff.


Train on 4457 samples, validate on 1115 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1b4dcabc128>

In [27]:
#prediction on text data
predicted_lstm=model.predict(test_data)
predicted_lstm

array([[9.99725163e-01, 2.74797145e-04],
       [9.98966575e-01, 1.03342067e-03],
       [9.99999881e-01, 1.02845455e-07],
       ...,
       [9.99988198e-01, 1.17637801e-05],
       [5.07229725e-09, 1.00000000e+00],
       [9.99894619e-01, 1.05330182e-04]], dtype=float32)

In [28]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,predicted_lstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,predicted_lstm.round()))

precision: [0.98748697 0.98717949]
recall: [0.99789252 0.92771084]
fscore: [0.99266247 0.95652174]
support: [949 166]
############################
             precision    recall  f1-score   support

          0       0.99      1.00      0.99       949
          1       0.99      0.93      0.96       166

avg / total       0.99      0.99      0.99      1115



##  Finally, let’s see what is Bidirectional LSTM and implement the same.

As we know, LSTM preserves information from inputs using the
hidden state. In bidirectional LSTMs, inputs are fed in two ways: one
from previous to future and the other going backward from future to
past, helping in learning future representation as well. Bidirectional
LSTMs are known for producing very good results as they are capable of
understanding the context better.

In [29]:
#model training
print('Training Bidirectional LSTM model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
          EMBEDDING_DIM,
          input_length=MAX_SEQUENCE_LENGTH))
model.add(Bidirectional(LSTM(16, return_sequences=True,dropout=0.1, recurrent_dropout=0.1)))
model.add(Conv1D(16, kernel_size = 3, padding = "valid",kernel_initializer = "glorot_uniform"))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
              optimizer='adam',metrics = ['accuracy'])
model.fit(train_data, labels_train,
          batch_size=16,
          epochs=3,
          validation_data=(test_data, labels_test))

Training Bidirectional LSTM model.
Train on 4457 samples, validate on 1115 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1b4e4f29d30>

In [30]:
# prediction on test data
predicted_blstm=model.predict(test_data)
predicted_blstm

array([[9.9998069e-01, 1.9336485e-05],
       [9.9972147e-01, 2.7848408e-04],
       [9.9995434e-01, 4.5624503e-05],
       ...,
       [9.9998879e-01, 1.1160270e-05],
       [1.5796587e-04, 9.9984205e-01],
       [9.9994087e-01, 5.9070258e-05]], dtype=float32)

In [31]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,predicted_blstm.round()))

precision: [0.98647242 0.99350649]
recall: [0.99894626 0.92168675]
fscore: [0.99267016 0.95625   ]
support: [949 166]
############################
             precision    recall  f1-score   support

          0       0.99      1.00      0.99       949
          1       0.99      0.92      0.96       166

avg / total       0.99      0.99      0.99      1115

