# LSTM Classification with CR Dataset
<hr>

We will build a text classification model using LSTM model on the MPQA Dataset. Since there is no standard train/test split for this dataset, we will use 10-Fold Cross Validation (CV). 

## Load the library

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import random
# from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import KFold

%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False
# nltk.download('twitter_samples')

In [2]:
tf.config.list_physical_devices('GPU') 

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Load the Dataset

In [3]:
corpus = pd.read_pickle('../../../0_data/MPQA/MPQA.pkl')
corpus.label = corpus.label.astype(int)
print(corpus.shape)
corpus

(3775, 3)


Unnamed: 0,sentence,label,split
0,weaknesses are minor the feel and layout of th...,0,train
1,many of our disney movies do n 't play on this...,0,train
2,player has a problem with dual layer dvd 's su...,0,train
3,i know the saying is you get what you pay for ...,0,train
4,will never purchase apex again .,0,train
...,...,...,...
3770,"so far , the anti spam feature seems to be ver...",1,train
3771,i downloaded a trial version of computer assoc...,1,train
3772,i did not have any of the installation problem...,1,train
3773,their products have been great and have saved ...,1,train


In [4]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3775 entries, 0 to 3774
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  3775 non-null   object
 1   label     3775 non-null   int32 
 2   split     3775 non-null   object
dtypes: int32(1), object(2)
memory usage: 73.9+ KB


In [5]:
corpus.groupby( by='label').count()

Unnamed: 0_level_0,sentence,split
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1368,1368
1,2407,2407


In [6]:
# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

In [7]:
sentences[0]

"weaknesses are minor the feel and layout of the remote control are only so so . it does n 't show the complete file names of mp3s with really long names . you must cycle through every zoom setting ( 2x , 3x , 4x , 1 2x , etc . ) before getting back to normal size sorry if i 'm just ignorant of a way to get back to 1x quickly ."

<!--## Split Dataset-->

# Data Preprocessing
<hr>

Preparing data for word embedding, especially for pre-trained word embedding like Word2Vec or GloVe, __don't use standard preprocessing steps like stemming or stopword removal__. Compared to our approach on cleaning the text when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc, now we will keep these words as we do not want to lose such information that might help the model learn better.

__Tomas Mikolov__, one of the developers of Word2Vec, in _word2vec-toolkit: google groups thread., 2015_, suggests only very minimal text cleaning is required when learning a word embedding model. Sometimes, it's good to disconnect
In short, what we will do is:
- Puntuations removal
- Lower the letter case
- Tokenization

The process above will be handled by __Tokenizer__ class in TensorFlow

- <b>One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.</b>

In [8]:
# Define a function to compute the max length of sequence
def max_length(sequences):
    '''
    input:
        sequences: a 2D list of integer sequences
    output:
        max_length: the max length of the sequences
    '''
    max_length = 0
    for i, seq in enumerate(sequences):
        length = len(seq)
        if max_length < length:
            max_length = length
    return max_length

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"

print("Example of sentence: ", sentences[4])

# Cleaning and Tokenization
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)

# Turn the text into sequence
training_sequences = tokenizer.texts_to_sequences(sentences)
max_len = max_length(training_sequences)

print('Into a sequence of int:', training_sequences[4])

# Pad the sequence to have the same size
training_padded = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
print('Into a padded sequence:', training_padded[4])

Example of sentence:  will never purchase apex again .
Into a sequence of int: [72, 194, 285, 207, 286]
Into a padded sequence: [ 72 194 285 207 286   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]


In [10]:
word_index = tokenizer.word_index
# See the first 10 words in the vocabulary
for i, word in enumerate(word_index):
    print(word, word_index.get(word))
    if i==9:
        break
vocab_size = len(word_index)+1
print(vocab_size)

<UNK> 1
the 2
and 3
i 4
it 5
to 6
a 7
is 8
of 9
this 10
5336


# Model 1: Embedding Random
<hr>

<img src="model.png" style="width:700px;height:400px;"> <br>

## LSTM Model

In [22]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model(input_dim = None, output_dim=300, max_length = None ):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=input_dim, 
                                  mask_zero= True,
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, )),
        
        tf.keras.layers.LSTM(units=128, return_sequences=True),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.LSTM(units=128, return_sequences=False),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        # Propagate X through a Dense layer with 1 unit
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])
    
    model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [23]:
model_0 = define_model( input_dim=1000, max_length=100)
model_0.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 100, 300)          300000    
_________________________________________________________________
lstm_10 (LSTM)               (None, 100, 128)          219648    
_________________________________________________________________
dropout_13 (Dropout)         (None, 100, 128)          0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dropout_14 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_15 (Dropout)         (None, 64)               

In [20]:
class myCallback(tf.keras.callbacks.Callback):
    # Overide the method on_epoch_end() for our benefit
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('accuracy') > 0.93):
            print("\nReached 93% accuracy so cancelling training!")
            self.model.stop_training=True


callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0, 
                                             patience=5, verbose=2, 
                                             mode='auto', restore_best_weights=True)

## Train and Test the Model

In [24]:
# Parameter Initialization
trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"

columns = ['acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record = pd.DataFrame(columns = columns)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

exp=0

# kfold.split() will return set indices for each split
acc_list = []
for train, test in kfold.split(sentences):
    
    exp+=1
    print('Training {}: '.format(exp))
    
    train_x, test_x = [], []
    train_y, test_y = [], []

    for i in train:
        train_x.append(sentences[i])
        train_y.append(labels[i])

    for i in test:
        test_x.append(sentences[i])
        test_y.append(labels[i])

    # Turn the labels into a numpy array
    train_y = np.array(train_y)
    test_y = np.array(test_y)

    # encode data using
    # Cleaning and Tokenization
    tokenizer = Tokenizer(oov_token=oov_tok)
    tokenizer.fit_on_texts(train_x)

    # Turn the text into sequence
    training_sequences = tokenizer.texts_to_sequences(train_x)
    test_sequences = tokenizer.texts_to_sequences(test_x)

    max_len = max_length(training_sequences)

    # Pad the sequence to have the same size
    Xtrain = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
    Xtest = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

    word_index = tokenizer.word_index
    vocab_size = len(word_index)+1

    # Define the input shape
    model = define_model(input_dim=vocab_size, max_length=max_len)

    # Train the model
    model.fit(Xtrain, train_y, batch_size=32, epochs=15, verbose=1, 
              callbacks=[callbacks], validation_data=(Xtest, test_y))

    # evaluate the model
    loss, acc = model.evaluate(Xtest, test_y, verbose=0)
    print('Test Accuracy: {}'.format(acc*100))

    acc_list.append(acc*100)

mean_acc = np.array(acc_list).mean()
entries = acc_list + [mean_acc]

temp = pd.DataFrame([entries], columns=columns)
record = record.append(temp, ignore_index=True)
print()
print(record)
print()

Training 1: 
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15

KeyboardInterrupt: 

## Summary

In [260]:
record.sort_values(by='AVG', ascending=False)

Unnamed: 0,Activation,Filters,acc1,acc2,acc3,acc4,acc5,acc6,acc7,acc8,acc9,acc10,AVG
1,relu,2,77.51323,81.481481,79.62963,80.423278,80.158728,80.106103,79.310346,81.697613,81.962866,81.43236,80.371563
4,relu,5,80.158728,78.306878,81.481481,78.306878,82.539684,83.02387,78.77984,80.37135,81.167108,79.045093,80.318091
3,relu,4,80.158728,79.365081,80.158728,81.216931,80.687833,79.045093,83.02387,80.901855,77.71883,80.901855,80.317881
5,relu,6,80.687833,81.216931,76.190478,80.158728,83.597887,78.77984,81.43236,77.188331,79.840851,83.02387,80.211711
8,tanh,3,83.597887,77.777779,79.365081,80.423278,81.74603,77.984083,79.045093,79.575598,81.167108,79.840851,80.052279
11,tanh,6,80.423278,80.423278,75.661373,80.423278,81.481481,76.657826,84.350133,80.106103,79.045093,81.697613,80.026945
0,relu,1,79.62963,78.306878,77.51323,82.539684,80.158728,78.249335,82.228118,79.840851,80.106103,79.045093,79.761765
9,tanh,4,79.62963,84.391534,78.306878,81.481481,78.835976,77.71883,79.840851,77.71883,81.167108,77.984083,79.70752
6,tanh,1,77.51323,77.777779,78.571427,81.216931,82.010579,78.514588,77.453583,81.697613,81.43236,79.310346,79.549844
10,tanh,5,82.539684,78.571427,79.100531,78.571427,80.158728,79.840851,79.045093,80.106103,78.249335,78.514588,79.469777


In [272]:
record[['Activation', 'AVG']].groupby(by='Activation').max().sort_values(by='AVG', ascending=False)

Unnamed: 0_level_0,AVG
Activation,Unnamed: 1_level_1
relu,80.371563
tanh,80.052279


In [310]:
report = record.sort_values(by='AVG', ascending=False)
report = report.to_excel('CNN_CR.xlsx', sheet_name='random')

# Model 2: Word2Vec Static

__Using and updating pre-trained embeddings__
* In this part, we will create an Embedding layer in Tensorflow Keras using a pre-trained word embedding called Word2Vec 300-d tht has been trained 100 bilion words from Google News.
* In this part,  we will leave the embeddings fixed instead of updating them (dynamic).

1. __Load `Word2Vec` Pre-trained Word Embedding__

In [3]:
from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)

In [4]:
# Access the dense vector value for the word 'handsome'
# word2vec.word_vec('handsome') # 0.11376953
word2vec.word_vec('cool') # 1.64062500e-01

array([ 1.64062500e-01,  1.87500000e-01, -4.10156250e-02,  1.25000000e-01,
       -3.22265625e-02,  8.69140625e-02,  1.19140625e-01, -1.26953125e-01,
        1.77001953e-02,  8.83789062e-02,  2.12402344e-02, -2.00195312e-01,
        4.83398438e-02, -1.01074219e-01, -1.89453125e-01,  2.30712891e-02,
        1.17675781e-01,  7.51953125e-02, -8.39843750e-02, -1.33666992e-02,
        1.53320312e-01,  4.08203125e-01,  3.80859375e-02,  3.36914062e-02,
       -4.02832031e-02, -6.88476562e-02,  9.03320312e-02,  2.12890625e-01,
        1.72119141e-02, -6.44531250e-02, -1.29882812e-01,  1.40625000e-01,
        2.38281250e-01,  1.37695312e-01, -1.76757812e-01, -2.71484375e-01,
       -1.36718750e-01, -1.69921875e-01, -9.15527344e-03,  3.47656250e-01,
        2.22656250e-01, -3.06640625e-01,  1.98242188e-01,  1.33789062e-01,
       -4.34570312e-02, -5.12695312e-02, -3.46679688e-02, -8.49609375e-02,
        1.01562500e-01,  1.42578125e-01, -7.95898438e-02,  1.78710938e-01,
        2.30468750e-01,  

2. __Check number of training words present in Word2Vec__

In [5]:
def training_words_in_word2vector(word_to_vec_map, word_to_index):
    '''
    input:
        word_to_vec_map: a word2vec GoogleNews-vectors-negative300.bin model loaded using gensim.models
        word_to_index: word to index mapping from training set
    '''
    
    vocab_size = len(word_to_index) + 1
    count = 0
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        if word in word_to_vec_map:
            count+=1
            
    return print('Found {} words present from {} training vocabulary in the set of pre-trained word vector'.format(count, vocab_size))

In [11]:
# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

# Cleaning and Tokenization
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
training_words_in_word2vector(word2vec, word_index)

Found 5046 words present from 5336 training vocabulary in the set of pre-trained word vector


2. __Define a `pretrained_embedding_layer` function__

In [138]:
from tensorflow.keras.layers import Embedding

def pretrained_embedding_matrix(word_to_vec_map, word_to_index):
    '''
    input:
        word_to_vec_map: a word2vec GoogleNews-vectors-negative300.bin model loaded using gensim.models
        word_to_index: word to index mapping from training set
    '''
    
    # adding 1 to fit Keras embedding (requirement)
    vocab_size = len(word_to_index) + 1
    # define dimensionality of your pre-trained word vectors (= 300)
    emb_dim = word_to_vec_map.word_vec('handsome').shape[0]
    
    
    embed_matrix = np.zeros((vocab_size, emb_dim))
    
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        if word in word_to_vec_map:
            embed_matrix[idx] = word_to_vec_map.word_vec(word)
            
        # initialize the unknown word with standard normal distribution values
        else:
            embed_matrix[idx] = np.random.randn(emb_dim)
            
    return embed_matrix

In [311]:
# Test the function
w_2_i = {'<UNK>': 1, 'handsome': 2, 'cool': 3, 'shit': 4 }
em_matrix = pretrained_embedding_matrix(word2vec, w_2_i)
em_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.7603335 ,  0.41298582,  1.6051669 , ...,  0.07348683,
        -0.93163275, -0.64774868],
       [ 0.11376953,  0.1796875 , -0.265625  , ..., -0.21875   ,
        -0.03930664,  0.20996094],
       [ 0.1640625 ,  0.1875    , -0.04101562, ...,  0.10888672,
        -0.01019287,  0.02075195],
       [ 0.10888672, -0.16699219,  0.08984375, ..., -0.19628906,
        -0.23144531,  0.04614258]])

## LSTM Model

In [312]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model_2(input_dim = None, output_dim=300, max_length = None ):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=input_dim, 
                                  mask_zero= True,
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, ),
                                  # Assign the embedding weight with word2vec embedding marix
                                  weights = [emb_matrix],
                                  # Set the weight to be not trainable (static)
                                  trainable = False),
        
        tf.keras.layers.LSTM(units=128, return_sequences=True),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.LSTM(units=128, return_sequences=False),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        # Propagate X through a Dense layer with 1 unit
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])
    
    model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [313]:
model_0 = define_model_2( input_dim=1000, max_length=100, emb_matrix=np.random.rand(vocab_size, 300))
model_0.summary()

Model: "sequential_1437"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1444 (Embedding)   (None, 100, 300)          1524600   
_________________________________________________________________
conv1d_1439 (Conv1D)         (None, 98, 100)           90100     
_________________________________________________________________
max_pooling1d_1439 (MaxPooli (None, 49, 100)           0         
_________________________________________________________________
flatten_1439 (Flatten)       (None, 4900)              0         
_________________________________________________________________
dropout_2871 (Dropout)       (None, 4900)              0         
_________________________________________________________________
dense_2869 (Dense)           (None, 10)                49010     
_________________________________________________________________
dropout_2872 (Dropout)       (None, 10)            

## Train and Test the Model

In [314]:
class myCallback(tf.keras.callbacks.Callback):
    # Overide the method on_epoch_end() for our benefit
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('accuracy') >= 0.9):
            print("\nReached 90% accuracy so cancelling training!")
            self.model.stop_training=True

callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0, 
                                             patience=5, verbose=2, 
                                             mode='auto', restore_best_weights=True)

In [315]:
# Parameter Initialization
trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"

columns = ['acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record = pd.DataFrame(columns = columns)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

exp=0

# kfold.split() will return set indices for each split
acc_list = []
for train, test in kfold.split(sentences):
    
    exp+=1
    print('Training {}: '.format(exp))
    
    train_x, test_x = [], []
    train_y, test_y = [], []

    for i in train:
        train_x.append(sentences[i])
        train_y.append(labels[i])

    for i in test:
        test_x.append(sentences[i])
        test_y.append(labels[i])

    # Turn the labels into a numpy array
    train_y = np.array(train_y)
    test_y = np.array(test_y)

    # encode data using
    # Cleaning and Tokenization
    tokenizer = Tokenizer(oov_token=oov_tok)
    tokenizer.fit_on_texts(train_x)

    # Turn the text into sequence
    training_sequences = tokenizer.texts_to_sequences(train_x)
    test_sequences = tokenizer.texts_to_sequences(test_x)

    max_len = max_length(training_sequences)

    # Pad the sequence to have the same size
    Xtrain = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
    Xtest = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

    word_index = tokenizer.word_index
    vocab_size = len(word_index)+1

    # Define the input shape
    model = define_model(input_dim=vocab_size, max_length=max_len)

    # Train the model
    model.fit(Xtrain, train_y, batch_size=32, epochs=15, verbose=1, 
              callbacks=[callbacks], validation_data=(Xtest, test_y))

    # evaluate the model
    loss, acc = model.evaluate(Xtest, test_y, verbose=0)
    print('Test Accuracy: {}'.format(acc*100))

    acc_list.append(acc*100)

mean_acc = np.array(acc_list).mean()
entries = acc_list + [mean_acc]

temp = pd.DataFrame([entries], columns=columns)
record = record.append(temp, ignore_index=True)
print()
print(record)
print()



Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 74.0740716457367
Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 73.01587462425232
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
Test Accuracy: 78.04232835769653
Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 74.60317611694336
Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 82.80423283576965
Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 80.90185523033142
Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 82.49337077140808
Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 77.98408269882202
Restoring model weights from the end of the best epoch.
Epoch 000

Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 74.86772537231445
Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping
Test Accuracy: 77.77777910232544
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 66.1375641822815
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 66.1375641822815
Restoring model weights from the end of the best epoch.
Epoch 00019: early stopping
Test Accuracy: 75.39682388305664
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 61.00795865058899
Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping
Test Accuracy: 76.92307829856873
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 75.59681534767151
Restoring model weights from the end of the best epoch.
Epoch 0002

## Summary

In [316]:
record2.sort_values(by='AVG', ascending=False)

Unnamed: 0,Activation,Filters,acc1,acc2,acc3,acc4,acc5,acc6,acc7,acc8,acc9,acc10,AVG
0,relu,1,74.074072,73.015875,78.042328,74.603176,82.804233,80.901855,82.493371,77.984083,78.514588,74.270558,77.670414
1,relu,2,81.481481,77.777779,73.280424,81.481481,73.280424,75.596815,77.188331,73.209548,82.493371,75.862068,77.165172
4,relu,5,72.222221,71.164024,81.481481,76.984125,81.216931,72.679043,77.984083,75.331563,77.453583,77.984083,76.450114
3,relu,4,75.396824,79.894179,74.603176,72.751325,73.809522,78.77984,78.249335,75.06631,73.740053,75.06631,75.735688
2,relu,3,76.190478,73.809522,76.455027,73.015875,74.867725,73.209548,76.923078,77.188331,75.331563,78.77984,75.577099
5,relu,6,74.867725,77.777779,66.137564,66.137564,75.396824,61.007959,76.923078,75.596815,78.249335,69.496024,72.159067
7,relu,8,76.190478,71.693122,72.48677,65.608466,70.37037,75.06631,70.822281,72.41379,63.925731,77.984083,71.65614
6,relu,7,79.365081,63.492066,63.22751,77.248675,63.22751,70.822281,68.435013,77.71883,62.599468,75.06631,70.120274


In [317]:
record2[['Activation', 'AVG']].groupby(by='Activation').max().sort_values(by='AVG', ascending=False)

Unnamed: 0_level_0,AVG
Activation,Unnamed: 1_level_1
relu,77.670414


In [318]:
report = record2.sort_values(by='AVG', ascending=False)
report = report.to_excel('CNN_CR_2.xlsx', sheet_name='static')

# Model 3: Word2Vec - Dynamic

* In this part,  we will fine tune the embeddings while training (dynamic).

## LSTM Model

In [319]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model_3(input_dim = None, output_dim=300, max_length = None ):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=input_dim, 
                                  mask_zero= True,
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, ),
                                  # Assign the embedding weight with word2vec embedding marix
                                  weights = [emb_matrix],
                                  # Set the weight to be not trainable (static)
                                  trainable = True),
        
        tf.keras.layers.LSTM(units=128, return_sequences=True),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.LSTM(units=128, return_sequences=False),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        # Propagate X through a Dense layer with 1 unit
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])
    
    model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [320]:
model_0 = define_model_3( input_dim=1000, max_length=100, emb_matrix=np.random.rand(vocab_size, 300))
model_0.summary()

Model: "sequential_1518"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1525 (Embedding)   (None, 100, 300)          1527300   
_________________________________________________________________
conv1d_1520 (Conv1D)         (None, 98, 100)           90100     
_________________________________________________________________
max_pooling1d_1520 (MaxPooli (None, 49, 100)           0         
_________________________________________________________________
flatten_1520 (Flatten)       (None, 4900)              0         
_________________________________________________________________
dropout_3033 (Dropout)       (None, 4900)              0         
_________________________________________________________________
dense_3031 (Dense)           (None, 10)                49010     
_________________________________________________________________
dropout_3034 (Dropout)       (None, 10)            

## Train and Test the Model

In [321]:
class myCallback(tf.keras.callbacks.Callback):
    # Overide the method on_epoch_end() for our benefit
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('accuracy') > 0.93):
            print("\nReached 93% accuracy so cancelling training!")
            self.model.stop_training=True

callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0, 
                                             patience=5, verbose=2, 
                                             mode='auto', restore_best_weights=True)

In [325]:
# Parameter Initialization
trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"

columns = ['acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record = pd.DataFrame(columns = columns)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

exp=0

# kfold.split() will return set indices for each split
acc_list = []
for train, test in kfold.split(sentences):
    
    exp+=1
    print('Training {}: '.format(exp))
    
    train_x, test_x = [], []
    train_y, test_y = [], []

    for i in train:
        train_x.append(sentences[i])
        train_y.append(labels[i])

    for i in test:
        test_x.append(sentences[i])
        test_y.append(labels[i])

    # Turn the labels into a numpy array
    train_y = np.array(train_y)
    test_y = np.array(test_y)

    # encode data using
    # Cleaning and Tokenization
    tokenizer = Tokenizer(oov_token=oov_tok)
    tokenizer.fit_on_texts(train_x)

    # Turn the text into sequence
    training_sequences = tokenizer.texts_to_sequences(train_x)
    test_sequences = tokenizer.texts_to_sequences(test_x)

    max_len = max_length(training_sequences)

    # Pad the sequence to have the same size
    Xtrain = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
    Xtest = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

    word_index = tokenizer.word_index
    vocab_size = len(word_index)+1

    # Define the input shape
    model = define_model(input_dim=vocab_size, max_length=max_len)

    # Train the model
    model.fit(Xtrain, train_y, batch_size=32, epochs=15, verbose=1, 
              callbacks=[callbacks], validation_data=(Xtest, test_y))

    # evaluate the model
    loss, acc = model.evaluate(Xtest, test_y, verbose=0)
    print('Test Accuracy: {}'.format(acc*100))

    acc_list.append(acc*100)

mean_acc = np.array(acc_list).mean()
entries = acc_list + [mean_acc]

temp = pd.DataFrame([entries], columns=columns)
record = record.append(temp, ignore_index=True)
print()
print(record)
print()



Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 80.15872836112976
Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 82.53968358039856
Restoring model weights from the end of the best epoch.
Epoch 00011: early stopping
Test Accuracy: 80.42327761650085
Restoring model weights from the end of the best epoch.
Epoch 00011: early stopping
Test Accuracy: 77.51322984695435
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
Test Accuracy: 83.59788656234741
Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping
Test Accuracy: 79.31034564971924
Restoring model weights from the end of the best epoch.
Epoch 00010: early stopping
Test Accuracy: 79.57559823989868
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
Test Accuracy: 78.51458787918091
Restoring model weights from the end of the best epoch.
Epoch 00

Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 77.24867463111877
Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 79.62962985038757
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 78.30687761306763
Restoring model weights from the end of the best epoch.
Epoch 00010: early stopping
Test Accuracy: 78.83597612380981
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 81.4814805984497
Restoring model weights from the end of the best epoch.
Epoch 00011: early stopping
Test Accuracy: 80.37135004997253
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 83.02386999130249
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 80.90185523033142
Restoring model weights from the end of the best epoch.
Epoch 000

## Summary

In [326]:
record3.sort_values(by='AVG', ascending=False)

Unnamed: 0,Activation,Filters,acc1,acc2,acc3,acc4,acc5,acc6,acc7,acc8,acc9,acc10,AVG
1,relu,2,80.687833,81.74603,79.365081,79.100531,80.687833,82.228118,80.901855,81.43236,80.901855,80.106103,80.71576
5,relu,6,77.248675,79.62963,78.306878,78.835976,81.481481,80.37135,83.02387,80.901855,81.962866,80.636603,80.239918
3,relu,4,80.687833,80.687833,79.365081,77.248675,80.952382,79.575598,79.045093,79.045093,81.43236,81.167108,79.920706
0,relu,1,80.158728,82.539684,80.423278,77.51323,83.597887,79.310346,79.575598,78.514588,74.801064,79.840851,79.627525
2,relu,3,81.216931,76.719576,81.74603,80.952382,76.719576,81.43236,80.106103,78.249335,77.188331,80.636603,79.496723
7,relu,8,78.042328,76.455027,82.275134,78.571427,78.042328,78.249335,78.249335,81.43236,76.657826,79.045093,78.702019
6,relu,7,80.158728,80.158728,80.687833,79.100531,80.158728,75.331563,74.270558,79.310346,79.575598,76.127321,78.487993
4,relu,5,64.550263,81.481481,80.158728,78.306878,79.62963,76.657826,75.331563,80.636603,80.901855,82.758623,78.041345


In [327]:
report = record3.sort_values(by='AVG', ascending=False)
report = report.to_excel('CNN_CR_3.xlsx', sheet_name='dynamic')