### Outline
#### 1. Introduction 
#### 2. Load sentiment dataset
#### 3. Pre-process
#### 4. Logistic Regression
#### 5. Neural Network

# 1. Introduction

During this project was developed three machine learning and data mining methods to solve the
problem of classifying the polarity of tweets (binary classes) using the Sentiment140 dataset. This work
explores the use of the logistic regression classifier model and two neural networks, Long Short-Term
Memory (LSTM) and Convolutional Neural Networks (CNN). 

The practice of aiming to deliver a high
accuracy score whilst pre-processing and fine-tuning the hyperparameters of these models allows us to
become familiar with each of these algorithms and its nuances both theoretically and in practice when
applied to Natural Language Processing (NLP). 

The results show that the CNN model was the best
performer in terms of prediction accuracy and had the shortest run time.


### 2. Load sentiment dataset

In [3]:
import tensorflow as tf


In [4]:
!pip install -q -U keras-tuner

[K     |████████████████████████████████| 133 kB 7.1 MB/s 
[?25h

In [1]:
#Loading the required libraries

import pandas as pd
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from sklearn.metrics import confusion_matrix, classification_report, roc_curve
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import SVC
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from keras.models import Sequential
from keras import layers
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
import keras_tuner as kt
from keras_tuner.tuners import RandomSearch, hyperband, BayesianOptimization
from keras.preprocessing.sequence import pad_sequences

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ericp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
#Loading the dataset

#Change the 'path', for the path of the dataset uploaded in google colab
path = '/content/sentiment140.csv'

data = pd.read_csv(path, encoding='latin-1', header=None) # no headers in this dataset

#Defining column names
data.columns = ['target', 'ids', 'date', 'flag', 'user', 'text'] # inserting header names
data.head() # show data

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [9]:
data['target'].size # show number of records

1600000

### 3. Pre-process

As the Sentiment140 is a large dataset of tweets extracted using the twitter api, pre-processing was
extremely important as it is required before the data can be used for analysis, to clean emojis, tags and
other features that would impact the performance of the models.


##### 3.1 Check for missing values

In [10]:
data.isnull().sum()

target    0
ids       0
date      0
flag      0
user      0
text      0
dtype: int64

Comment: No missing values across all variables

##### 3.2 Review class distribution

In [11]:
data['target'].value_counts()

0    800000
4    800000
Name: target, dtype: int64

Comment: labels show a 50/50 split with only 0 = neg and 4 = positive.

##### 3.3 Relevant columns

In [12]:
data['flag'].value_counts()

NO_QUERY    1600000
Name: flag, dtype: int64

Comment: 'flag' column only shows one value = "NO_QUERY" making this variable redundant. 'date', 'user' and 'ids' also not in scope for sentiment analysis, drop also.

In [13]:
data = data.drop(['ids', 'date', 'flag', 'user'], axis=1) # drop columns
data.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [14]:
type(data)

pandas.core.frame.DataFrame

##### 3.4 Cleaning & Tokenization

After dropping unnecessary columns, common abbreviations and non-English words such as ‘LOL’,
‘bday’, ‘hahaha’, were replaced with a descriptive word to reduce the number of text features present.
Stop words were used to identify and remove commonly used insignificant words that
contain low-level information such as ‘is’, ‘the’ and ‘in’. 

The removal of stop words gives the model
more focus on words with more contextual importance, and the decrease of features will increase the
speed of training models.

In [15]:
#Cleaning non-english words and replacing them for its true meaning

def further_cleaning(tweet):
    message = []
    
    for i in tweet.split():            
        
        #Replacing LOL for laugh
        if i == 'LOL':
            replace = i.replace('LOL', 'laugh')
            message.append(replace)
            
            
        elif i == 'ugh':
            replace = i.replace('ugh', 'disgust')
            message.append(replace)
            

        else:
            message.append(i)
        
    
    return " ".join(message)     

In [16]:
#Comparing, before and after the cleaning process

print(f"Orignal : {data.text[28]}")
print()
print(f"Preprocessed : {further_cleaning(data.text[28])}")

Orignal : ooooh.... LOL  that leslie.... and ok I won't do it again so leslie won't  get mad again 

Preprocessed : ooooh.... laugh that leslie.... and ok I won't do it again so leslie won't get mad again


In [17]:
#Applying the replacing process in the dataset

data.text = data.text.apply(lambda x: further_cleaning(x))
data.text

0          @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          is upset that he can't update his Facebook by ...
2          @Kenichan I dived many times for the ball. Man...
3             my whole body feels itchy and like its on fire
4          @nationwideclass no, it's not behaving at all....
                                 ...                        
1599995    Just woke up. Having no school is the best fee...
1599996    TheWDB.com - Very cool to hear old Walt interv...
1599997    Are you ready for your MoJo Makeover? Ask me f...
1599998    Happy 38th Birthday to my boo of alll time!!! ...
1599999    happy #charitytuesday @theNSPCC @SparksCharity...
Name: text, Length: 1600000, dtype: object

In [18]:
print(data.text[28])

ooooh.... laugh that leslie.... and ok I won't do it again so leslie won't get mad again


In [19]:
#Further cleaning of non-english words and abbreviations

def cleaning(tweet):
    message = []
    
    for i in tweet.split():
        if i == 'hahaha' or i == 'hahah' or i == 'hahahaha' or i == 'hehehe':
            replace = i.replace(i, 'laugh')
            message.append(replace)
            
            #bday will be replaced for birthday in the entire dataset
        elif i == 'bday':
            replace = i.replace('bday', 'birthday')
            message.append(replace)
            
        else:
            message.append(i)
        
    
    return " ".join(message)  

In [20]:
data.text = data.text.apply(lambda x: cleaning(x))

In [21]:
#Removing stop words such as 'me', 'my'.. and stemmer to cut the end the words such as trying -> try
stopwords = stopwords.words('english')
stemming = SnowballStemmer('english')


#Clean urls and tags
clean = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"


def data_preprocessing(message, stem = False):
    
    
    message = re.sub(clean, ' ', str(message).lower()).strip() 
    text = []
    
    for i in message.split():
        if i not in stopwords:
            if stem:
                text.append(stemming.stem(i).lower())
            else:
                text.append(i.lower())
                
    return " ".join(text)



Comment:
Stemming reduces different forms of
a word to their basic form as inflected forms create various versions of the same word which could
produce less effective trained models. I have used the snowball technique for stemming which
transforms words such as running, to runner, to run, and trying to try, and generously to generous. In
addition to stop word and stemming techniques, the text corpus was also converted to lowercase to avoid 
the overlap of words with different casing patterns. The dataset was then cleaned of tags, symbols and
punctuations such as ‘@’, ‘/http’ and ‘,’.

In [22]:
#Comparing the original dataset with the cleaned one

print(f"Orignal : {data.text[7]}")
print()
print(f"Preprocessed : {data_preprocessing(data.text[7])}")

Orignal : @LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit laugh , I'm fine thanks , how's you ?

Preprocessed : hey long time see yes rains bit bit laugh fine thanks


In [23]:
#Applying the cleaned text in the dataset

data_text = data.text.apply(lambda x: data_preprocessing(x))
data_text

0               awww bummer shoulda got david carr third day
1          upset update facebook texting might cry result...
2          dived many times ball managed save 50 rest go ...
3                           whole body feels itchy like fire
4                                           behaving mad see
                                 ...                        
1599995                        woke school best feeling ever
1599996             thewdb com cool hear old walt interviews
1599997                      ready mojo makeover ask details
1599998    happy 38th birthday boo alll time tupac amaru ...
1599999    happy charitytuesday thenspcc sparkscharity sp...
Name: text, Length: 1600000, dtype: object

In [24]:
#Checking target values
data.target

0          0
1          0
2          0
3          0
4          0
          ..
1599995    4
1599996    4
1599997    4
1599998    4
1599999    4
Name: target, Length: 1600000, dtype: int64

In [25]:
#Defining binary classes like 0 = 0 and 4 = 1

binary_class = set(data.target)
index = dict((a, b) for b, a in enumerate(binary_class))
index_to_class = dict((c, d) for d, c in index.items())

print(index)

{0: 0, 4: 1}


In [26]:
#Changing in the dataset the classes

import numpy as np 

ids = lambda labels: np.array([index.get(x) for x in labels])

labels = ids(data.target)
print(labels)

[0 0 0 ... 1 1 1]


In [27]:
labels

array([0, 0, 0, ..., 1, 1, 1])

In [28]:
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(data_text, labels, test_size=0.25,
                                                    random_state=23) # so we get the same results


#### TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document.

In [29]:
# tranforming the tweet data into vectors matrix

trans_vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000, lowercase= False)
trans_vectoriser.fit(X_train)
print(len(trans_vectoriser.get_feature_names_out()))

500000


In [30]:
#Transforming X_train and X_test 

X_train_v = trans_vectoriser.transform(X_train)
X_test_v  = trans_vectoriser.transform(X_test)

#  4. Logistic regression

The logistic regression classifier uses the weighted
combination of words and passes the numerical inputs to a value between 0 and 1 through the sigmoid
function. 

Logistic regression is known for its easy implementation, interpretability and efficiency in
training, in addition to the fact that the dataset is linearly separable, it makes it a great starter algorithm.

In [33]:
#Creating logistic regression as the first model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

log_r = LogisticRegression(max_iter = 2000)
log_r.fit(X_train_v, y_train)

#Predicting the trained model with the test dataset
log_pred = log_r.predict(X_test_v)
print("Model accuracy: {:.2f}".format(accuracy_score(y_test, log_pred)))

Model accuracy: 0.79


### 4.1 Model Tuning

To improve the model, grid search was used for hyperparameter tuning where all possible
combinations of the relevant hyperparameter values are tested to produce the highest accuracy. More
specifically, for logistic regression these are the inverse of regularization strength (C), the type of
regularization (penalty). 

The value of C influences the model in the way that it controls the regularization
to avoid overfitting the model. The value of penalty could serve two purposes; L1 helps with sparsity and
L2 aids in finding the optimal parameter 𝜆 (lambda) (17). As L2 works best with prediction, this is preset
in parameters. We then allowed the grid search to test three combinations of C, using a cross validation
technique with 5 folds which returned C = 1.0 as the preferred value. This produced an accuracy of
0.7913 which is barely a noticeable difference from the pre-tuned model.


In [None]:
#HYPERPARAMETER TUNING
#Gridsearch for Logistic Regression
from sklearn.model_selection import GridSearchCV

param_grid = {'penalty':['l2'],
                'C':np.logspace(0, 3, 7)
                }
logistic_r = LogisticRegression(max_iter = 2000, n_jobs=-1)

logis_cv = GridSearchCV(logistic_r, param_grid, cv=5, verbose=1)
logis_cv.fit(X_train_v, y_train)

lr_score = logis_cv.score(X_test_v, y_test)
print("Best parameters")
print(logis_cv.best_params_)
print(logis_cv.best_score_)
print("Accuracy:{:.2f}".format(lr_score))

Fitting 5 folds for each of 7 candidates, totalling 35 fits
Best parameters
{'C': 1.0, 'penalty': 'l2'}
0.7912681598985829
Accuracy:0.79


In [None]:
#Getting results from the tuned model

from sklearn import metrics

predict = logis_cv.predict(X_test_v)
print("Results:\n{}".format(metrics.classification_report(y_test, predict)))

Results:
              precision    recall  f1-score   support

           0       0.80      0.78      0.79    200309
           4       0.78      0.81      0.80    199691

    accuracy                           0.79    400000
   macro avg       0.79      0.79      0.79    400000
weighted avg       0.79      0.79      0.79    400000



# 5. Neural Network

### 5.1 Pre-processing for Neural Network

For NN, tokenization and padding was applied. Tokenization is the process of breaking text
into smaller units called tokens which aids in interpreting the meaning of sentences by analysing the
sequence of words. For example, ‘what a beautiful code’ would be split into tokens of ‘what’, ‘a’,
‘beautiful’, ‘code’. Padding then adds zeros at the end of the sequences to match up the size of each
sample. As not all sequences have the same number of words, this is applied to ensure the data can be fed
into our neural network models.


In [34]:
#Tokenization for NN

tokenizer = Tokenizer(num_words=100000, oov_token='<UNK>')

tokenizer.fit_on_texts(data_text)

print(tokenizer.texts_to_sequences([data_text[0]]))

[[342, 1062, 3367, 12, 721, 9502, 1767, 3]]


In [35]:
#Cheking the word index length

len(tokenizer.word_index) + 1

335374

In [36]:
#Padding the model, to make sure that the length are the same

def getting_sequences(tokenizer, message):
    text_seq= tokenizer.texts_to_sequences(message)
    pad_seq = pad_sequences(text_seq, truncating='post', maxlen=100, padding='post')
    return pad_seq

In [37]:
#Applying padding

data_text_seq = getting_sequences(tokenizer, data_text)

In [38]:
#Print the sequences to see the pad, 0 will be added to reach the maxlen set to 100

data_text_seq[100], data_text_seq[1000]

(array([  694,   233,  5576, 21272,   170,   219,  2828,   219,  7551,
          233,  5576,  1751,   170,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0], dtype=int32),
 array([ 3396, 64805,  1308,    51,   348, 14321,  2471,   386,  1412,
         2344,   203,  4562,    53,   164,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     

In [39]:
# Split data into training and test sets for the NN models

X_train, X_test, y_train, y_test = train_test_split(data_text_seq, labels, test_size=0.25,
                                                    random_state=23) # so we get the same results

In [40]:
X_train

array([[   64,     5,  2348, ...,     0,     0,     0],
       [   60,   168,   362, ...,     0,     0,     0],
       [   82,   398,   149, ...,     0,     0,     0],
       ...,
       [   32,   633,     1, ...,     0,     0,     0],
       [41822,  4631,   536, ...,     0,     0,     0],
       [   31,  9749,   315, ...,     0,     0,     0]], dtype=int32)

### 5.2 LSMT Model

The LSTM model was built using TensorFlow and Keras. I used a sequential keras model and stated
the first Embedding layer with word embedding, which maps each word to a dimensional space and it
receives a real-valued vector to get a dense representation of words and their relative meaning. I have
defined the Embedding layer with a vocabulary of 100,000 and a vector space of 100 dimensions.

For the LSTM layers it was decided to use bidirectional instead of the regular layer, for the fact
that bidirectional allows input to flow in two directions. It is useful for text classification as it trains
two sides of the input sequence, facilitating to fit the word in the right context. Therefore, two layers were
defined as it is more suitable to detect complex features, where the first and the second bidirectional
layers were defined with 64 units. 
Sigmoid activation function was used in the final output layer as sigmoid performs better for binary
classification problems

In [41]:
#Building LSTM model

model_1 = tf.keras.models.Sequential([
    
    tf.keras.layers.Embedding(100000, 100),
    
    #Bidirectional layers
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    
    #Final output layer
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         10000000  
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        84480     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 10,183,425
Trainable params: 10,183,425
Non-trainable params: 0
_________________________________________________________________


In [42]:
#Compiling the model
model_1.compile("adam", "binary_crossentropy", 
              metrics=["accuracy"])

#Fitting the model in the train dataset
model_1.fit(X_train, y_train, validation_split=0.1, batch_size=128, epochs=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f69e3c1ba90>

###  5.3 Hyperparameter tuning with Keras Tuner

From Keras tuner, hyperband tuning was the selected method as it trains a considerable number of
models for a few epochs and continues training the best performing models on the validation set.

Hyperband was set with a maximum epoch of 10, validation accuracy as the objective and an early stop
callback was set to monitor validation loss and stop training when reaching a certain value.

In [48]:
# Defining the structure of the model to tune it

def lstm_model(hp):
    model_lstm = Sequential()
    
    model_lstm.add(tf.keras.layers.Embedding(100000, 100)),
    
    #adding dropout to avoid overfitting  
    #and Hp.float, allows the algorithms to try different dropout rates
    model_lstm.add(tf.keras.layers.Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.25)))
    
    model_lstm.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True))),

    
    #Adding another dropout layer to avoid overfitting
    model_lstm.add(tf.keras.layers.Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.25)))
               
                   
    model_lstm.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))),

                
    model_lstm.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    
    # Compiling the model and tuning learning rate
    model_lstm.compile(
    optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-3, 1e-4])), 
        loss='binary_crossentropy', metrics=['accuracy'])
    return model_lstm

In [50]:
#Tuning the parameters with Hyperband with objective set as validation accuracy

tuner_lstm = kt.Hyperband(lstm_model,
                     #overwrite=True,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3,
                     directory='my_dir',
                     project_name='tuning_kt'
                    )

In [51]:
#Appying early stop to avoid overfitting

stop_lstm = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

In [52]:
# tuning and searching for best parameters

tuner_lstm.search(X_train,y_train,epochs=20, batch_size=64, validation_split=0.1, callbacks = [stop_lstm])

#getting the 1st model with best performance
model_tun_lstm = tuner_lstm.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The opimized dropout rate is {model_tun_lstm.get('Dropout_rate')} and the optimal learning rate for the optimizer
is {model_tun_lstm.get('learning_rate')}.
""")

Trial 6 Complete [00h 20m 38s]
val_accuracy: 0.7873166799545288

Best val_accuracy So Far: 0.7934583425521851
Total elapsed time: 02h 04m 21s
INFO:tensorflow:Oracle triggered exit

The opimized dropout rate is 0.25 and the optimal learning rate for the optimizer
is 0.001.



In [55]:
# Build the tuned model and check the optimal number of epochs
model_ep = tuner_lstm.hypermodel.build(model_tun_lstm)
history = model_ep.fit(X_train, y_train, epochs=5, validation_split=0.1)

#Optmial number of epochs
val_epoch = history.history['val_accuracy']
tuned_epoch = val_epoch.index(max(val_epoch)) + 1
print('Best epoch is: %d' % (tuned_epoch,))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Best epoch is: 2


In [60]:
#Predicting in the test dataset 
acc_lstm = model_ep.evaluate(X_test,y_test)



### 5.4 CNN Model

To set up the CNN model, 1D convolutions were used to scan through the sequence of words rather than
2D, which is most commonly used for image classification.

The model was created with 1D
convolutional layer, with 128 units, kernel size defined to 5 and activation ‘relu’. A global max pooling
1D was added in order to downsample the input representation, followed by the final output layer with
first a dense layer with value set to 10 and activation ‘relu’ which is a non-linear activation function, and
lastly the sigmoid activation dense layer as it is a binary classification problem

In [62]:
#word_index = tokenizer.word_index
len_voca = len(tokenizer.word_index) + 1
print("Vocabulary Size :", len_voca)

Vocabulary Size : 335374


In [63]:
#Defining the parameters for the CNN model

embedding_dim = 200
maxlen=100
model_cnn = Sequential()

#Embending layer
model_cnn.add(layers.Embedding(len_voca, embedding_dim, input_length=maxlen))

#Convolutional layer
model_cnn.add(layers.Conv1D(128, 5, activation='relu'))

#Pooling with max pooling1D
model_cnn.add(layers.GlobalMaxPooling1D())

#Dense Layer with relu
model_cnn.add(layers.Dense(10, activation='relu'))

#Sigmoid Final output layer
model_cnn.add(layers.Dense(1, activation='sigmoid'))

#Compiling the model
model_cnn.compile(optimizer='adam',
           loss='binary_crossentropy',
           metrics=['accuracy'])
model_cnn.summary() 

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 100, 200)          67074800  
                                                                 
 conv1d (Conv1D)             (None, 96, 128)           128128    
                                                                 
 global_max_pooling1d (Globa  (None, 128)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense_2 (Dense)             (None, 10)                1290      
                                                                 
 dense_3 (Dense)             (None, 1)                 11        
                                                                 
Total params: 67,204,229
Trainable params: 67,204,229
Non-trainable params: 0
__________________________________________

In [64]:
#Fitting the model with 3 epochs and batch size = 128 so the model can train faster

EPOCHS=3
BATCH_SIZE=128

model_cnn.fit(X_train, y_train,
          epochs=EPOCHS, 
          validation_split=0.1,
          batch_size=BATCH_SIZE, 
          verbose=1)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f69d2b45e90>

### 5.5 Hyperparameter tuning for CNN

For the tuning process, random search from Keras Tuner was used instead of Hyperband that was
previously used to tune LSTM. Random search uses different values to find the best parameters to build
the optimized model.

Max trial represents the number of model configurations to be tested. In the first trial the algorithm
will run with a combination of convolutional layers units of 128, kernel size of 5, dropout rate of 0.25,
dense units of 10 and a learning rate of 0.001. The second trial will then combine different values and at
the end the model will be re-defined with the values that provided the best validation accuracy

In [65]:
# generating validation set

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size=0.9)

In [66]:
#Tuning the parameters for CNN

embedding_dim = 200

def tuning_model(hp):
    model_h = Sequential()
    model_h.add(layers.Embedding(len_voca, embedding_dim, input_length=100))
    
    # Random search will test different filter values and kernel sizes
    model_h.add(layers.Conv1D(filters=hp.Int('filters',
                                        min_value=64,
                                        max_value=128,
                                        step = 64),
                kernel_size=hp.Choice('conv_1_filter', values = [3,5]), activation='relu'))
    
    model_h.add(layers.GlobalMaxPooling1D())
    
    #Dropout layer was added to avoid overfitting, and random search will try different values as showing bellow
    model_h.add(layers.Dropout(rate=hp.Float('dropout_1', min_value = 0.0, max_value = 0.5, default=0.25, step=0.25,)))
    
    
    model_h.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=20,
                                        step=10),
                           activation='relu'))
    
    model_h.add(layers.Dense(1, activation='sigmoid'))
    
    #Compiling the model
    model_h.compile(
    optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-3])), 
        loss='binary_crossentropy', metrics=['accuracy'])
    
    
    return model_h

In [67]:
#Random search will find the best model/parameters

tuner = RandomSearch(tuning_model,
                     #overwrite=True,
                     objective='val_accuracy',
                     max_trials = 4,
                     executions_per_trial=1,
                     directory='cnn_model'
                    )

In [68]:
#Appying early stop to monitor the validation loss and stop the model

stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

In [69]:
# tuning and seraching for best parameters

tuner.search(X_train,y_train,epochs=4, batch_size=128, callbacks = [stop],
             validation_data = (X_valid, y_valid))
model_tun = tuner.get_best_models(num_models=1)[0]

#summary of best model
model_tun.summary()

Trial 4 Complete [00h 12m 30s]
val_accuracy: 0.7889083623886108

Best val_accuracy So Far: 0.7889083623886108
Total elapsed time: 00h 49m 53s
INFO:tensorflow:Oracle triggered exit
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 200)          67074800  
                                                                 
 conv1d (Conv1D)             (None, 98, 128)           76928     
                                                                 
 global_max_pooling1d (Globa  (None, 128)              0         
 lMaxPooling1D)                                                  
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 10)                1290      
        

In [70]:
#Predicting with the test data set
accuracy = model_tun.evaluate(X_test,y_test)



# Conclusion

There is great potential in future improvements with investigating other pre-processing steps to
optimise the models. 

For cleaning in particular, the removal of non-English words and abbreviations.
Other areas include the use of other word embedding tools such as Glove or Word2Vec. Additionally,
several other model architectures for Long-short term memory (LSTM) and Convolutional neural network
(CNN) could be trialed and tested, which may improve the model’s performance.
