<a href="https://colab.research.google.com/github/cleopatra27/feed_forward_neural-network_binary_classifier/blob/main/binary_sentiment_classifier_using_a_feed_forward_neural%C2%A0network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Building a binary sentiment classifier using a feed-forward neural network.**

We will Treat only word unigrams as features for your neural network classifier, our neural network will have an input layer, 2 hidden layers, and an output layer, train our classifier, then use 10-fold cross validation to optimize parameters using “accuracy” as the metric. For example, you can choose the activation function and choose the number of nodes in the first hidden layer. 

We will then use the parameters from best performing model and train this neural network on our whole training corpus again. 

Then we will classify each review in the test set as either positive of negative using your best performing classifier 

import tensorflow

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.6.0


Load training data from my google drive

In [None]:
import glob
files = glob.glob("/content/drive/MyDrive/imdb/train/*")

I loaded the data into a data frame and set our sentiments to 1 for positive and 0 for negative

In [None]:
import pandas as pd
import os

rows = []
for folder in files:
  for file in glob.glob(folder+'/*'):
        with open(os.path.join(os.getcwd(), file), 'r', encoding="utf-8-sig", errors='ignore') as suffix:
            sentence = suffix.read().split('\n')
            for line in sentence:
                targ = 0
                if "pos" in file:
                  targ = 1
                rows.append([line, targ])

df = pd.DataFrame(rows, columns=["text", "sentiment"])
df

Unnamed: 0,text,sentiment
0,"tristar / 1 : 30 / 1997 / r ( language , viole...",0
1,,0
2,the brady bunch movie is less a motion picture...,0
3,,0
4,"i'm going to keep this plot summary brief , so...",0
...,...,...
2261,,1
2262,what starts out as a monotonous talking-head m...,1
2263,,1
2264,jackie brown ( miramax - 1997 ) starring pam g...,1


Preview of the total positive and negative reviews we have in our data

In [None]:
print((df.sentiment == 1). sum()) # positive
print((df.sentiment == 0). sum()) # negative

1132
1134


Next i cleaned up the data, starting with punctuations and URLS.
Here, i defined a function to handle punctuations and another for URLS.

In [None]:
import re
import string 

def remove_URL(text):
  url = re.compile(r"https?:\/\/.*[\r\n]*")
  return url.sub(r"", text)

def remove_punctuation(text):
  translator = str.maketrans("", "", string.punctuation)
  return text.translate(translator)

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
pattern = re.compile(r"https?:\/\/.*[\r\n]*")
for t in df.text:
  matches = pattern.findall(t)
  for match in matches:
    print(t)
    print(match)
    print(pattern.sub(r"", t))
  if len(matches) > 0:
    break

Cleaned the text by calling the punctuation and URL removal function

In [None]:
df["text"] = df.text.map(remove_URL)
df["text"] = df.text.map(remove_punctuation)

Implementted function to remove stop words

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = set(stopwords.words("english"))

def remove_stopwords(text):
  filtered_words = [word.lower() for word in text.split() if word.lower() not in stop]
  return " ".join(filtered_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Preview of the stop words

In [None]:
stop

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

Removed stopwords

In [None]:
df["text"] = df.text.map(remove_stopwords)

Now the shape of our dataframe is

In [None]:
df.shape

(2266, 2)

Preview of our text column

In [None]:
df.text

0       tristar 1 30 1997 r language violence dennis r...
1                                                        
2       brady bunch movie less motion picture minor po...
3                                                        
4       im going keep plot summary brief something wis...
                              ...                        
2261                                                     
2262    starts monotonous talkinghead musical history ...
2263                                                     
2264    jackie brown miramax 1997 starring pam grier s...
2265                                                     
Name: text, Length: 2266, dtype: object

I prepared the data to a format our model can take, by tokenizing our words.

We start by getting our unique words and implementing a word frequency counter

In [None]:
from collections import Counter

def counter_word(text_col):
  count = Counter()
  for text in text_col.values:
    for word in text.split():
      count[word] += 1
  return count

counter = counter_word(df.text)

The most common words

In [None]:
counter.most_common(5)

[('film', 5034),
 ('movie', 3249),
 ('one', 3012),
 ('like', 2036),
 ('even', 1421)]

Our number of unique words

In [None]:
num_unique_words = len(counter)

Tokenizing the texts 

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=num_unique_words)
tokenizer.fit_on_texts(df.text)

Preview the word index

In [None]:
word_index = tokenizer.word_index

In [None]:
word_index

{'film': 1,
 'movie': 2,
 'one': 3,
 'like': 4,
 'even': 5,
 'time': 6,
 'good': 7,
 'films': 8,
 'also': 9,
 'story': 10,
 'much': 11,
 'get': 12,
 'characters': 13,
 'would': 14,
 'two': 15,
 'first': 16,
 'character': 17,
 'see': 18,
 'well': 19,
 'way': 20,
 'really': 21,
 'make': 22,
 'little': 23,
 'people': 24,
 'plot': 25,
 'movies': 26,
 'never': 27,
 'life': 28,
 'could': 29,
 'bad': 30,
 'scene': 31,
 'new': 32,
 'man': 33,
 'know': 34,
 'director': 35,
 'many': 36,
 'dont': 37,
 'hes': 38,
 'best': 39,
 'great': 40,
 'scenes': 41,
 'doesnt': 42,
 'another': 43,
 'us': 44,
 'action': 45,
 'love': 46,
 'something': 47,
 'made': 48,
 'theres': 49,
 'minutes': 50,
 'still': 51,
 'back': 52,
 'john': 53,
 'cast': 54,
 'go': 55,
 'makes': 56,
 'end': 57,
 'years': 58,
 'seems': 59,
 'however': 60,
 'work': 61,
 'things': 62,
 'every': 63,
 'since': 64,
 'actually': 65,
 'gets': 66,
 'going': 67,
 'big': 68,
 'around': 69,
 'better': 70,
 'role': 71,
 'think': 72,
 'may': 73,
 'se

Coverting our words to a sequence

In [None]:
train_sequences = tokenizer.texts_to_sequences(df.text)

In [None]:
print(train_sequences[10:15])

[[1625, 11007, 1782, 1, 533, 8576, 12945, 583, 781, 222, 19, 29, 3950, 564, 8577, 1008, 1782, 832, 295, 82, 11008, 3765, 6465, 77, 1435, 3441, 70, 237, 2902, 3442, 21270, 5958, 21271, 2804, 1436, 704, 1497, 2213, 3017, 5182, 220, 12946, 1522, 5958, 5183, 2034, 207, 95, 781, 1250, 30, 62, 510, 2214, 2034, 794, 1219, 484, 9611, 15938, 5184, 4135, 39, 5547, 377, 5959, 209, 2214, 21272, 2908, 240, 972, 1437, 748, 2148, 3951, 2342, 243, 166, 4583, 1601, 141, 22, 1035, 9612, 125, 28, 249, 2343, 12947, 39, 279, 1924, 3766, 21273, 66, 16, 4584, 266, 2707, 84, 69, 3767, 1601, 531, 1469, 76, 77, 1435, 3441, 912, 1662, 479, 251, 1, 60, 1376, 517, 1319, 732, 371, 3145, 21274, 251, 135, 146, 21275, 18, 283, 173, 2909, 47, 3768, 21276, 77, 1435, 3441, 878, 517, 759, 913, 1045, 4585, 1224, 164, 4586, 10, 1251, 6466, 6467, 135, 12, 3146, 2, 108, 135, 766, 40, 54, 2342, 4135, 15938, 1038, 340, 958, 23, 3294, 6466, 518, 39, 279, 972, 1437, 16, 4584, 649, 222, 11, 70, 26, 65, 146, 4, 8578, 8, 5960, 41, 1

Padding our sequence.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# max number of wors in a sequence
max_length = 20

train_paded = pad_sequences(train_sequences, maxlen=max_length, padding="post", truncating="post")

In [None]:
train_paded.shape

(2266, 20)

Cross checking padding

In [None]:
reverse_word_i = dict([(idx, word) for (word, idx) in word_index.items()])
reverse_word_i[10]

'story'

In [None]:
def decode(sequence):
  return " ".join([reverse_word_i.get(idx, "?") for idx in sequence])

In [None]:
decoded_text = decode(train_sequences[10])

print(train_sequences[10])
print(decoded_text)

[1625, 11007, 1782, 1, 533, 8576, 12945, 583, 781, 222, 19, 29, 3950, 564, 8577, 1008, 1782, 832, 295, 82, 11008, 3765, 6465, 77, 1435, 3441, 70, 237, 2902, 3442, 21270, 5958, 21271, 2804, 1436, 704, 1497, 2213, 3017, 5182, 220, 12946, 1522, 5958, 5183, 2034, 207, 95, 781, 1250, 30, 62, 510, 2214, 2034, 794, 1219, 484, 9611, 15938, 5184, 4135, 39, 5547, 377, 5959, 209, 2214, 21272, 2908, 240, 972, 1437, 748, 2148, 3951, 2342, 243, 166, 4583, 1601, 141, 22, 1035, 9612, 125, 28, 249, 2343, 12947, 39, 279, 1924, 3766, 21273, 66, 16, 4584, 266, 2707, 84, 69, 3767, 1601, 531, 1469, 76, 77, 1435, 3441, 912, 1662, 479, 251, 1, 60, 1376, 517, 1319, 732, 371, 3145, 21274, 251, 135, 146, 21275, 18, 283, 173, 2909, 47, 3768, 21276, 77, 1435, 3441, 878, 517, 759, 913, 1045, 4585, 1224, 164, 4586, 10, 1251, 6466, 6467, 135, 12, 3146, 2, 108, 135, 766, 40, 54, 2342, 4135, 15938, 1038, 340, 958, 23, 3294, 6466, 518, 39, 279, 972, 1437, 16, 4584, 649, 222, 11, 70, 26, 65, 146, 4, 8578, 8, 5960, 41, 12

Train our model with one input layer, one hidden layer and one output layer.

Our input data is simply vectors, and our labels are scalars (1 and 0). Two Dense layers with relu activations: Dense(20, activation='relu') and other with (10, activation="relu")  as our hidden layer, And Dense(1, activation="sigmoid") as our output layer.

Word embedings give us an efficient, dense representation in whicb similar words have a similar encoding, and we wont have to specifiy that manually, so we will use the embeddings layer, which takes as input an integer matrix of size (batch, input_length).

Also using 10-fold cross validation to optimize parameters using “accuracy” as the metric.

In [None]:
from tensorflow.keras import layers, optimizers, losses, metrics
import keras
from sklearn.model_selection import KFold
import numpy as np
from keras.callbacks import ModelCheckpoint

labels = np.asarray(df['sentiment']).astype('float32')

# Define the K-fold Cross Validator
kfold = KFold(n_splits=10, shuffle=True)

# Define per-fold score containers <-- these are new
acc_per_fold = []
loss_per_fold = []

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(df["text"], df["sentiment"]):
  model = keras.models.Sequential()

  model.add(layers.Embedding(num_unique_words, 20, input_length=max_length)) #input layer
  # model.add(layers.Dense(20, activation="relu")) #input layer
  # model.add(layers.Input(20)) #input layer
  model.add(layers.Dense(20, activation="relu")) #hidden layer
  model.add(layers.Dense(10, activation="relu")) #hidden layer 2
  model.add(layers.Dense(1, activation="sigmoid"))# output layer

  model.compile(optimizer=optimizers.RMSprop(learning_rate=0.001),
                loss=losses.binary_crossentropy,
                metrics=[metrics.binary_accuracy])
  
    #create callback
  filepath = 'my_best_model.hdf5'
  checkpoint = ModelCheckpoint(filepath=filepath, 
                              monitor='loss',
                              verbose=1, 
                              save_best_only=True,
                              mode='min')
  callbacks = [checkpoint]

  # Fit data to model
  history = model.fit(train_paded, labels,
              batch_size=50,
              epochs=20,
              verbose=1,
              callbacks=callbacks)
  
  # Generate generalization metrics
  scores = model.evaluate(train_paded, labels, verbose=0)
  print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
  acc_per_fold.append(scores[1] * 100)
  loss_per_fold.append(scores[0])

  # Increase fold number
  fold_no = fold_no + 1

Epoch 1/20

Epoch 00001: loss improved from inf to 0.69330, saving model to my_best_model.hdf5
Epoch 2/20

Epoch 00002: loss improved from 0.69330 to 0.69171, saving model to my_best_model.hdf5
Epoch 3/20

Epoch 00003: loss improved from 0.69171 to 0.68887, saving model to my_best_model.hdf5
Epoch 4/20

Epoch 00004: loss improved from 0.68887 to 0.68289, saving model to my_best_model.hdf5
Epoch 5/20

Epoch 00005: loss improved from 0.68289 to 0.67199, saving model to my_best_model.hdf5
Epoch 6/20

Epoch 00006: loss improved from 0.67199 to 0.66041, saving model to my_best_model.hdf5
Epoch 7/20

Epoch 00007: loss improved from 0.66041 to 0.64773, saving model to my_best_model.hdf5
Epoch 8/20

Epoch 00008: loss improved from 0.64773 to 0.63588, saving model to my_best_model.hdf5
Epoch 9/20

Epoch 00009: loss improved from 0.63588 to 0.62640, saving model to my_best_model.hdf5
Epoch 10/20

Epoch 00010: loss improved from 0.62640 to 0.61987, saving model to my_best_model.hdf5
Epoch 11/20



Parameter optimization flow: I first tried using a normal input layer a dense layer as my input layer, but realized my accuracy and loss was better with an embedding layer as my input layer, then i had to go back and calculate the needed hyperparameters for an ebedding layer.

I also had issues with the number of nodes to assign my layers, but finally settled on the of the input X

Parameters for the best model:

In [None]:
from keras.models import Sequential, load_model

model = load_model("my_best_model.hdf5")
model.get_config()

{'layers': [{'class_name': 'InputLayer',
   'config': {'batch_input_shape': (None, 20),
    'dtype': 'float32',
    'name': 'embedding_29_input',
    'ragged': False,
    'sparse': False}},
  {'class_name': 'Embedding',
   'config': {'activity_regularizer': None,
    'batch_input_shape': (None, 20),
    'dtype': 'float32',
    'embeddings_constraint': None,
    'embeddings_initializer': {'class_name': 'RandomUniform',
     'config': {'maxval': 0.05, 'minval': -0.05, 'seed': None}},
    'embeddings_regularizer': None,
    'input_dim': 37049,
    'input_length': 20,
    'mask_zero': False,
    'name': 'embedding_29',
    'output_dim': 20,
    'trainable': True}},
  {'class_name': 'Dense',
   'config': {'activation': 'relu',
    'activity_regularizer': None,
    'bias_constraint': None,
    'bias_initializer': {'class_name': 'Zeros', 'config': {}},
    'bias_regularizer': None,
    'dtype': 'float32',
    'kernel_constraint': None,
    'kernel_initializer': {'class_name': 'GlorotUniform',

 **2.2 Test your feed-forward Neural Network**
Your goal for this part of the assignment is to test your neural network on the “training set”. 
5 
 
1. Use the parameters from best performing model in Section 2.1 of this assignment and 
train the neural network on your whole training corpus. 
2. Report your accuracy on the entire training set. 

In [None]:
best_model = load_model("my_best_model.hdf5")

# Fit data to model
history = best_model.fit(train_paded, labels,
            batch_size=50,
            epochs=20,
            verbose=1)

# Generate generalization metrics
scores = best_model.evaluate(train_paded, labels, verbose=0)
print(scores)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[0.5747664570808411, 0.6167696714401245]


We see a little imprvement when retrained
Score for fold 10: loss of 0.5854668617248535; binary_accuracy of 61.59312129020691%


and score after retrianing train set with best model - 
[loss - 0.5747664570808411, binary_accuracy - 0.6167696714401245]



