# Text Classification Model for Sentiment Labelled Sentences Using TensorFlow Take 5
### David Lowe
### January 22, 2021

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Sentiment Labelled Sentences dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: This dataset was created for the research paper 'From Group to Individual Labels using Deep Features,' Kotzias et al., KDD 2015. The paper researchers randomly selected 500 positive and 500 negative sentences from a larger dataset of reviews for each website. The researcher also attempted to choose sentences with a positive or negative connotation as the goal was to avoid selecting neutral sentences.

From iteration Take1, we deployed a bag-of-words model to classify the Amazon dataset's review comments. We also applied various sequence-to-matrix modes to evaluate the model's performance.

From iteration Take2, we deployed a word-embedding model to classify the Amazon dataset's review comments. We also compared the result with the bag-of-word model from the previous iteration.

From iteration Take3, we deployed a bag-of-words model to classify the IMDB dataset's review comments. We also applied various sequence-to-matrix modes to evaluate the model's performance.

From iteration Take4, we deployed a word-embedding model to classify the IMDB dataset's review comments. We also compared the result with the bag-of-word model from the previous iteration.

In this Take5 iteration, we will deploy a bag-of-words model to classify the Yelp dataset's review comments. We will also apply various sequence-to-matrix modes to evaluate the model's performance.

ANALYSIS: From iteration Take1, the bag-of-words model's performance achieved an average accuracy score of 77.31% after 25 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 71.00%.

From iteration Take2, the word-embedding model's performance achieved an average accuracy score of 73.25% after 25 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 67.00%.

From iteration Take3, the bag-of-words model's performance achieved an average accuracy score of 77.26% after 25 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 68.66%.

From iteration Take4, the word-embedding model's performance achieved an average accuracy score of 72.84% after 25 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 66.00%.

In this Take5 iteration, the bag-of-words model's performance achieved an average accuracy score of 75.19% after 25 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 72.00%.

CONCLUSION: In this modeling iteration, the bag-of-words TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: Sentiment Labelled Sentences

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

A deep-learning text classification project generally can be broken down into five major tasks:

1. Prepare Environment
2. Load and Prepare Text Data
3. Define and Train Models
4. Evaluate and Optimize Models
5. Finalize Model and Make Predictions

# Task 1 - Prepare Environment

In [1]:
# # Install the packages to support accessing environment variable and SQL databases
# !pip install python-dotenv PyMySQL boto3

In [2]:
# # Retrieve GPU configuration information from Colab
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#     print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
#     print('and then re-execute this cell.')
# else:
#     print(gpu_info)

In [3]:
# # Retrieve memory configuration information from Colab
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#     print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
#     print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
#     print('re-execute this cell.')
# else:
#     print('You are using a high-RAM runtime!')

In [4]:
# Retrieve CPU information from the system
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])

The number of available CPUs is: 4


## 1.a) Load libraries and modules

In [5]:
# Set the random seed number for reproducible results
seedNum = 1

In [6]:
# Load libraries and packages
import random
random.seed(seedNum)
import numpy as np
np.random.seed(seedNum)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
import sys
import boto3
import shutil
import string
import nltk
from nltk.corpus import stopwords
from collections import Counter
from datetime import datetime
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import tensorflow as tf
tf.random.set_seed(seedNum)
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ReduceLROnPlateau

In [7]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package movie_reviews is a

True

## 1.b) Set up the controlling parameters and functions

In [8]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the number of CPU cores available for multi-thread processing
n_jobs = 1

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Set the verbose level for program execution output
verbose = False

# Set Pandas options
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 140)

# Set the percentage sizes for splitting the dataset
TEST_SET_SIZE = 0.2
VAL_SET_SIZE = 0.25

# Set the number of folds for cross validation
N_FOLDS = 5
N_ITERATIONS = 2

# Set various default modeling parameters
DEFAULT_LOSS = 'binary_crossentropy'
DEFAULT_METRICS = ['accuracy']
DEFAULT_OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=0.001)
DEFAULT_INITIALIZER = tf.keras.initializers.GlorotUniform(seed=seedNum)
EPOCH_NUMBER = 25
BATCH_SIZE = 16

# Define the labels to use for graphing the data
train_metric = "accuracy"
validation_metric = "val_accuracy"
train_loss = "loss"
validation_loss = "val_loss"

# Check the number of GPUs accessible through TensorFlow
print('Num GPUs Available:', len(tf.config.list_physical_devices('GPU')))

# Print out the TensorFlow version for confirmation
print('TensorFlow version:', tf.__version__)

Num GPUs Available: 0
TensorFlow version: 2.3.1


In [9]:
# Set up the parent directory location for loading the dotenv files
# from google.colab import drive
# drive.mount('/content/gdrive')
# gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
# env_path = '/content/gdrive/My Drive/Colab Notebooks/'
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)

# Set up the dotenv file for retrieving environment variables
# env_path = "/Users/david/PycharmProjects/"
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)

In [10]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [11]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 1 - Prepare Environment has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [12]:
# Reset the random number generators
def reset_random(x):
    random.seed(x)
    np.random.seed(x)
    tf.random.set_seed(x)

In [13]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 1 - Prepare Environment completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 2 - Load and Prepare Text Data

In [14]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 2 - Load and Prepare Text Data has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

## 2.a) Download Text Data Archive

In [15]:
dataset_path = 'https://dainesanalytics.com/datasets/ucirvine-sentiment-labelled-sentences/yelp_labelled.txt'
colNames = ['comment','targetVar']
Xy_original = pd.read_csv(dataset_path, names=colNames, sep='\t', header=None, index_col=False)

# Take a peek at the dataframe after import
print(Xy_original.head(10))

                                             comment  targetVar
0                           Wow... Loved this place.          1
1                                 Crust is not good.          0
2          Not tasty and the texture was just nasty.          0
3  Stopped by during the late May bank holiday of...          1
4  The selection on the menu was great and so wer...          1
5     Now I am getting angry and I want my damn pho.          0
6              Honeslty it didn't taste THAT fresh.)          0
7  The potatoes were like rubber and you could te...          0
8                          The fries were great too.          1
9                                     A great touch.          1


## 2.b) Splitting Data for Training and Validation

In [16]:
X_original = Xy_original.iloc[:,0]
y_original = Xy_original.iloc[:,1]
print("Xy_original.shape: {} X_original.shape: {} y_original.shape: {}".format(Xy_original.shape, X_original.shape, y_original.shape))
print(X_original.head())

Xy_original.shape: (1000, 2) X_original.shape: (1000,) y_original.shape: (1000,)
0                             Wow... Loved this place.
1                                   Crust is not good.
2            Not tasty and the texture was just nasty.
3    Stopped by during the late May bank holiday of...
4    The selection on the menu was great and so wer...
Name: comment, dtype: object


In [17]:
# Split the data further into training and test datasets
X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(X_original, y_original, test_size=TEST_SET_SIZE, stratify=y_original, random_state=seedNum)
print("X_train_df.shape: {} y_train_df.shape: {}".format(X_train_df.shape, y_train_df.shape))
print("X_test_df.shape: {} y_test_df.shape: {}".format(X_test_df.shape, y_test_df.shape))

X_train_df.shape: (800,) y_train_df.shape: (800,)
X_test_df.shape: (200,) y_test_df.shape: (200,)


## 2.c) Load Document and Build Vocabulary

In [18]:
# turn a doc into clean tokens
def clean_sentence(sentence):
    # split into tokens by white space
    tokens = sentence.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

In [19]:
# load sentences and add to vocab
def add_sentence_to_vocab(sentence, vocab):
    # clean doc
    tokens = clean_sentence(sentence)
    # update counts
    vocab.update(tokens)

In [20]:
# load all docs in a directory
def build_vocabulary(comments, vocab):
    # walk through all comments in the dataframe
    for i in range(len(comments)):
        # add comments to vocab
        sentence = comments.iloc[i]
        add_sentence_to_vocab(sentence, vocab)
        if verbose : print('Processing comment:', sentence)
        if ((i+1) % 100) == 0 : print(i+1, 'comments processed so far.')
    print('Total number of comments loaded into the vocabulary:', i+1, '\n')

In [21]:
# define vocab
vocab = Counter()
# add all docs to vocab
build_vocabulary(X_train_df, vocab)
# print the size of the vocab
print('The total number of words in the vocabulary:', len(vocab))
# print the top words in the vocab
top_words = 50
print('The top', top_words, 'words in the vocabulary:\n', vocab.most_common(top_words))

100 comments processed so far.
200 comments processed so far.
300 comments processed so far.
400 comments processed so far.
500 comments processed so far.
600 comments processed so far.
700 comments processed so far.
800 comments processed so far.
Total number of comments loaded into the vocabulary: 800 

The total number of words in the vocabulary: 1889
The top 50 words in the vocabulary:
 [('The', 142), ('food', 88), ('place', 84), ('good', 78), ('service', 55), ('back', 51), ('great', 43), ('This', 40), ('like', 36), ('time', 34), ('We', 30), ('go', 30), ('really', 29), ('It', 23), ('ever', 23), ('friendly', 22), ('dont', 22), ('My', 20), ('amazing', 19), ('would', 19), ('Great', 18), ('also', 18), ('experience', 18), ('nice', 18), ('one', 18), ('staff', 17), ('best', 17), ('eat', 17), ('get', 16), ('restaurant', 16), ('us', 16), ('delicious', 16), ('Im', 15), ('minutes', 15), ('Vegas', 15), ('They', 15), ('never', 15), ('bad', 14), ('say', 14), ('pretty', 14), ('first', 13), ('goin

In [22]:
# keep tokens with a min occurrence
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print('The number of words with the minimum appearance:', len(tokens))

The number of words with the minimum appearance: 702


In [23]:
# save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()

# save tokens to a vocabulary file
save_list(tokens, 'vocabulary.txt')

In [24]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [25]:
# load the vocabulary
vocab_filename = 'vocabulary.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
print('Number of tokens in the vocabulary:', len(vocab))

Number of tokens in the vocabulary: 702


## 2.d) Create Tokenizer and Encode the Input Text

In [26]:
# load doc, clean and return line of tokens
def comment_to_line(sentence, vocab):
    # clean sentence
    tokens = clean_sentence(sentence)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    line = ' '.join(tokens)
    return line

In [27]:
# load all docs in a directory
def process_comments_to_lines(comments, vocab):
    lines = list()
    # walk through all comments in the dataframe
    for i in range(len(comments)):
        # load and clean the comments
        sentence = comments.iloc[i]
        line = comment_to_line(sentence, vocab)
        # add to list
        lines.append(line)
    return lines

In [28]:
# # Load all training cases
training_docs = process_comments_to_lines(X_train_df, vocab)

In [29]:
# # load all test/validation cases
testing_docs = process_comments_to_lines(X_test_df, vocab)

In [30]:
# prepare bag of words encoding of docs
def encode_data(train_docs, val_docs, mode='binary'):
    # create the tokenizer
    tokenizer = Tokenizer()
    # fit the tokenizer on the documents
    tokenizer.fit_on_texts(train_docs)
    # encode training data set
    train_encoded = tokenizer.texts_to_matrix(train_docs, mode=mode)
    # encode validation data set
    val_encoded = tokenizer.texts_to_matrix(val_docs, mode=mode)
    return train_encoded, val_encoded

In [31]:
# encode training and validation datasets
X_train, X_test = encode_data(training_docs, testing_docs)
print('The shape of the encoded training dataset:', X_train.shape)
print('The shape of the encoded validation dataset:', X_test.shape)

The shape of the encoded training dataset: (800, 660)
The shape of the encoded validation dataset: (200, 660)


In [32]:
y_train = y_train_df.values.ravel()
y_test = y_test_df.values.ravel()
print('The shape of the encoded test classes:', y_train.shape)
print('The shape of the encoded test classes:', y_test.shape)

The shape of the encoded test classes: (800,)
The shape of the encoded test classes: (200,)


In [33]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 2 - Load and Prepare Text Data completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 3 - Define and Train Models

In [34]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 3 - Define and Train Models has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [35]:
# Define the default numbers of input/output for modeling
num_inputs = X_train.shape[1]
num_outputs = 1

In [36]:
# Define the baseline model for benchmarking
def create_nn_model(n_inputs=num_inputs, n_outputs=num_outputs, layer1_nodes=50, layer1_dropout=0, opt_param=DEFAULT_OPTIMIZER, init_param=DEFAULT_INITIALIZER, loss_param=DEFAULT_LOSS, metrics_param=DEFAULT_METRICS):
    nn_model = keras.Sequential([
        keras.layers.Dense(layer1_nodes, input_shape=(n_inputs,), activation='relu', kernel_initializer=init_param),
#         keras.layers.Dropout(layer1_dropout),
        keras.layers.Dense(n_outputs, activation='sigmoid', kernel_initializer=init_param)
    ])
    nn_model.compile(loss=loss_param, optimizer=opt_param, metrics=metrics_param)
    return nn_model

In [37]:
# Initialize the default model and get a baseline result
startTimeModule = datetime.now()
results = list()
iteration = 0
cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=seedNum)
for train_ix, val_ix in cv.split(X_train):
    feature_train, feature_validation = X_train[train_ix], X_train[val_ix]
    target_train, target_validation = y_train[train_ix], y_train[val_ix]
    reset_random(seedNum)
    baseline_model = create_nn_model()
    baseline_model.fit(feature_train, target_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=0)
    model_metric = baseline_model.evaluate(feature_validation, target_validation, verbose=0)[1]
    iteration = iteration + 1
    print('Accuracy measurement from iteration %d >>> %.2f%%' % (iteration, model_metric*100))
    results.append(model_metric)
validation_score = np.mean(results)
validation_variance = np.std(results)
print('Average model accuracy from all iterations: %.2f%% (%.2f%%)' % (validation_score*100, validation_variance*100))
print('Total time for model fitting and cross validating:', (datetime.now() - startTimeModule))

Accuracy measurement from iteration 1 >>> 74.37%
Accuracy measurement from iteration 2 >>> 72.50%
Accuracy measurement from iteration 3 >>> 75.00%
Accuracy measurement from iteration 4 >>> 79.37%
Accuracy measurement from iteration 5 >>> 76.88%
Accuracy measurement from iteration 6 >>> 80.00%
Accuracy measurement from iteration 7 >>> 72.50%
Accuracy measurement from iteration 8 >>> 74.37%
Accuracy measurement from iteration 9 >>> 71.25%
Accuracy measurement from iteration 10 >>> 75.63%
Average model accuracy from all iterations: 75.19% (2.74%)
Total time for model fitting and cross validating: 0:00:26.416044


In [38]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 3 - Define and Train Models completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 4 - Evaluate and Optimize Models

In [39]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 4 - Evaluate and Optimize Models has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

## 4.a) Alternate Model One

In [40]:
# encode training and validation datasets
X_train, X_test = encode_data(training_docs, testing_docs, 'count')
print('The shape of the encoded training dataset:', X_train.shape)
print('The shape of the encoded validation dataset:', X_test.shape)

The shape of the encoded training dataset: (800, 660)
The shape of the encoded validation dataset: (200, 660)


In [41]:
# Initialize the default model and get a baseline result
startTimeModule = datetime.now()
results = list()
iteration = 0
cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=seedNum)
for train_ix, val_ix in cv.split(X_train):
    feature_train, feature_validation = X_train[train_ix], X_train[val_ix]
    target_train, target_validation = y_train[train_ix], y_train[val_ix]
    reset_random(seedNum)
    alternate_model_1 = create_nn_model()
    alternate_model_1.fit(feature_train, target_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=0)
    model_metric = alternate_model_1.evaluate(feature_validation, target_validation, verbose=0)[1]
    iteration = iteration + 1
    print('Accuracy measurement from iteration %d >>> %.2f%%' % (iteration, model_metric*100))
    results.append(model_metric)
validation_score = np.mean(results)
validation_variance = np.std(results)
print('Average model accuracy from all iterations: %.2f%% (%.2f%%)' % (validation_score*100, validation_variance*100))
print('Total time for model fitting and cross validating:', (datetime.now() - startTimeModule))

Accuracy measurement from iteration 1 >>> 74.37%
Accuracy measurement from iteration 2 >>> 72.50%
Accuracy measurement from iteration 3 >>> 75.63%
Accuracy measurement from iteration 4 >>> 78.75%
Accuracy measurement from iteration 5 >>> 76.88%
Accuracy measurement from iteration 6 >>> 80.00%
Accuracy measurement from iteration 7 >>> 72.50%
Accuracy measurement from iteration 8 >>> 73.12%
Accuracy measurement from iteration 9 >>> 71.25%
Accuracy measurement from iteration 10 >>> 75.63%
Average model accuracy from all iterations: 75.06% (2.72%)
Total time for model fitting and cross validating: 0:00:26.039286


## 4.a) Alternate Model Two

In [42]:
# encode training and validation datasets
X_train, X_test = encode_data(training_docs, testing_docs, 'tfidf')
print('The shape of the encoded training dataset:', X_train.shape)
print('The shape of the encoded validation dataset:', X_test.shape)

The shape of the encoded training dataset: (800, 660)
The shape of the encoded validation dataset: (200, 660)


In [43]:
# Initialize the default model and get a baseline result
startTimeModule = datetime.now()
results = list()
iteration = 0
cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=seedNum)
for train_ix, val_ix in cv.split(X_train):
    feature_train, feature_validation = X_train[train_ix], X_train[val_ix]
    target_train, target_validation = y_train[train_ix], y_train[val_ix]
    reset_random(seedNum)
    alternate_model_2 = create_nn_model()
    alternate_model_2.fit(feature_train, target_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=0)
    model_metric = alternate_model_2.evaluate(feature_validation, target_validation, verbose=0)[1]
    iteration = iteration + 1
    print('Accuracy measurement from iteration %d >>> %.2f%%' % (iteration, model_metric*100))
    results.append(model_metric)
validation_score = np.mean(results)
validation_variance = np.std(results)
print('Average model accuracy from all iterations: %.2f%% (%.2f%%)' % (validation_score*100, validation_variance*100))
print('Total time for model fitting and cross validating:', (datetime.now() - startTimeModule))

Accuracy measurement from iteration 1 >>> 73.75%
Accuracy measurement from iteration 2 >>> 70.63%
Accuracy measurement from iteration 3 >>> 75.00%
Accuracy measurement from iteration 4 >>> 78.75%
Accuracy measurement from iteration 5 >>> 77.50%
Accuracy measurement from iteration 6 >>> 78.75%
Accuracy measurement from iteration 7 >>> 73.75%
Accuracy measurement from iteration 8 >>> 73.75%
Accuracy measurement from iteration 9 >>> 73.75%
Accuracy measurement from iteration 10 >>> 74.37%
Average model accuracy from all iterations: 75.00% (2.45%)
Total time for model fitting and cross validating: 0:00:25.913698


## 4.a) Alternate Model Three

In [44]:
# encode training and validation datasets
X_train, X_test = encode_data(training_docs, testing_docs, 'freq')
print('The shape of the encoded training dataset:', X_train.shape)
print('The shape of the encoded validation dataset:', X_test.shape)

The shape of the encoded training dataset: (800, 660)
The shape of the encoded validation dataset: (200, 660)


In [45]:
# Initialize the default model and get a baseline result
startTimeModule = datetime.now()
results = list()
iteration = 0
cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=seedNum)
for train_ix, val_ix in cv.split(X_train):
    feature_train, feature_validation = X_train[train_ix], X_train[val_ix]
    target_train, target_validation = y_train[train_ix], y_train[val_ix]
    reset_random(seedNum)
    alternate_model_3 = create_nn_model()
    alternate_model_3.fit(feature_train, target_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=0)
    model_metric = alternate_model_3.evaluate(feature_validation, target_validation, verbose=0)[1]
    iteration = iteration + 1
    print('Accuracy measurement from iteration %d >>> %.2f%%' % (iteration, model_metric*100))
    results.append(model_metric)
validation_score = np.mean(results)
validation_variance = np.std(results)
print('Average model accuracy from all iterations: %.2f%% (%.2f%%)' % (validation_score*100, validation_variance*100))
print('Total time for model fitting and cross validating:', (datetime.now() - startTimeModule))

Accuracy measurement from iteration 1 >>> 73.75%
Accuracy measurement from iteration 2 >>> 71.88%
Accuracy measurement from iteration 3 >>> 76.88%
Accuracy measurement from iteration 4 >>> 75.00%
Accuracy measurement from iteration 5 >>> 77.50%
Accuracy measurement from iteration 6 >>> 76.25%
Accuracy measurement from iteration 7 >>> 76.25%
Accuracy measurement from iteration 8 >>> 78.12%
Accuracy measurement from iteration 9 >>> 73.75%
Accuracy measurement from iteration 10 >>> 72.50%
Average model accuracy from all iterations: 75.19% (2.04%)
Total time for model fitting and cross validating: 0:00:25.853118


In [46]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 4 - Evaluate and Optimize Models completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 5 - Finalize Model and Make Predictions

In [47]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 5 - Finalize Model and Make Predictions has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [48]:
# encode training and validation datasets
X_train, X_test = encode_data(training_docs, testing_docs, 'freq')
print('The shape of the encoded training dataset:', X_train.shape)
print('The shape of the encoded validation dataset:', X_test.shape)

The shape of the encoded training dataset: (800, 660)
The shape of the encoded validation dataset: (200, 660)


In [49]:
# Create the final model for evaluating the test dataset
reset_random(seedNum)
final_model = create_nn_model()
final_model.fit(X_train, y_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=1)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<tensorflow.python.keras.callbacks.History at 0x7f5d40083df0>

In [50]:
# Summarize the final model
final_model.summary()

Model: "sequential_40"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_80 (Dense)             (None, 50)                33050     
_________________________________________________________________
dense_81 (Dense)             (None, 1)                 51        
Total params: 33,101
Trainable params: 33,101
Non-trainable params: 0
_________________________________________________________________


In [51]:
# test_predictions = final_model.predict(X_test, batch_size=default_batch, verbose=1)
test_predictions = (final_model.predict(X_test) > 0.5).astype("int32").ravel()
print('Accuracy Score:', accuracy_score(y_test, test_predictions))
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))

Accuracy Score: 0.72
[[74 26]
 [30 70]]
              precision    recall  f1-score   support

           0       0.71      0.74      0.73       100
           1       0.73      0.70      0.71       100

    accuracy                           0.72       200
   macro avg       0.72      0.72      0.72       200
weighted avg       0.72      0.72      0.72       200



In [52]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 5 - Finalize Model and Make Predictions completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [53]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:01:53.987163
