# Multi-Label Deep Learning Model for [PROJECT NAME] Using TensorFlow version 2
### David Lowe
### January 6, 2022

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a predictive model using the TensorFlow deep learning framework and documenting the end-to-end steps with a template. The [PROJECT NAME] dataset is a multi-label classification situation where we are trying to predict multiple mutually non-exclusive classes or "labels" for a set of features.

INTRODUCTION: This script will replicate Dr. Jason Brownlee's blog post [https://machinelearningmastery.com/multi-label-classification-with-deep-learning/] on this topic with some modifications. The desired output is to build a robust template for modeling future similar problems.

ANALYSIS: [Sample Paragraph - The performance of the baseline model achieved a cross-validated accuracy score of 75.18% after 50 epochs using the training dataset. After tuning the hyperparameters, the best model processed the validation dataset with an accuracy score of 74.88%. Furthermore, the final model processed the test dataset with an accuracy measurement of 58.30%.]

CONCLUSION: [Sample Paragraph - In this iteration, the simple TensorFlow model appeared to be suitable for modeling this dataset.]

Dataset Used: [PROJECT NAME] Dataset

Dataset ML Model: Multi-label classification with numerical attributes

Dataset Reference: [Dataset URL]

Potential Sources of Benchmark: [Benchmark URL]

# Task 1 - Prepare Environment

In [1]:
# # Install the packages to support accessing environment variable and SQL databases
# !pip install python-dotenv PyMySQL

In [2]:
# # Retrieve GPU configuration information from Colab
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#     print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
#     print('and then re-execute this cell.')
# else:
#     print(gpu_info)

In [3]:
# # Retrieve memory configuration information from Colab
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#     print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
#     print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
#     print('re-execute this cell.')
# else:
#     print('You are using a high-RAM runtime!')

In [4]:
# # Mount Google Drive locally for loading the dotenv files
# from dotenv import load_dotenv
# from google.colab import drive
# drive.mount('/content/gdrive')
# gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
# env_path = '/content/gdrive/My Drive/Colab Notebooks/'
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)

In [5]:
# Retrieve CPU information from the system
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])

The number of available CPUs is: 8


## 1.a) Load libraries and modules

In [6]:
# Set the random seed number for reproducible results
RNG_SEED = 88

In [7]:
# Load libraries and packages
import random
random.seed(RNG_SEED)
import numpy as np
np.random.seed(RNG_SEED)
import pandas as pd
import os
import sys
# import boto3
import zipfile
from datetime import datetime
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import accuracy_score

import tensorflow as tf
tf.random.set_seed(RNG_SEED)
from tensorflow import keras

## 1.b) Set up the controlling parameters and functions

In [8]:
# Begin the timer for the script processing
START_TIME_SCRIPT = datetime.now()

In [9]:
# Set up the number of CPU cores available for multi-thread processing
N_JOBS = 1

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
NOTIFY_STATUS = False

# Set the percentage sizes for splitting the dataset
TEST_SET_RATIO = 0.2
VAL_SET_RATIO = 0.2

# Set the number of folds for cross validation
N_FOLDS = 5
N_ITERATIONS = 2

# Set various default modeling parameters
DEFAULT_LOSS = 'binary_crossentropy'
DEFAULT_METRICS = ['accuracy']
DEFAULT_OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=0.001)
DEFAULT_INITIALIZER = tf.keras.initializers.RandomNormal(seed=RNG_SEED)
CLASSIFIER_ACTIVATION = 'sigmoid'
MAX_EPOCHS = 50
BATCH_SIZE = 32

NUM_SAMPLES = 10000
NUM_FEATURES = 10
NUM_CLASSES = 3
NUM_LABELS = 2

# Define the labels to use for graphing the data
TRAIN_METRIC = "accuracy"
VALIDATION_METRIC = "val_accuracy"
TRAIN_LOSS = "loss"
VALIDATION_LOSS = "val_loss"

# Check the number of GPUs accessible through TensorFlow
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Print out the TensorFlow version for confirmation
print('TensorFlow version:', tf.__version__)

Num GPUs Available:  1
TensorFlow version: 2.6.0


2022-01-04 20:58:03.011081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:58:03.025086: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:58:03.026516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


In [10]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [11]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 1 - Prepare Environment has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [12]:
# Reset the random number generators
def reset_random(x=RNG_SEED):
    random.seed(x)
    np.random.seed(x)
    tf.random.set_seed(x)

In [13]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 1 - Prepare Environment completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 2. Load and Prepare Data

In [14]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 2 - Load and Prepare Data has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [15]:
# Retrieve the dataset
X_original, y_original = make_multilabel_classification(n_samples=NUM_SAMPLES,
                                                        n_features=NUM_FEATURES,
                                                        n_classes=NUM_CLASSES,
                                                        n_labels=NUM_LABELS,
                                                        random_state=RNG_SEED)

In [16]:
# Split the data further into training, validation, and test datasets
X_train_val, X_test, y_train_val, y_test = train_test_split(X_original, y_original,
                                                            test_size=TEST_SET_RATIO,
                                                            random_state=RNG_SEED)
print("X_train_val.shape: {} y_train_val.shape: {}".format(X_train_val.shape, y_train_val.shape))
print("X_test.shape: {} y_test.shape: {}".format(X_test.shape, y_test.shape))

X_train_val.shape: (8000, 10) y_train_val.shape: (8000, 3)
X_test.shape: (2000, 10) y_test.shape: (2000, 3)


In [17]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 2 - Load and Prepare Data completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 3. Define and Train Models

In [18]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 3 - Define and Train Models has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [19]:
# Define the function for plotting training results for comparison
def plot_metrics(history):
    fig, axs = plt.subplots(1, 2, figsize=(24, 15))
    metrics =  [TRAIN_LOSS, TRAIN_METRIC]
    for n, metric in enumerate(metrics):
        name = metric.replace("_"," ").capitalize()
        plt.subplot(2,2,n+1)
        plt.plot(history.epoch, history.history[metric], color='blue', label='Train')
        plt.plot(history.epoch, history.history['val_'+metric], color='red', linestyle="--", label='Val')
        plt.xlabel('Epoch')
        plt.ylabel(name)
        if metric == train_loss:
            plt.ylim([0, plt.ylim()[1]])
        else:
            plt.ylim([0, 1])
        plt.legend()

In [20]:
# Define the baseline model for benchmarking
def create_nn_model(input_param, output_param, dense_nodes=32,
                    layer1_dropout=0.25, layer2_dropout=0.25,
                    init_param=DEFAULT_INITIALIZER, classifier_activation=CLASSIFIER_ACTIVATION,
                    loss_param=DEFAULT_LOSS, opt_param=DEFAULT_OPTIMIZER, metrics_param=DEFAULT_METRICS):
    nn_model = keras.Sequential([
        keras.layers.Dense(dense_nodes, input_dim=input_param, activation='relu', kernel_initializer=DEFAULT_INITIALIZER),
        keras.layers.Dropout(layer1_dropout),
        keras.layers.Dense(dense_nodes, activation='relu', kernel_initializer=DEFAULT_INITIALIZER),
        keras.layers.Dropout(layer2_dropout),
        keras.layers.Dense(output_param, activation=CLASSIFIER_ACTIVATION, kernel_initializer=DEFAULT_INITIALIZER)
    ])
    nn_model.compile(loss=loss_param, optimizer=opt_param, metrics=metrics_param)
    return nn_model

In [21]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_baseline(X, y):
    results = list()
    n_inputs, n_outputs = X.shape[1], y.shape[1]
    # define evaluation procedure
    cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=RNG_SEED)
    # enumerate folds
    for train_ix, validation_ix in cv.split(X):
        # prepare data
        X_train, X_validation = X[train_ix], X[validation_ix]
        y_train, y_validation = y[train_ix], y[validation_ix]
        # define model
        reset_random()
        model = create_nn_model(n_inputs, n_outputs)
        # fit model
        model.fit(X_train, y_train, epochs=MAX_EPOCHS, batch_size=BATCH_SIZE, verbose=0)
        # make a prediction on the test set
        yhat = model.predict(X_validation)
        # round probabilities to class labels
        yhat = yhat.round()
        # calculate accuracy
        acc = accuracy_score(y_validation, yhat)
        # store result
        print('Accuracy score obtained for this CV round: %.4f' % acc)
        results.append(acc)
    return results

In [22]:
# evaluate model
results = evaluate_baseline(X_train_val, y_train_val)
# summarize performance
print('Final Accuracy Measurements: %.4f (%.4f)' % (np.mean(results), np.std(results)))

2022-01-04 20:58:03.716800: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:58:03.718181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:58:03.719452: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:58:05.542470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:58:05.543691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read f

Accuracy score obtained for this CV round: 0.7581
Accuracy score obtained for this CV round: 0.7475
Accuracy score obtained for this CV round: 0.7556
Accuracy score obtained for this CV round: 0.7531
Accuracy score obtained for this CV round: 0.7462
Accuracy score obtained for this CV round: 0.7612
Accuracy score obtained for this CV round: 0.7481
Accuracy score obtained for this CV round: 0.7481
Accuracy score obtained for this CV round: 0.7475
Accuracy score obtained for this CV round: 0.7525
Final Accuracy Measurements: 0.7518 (0.0049)


In [23]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 3 - Define and Train Models completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 4. Tune and Optimize Models

In [24]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 4 - Tune and Optimize Models has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [25]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_alternate(X, y):
    results = list()
    n_inputs, n_outputs = X.shape[1], y.shape[1]
    # define evaluation procedure
    cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=RNG_SEED)
    # enumerate folds
    for train_ix, validation_ix in cv.split(X):
        # prepare data
        X_train, X_validation = X[train_ix], X[validation_ix]
        y_train, y_validation = y[train_ix], y[validation_ix]
        # define model
        dense_alternate = 128
        optimizer_alternate = tf.keras.optimizers.Adam(learning_rate=0.0001)
        reset_random()
        model = create_nn_model(n_inputs, n_outputs,
                                dense_nodes=dense_alternate,
                                opt_param=optimizer_alternate)
        # fit model
        model.fit(X_train, y_train, epochs=MAX_EPOCHS, batch_size=BATCH_SIZE, verbose=0)
        # make a prediction on the test set
        yhat = model.predict(X_validation)
        # round probabilities to class labels
        yhat = yhat.round()
        # calculate accuracy
        acc = accuracy_score(y_validation, yhat)
        # store result
        print('Accuracy score obtained for this CV round: %.4f' % acc)
        results.append(acc)
    return results

In [26]:
# evaluate model
results = evaluate_alternate(X_train_val, y_train_val)
# summarize performance
print('Final Accuracy Measurements: %.4f (%.4f)' % (np.mean(results), np.std(results)))

Accuracy score obtained for this CV round: 0.7506
Accuracy score obtained for this CV round: 0.7419
Accuracy score obtained for this CV round: 0.7588
Accuracy score obtained for this CV round: 0.7456
Accuracy score obtained for this CV round: 0.7462
Accuracy score obtained for this CV round: 0.7688
Accuracy score obtained for this CV round: 0.7356
Accuracy score obtained for this CV round: 0.7419
Accuracy score obtained for this CV round: 0.7575
Accuracy score obtained for this CV round: 0.7412
Final Accuracy Measurements: 0.7488 (0.0096)


In [27]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 4 - Tune and Optimize Models completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 5. Finalize Model and Make Predictions

In [28]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Task 5 - Finalize Model and Make Predictions has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [29]:
# Train the final model
FINAL_DENSE_NODES = 128
FINAL_OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=0.0001)
n_inputs, n_outputs = X_train_val.shape[1], y_train_val.shape[1]
reset_random()
final_model = create_nn_model(n_inputs, n_outputs,
                              dense_nodes=FINAL_DENSE_NODES,
                              opt_param=FINAL_OPTIMIZER)
final_model.fit(X_train_val, y_train_val, epochs=MAX_EPOCHS, batch_size=BATCH_SIZE, verbose=0)

<keras.callbacks.History at 0x7fa52c43f9d0>

In [30]:
# Display a summary of the final model
print(final_model.summary())

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_60 (Dense)             (None, 128)               1408      
_________________________________________________________________
dropout_40 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_61 (Dense)             (None, 128)               16512     
_________________________________________________________________
dropout_41 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_62 (Dense)             (None, 3)                 387       
Total params: 18,307
Trainable params: 18,307
Non-trainable params: 0
_________________________________________________________________
None


In [31]:
# Check the performance of the model using the test dataset
final_model.evaluate(X_test, y_test)



[0.25184622406959534, 0.5830000042915344]

In [32]:
if (NOTIFY_STATUS): status_notify('(TensorFlow Multi-Label) Phase 5 - Finalize Model and Make Predictions completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [33]:
print ('Total time for the script:',(datetime.now() - START_TIME_SCRIPT))

Total time for the script: 0:06:27.491521
