# Multi-Label Deep Learning Model for [PROJECT NAME] Using TensorFlow version 1
### David Lowe
### September 14, 2020

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a predictive model using the TensorFlow deep learning framework and documenting the end-to-end steps with a template. The [PROJECT NAME] dataset is a multi-label classification situation where we are trying to predict multiple mutually non-exclusive classes or "labels" for a set of features.

INTRODUCTION: This script will replicate Dr. Jason Brownlee's blog post [https://machinelearningmastery.com/multi-label-classification-with-deep-learning/] on this topic with some modifications. The desired output is to build a robust template for modeling future similar problems.

ANALYSIS: [Sample Paragraph]

CONCLUSION: [Sample Paragraph]

Dataset Used: Planet: [PROJECT NAME] Dataset

Dataset ML Model: Multi-label classification with numerical attributes

Dataset Reference: [Dataset URL]

Potential Sources of Benchmark: [Benchmark URL]

A deep-learning modeling project generally can be broken down into five major tasks:

1. Prepare Environment
2. Load and Prepare Data
3. Define and Train Models
4. Evaluate and Optimize Models
5. Finalize Model and Make Predictions

# Task 1 - Prepare Environment

In [1]:
# # Install the packages to support accessing environment variable and SQL databases
# !pip install python-dotenv PyMySQL

In [2]:
# # Retrieve GPU configuration information from Colab
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#     print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
#     print('and then re-execute this cell.')
# else:
#     print(gpu_info)

In [3]:
# # Retrieve memory configuration information from Colab
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#     print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
#     print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
#     print('re-execute this cell.')
# else:
#     print('You are using a high-RAM runtime!')

In [4]:
# # Direct Colab to use TensorFlow v2
# %tensorflow_version 2.x

In [5]:
# Retrieve CPU information from the system
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])

The number of available CPUs is: 4


In [6]:
# Set the random seed number for reproducible results
seedNum = 8

In [7]:
# Load libraries and packages
import random
random.seed(seedNum)
import numpy as np
np.random.seed(seedNum)
import os
import sys
import zipfile
import boto3
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from dotenv import load_dotenv
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import accuracy_score
import tensorflow as tf
tf.random.set_seed(seedNum)
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [8]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the number of CPU cores available for multi-thread processing
n_jobs = 1

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Set Pandas options
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 140)

# Set the number of folds for cross validation
n_folds = 5

# Set the percentage sizes for splitting the dataset
test_set_size = 0.20
val_set_size = 0.25

# Set various default Keras modeling parameters
default_loss = 'binary_crossentropy'
default_metrics = ['accuracy']
default_optimizer = keras.optimizers.Adam(learning_rate=0.001)
default_kernel_init = keras.initializers.GlorotUniform(seed=seedNum)
default_epoch = 100
default_batch = 32

default_samples = 1000  # The number of samples
default_features = 10  # The total number of features
default_classes = 3  # The number of classes of the classification problem
default_labels = 2  # The average number of labels per instance

# Define the labels to use for graphing the data
train_metric = "accuracy"
train_loss = "loss"

# Check the number of GPUs accessible through TensorFlow
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Print out the TensorFlow version for confirmation
print('TensorFlow version:', tf.__version__)

Num GPUs Available:  0


In [None]:
# Set up the parent directory location for loading the dotenv files
# useGDrive = True
# if useGDrive:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the dotenv file for retrieving environment variables
# useLocalPC = True
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

In [9]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [None]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 1 - Prepare Environment has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [10]:
# Reset the random number generators
def reset_random(x):
    random.seed(x)
    np.random.seed(x)
    tf.random.set_seed(x)

In [11]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 1 - Prepare Environment completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 2. Load and Prepare Data

In [12]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 2 - Load and Prepare Data has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [13]:
# Retrieve the dataset
X_original, y_original = make_multilabel_classification(n_samples=default_samples,
                                                        n_features=default_features,
                                                        n_classes=default_classes,
                                                        n_labels=default_labels,
                                                        random_state=seedNum)

In [14]:
# Split the data further into training, validation, and test datasets
X_train_val, X_test, y_train_val, y_test = train_test_split(X_original, y_original, test_size=test_set_size, random_state=seedNum)
print("X_train_val.shape: {} y_train_val.shape: {}".format(X_train_val.shape, y_train_val.shape))
print("X_test.shape: {} y_test.shape: {}".format(X_test.shape, y_test.shape))

X_train_val.shape: (800, 10) y_train_val.shape: (800, 3)
X_test.shape: (200, 10) y_test.shape: (200, 3)


In [15]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 2 - Load and Prepare Data completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 3. Define and Train Models

In [16]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 3 - Define and Train Models has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [17]:
# Define the function for plotting training results for comparison
def plot_metrics(history):
    fig, axs = plt.subplots(1, 2, figsize=(24, 15))
    metrics =  [train_loss, train_metric]
    for n, metric in enumerate(metrics):
        name = metric.replace("_"," ").capitalize()
        plt.subplot(2,2,n+1)
        plt.plot(history.epoch, history.history[metric], color='blue', label='Train')
        plt.plot(history.epoch, history.history['val_'+metric], color='red', linestyle="--", label='Val')
        plt.xlabel('Epoch')
        plt.ylabel(name)
        if metric == train_loss:
            plt.ylim([0, plt.ylim()[1]])
        else:
            plt.ylim([0, 1])
        plt.legend()

In [18]:
# Define the baseline model for benchmarking
def create_nn_model(n_inputs, n_outputs, dense_nodes=20, opt_param=default_optimizer, init_param=default_kernel_init):
	nn_model = keras.Sequential([
        keras.layers.Dense(dense_nodes, input_dim=n_inputs, activation='relu', kernel_initializer=init_param),
        keras.layers.Dense(n_outputs, activation='sigmoid')
    ])
	nn_model.compile(loss=default_loss, optimizer=default_optimizer, metrics=default_metrics)
	return nn_model

In [19]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_baseline(X, y):
	results = list()
	n_inputs, n_outputs = X.shape[1], y.shape[1]
	# define evaluation procedure
	cv = RepeatedKFold(n_splits=n_folds, n_repeats=3, random_state=seedNum)
	# enumerate folds
	for train_ix, validation_ix in cv.split(X):
		# prepare data
		X_train, X_validation = X[train_ix], X[validation_ix]
		y_train, y_validation = y[train_ix], y[validation_ix]
		# define model
		model = create_nn_model(n_inputs, n_outputs)
		# fit model
		model.fit(X_train, y_train, epochs=default_epoch, batch_size=default_batch, verbose=0)
		# make a prediction on the test set
		yhat = model.predict(X_validation)
		# round probabilities to class labels
		yhat = yhat.round()
		# calculate accuracy
		acc = accuracy_score(y_validation, yhat)
		# store result
		print('Accuracy score obtained for this CV round: %.3f' % acc)
		results.append(acc)
	return results

# evaluate model
results = evaluate_baseline(X_train_val, y_train_val)
# summarize performance
print('Final Accuracy Measurements: %.3f (%.3f)' % (np.mean(results), np.std(results)))

Accuracy score obtained for this CV round: 0.637
Accuracy score obtained for this CV round: 0.681
Accuracy score obtained for this CV round: 0.731
Accuracy score obtained for this CV round: 0.656
Accuracy score obtained for this CV round: 0.581
Accuracy score obtained for this CV round: 0.625
Accuracy score obtained for this CV round: 0.631
Accuracy score obtained for this CV round: 0.738
Accuracy score obtained for this CV round: 0.650
Accuracy score obtained for this CV round: 0.575
Accuracy score obtained for this CV round: 0.525
Accuracy score obtained for this CV round: 0.681
Accuracy score obtained for this CV round: 0.637
Accuracy score obtained for this CV round: 0.644
Accuracy score obtained for this CV round: 0.700
Final Accuracy Measurements: 0.646 (0.055)


In [20]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 3 - Define and Train Models completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 4. Evaluate and Optimize Models

In [21]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 4 - Evaluate and Optimize Models has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [22]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_alternate(X, y):
	results = list()
	n_inputs, n_outputs = X.shape[1], y.shape[1]
	# define evaluation procedure
	cv = RepeatedKFold(n_splits=n_folds, n_repeats=3, random_state=seedNum)
	# enumerate folds
	for train_ix, validation_ix in cv.split(X):
		# prepare data
		X_train, X_validation = X[train_ix], X[validation_ix]
		y_train, y_validation = y[train_ix], y[validation_ix]
		# define model
		dense_alternate = 40
		model = create_nn_model(n_inputs, n_outputs, dense_alternate)
		# fit model
		model.fit(X_train, y_train, epochs=default_epoch, batch_size=default_batch, verbose=0)
		# make a prediction on the test set
		yhat = model.predict(X_validation)
		# round probabilities to class labels
		yhat = yhat.round()
		# calculate accuracy
		acc = accuracy_score(y_validation, yhat)
		# store result
		print('Accuracy score obtained for this CV round: %.3f' % acc)
		results.append(acc)
	return results

# evaluate model
results = evaluate_alternate(X_train_val, y_train_val)
# summarize performance
print('Final Accuracy Measurements: %.3f (%.3f)' % (np.mean(results), np.std(results)))

Accuracy score obtained for this CV round: 0.656
Accuracy score obtained for this CV round: 0.669
Accuracy score obtained for this CV round: 0.750
Accuracy score obtained for this CV round: 0.681
Accuracy score obtained for this CV round: 0.600
Accuracy score obtained for this CV round: 0.619
Accuracy score obtained for this CV round: 0.644
Accuracy score obtained for this CV round: 0.694
Accuracy score obtained for this CV round: 0.637
Accuracy score obtained for this CV round: 0.644
Accuracy score obtained for this CV round: 0.544
Accuracy score obtained for this CV round: 0.700
Accuracy score obtained for this CV round: 0.675
Accuracy score obtained for this CV round: 0.656
Accuracy score obtained for this CV round: 0.725
Final Accuracy Measurements: 0.660 (0.049)


In [23]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 4 - Evaluate and Optimize Models completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# Task 5. Finalize Model and Make Predictions

In [24]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Task 5 - Finalize Model and Make Predictions has begun on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [25]:
# Train the final model
final_optimizer = default_optimizer
final_kernel_init = default_kernel_init
final_epoch = default_epoch
final_batch = default_batch
layer1_nodes = 40
n_inputs, n_outputs = X_train_val.shape[1], y_train_val.shape[1]
print('The final modeling parameters are: optimizer=%s, kernel=%s, epochs=%d, batch_size=%d' % (final_optimizer, final_kernel_init, final_epoch, final_batch))
final_model = create_nn_model(n_inputs, n_outputs, layer1_nodes, final_optimizer, final_kernel_init)
final_hist = final_model.fit(X_train_val, y_train_val, epochs=final_epoch, batch_size=final_batch, verbose=1)

The final modeling parameters are: optimizer=<tensorflow.python.keras.optimizer_v2.adam.Adam object at 0x7fc1b8318fd0>, kernel=<tensorflow.python.ops.init_ops_v2.GlorotUniform object at 0x7fc1b831f110>, epochs=100, batch_size=32
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


In [26]:
# Display a summary of the final model
print(final_model.summary())

Model: "sequential_30"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_60 (Dense)             (None, 40)                440       
_________________________________________________________________
dense_61 (Dense)             (None, 3)                 123       
Total params: 563
Trainable params: 563
Non-trainable params: 0
_________________________________________________________________
None


In [27]:
# Check the performance of the model using the test dataset
final_model.evaluate(X_test, y_test)



[0.3741074800491333, 0.675000011920929]

In [28]:
if (notifyStatus): status_notify('(TensorFlow Multi-Label) Phase 5 - Finalize Model and Make Predictions completed on ' + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [29]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:01:53.199012
