<a href="https://colab.research.google.com/github/alegtr2003/alegtr2003/blob/main/pws_nut_phase_SNN_function_based_(nut_phase_partitioned)%3B_09_15_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

**Program Overview**

---

In [None]:
# Author: Carlos R. Sulsona (CS)
# Date Created: 08/09/2023
# Latest Revision Date: 09/20/2023, CS
# Description: This module contains Python code for constructing a Sequential Neural Network(SNN)
# using Scikit-Learn, Tensorflow, and Keras APIs.  The model uses the multi-class classification algorithm
# categorical_crossentropy to predict Prader-Willi Syndrome (PWS) nutritional phase on data collected
# from a nutritional phase questionnaire.


---

**Required Modules, Libraries and Packages**

---

In [None]:
#!pip install -q keras
#!pip install tf2onnx
#!pip install git+https://github.com/onnx/tensorflow-onnx
!pip install --upgrade tensorflow

Collecting tensorflow
  Downloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting h5py>=3.10.0 (from tensorflow)
  Downloading h5py-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
Collecting ml-dtypes~=0.3.1 (from tensorflow)
  Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m68.6 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.17,>=2.16 (from tensorflow)
  Downloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m75.0 MB/s[0m eta [36m0:00:00[0m
[?25

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
!nvcc --version
!apt list --installed | grep cudnn


TensorFlow version: 2.15.0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


libcudnn8-dev/unknown,now 8.9.6.50-1+cuda12.2 amd64 [installed,upgradable to: 8.9.7.29-1+cuda12.2]
libcudnn8/unknown,now 8.9.6.50-1+cuda12.2 amd64 [installed,upgradable to: 8.9.7.29-1+cuda12.2]


In [None]:
# scikit: The scikit-learn library is the most popular library for general machine learning in Python.  Library contains fucntions used in tasks like model evaluation and model hyper-parameter optimization.
# Keras: Keras is one of the most popular deep learning libraries in Python for research and development. It is is a high-level, deep learning API developed by Google for implementing neural networks.

#--- CLASS MODULES REQUIRED FOR DATA PROCESSING AND DATA VISUALIZATION --------------------
# from numpy.random import seed
# seed(0)
import pandas as pd                                          # imports pandas library for reading datafile and creating and manipulating data frames
import sklearn                                               # imports the sklearn scikit library - features various classification, regression and clustering algorithms
import numpy as np                                           # library for scientific computing and matrix support for Python
import seaborn as sn                                         # library for graphical visualization of data
import pickle
from flask import Flask
import importlib.metadata

from sklearn.model_selection import train_test_split         # imports the sklearn scikit library required to split dataset into train and test sets
from sklearn.metrics import accuracy_score                   # used to validate trained model
import matplotlib.pyplot as plt                              # used for plotting data
from sklearn.utils import shuffle                            # contains tools for shuffling data in a dataframe
from IPython.display import clear_output                     # contains functions for updating data during program execution
#from keras.callbacks import Callback                         # contains functions that can perform actions during training and allow views of the current state of the model
#from keras.api._v2.keras.backend import clear_session        # contains functions for releasing resources used in creating a model
from sklearn.preprocessing import MinMaxScaler               # scales values to a range of 0 to 1
from sklearn.preprocessing import StandardScaler             # scales values to a range of 0 to 1
from sklearn.preprocessing import LabelEncoder               # contains functions for encoding categorical data to values of zeroes(0) and ones(1)
from keras.utils import normalize                            # contains functions to normalize a numpy array

#--- CLASS MODULES REQUIRED FOR BUILDING AND TRAINING OF NEURAL NETWORK --------------------
import tensorflow as tf                                      # contains methods for creating a neural network
from tensorflow import keras                                 # high-level API running on top of Tensorflow used for the implementation of a neural network
from keras.models import Sequential                          # contains functions for building a sequential (multi-layered) neural network
from keras.optimizers import Adam                            # model optimization loss reduction function
from keras.metrics import categorical_crossentropy           # model optimization loss reduction function
from keras.layers import Dense                               # contains functions for building the layers of a dense neural network
# from keras.utils.vis_utils import plot_model               # converts a Keras model to dot format and save to a file
from keras.models import save_model                          # contains functions for saving a trained model to a file
from keras.models import load_model                          # contains functions for loading a saved model from a file
from keras import optimizers                                 # contains functions for implementing various optimization algorithms
#from keras.saving.saving_api import save_weights           # contains functions for saving a model's edge weights
from sklearn.model_selection import KFold, StratifiedKFold   # contains function required to perform cross-validation
from keras.models import Sequential
from keras.models import load_model

#--- MODEL EVALUATION MODULES -------------------
from sklearn.model_selection import cross_val_score          # contains function required to perform cross-validation scores
from sklearn.model_selection import cross_validate           # contains metrics to evaluate by cross-validation
from sklearn.model_selection import cross_val_predict        # contains metrics to evaluate by cross-validation
from sklearn.model_selection import GridSearchCV             # contains functions for evaluating a model using cross-validation grid search
from sklearn.metrics import confusion_matrix                 # contains fucntions for conducting a confusion matrix for model performance
from sklearn.linear_model import SGDClassifier, LogisticRegression
#from sklearn.wrappers import KerasClassifier
from sklearn import metrics


#--- UTILITY CLASS MODULES  --------------------
from datetime import date                                    # contains functions for loading current date
from time import time                                        # contains functions for loading current time
from datetime import datetime                                # contains functions for loading current date and local time
import sys                                                   # contains functions for obtaining information of the Python Runtime Environment, such as version, etc.
from keras.callbacks import History                          # contains fucntions for recording events into a History object
from sklearn.pipeline import Pipeline                        # contains functions for creating a pipeline of transforms with a final estimator
from sklearn.preprocessing import Normalizer                 # contains functions for normalizing the features dataset (X_values)
from sklearn.preprocessing import Binarizer                  # contains functions for binarization of values in the features dataset (X_values)
from keras import backend as K                               # contains functions for reseting all global states generated by Keras API during model implementation
import statistics as st                                      # module for performing statistics (mean, stdDev, etc.)
from weakref import ref                                      # contains functions to allow creation of weak references to objects
import random                                                # contains random number generators
#import tf2onnx
#import onnx



---

**Global Variables**

---

In [None]:
#--- GLOBAL AND TEMPORARY VARIABLES  --------------------
models = []                                                  # list for storing the loop generated models at the various percentages
modelTrainingDatasets=[]                                     # list for storing datasets to be used for training a model
mergedDatasets=[]                                            # list for storing merged datasets
history=Sequential()                                         # variable stores shell to create a sequential neural network object
tempModel=Sequential()                                       # variable stores shell to create a sequential neural network object
tempModel_1=Sequential()                                     # variable stores shell to create a sequential neural network object
tempModel_2=Sequential()                                     # variable stores shell to create a sequential neural network object

---

**Version of Programming Language(s) and API's Used**

---

In [None]:
# Display Platform Versions Used
print("Python version: " + sys.version)
print("TensorFlow, Keras version: " + tf.__version__)
print("Scikit-Learn version: " + sklearn.__version__)
print("Pickle version: " + pickle.format_version)
print("Flask version: " + importlib.metadata.version("flask"))
print("Pandas version: " + importlib.metadata.version("pandas"))
print("Numpy version: " + importlib.metadata.version("numpy"))
#print("ONNX version: " + importlib.metadata.version("onnx"))

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
TensorFlow, Keras version: 2.16.1
Scikit-Learn version: 1.2.2
Pickle version: 4.0
Flask version: 2.2.5
Pandas version: 2.0.3
Numpy version: 1.25.2


In [None]:
# Use from_function for tf functions
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(4, activation="relu"))

input_signature = [tf.TensorSpec([43,6], tf.float32, name='x')]   # the tf.TensorSpec => num of input values and num of output values, nothing to do with the number of records!
onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=13)
onnx.save(onnx_model, "sample_data/model.onnx")

AttributeError: 'Sequential' object has no attribute 'output_names'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

---

**Classes**

---



In [None]:
#--- CUSTOM CLASSES --------------------
class ShowModelLearning(keras.callbacks.Callback):
    """Class to graphically visualize training probabilistic loss and accuracy of model during training"""
    def on_train_begin(self, logs={}):
        self.metrics = {}
        for metric in logs:
            self.metrics[metric] = []

    def on_epoch_end(self, epoch, logs={}):
        # Storing metrics
        for metric in logs:
            if metric in self.metrics:
                self.metrics[metric].append(logs.get(metric))
            else:
                self.metrics[metric] = [logs.get(metric)]

        # Plot loss and accuracy values
        metrics = [x for x in logs if "val" not in x]

        f, axis = plt.subplots(1, len(metrics), figsize=(15,5))
        clear_output(wait=True)

        for i, metric in enumerate(metrics):
            axis[i].plot(range(1, epoch + 2),
                        self.metrics[metric],
                        label=metric)
            if logs["val_" + metric]:
                axis[i].plot(range(1, epoch + 2),
                            self.metrics["val_" + metric],
                            label="val_" + metric)

            axis[i].legend()
            axis[i].grid()

        plt.tight_layout()
        plt.show()

# Create an object of the ShowModelLearning Class
callbacks_list = ShowModelLearning()


---

**Utility Functions**

---



In [None]:
def readCSVFile(datafile):
    '''Function reads a csv file passed by the function caller and copies its contents into a pandas dataframe object'''
    df = pd.read_csv(datafile)    # main dataframe
    df2 = df    # working copy of the main data frame

    return(df2)


In [None]:
def listMultiAppend(*args):
    '''Accepts 'n' elements and appends them to a list then returns the list'''

    elementsToAppend=args
    appendedElements=[]

    for element in elementsToAppend:
        appendedElements.append(element)

    return(appendedElements)


In [None]:
def dictMultiAppend(*args):
    '''Accepts 'n' elements and appends them to a dictionary then returns the dictionary'''

    elementsToAppend=args
    appendedElements={}

    for element in elementsToAppend:
        appendedElements.append(element)

    return(appendedElements)

In [None]:
def averageMultiple2DArrays(inputArray):
    '''Function inputs an array of 2D lists (3D object) and returns the averages of the lists within each 2D list block'''

    listsToAvg=inputArray
    block=0
    arrStruct=0
    elem=0
    avg=0
    tempAvg=[]
    results=[]

    while block<len(listsToAvg): #n=2
        while elem<len(listsToAvg[0][0]): #n=5
            while arrStruct<len(listsToAvg[0]): #n=2
                avg=avg+listsToAvg[block][arrStruct][elem]
                arrStruct+=1
            tempAvg.append(avg/len(listsToAvg[0]))
            avg=0
            arrStruct=0
            elem+=1
        results.append(tempAvg)
        tempAvg=[]
        elem=0
        block+=1
    block=0

    i=0

    return(results)


---

**Dataset Preprocessing Functions**

---

In [None]:
def datasetMinMaxScaler(X_values):
    '''Scales data in a dataset so that all values fall within the range of values,
    typically zero(0) and one(1). Function takes in a 2D dataframe object and
    returns a 2D dataframe object'''

    # Variables
    dfColNames=X_values.columns

    # Initialize a MinMaxScaler object
    scaler=MinMaxScaler(feature_range=(0,1))  # "feature_range" defines the range to scale data to

    # Fit and transform the data to range of zero(0) and one(1) values
    minMaxScaledFeatures=scaler.fit_transform(X_values)
    dfScale=minMaxScaledFeatures


    # Convert array to a Pandas dataframe object and reassign names to columns
    minMaxScaledFeatures=pd.DataFrame(minMaxScaledFeatures)
    minMaxScaledFeatures.columns=dfColNames
    scaledData=minMaxScaledFeatures

    return(minMaxScaledFeatures)    # retruns a 2D dataframe object



In [None]:
dfScale=pd.read_csv("mlpa_results (no_ID).2; 09-20-2023.csv")
dfScale=dfScale.drop(columns=["id"])

results=datasetMinMaxScaler(dfScale)
# dfScale=results
dfRev=results
results

results=results["Insulin"]
results=results
results


# -- Dataset visualization --
# Apply default theme
sn.set_theme()

# Create plot
dfScale=dfScale['Insulin']
dfScale=pd.DataFrame(dfScale)
dfScale
nums=list(range(0,len(dfScale)))
dfScale.index=nums
index=dfScale.index
sn.barplot(data=dfScale, y="Insulin", x=index)
sn.relplot(data=dfScale, y="Insulin", x=index, kind="line")

dfRev


FileNotFoundError: [Errno 2] No such file or directory: 'mlpa_results (no_ID).2; 09-20-2023.csv'

In [None]:
def datasetStandardScaler(X_values):
    '''Standardizes the input values (x-values) by removing the mean and scaling
       values to unit variance (z = (x - u) / s). This centers the data around
       zero, with a stdDev of 1.  x=sample, u is the mean, s is standard deviation.
       Function takes in a 1D dataframe object and returns a 1D dataframe object.
       "Centering and scaling happen independently on each feature by computing
       the relevant statistics on the samples in the training set. Mean and
       standard deviation are then stored to be used on later data using transform.
       Standardization of a dataset is a common requirement for many machine
       learning estimators: they might behave badly if the individual features
       do not more or less look like standard normally distributed data
       (e.g. Gaussian with 0 mean and unit variance)." - scikitlearn'''

    # Variables
    dfColNames=X_values.columns    # stores dataframe column names

    # Initialize a MinMaxScaler object
    scaler=StandardScaler()

    # Fit and transform the data to range of zero(0) and one(1) values
    standardized_X_values=scaler.fit_transform(X_values)

    # Convert array to a Pandas dataframe object and reassign names to columns
    standardized_X_values=pd.DataFrame(standardized_X_values)
    standardized_X_values.columns=dfColNames

    return(standardized_X_values)    # retruns a 2D dataframe object


In [None]:
dfScale=pd.read_csv("multiclass_test_dataset; 09-18-2023.csv")
dfScale=dfScale.drop(columns=["id"])

results=datasetStandardScaler(dfScale)
dfScale=results

results=results["Leptin"]
results=results
results

# -- Dataset visualization --
# Apply default theme
sn.set_theme()

# Create plot
dfScale=dfScale['Leptin']
dfScale=pd.DataFrame(dfScale)
dfScale
nums=list(range(0,len(dfScale)))
dfScale.index=nums
index=dfScale.index
sn.barplot(data=dfScale, y="Leptin", x=index)
sn.displot(data=dfScale)
sn.displot(data=dfScale)


In [None]:
def datasetNormalization(X_values):
    '''Function takes in a dataframe of x-values (features) and returns a
    dataframe of normalized values. Used in Vector Space Model for
    text classification or clustering'''

    # Variables
    dfColNames=X_values.columns    # stores dataframe column names

    # Normalize X_values
    scaler = Normalizer()

    # Fit and transform the data to range of zero(0) and one(1) values
    scaler=scaler.fit(X_values)
    normalized_features = scaler.transform(X_values)

    # Convert array to a Pandas dataframe object and reassign names to columns
    normalized_features = pd.DataFrame(normalized_features)
    normalized_features.columns=dfColNames

    return(normalized_features)    # returns a 2D dataframe object


In [None]:
dfScale=pd.read_csv("mlpa_results (no_ID).2; 09-20-2023.csv")
dfScale=dfScale.drop(columns=["id", "Phase_1a", "Phase_1b", "Phase_2a", "Phase_2b", "Phase_3", "Phase_4"])
dfCat=dfScale

results=datasetMinMaxScaler(dfScale)

results=datasetNormalization(results)

dfScale=results
dfCatNorm=dfScale

# results=results["Leptin"]
# results=results



# -- Dataset visualization --
# Apply default theme
sn.set_theme()

# # Create plot
# dfScale=dfScale['Leptin']
# dfScale=pd.DataFrame(dfScale)
# nums=list(range(0,96))
# dfScale.index=nums
# index=dfScale.index
# cols=[]
# cols=listMultiAppend(dfCat.columns)
# cols=cols[0]
# sn.set(rc={'figure.figsize':(11.7,8.27)})
# sn.barplot(data=dfScale, y="Leptin", x=index)
# sn.relplot(data=dfScale, y="Leptin", x=index)
# sn.displot(data=dfScale, y="Leptin", x=index, stat="density")
# sn.displot(data=dfCatNorm, y="Leptin", x=index, stat="probability", kind="kde")

# Dataset by single feature
feature=pd.DataFrame(dfCat["Leptin"])
normFeature=pd.DataFrame(dfCatNorm["Leptin"])

feature2=pd.DataFrame(dfCat["Ghrelin"])
normFeature2=pd.DataFrame(dfCatNorm["Ghrelin"])

feature3=pd.DataFrame(dfCat["Insulin"])
normFeature3=pd.DataFrame(dfCatNorm["Insulin"])

# Plots
# sn.catplot(data=dfCat, kind="box")
# sn.catplot(data=dfCatNorm, kind="box")
sn.displot(data=feature2, kde="True")
sn.displot(data=normFeature2, kde="True")

# sn.displot(data=feature2, kind="kde")
# sn.displot(data=normFeature2, kind="kde")
# sn.displot(data=normFeature2, kind="ecdf")

# sn.displot(data=results, x="Leptin", y="Ghrelin", kind="kde")

# # for normFeature in dfCatNorm:
# # sn.displot(data=dfCat[normFeature], kde="True")
# # sn.displot(data=dfCatNorm[normFeature], kde="True")

sn.displot(data=feature3, kde="True")
sn.displot(data=normFeature3, kde="True")

dfCatNorm

In [None]:
def datasetBinarization(X_values):
    '''Function takes in a dataframe of x-values (features) and returns a 2D
    dataframe of binary values.'''

    # Variables
    dfColNames=X_values.columns    # stores dataframe column names

    # Normalize X_values
    scaler = Binarizer(threshold=500)    # threshold is crucial, setting prob that a value is either 0 or 1

    # Fit and transform the data to range of zero(0) and one(1) values
    scaler=scaler.fit(X_values)
    binarized_features = scaler.transform(X_values)

    # Convert array to a Pandas dataframe object and reassign names to columns
    binarized_features = pd.DataFrame(binarized_features)
    binarized_features.columns=dfColNames

    return(binarized_features)    # returns a 2D dataframe object


In [None]:
dfScale=pd.read_csv("multiclass_test_dataset; 09-18-2023.csv")
dfScale=dfScale.drop(columns=["id"])

datasetBinarization(dfScale)

In [None]:
def categoricalDataEncoder(datasetToEncode):
    '''Endcodes textual data a dataframe column so that all values are numerical.
    Values range from 0 to (number of catgories) -1.
    Takes in a 1D dataframe and returns a 1D dataframe object'''

    # Variables
    dfColName=datasetToEncode.name    # stores dataframe column names

    # Initialize a LabelEncoder object
    labelEncoder=LabelEncoder()

    # Fit and transform the data to values of zeroes(0) and ones(1)
    encodedData=labelEncoder.fit_transform(datasetToEncode)

    # Convert array to a Pandas dataframe object and reassign name to column
    encodedData=pd.DataFrame(encodedData)
    encodedData.columns=[dfColName]

    return(encodedData)    # returns a 1D dataframe object


In [None]:
dfOri=pd.read_csv("multiclass_test_dataset_with_categorical_data; 09-18-2023.csv")
dfEncoded=dfOri["nut_phase"]
dfNew=dfOri.drop(columns=["nut_phase"])

dfEncoded=categoricalDataEncoder(dfScale)
dfEncoded=pd.concat([dfNew, dfEncoded], axis=1)

dfEncoded

In [None]:
def categoricalDataEncoderDummies(datasetToEncode):
    ''' '''

---

**Split Dataframe(s) By Nutritional Phase Functions**

---

In [None]:
def splitDataFrameByNutPhase(dataframeToSplit):
    '''Splits a dataset into subsets separating the data by nutritional phase.
        Returns a 1D array of six(6) lists, each containing input values and
        output values for one of the six(6) nut_phases.'''

    #Variables
    df=pd.DataFrame()
    dataframesByNutPhase=[]
    nutPhaseSubsets=[]

    # Copy the dataset
    df=dataframeToSplit

    # Columns to omit from dataframe
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a",
                         "Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b",
                      "Phase_3", "Phase_4"]

    # nut_phase 1a dataset (n=11)
    df_1a=df[df["Phase_1a"] > 0.0]  # creates a df with only Phase 1a values

    # nut_phase 1b dataset (n=12)
    df_1b=df[df["Phase_1b"] > 0.0]  # creates a df with only Phase 1b values

    # nut_phase 2a dataset (n=10)
    df_2a=df[df["Phase_2a"] > 0.0]  # creates a df with only Phase 2a values

    # nut_phase 2b dataset (n=42)
    df_2b=df[df["Phase_2b"] > 0.0]  # creates a df with only Phase 2b values

    # nut_phase 3 dataset (n=19)
    df_3=df[df["Phase_3"] > 0.0]  # creates a df with only Phase 3 values

    # nut_phase 4 dataset (n=7)
    df_4=df[df["Phase_4"] > 0.0]  # creates a df with only Phase 4 values

    # package all subsets
    dataframesByNutPhase=listMultiAppend(df_1a, df_1b, df_2a, df_2b, df_3, df_4)    # 1D array

    return(dataframesByNutPhase)


In [None]:
def splitDataFrameIntoInputOutputByNutPhase(dataframeToSplit):
    '''Splits a dataset into subsets separating the data by nutritional phase.
       Returns a 2D array of three(3) lists, input values, output values,
       mergedInputOutput values'''

    #Variables
    df=pd.DataFrame()
    inputValues=[]
    outputValues=[]
    nutPhaseSubsets=[]

    # Copy the dataset
    df=dataframeToSplit

    # Columns to omit from dataframe (43 features are retained representing inputs for each nut_phase)
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a",
                         "Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a", "Phase_1b", "Phase_2a", "Phase_2b",
                      "Phase_3", "Phase_4"]

    # nut_phase 1a dataset (n=11)
    df_1a=df[df["Phase_1a"] > 0.0]  # creates a df with only Phase 1a values
    X_1a_input=df_1a.drop(columns=nonEssential_columns)
    y_1a_output=df_1a[output_columns]

    # nut_phase 1b dataset (n=12)
    df_1b=df[df["Phase_1b"] > 0.0]  # creates a df with only Phase 1b values
    X_1b_input=df_1b.drop(columns=nonEssential_columns)
    y_1b_output=df_1b[output_columns]

    # nut_phase 2a dataset (n=10)
    df_2a=df[df["Phase_2a"] > 0.0]  # creates a df with only Phase 2a values
    X_2a_input=df_2a.drop(columns=nonEssential_columns)
    y_2a_output=df_2a[output_columns]

    # nut_phase 2b dataset (n=42)
    df_2b=df[df["Phase_2b"] > 0.0]  # creates a df with only Phase 2b values
    X_2b_input=df_2b.drop(columns=nonEssential_columns)
    y_2b_output=df_2b[output_columns]

    # nut_phase 3 dataset (n=19)
    df_3=df[df["Phase_3"] > 0.0]  # creates a df with only Phase 3 values
    X_3_input=df_3.drop(columns=nonEssential_columns)
    y_3_output=df_3[output_columns]

    # nut_phase 4 dataset (n=7)
    df_4=df[df["Phase_4"] > 0.0]  # creates a df with only Phase 4 values
    X_4_input=df_4.drop(columns=nonEssential_columns)
    y_4_output=df_4[output_columns]

    inputValues=listMultiAppend(X_1a_input, X_1b_input, X_2a_input, X_2b_input, X_3_input, X_4_input)
    outputValues=listMultiAppend(y_1a_output, y_1b_output, y_2a_output, y_2b_output, y_3_output, y_4_output)

    # package all subsets
    nutPhaseSubsets=listMultiAppend(inputValues, outputValues)    # 2D Array

    return(nutPhaseSubsets)


In [None]:
def splitDataFrameByNutPhase2(dataframeToSplit):
    '''Splits a dataset into subsets separating the data by nutritional phase. Returns a 2D array of three(3) lists, input values, output values, mergedInputOutput values'''

    #Variables
    df=pd.DataFrame()
    inputValues=[]
    outputValues=[]
    X_1a_dataset=[]
    X_1b_dataset=[]
    X_2a_dataset=[]
    X_2b_dataset=[]
    X_3_dataset=[]
    X_4_dataset=[]
    allDatasets=[]

    # Copy the dataset
    df=dataframeToSplit

    # Columns to omit from dataframe (43 features are retained representing inputs for each nut_phase)
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a",
                         "Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b",
                      "Phase_3", "Phase_4"]

    # nut_phase 1a dataset (n=11)
    df_1a=df[df["Phase_1a"] > 0.0]  # creates a df with only Phase 1a values
    X_1a_dataset=df_1a.drop(columns=["rec_num", "sample_id"])

    # nut_phase 1b dataset (n=12)
    df_1b=df[df["Phase_1b"] > 0.0]  # creates a df with only Phase 1b values
    X_1b_dataset=df_1b.drop(columns=["rec_num", "sample_id"])

    # nut_phase 2a dataset (n=10)
    df_2a=df[df["Phase_2a"] > 0.0]  # creates a df with only Phase 2a values
    X_2a_dataset=df_2a.drop(columns=["rec_num", "sample_id"])

    # nut_phase 2b dataset (n=42)
    df_2b=df[df["Phase_2b"] > 0.0]  # creates a df with only Phase 2b values
    X_2b_dataset=df_2b.drop(columns=["rec_num", "sample_id"])

    # nut_phase 3 dataset (n=19)
    df_3=df[df["Phase_3"] > 0.0]  # creates a df with only Phase 3 values
    X_3_dataset=df_3.drop(columns=["rec_num", "sample_id"])

    # nut_phase 4 dataset (n=7)
    df_4=df[df["Phase_4"] > 0.0]  # creates a df with only Phase 4 values
    X_4_dataset=df_4.drop(columns=["rec_num", "sample_id"])

    allDatasets=listMultiAppend(X_1a_dataset, X_1b_dataset, X_2a_dataset, X_2b_dataset, X_3_dataset, X_4_dataset)

    # package all subsets
    nutPhaseSubsets=listMultiAppend(allDatasets)

    return(nutPhaseSubsets)


In [None]:
def splitDataFrameIntoInputOutputFrames(dataFrameToSplit):
    '''Splits dataframe into input and output datasets by nutritional phase'''

    #Variables
    df=pd.DataFrame()
    inputValues=[]
    outputValues=[]
    allInputOutputValues=[]

    # Copy the dataset
    df=dataFrameToSplit

    # Reference columns
    # Columns to omit from dataframe (43 features are retained representing inputs for each nut_phase)
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a",
                         "Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b",
                      "Phase_3", "Phase_4"]

    # nut_phase 1a dataset (n=11)
    df_1a=df[0]
    df_1a=df_1a[df_1a["Phase_1a"] > 0.0]   # creates a df with only Phase_1a values
    X_1a_input=df_1a.drop(columns=nonEssential_columns)
    y_1a_output=df_1a[output_columns]

    # nut_phase 1b dataset (n=12)
    df_1b=df[1]
    df_1b=df_1b[df_1b["Phase_1b"] > 0.0]  # creates a df with only Phase_1b values
    X_1b_input=df_1b.drop(columns=nonEssential_columns)
    y_1b_output=df_1b[output_columns]

    # nut_phase 2a dataset (n=10)
    df_2a=df[2]
    df_2a=df_2a[df_2a["Phase_2a"] > 0.0]  # creates a df with only Phase_2a values
    X_2a_input=df_2a.drop(columns=nonEssential_columns)
    y_2a_output=df_2a[output_columns]

    # nut_phase 2b dataset (n=42)
    df_2b=df[3]
    df_2b=df_2b[df_2b["Phase_2b"] > 0.0]  # creates a df with only Phase_2b values
    X_2b_input=df_2b.drop(columns=nonEssential_columns)
    y_2b_output=df_2b[output_columns]

    # nut_phase 3 dataset (n=19)
    df_3=df[4]
    df_3=df_3[df_3["Phase_3"] > 0.0]  # creates a df with only Phase_3 values
    X_3_input=df_3.drop(columns=nonEssential_columns)
    y_3_output=df_3[output_columns]

    # nut_phase 4 dataset (n=7)
    df_4=df[5]
    df_4=df_4[df_4["Phase_4"] > 0.0]  # creates a df with only Phase_4 values
    X_4_input=df_4.drop(columns=nonEssential_columns)
    y_4_output=df_4[output_columns]

    inputValues=listMultiAppend(X_1a_input, X_1b_input, X_2a_input, X_2b_input, X_3_input, X_4_input)
    outputValues=listMultiAppend(y_1a_output, y_1b_output, y_2a_output, y_2b_output, y_3_output, y_4_output)

    # # package all subsets
    allInputOutputValues=listMultiAppend(inputValues, outputValues)

    return(allInputOutputValues)


---

**Split Dataframe(s) Into Train & Test Sets Functions**

---



In [None]:
def splitDataFrameIntoTrainTestSets(dataframeToSplit, shuffleDataFrame):
    '''Splits a dataframe into train and test subsets without separating by nutritional phase.
    Returns train, test dataframes'''

    #Variables
    df=pd.DataFrame()
    trainSets=[]
    testSets=[]
    nutPhaseSubsets=[]
    toShuffle=shuffleDataFrame

    # Copy the dataframe
    df=dataframeToSplit

    # Columns to omit from dataframe (43 features are retained representing inputs for each nut_phase)
    input_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a",
                     "Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b",
                      "Phase_3", "Phase_4"]

    X_input=dataframeToSplit.drop(columns=input_columns)
    y_output=dataframeToSplit[output_columns]

    X_train_fullset, X_test_fullset, y_train_fullset, y_test_fullset = train_test_split(X_input, y_output, test_size=0.30, random_state=8,  shuffle=toShuffle)

    trainSets=listMultiAppend(X_train_fullset, y_train_fullset)
    testSets=listMultiAppend(X_test_fullset, y_test_fullset)
    dataSets=listMultiAppend(trainSets, testSets)

    return(dataSets)


In [None]:
def splitNutPhaseSeparatedDataFrameIntoTrainTestSets(nutPhaseSeparatedDataFrames, shuffleDataFrames):
    '''Splits dataframe into train and test subsets separating dataset by nutritional phase.
    Returns a 2D array of train, test dataframes
    [trainSets[X-train[...], y_train[...]], testSets[X-test[...], y_test[...]]]'''

    #--- SPLIT THE DATA SUBSETS INTO A TRAINING SET AND A TEST SET --------------------
    # dataset order --> X_train, X_test, y_train, y_test

    # Variables
    X_train=[]
    y_train=[]
    X_test=[]
    y_test=[]
    trainSets=[]
    testSets=[]
    toShuffle=shuffleDataFrames

    X_input=nutPhaseSeparatedDataFrames[0]    # extract the train sets from the 2D array of data
    y_output=nutPhaseSeparatedDataFrames[1]     # extract the test sets from the 2D array of data

    # Get data subsets separated by nutritional phase

    # Phase_1a: split nut_phase 1a dataset into a training and a testing set (n=11)
    X_1a_train, X_1a_test, y_1a_train, y_1a_test = train_test_split(X_input[0], y_output[0], test_size=0.30, random_state=8,  shuffle=toShuffle)

    # Phase_1b: split nut_phase 1b dataset into a training and a testing set (n=12)
    X_1b_train, X_1b_test, y_1b_train, y_1b_test = train_test_split(X_input[1], y_output[1], test_size=0.30, random_state=8,  shuffle=toShuffle)

    # Phase_2a: split nut_phase 2a dataset into a training and a testing set (n=10)
    X_2a_train, X_2a_test, y_2a_train, y_2a_test = train_test_split(X_input[2], y_output[2], test_size=0.30, random_state=8,  shuffle=toShuffle)

    # Phase_2b: split nut_phase 2b dataset into a training and a testing set (n=42)
    X_2b_train, X_2b_test, y_2b_train, y_2b_test = train_test_split(X_input[3], y_output[3], test_size=0.30, random_state=8,  shuffle=toShuffle)

    # Phase_3: split nut_phase 3 dataset into a training and a testing set (n=19)
    X_3_train, X_3_test, y_3_train, y_3_test = train_test_split(X_input[4], y_output[4], test_size=0.30, random_state=8,  shuffle=toShuffle)

    # Phase_4: split nut_phase 4 dataset into a training and a testing set (n=7)
    X_4_train, X_4_test, y_4_train, y_4_test = train_test_split(X_input[5], y_output[5], test_size=0.30, random_state=8,  shuffle=toShuffle)

    X_train=listMultiAppend(X_1a_train, X_1b_train, X_2a_train, X_2b_train, X_3_train, X_4_train)
    y_train=listMultiAppend(y_1a_train, y_1b_train, y_2a_train, y_2b_train, y_3_train, y_4_train)

    X_test=listMultiAppend(X_1a_test, X_1b_test, X_2a_test, X_2b_test, X_3_test, X_4_test)
    y_test=listMultiAppend(y_1a_test, y_1b_test, y_2a_test, y_2b_test, y_3_test, y_4_test)

    trainTestSets=listMultiAppend(X_train, y_train, X_test, y_test)    # 2D array

    return(trainTestSets)


In [None]:
def multiFractionSplitNutPhaseSeparatedDataFrameTrainTestSets(nutPhaseSeparatedDataFrames, shuffleDataFrames, minTestFraction, maxTestFraction, fractionIncrement):
    '''Performs a multifraction split of nut_phase separated dataframe into train and test subsets separating them by nutritional phase.
    Returns a 2D array of dataframes'''

    #--- SPLIT THE DATA SUBSETS INTO A TRAINING SET AND A TEST SET --------------------
    # dataset order --> X_train, X_test, y_train, y_test

    # Variables
    X_train=[]
    y_train=[]
    X_test=[]
    y_test=[]
    trainSets=[]
    testSets=[]
    trainTestDatasets=[]
    adjTestFractions=[]
    toShuffle=shuffleDataFrames
    i=minTestFraction

    testFractions=list(range(minTestFraction, maxTestFraction, fractionIncrement))
    X_input=nutPhaseSeparatedDataFrames[0]      # extract the train sets from the 2D array of data
    y_output=nutPhaseSeparatedDataFrames[1]     # extract the test sets from the 2D array of data

    for fraction in testFractions:
        fraction=float(fraction/100)
        adjTestFractions.append(fraction)


    # While loop
    for testFraction in adjTestFractions:

        # Phase_1a: split nut_phase 1a dataset into a training and a testing set (n=11)
        X_1a_train, X_1a_test, y_1a_train, y_1a_test = train_test_split(X_input[0], y_output[0], test_size=testFraction, random_state=8,  shuffle=toShuffle)

        # Phase_1b: split nut_phase 1b dataset into a training and a testing set (n=12)
        X_1b_train, X_1b_test, y_1b_train, y_1b_test = train_test_split(X_input[1], y_output[1], test_size=testFraction, random_state=8,  shuffle=toShuffle)

        # Phase_2a: split nut_phase 2a dataset into a training and a testing set (n=10)
        X_2a_train, X_2a_test, y_2a_train, y_2a_test = train_test_split(X_input[2], y_output[2], test_size=testFraction, random_state=8,  shuffle=toShuffle)

        # Phase_2b: split nut_phase 2b dataset into a training and a testing set (n=42)
        X_2b_train, X_2b_test, y_2b_train, y_2b_test = train_test_split(X_input[3], y_output[3], test_size=testFraction, random_state=8,  shuffle=toShuffle)

        # Phase_3: split nut_phase 3 dataset into a training and a testing set (n=19)
        X_3_train, X_3_test, y_3_train, y_3_test = train_test_split(X_input[4], y_output[4], test_size=testFraction, random_state=8,  shuffle=toShuffle)

        # Phase_4: split nut_phase 4 dataset into a training and a testing set (n=7)
        X_4_train, X_4_test, y_4_train, y_4_test = train_test_split(X_input[5], y_output[5], test_size=testFraction, random_state=8,  shuffle=toShuffle)

        # Append all sets to a list
        X_train=listMultiAppend(X_1a_train, X_1b_train, X_2a_train, X_2b_train, X_3_train, X_4_train)
        y_train=listMultiAppend(y_1a_train, y_1b_train, y_2a_train, y_2b_train, y_3_train, y_4_train)

        X_test=listMultiAppend(X_1a_test, X_1b_test, X_2a_test, X_2b_test, X_3_test, X_4_test)
        y_test=listMultiAppend(y_1a_test, y_1b_test, y_2a_test, y_2b_test, y_3_test, y_4_test)

        trainTestDatasets=listMultiAppend(X_train, y_train, X_test, y_test)

    return(trainTestDatasets)


In [None]:
def k_FoldCrossValidationTrainTestSplit(dataframeToSplit):
    ''' '''
    # Variables

    # Reference Columns
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase"]    # arbitrary columns to exclude from the dataframe to be used for Hold-out Cross-validation Stratified Sampling (HCSS)
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]    # these are also referred to as "target" columns

    # Define the input(x-values) and output (y-values) datasets
    X_dataset = dataframeToSplit.drop(columns=nonEssential_columns)
    y_dataset = dataframeToSplit[output_columns]


In [None]:
####################################################################
#-------------------------------------------------------------------
#    TEST FOR "k_FoldCrossValidationTrainTestSplit()" FUNCTION
#-------------------------------------------------------------------
####################################################################

# Read CSV datafile and convert to a Pandas Dataframe object
filename="nut_phase_questionnaire_data_fullset.csv"
df=pd.read_csv(filename)

k_FoldCrossValidationTrainTestSplit(df)


---

**Shuffle Dataframes Functions**

---



In [None]:
def shuffleDataFrame(datasetToShuffle):
    '''Shuffles records in a dataframe and returns the shuffled dataframe'''

    # Variables
    shuffledIndeces=[]

    numRecords = (len(datasetToShuffle))  #numberOfRecords
    shuffledIndeces = list(range(0, numRecords))  # creates a number list of length equivalent to dataset size
    random.shuffle(shuffledIndeces)   # shuffles the "shuffledIndeces" list to use in shuffling datasets by index

    tempDataset = []                  # temporary list object for storing the shuffled dataset
    shuffledDataset = []              # variable for storing and returning the shuffled datasets


    # # Loop for assembling a list of shuffled data by shuffled index
    for index in shuffledIndeces:
        tempDataset.append(datasetToShuffle.iloc[index])  # appends shuffled element to temp list object

    # # Convert lists to pandas dataframes
    shuffledDataFrame = pd.DataFrame(tempDataset)

    shuffledIndeces=[]

    return (shuffledDataFrame)    # returns shuffled dataframe


In [None]:
####################################################################
#-------------------------------------------------------------------
#            TEST FOR "shuffleDataFrame()" FUNCTION
#-------------------------------------------------------------------
####################################################################

# Variables
filename="nut_phase_questionnaire_data_fullset.csv"

#--- METHOD CALLS ------------------------------------------
# Read CSV datafile and convert to a Pandas Dataframe object
df=readCSVFile(filename)
df2=pd.DataFrame(df)

splitData=splitDataFrameByNutPhase(df2)
shuffledDataFrame=shuffleDataFrame(splitData[0])
#shuffledDataFrame
print(splitData[0])
print(shuffledDataFrame)


In [None]:
def shuffleDataFrames(X_dataframeToShuffle, y_dataframeToShuffle, numRecords):
    '''Function to shuffle train or test datasets'''

    dataset_size = numRecords   # might not need the (-1)
    shuffledIndeces = list(range(0, dataset_size))  # creates a number list of length equivalent to dataset size
    random.shuffle(shuffledIndeces)   # shuffles the "shuffledIndeces" list to use in shuffling datasets by index

    X_dataframe = X_dataframeToShuffle    # assigns X_datasetToShuffle to a function-level variable for processing
    y_dataframe = y_dataframeToShuffle    # assigns y_datasetToShuffle to a function-level variable for processing
    X_temp = []                       # temporary list object for storing shuffled dataset of X-values being assembled
    y_temp = []                       # temporary list object for storing shuffled dataset of y-values being assembled
    shuffledDataFrames = []           # variable for returning shuffled datasets

    # Loop for assembling lists of shuffled data by shuffled index
    for index in shuffledIndeces:
        X_temp.append(X_dataframe[index])  # appends X_dataset element to X_temp list object
        y_temp.append(y_dataframe[index])  # appends y_dataset element to y_temp list object


    # Convert lists to pandas dataframes
    X_dataframe = pd.DataFrame(X_temp)
    y_dataframe = pd.DataFrame(y_temp)

    # Multi-append "X_dataset" and "y_dataset" dataframe to "shuffledDataFrames" list
    shuffledDataFrames=listMultiAppend(X_dataframe, y_dataframe)

    return (shuffledDataFrames)    # returns shuffled dataframes


In [None]:
def shufflePWSNutPhasePartitionedDataFrames(nutPhasePartitionedDFToShuffle):
    '''(OPTIONAL): Shuffles a dataset. Note: Use only with datasets separated by nutritional phase'''

    # Variables
    shuffledDataFrame=[]
    shuffledDataFrames=[]
    i=0

    dataFramesToShuffle=nutPhasePartitionedDFToShuffle

    for dataFrame in dataFramesToShuffle:
      shuffledDataFrame=shuffleDataFrame(dataFrame)
      shuffledDataFrames.append(shuffledDataFrame)

    return(shuffledDataFrames)


In [None]:
####################################################################
#-------------------------------------------------------------------
#            TEST FOR "shuffleDataFrames()" FUNCTION
#-------------------------------------------------------------------
####################################################################

# Variables
filename="nut_phase_questionnaire_data_fullset.csv"

#--- METHOD CALLS ------------------------------------------
# Read CSV datafile and convert to a Pandas Dataframe object
df=readCSVFile(filename)
df2=pd.DataFrame(df)

shuffledDataFrame=shuffleDataFrame(df2)
shuffledDataFrame

filename="nut_phase_questionnaire_data_fullset.csv"
modelTrainingDatasets=[]


#--- METHOD CALLS ------------------------------------------
# Read CSV datafile and convert to a Pandas Dataframe object
dataFrame=readCSVFile(filename)

# Partition the dataset by nut_phase
splitDataFrame = splitDataFrameByNutPhase(dataFrame)

# Shuffle the nut+phase partitioned dataframes
shuffledDataFrames=shufflePWSNutPhasePartitionedDataFrames(splitDataFrame)

# Merge the partitioned datasets
mergedDataFrames=mergeDataFramesRow_Wise(shuffledDataFrames)

# Split merged dataset into input and output dataframes by nutritional phase for Train/Test split
datasetForTrainTestSplit=splitDataFrameIntoInputOutputByNutPhase(mergedDataFrames)



In [None]:
def shuffleDataFrames2(X_dataframeToShuffle, y_dataframeToShuffle, numRecords):
    '''Shuffles records in two dataframes, 'X' and 'y', and returns the shuffled datasets'''

    numRecords = numRecords   # might not need the (-1)
    shuffledIndeces = list(range(0, numRecords))  # creates a number list of length equivalent to dataset size
    random.shuffle(shuffledIndeces)   # shuffles the "shuffledIndeces" list to use in shuffling datasets by index

    X_dataframe = X_dataframeToShuffle    # assigns X_datasetToShuffle to a function-level variable for processing
    y_dataframe = y_dataframeToShuffle    # assigns y_datasetToShuffle to a function-level variable for processing
    X_temp = []                           # temporary list object for storing shuffled dataset of X-values being assembled
    y_temp = []                           # temporary list object for storing shuffled dataset of y-values being assembled
    shuffledDataFrames = []               # variable for returning shuffled datasets

    # Loop for assembling lists of shuffled data by shuffled index
    for index in shuffledIndeces:
        X_temp.append(X_dataframe.iloc[index])  # appends X_dataset element to X_temp list object
        y_temp.append(y_dataframe.iloc[index])  # appends y_dataset element to y_temp list object


    # Convert lists to pandas dataframes
    X_dataframe = pd.DataFrame(X_temp)
    y_dataframe = pd.DataFrame(y_temp)

    # Multi-append "X_dataset" and "y_dataset" dataframe to "shuffledDataFrames" list
    shuffledDataFrames=listMultiAppend(X_dataframe, y_dataframe)

    return (shuffledDataFrames)    # returns shuffled dataframes


In [None]:
####################################################################
#-------------------------------------------------------------------
#  TEST FOR "shufflePWSNutPhasePartitionedDataFrames()" FUNCTION
#-------------------------------------------------------------------
####################################################################

# Variables
filename="nut_phase_questionnaire_data_fullset.csv"

#--- METHOD CALLS ------------------------------------------
# Read CSV datafile and convert to a Pandas Dataframe object
df=readCSVFile(filename)
df2=pd.DataFrame(df)

splitData=splitDataFrameByNutPhase(df2)

splitData[1][5]


---

**Merge Dataframes Functions**

---



In [None]:
def mergePWSNutPhaseSubsets(datasetsToMerge):
    '''Takes in a 2D array. Merges PWS nut_phase split data subsets into a full train and a full test sets.
    Returns a 3D array of train, test dataframes:
    [mergedTrainSets[X_trainfullset [...], y_train_fullset], ...],  mergedTestSets[X_testfullset [...], y_test_fullset], ...]'''

    #--- MERGE NUT_PHASE TRAIN AND TEST SUBSETS TO CREATE FULL TRAIN AND TEST DATASETS --------------------
    # dataset order --> X_train, X_test, y_train, y_test

    # Variables
    mergedTrainSets=[]
    mergedTestSets=[]
    mergedDataSets=[]

    X_trainSets=datasetsToMerge[0]
    y_trainSets=datasetsToMerge[1]

    X_testSets=datasetsToMerge[2]
    y_testSets=datasetsToMerge[3]

    # X_train dataset [.....1a.......,.....1b.......,.....2a........,.....2b..........,....3........,.......4.......]
    X_train_subsets = [X_trainSets[0],X_trainSets[1], X_trainSets[2], X_trainSets[3], X_trainSets[4], X_trainSets[5]]
    X_train_fullset = pd.concat(X_train_subsets, axis=0) #.values.astype("float32")    # concatenate frames row-wise (axis=0)

    # X_test dataset
    X_test_subsets = [ X_testSets[0],  X_testSets[1],  X_testSets[2],  X_testSets[3],  X_testSets[4],  X_testSets[5]]
    X_test_fullset = pd.concat(X_test_subsets, axis=0) #.values.astype("float32")      # concatenate frames row-wise (axis=0)

    # y_train dataset
    y_train_subsets = [y_trainSets[0], y_trainSets[1], y_trainSets[2], y_trainSets[3], y_trainSets[4], y_trainSets[5]]
    y_train_fullset = pd.concat(y_train_subsets, axis=0) #.values.astype("float32")    # concatenate frames row-wise (axis=0)

    # y_test dataset
    y_test_subsets = [y_testSets[0], y_testSets[1], y_testSets[2], y_testSets[3], y_testSets[4], y_testSets[5]]
    y_test_fullset = pd.concat(y_test_subsets, axis=0) #.values.astype("float32")       # concatenate frames row-wise (axis=0)

    mergedTrainSets=listMultiAppend(X_train_fullset, y_train_fullset)
    mergedTestSets=listMultiAppend(X_test_fullset, y_test_fullset)
    mergedDataSets=listMultiAppend(mergedTrainSets, mergedTestSets)

    #--- DATASET SHAPE (ARRAY DIMENSIONS) --------------------
    # print("X_train array dimensions: " + str(X_train_fullset.shape) + "; (rows, cols)" + "\n" +
    #       "y_train array dimensions: " + str(y_train_fullset.shape) + "; (rows, cols)")

    return(mergedDataSets)


In [None]:
def mergeDataFramesRow_Wise(dataFramesToMerge):
    ''' '''

    dataFrames=[]

    for dataFrame in dataFramesToMerge:
        dataFrames.append(dataFrame)

    mergedDataFrames=pd.concat(dataFrames, axis=0)

    return(mergedDataFrames)


---

**Construct a Sequential Neural Network (SNN) Functions**

---



In [None]:
def constructSequentialNeuralNetwork(numFeatures, nodesPerLayer):
    '''Constructs a Sequential Neural Network (SNN) of dimensions 43*([86]*6)*6 => (input nodes*([height in nodes]*depth in layers)*output nodes)'''

    # Exhaustive analysis of height vs depth concluded that 6-8 hidden
    # layers, 86-172 nodes in height, are sufficient to produce maximal
    # performance for the current dataset having 43 features (fields).
    # TensorFlow TensorSpecs dimensions and data types of the model input => [None,43], float32. "None" indicates unknown batch size (number of records).

    # Variables
    nodes=nodesPerLayer

    tf.random.set_seed(42)
    model = Sequential()   # create an instance of a Sequential object

    # -- Add input layer -----------------------------------------
    model.add(tf.keras.layers.Input(shape=(numFeatures, )))              # input layer (follows matrix mult (A X B), B is m rows and n cols, thus A is k rows and m cols (where k=#records in the dataset and m=#input cols))

    # -- Add hidden layers ---------------------------------------
    model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # first hidden layer   (1)
    model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # second hidden layer  (2)
    model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # third hidden layer   (3)
    model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # fourth hidden layer  (4)
    model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # fifth hidden layer   (5)
    model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # sixth hidden layer   (6)
    # model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # seventh hidden layer (7)
    # model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # eight hidden layer   (8)
    # model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # seventh hidden layer (9)
    # model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # eight hidden layer   (10)


    # -- Add output layer ----------------------------------------
    model.add(tf.keras.layers.Dense(6, activation="softmax"))   # output layer.  Softmax converts a vector (array) of values into a probability distribution with a range (0,1).
    tempModel=model
    return(model)


In [None]:
def constructSequentialNeuralNetworkWithDropout(numFeatures):
    '''Constructs a Sequential Neural Network (SNN) of dimensions 43/[86]*4/6'''

    tf.random.set_seed(42)
    model = Sequential()   # create an instance of a Sequential object

    # -- Add input layer -----------------------------------------
    model.add(tf.keras.layers.Input(shape=(numFeatures, )))             # input layer (follows matrix mult (A X B), B is m rows and n cols, thus A is k rows and m cols (where k=#records in the dataset and m=#input cols))

    # -- Add hidden layers ---------------------------------------
    model.add(tf.keras.layers.Dropout(0.4))                    # Dropout helps protect the model from memorizing or "overfitting" the training data
    model.add(tf.keras.layers.Dense(86, activation="relu"))    # first hidden layer (1)
    # model.add(tf.keras.layers.Dropout(0.2))                  # Dropout helps protect the model from memorizing or "overfitting" the training data
    model.add(tf.keras.layers.Dense(86, activation="relu"))    # second hidden layer (2)
    model.add(tf.keras.layers.Dropout(0.2))                    # Dropout helps protect the model from memorizing or "overfitting" the training data
    model.add(tf.keras.layers.Dense(86, activation="relu"))    # third hidden layer (3)
    # model.add(tf.keras.layers.Dropout(0.2))                  # Dropout helps protect the model from memorizing or "overfitting" the training data
    model.add(tf.keras.layers.Dense(86, activation="relu"))    # fourth hidden layer (4)
    model.add(tf.keras.layers.Dropout(0.2))                    # Dropout helps protect the model from memorizing or "overfitting" the training data
    # model.add(tf.keras.layers.Dropout(0.2))                  # Dropout helps protect the model from memorizing or "overfitting" the training data

    # -- Add output layer ----------------------------------------
    model.add(tf.keras.layers.Dense(6, activation="softmax"))  # output layer.  Softmax converts a vector (array) of values into a probability distribution with a range (0,1).

    return(model)


In [None]:
def constructCustomSequentialNeuralNetwork():
    '''Constructs a custom Sequential Neural Network (SNN) by prompting user for network hypervariables to use in implementation'''

    # Variables
    rndSeed=42
    numRecords=1
    dimHiddenLayers=numRecords*2
    layerNum=1
    numHiddenLayers=1

    # Initialize a Sequential Neural Network Object (SNN)
    tf.random.set_seed(rndSeed)
    model=Sequential()

    # Prompt user for Sequential Neural Network hyperparameters
    rndSeed=int(input("Please enter a seed number (default seed is: 42): "))
    numRecords=int(input("Please enter the number of records in the dataset: "))
    numHiddenLayers=int(input("Please enter the number of hidden layers desired: "))
    dimHiddenLayers=input("Please enter the dimension of the hidden layers, otherwise the default value of [(input variables) X 2] will be used: ")
    print("Please note that for now the 'relu' and 'softmax' activation functions will be used")

    # Case if user does not enter a number for dimension of hidden layer(s)
    if (dimHiddenLayers==''):
        dimHiddenLayers=numRecords*2

    # --- Build the model ---
    # Add input layer
    model.add(tf.keras.layers.Input(shape=(numRecords, )))    # input layer (follows matrix mult (A X B), B is m rows and n cols, thus A is k rows and m cols (where k=#records in the dataset and m=#input cols))

    # Add hidden layers to model
    while (layerNum < numHiddenLayers+1):
        model.add(tf.keras.layers.Dense(dimHiddenLayers, activation="relu"))    # N-hidden layer
        layerNum+=1    # increment loop index by one(1)
    layerNum=0    # reset loop index to zero(0)

    # Add output layer
    model.add(tf.keras.layers.Dense(6, activation="softmax"))   # output layer.  Softmax converts a vector of values to a probability distribution with a range (0,1).

    # Display summary of model structure
    displayNeuralNetworkSummary(model)

    # Display graphical structure of model
    displayNeuralNetworkStructure(model)

    return(model)    # return the model


In [None]:
####################################################################
#-------------------------------------------------------------------
#  TEST FOR "constructCustomSequentialNeuralNetwork()" FUNCTION
#-------------------------------------------------------------------
####################################################################

constructCustomSequentialNeuralNetwork()


---

**Compile Sequential Neural Network Functions**

---

In [None]:
def compileNeuralNetwork(model):
    '''Compiles a neural network'''

    #model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True), metrics=["accuracy"])
    model.compile(keras.optimizers.Adam(learning_rate=0.001), loss="categorical_crossentropy", metrics=["accuracy"])

    return(model)


---

**Summarize and Display Structure of Sequential Neural Network Functions**

---

In [None]:
def displayNeuralNetworkSummary(model):
    '''Displays a visual representation of the neural network'''

    print(model.summary())


In [None]:
def displayNeuralNetworkStructure(model):
    '''Displays a visual representation of the neural network'''

    networkStructure=plot_model(model, to_file='model_plot.jpg', show_shapes=True, show_layer_names=True)

    return(networkStructure)


---

**Model Training Functions**

---



**Training Functions:**
1. trainNeuralNetwork()
2. trainNeuralNetwork_N_Fold()
3. trainNeuralNetworkUsing_K_FoldCrossValidation()

In [None]:
def trainNeuralNetwork(model, trainingDatasets, epochs, userDefinedVerbose):
    '''Trains a neural network'''

    # Extract the Train datasets
    X_train=trainingDatasets[0][0]
    y_train=trainingDatasets[0][1]

    # Extract the Test datasets
    X_test=trainingDatasets[1][0]
    y_test=trainingDatasets[1][1]

    # Train the model
    model.fit(X_train, y_train, epochs=epochs, batch_size=6, verbose=userDefinedVerbose, validation_data=(X_test,y_test)) #, callbacks=callbacks_list)

    return(model)


In [None]:
def trainNeuralNetwork_TEST(model, trainingDatasets, userDefinedVerbose):
    '''Trains a neural network'''

    # Extract the Train and Test sets from passed list
    modelTrainSets=trainingDatasets[0]    # Extracts the train sets
    modelTestSets=trainingDatasets[1]     # Extracts the test sets

    # Train the model
    history=model.fit(modelTrainSets[0], modelTrainSets[1], epochs=75, batch_size=12, verbose=userDefinedVerbose, validation_data=(modelTestSets[0], modelTestSets[1])) #, callbacks=callbacks_list)
    model=history
    print("History:")
    print(history.history.keys())
    return(model)


In [None]:
def trainNeuralNetwork_N_Fold(model, trainingDatasets, trainingIterations):
    '''Trains a neural network N-times and returns a list of models'''

    # Variables
    iterationCount=0    # while loop index
    trainedModels=[]    # list for storing trained models

    # Extract the Train and Test sets from passed list
    modelTrainSets=trainingDatasets[0]    # Extracts the train sets
    modelTestSets=trainingDatasets[1]     # Extracts the test sets

    while (iterationCount < trainingIterations):
        # Train the model
        model.fit(modelTrainSets[0], modelTrainSets[1], epochs=100, batch_size=10, verbose=2, validation_data=(modelTestSets[0], modelTestSets[1]), callbacks=[history])
        trainedModels.append(model)
        iterationCount +=1

    # reset loop index to zero(0)
    iterationCount=0

    return(trainedModels)


In [None]:
def k_FoldCrossValidation(model, trainingDatasets, splits):
    '''Takes in a pre-constructed model and performs k-fold cross-validation on the referenced dataset'''

    # Variables
    foldNum=1
    accuracyPerFold=[]

    # Prepare the features and target datasets
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]

    X_input=trainingDatasets.drop(columns=nonEssential_columns)
    y_output=trainingDatasets[output_columns]

    X=X_input.to_numpy()  # cannot use datasets in Pandas dataframe format. Must convert to numpy arrays.
    y= y_output.to_numpy()

    # Initialize an instance of a KFold object
    cv = KFold(n_splits=splits, shuffle=True, random_state=7)

    for train, test in cv.split(X, y):
        X_train=X[train]
        y_train=y[train]

        X_test=X[test]
        y_test=y[test]

        # fit data to model
        history=model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=7, epochs=50, verbose=0)

        # Evaluate the model - store accuracy score into list
        scores=model.evaluate(X_test, y_test, verbose=0)
        accuracyPerFold.append(scores[1]*100)
        foldNum +=1

    x=1

    # Display model accuracy values for each split
    avg=sum(accuracyPerFold)/len(accuracyPerFold)
    for acc in accuracyPerFold:
        print("Model's accuracy for fold " + str(x) + " " + str(int(acc)) + "%")
        x+=1
    print("Average accuracy score: " + str(int(avg)) + "%")

    return(models)


In [None]:
######################################################################
#---------------------------------------------------------------------
# TEST FOR "trainNeuralNetworkUsing_K_FoldCrossValidation()" FUNCTION
#---------------------------------------------------------------------
######################################################################

# Variables
model=tempModel
filename="nut_phase_questionnaire_data_augmented; 09-05-2023.csv"
splits=5

#--- METHOD CALLS ------------------------------------------
# Read CSV datafile and convert to a Pandas Dataframe object
dataFrame=readCSVFile(filename)
models = k_FoldCrossValidation(model, dataFrame, splits)

print("\n"+"Number of models: "+str(len(models)))


In [None]:
def k_FoldCrossValidationWithNeuralNetworkConstruct(trainingDatasets):
    '''Creates a model to use for k-fold cross-validation'''

    # Variables
    foldNum=1
    accuracyPerFold=[]

    # Reference columns
    nonEssential_columns = ["rec_num", "sample_id", "nut_phase", "Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]

    X_input=trainingDatasets.drop(columns=nonEssential_columns)
    y_output=trainingDatasets[output_columns]

    X=X_input.to_numpy()  # cannot use X_input in dataframe format. Must convert to numpy array.
    y= y_output.to_numpy()

    # Initialize an instance of a KFold object
    cv = KFold(n_splits=5, shuffle=True, random_state=7)

    # construct SNN model - do this inside the for loop to train a new model at each iteration of the loop
    model=constructSequentialNeuralNetwork()

    # compile model
    model=compileNeuralNetwork(model)

    for train, test in cv.split(X, y):
        X_train=X[train]
        y_train=y[train]

        X_test=X[test]
        y_test=y[test]

        # Clear the current model
        tf.keras.backend.clear_session()    # clears all previously created models from memory

        # fit data to model
        history=model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=7, epochs=50, verbose=0)

        # save the model to list object
        # model.save("model_fold_"+str(foldNum)+"_"+str(datetime.now())+".h5")

        # Evaluate the model - store accuracy score into list
        scores=model.evaluate(X_test, y_test, verbose=0)
        accuracyPerFold.append(scores[1]*100)
        foldNum +=1

    x=1

    # Display model accuracy values for each split
    avg=sum(accuracyPerFold)/len(accuracyPerFold)
    for acc in accuracyPerFold:
        print("Model's accuracy for fold " + str(x) + " " + str(int(acc)) + "%")
        x+=1
    print("Average accuracy score: " + str(int(avg)) + "%")

    return(models)


In [None]:
####################################################################
#----------------------------------------------------------------------
# TEST FOR "trainNeuralNetworkUsing_K_FoldCrossValidation_2()" FUNCTION
#-------------------------------------------------------------------
####################################################################

# Variables
filename="nut_phase_questionnaire_data_augmented; 09-05-2023.csv"

#--- METHOD CALLS ------------------------------------------
# Read CSV datafile and convert to a Pandas Dataframe object
dataFrame=readCSVFile(filename)
models = k_FoldCrossValidationWithNeuralNetworkConstruct(dataFrame)

print("\n"+"Number of models: "+str(len(models)))


---


**Model Validation Functions**


---

**Validation Functions:**
1. verifyModelHasLearned()
2. evaluateModel()
3. getModelsPerformance()

In [None]:
def verifyModelHasLearned(model):
    '''Displays edge weights of model. If model did not learn, no weights will be displayed'''

    # Display weights
    model.weights


In [None]:
def evaluateModel(model, trainTestDataFrames):
    '''Displays model performance metrics.  Returns evaluation scores as a 2D array object'''

    # Variables
    scores=[]

    # Train datasets
    modelTrainSets=trainTestDataFrames[0]    # Extracts only the test sets
    X_train=modelTrainSets[0]    # Extracts the X_test set (input values) from "modelTestSets"
    y_train=modelTrainSets[1]    # Extracts the y_test set (output values) from "modelTestSets"

    # Test datasets
    modelTestSets=trainTestDataFrames[1]    # Extracts only the test sets
    X_test=modelTestSets[0]    # Extracts the X_test set (input values) from "modelTestSets"
    y_test=modelTestSets[1]    # Extracts the y_test set (output values) from "modelTestSets"

    # Train and Test dataset evaluation scores
    # [loss [0], accuracy [1]]
    trainScores = model.evaluate(X_train, y_train, verbose=2)  # returns a list with two values, final loss function value and the model's accuracy on the train data
    testScores = model.evaluate(X_test, y_test, verbose=2)  # returns a list with two values, final loss function value and the model's accuracy on the test data


    # Print Train dataset evaluation scores
    print("\n"+"Model evaluation results:")
    print("Train dataset loss:", trainScores[0])
    print("Train dataset accuracy:", trainScores[1])

    # Print Test dataset evaluation scores
    print("Validation dataset loss:", testScores[0])
    print("Validation dataset accuracy:", testScores[1])

    # Store evaluation scores into "scores" list object to be returned
    scores=listMultiAppend(trainScores, testScores)

    return(scores)    # returns a 2D array object


In [None]:
def getModelsPerformance(model, trainingDatasets):
    '''Displays model performance metrics.  Returns evaluation scores as a 2D array object'''

    # Variables
    scores=[]

    # Extract the Train datasets
    X_train=trainingDatasets[0][0]
    y_train=trainingDatasets[0][1]

    # Extracts the Test sets
    X_test=trainingDatasets[1][0]
    y_test=trainingDatasets[1][1]


    # Train and Test dataset evaluation scores
    # [loss [0], accuracy [1]]
    trainScores = model.evaluate(X_train, y_train, verbose=2)  # returns a list with two values, final loss function value and the model's accuracy on the train data
    testScores = model.evaluate(X_test, y_test, verbose=2)  # returns a list with two values, final loss function value and the model's accuracy on the test data

    # Store evaluation scores into "scores" list object to be returned
    scores=listMultiAppend(trainScores, testScores)

    return(scores)    # returns a 2D array object


In [None]:
def multiClassConfusionMatrix(model, dataframe):
    '''Function constructs a confusion matrix for a multi-class model'''

    # Variables
    df=dataframe

    # Make prediction (model, dataset, shuffle_dataset (True or False))
    results=predictPWSNutPhase(model, df, False)

    # Create the multiclass confusion matrix
    confusionMatrix=pd.crosstab(results.Predicted, results.Actual)
    fig=plt.figure(figsize=(17,5))
    ax=plt.subplot(121)
    ax.set_title("PWS Nut_Phase Questionnaire Deep Neural Network")
    sn.heatmap(confusionMatrix, annot=True, cmap="Reds")

    # Calculate model's overall accuracy
    numRecords=confusionMatrix.sum().sum()
    accuracy=round((np.diag(confusionMatrix).sum()/numRecords*100),2)

    report=pd.DataFrame(metrics.classification_report(results.Actual, results.Predicted, output_dict=True))
    report=report.transpose()
    report.columns=["precision", "recall", "f1-score", "no_records"]

    return(report)


---

**Model Utility Functions**

---



**Utility Functions:**
1. calcAvgEdgeWeights()
2. saveModelEdgeWeights()
3. saveTrainedModel()

In [None]:
def calcAvgEdgeWeights(modelsToAvg):
    '''Calculates the stochastic average edge weights for a model from weights obtained by performing multiple training iterations'''
    # Function variables
    i=0                       # inner while loop iteration index
    j=0                       # outer while loop iteration index
    numModels=0               # stores the number of models passed by the function call
    arraysToAverage=[]        # stores the arrays to be averaged
    calcAvgWeights=[]         # stores the calculated average of the sum of weights for a given neural layer
    modelWeights=[]           # stores the weights of each model
    avgModelWeights=[]        # stores all of the calculated average edge weights to be returned by the function


    # Collect the weights of each model and store in the "modelWeights" list
    for model in modelsToAvg:
      modelWeights.append(model.weights)

    # Obtain the number of models to be averaged
    numModels=len(modelsToAvg)

    # Obtain the number of weight arrays in the models (all should have the same number of arrays.  Will selecte the first model "[0]" to obtain this number)
    numWeightArrays=len(modelWeights[0])

    if numModels > 1:  # if the number of models passed > 1, then calculate  model's average edge weights
        # Calculate the model's average edge weights
        while j < numWeightArrays:
            while i < numModels:
                arraysToAverage.append(modelWeights[i][j])            # populate the "arraysToAverage" list with arrays to be averaged.  Note loop indeces, [i = model][j = array]
                calcAvgWeights=np.mean(arraysToAverage, axis=0)       # calculate the average of the array values for the current model array
                i=i+1    # increment inner loop counter 'i' by one(1)

            avgModelWeights.append(calcAvgWeights)    # append calculated average weights to final averages list to be returned
            arraysToAverage=[]     # clear the arraysToCount list
            calcAvgWeights=[]      # clear the avgVals list
            i=0      # reset inner loop counter 'i' to zero(0)
            j=j+1    # increment outer loop counter 'j' by one(1)

        # Models summary
        print("Number of models: " + str(numModels))
        print("Weight arrays per model: " + str(numWeightArrays) + "\n")
        print(avgModelWeights)

        # Reset all variables
        numModels=0
        arraysToAverage=[]
        calcAvgWeights=[]
        modelWeights=[]
        avgModelWeights=[]
        i=0
        j=0

        # Return model's average edge weights
        return(avgModelWeights)


In [None]:
def saveModelEdgeWeights(model, filename="pws_qnr_dnn_model.weights.h5"):
    ''' Saves all layer weights to a file. Target file name must end in ".weight.h5" '''

    # Save the model's weights
    model_weigths = model.save_weights(model, filename)


In [None]:
from datetime import date

def saveTrainedModelAsKeras(model, filename="sequential_neural_network_models"):
    '''Saves the trained model to a file, catalogued by date and time'''


    # Ask user if he/she would like to save the trained model
    response=input("Would you like to save this model? (y/n)")

    # Get today's date
    todaysDate = str(date.today()).replace("-", "")

    # Create file name to save model
    filename = "pws_qnr_dnn_model_" + todaysDate + ".keras"

    # Action to perform based on user's response
    if (response=='y'):    # Save the model
        model = model.save(filename)

    elif (response=='n'):  # Do not save the model
      print("Ok, the model will not be saved")


In [None]:
# *** DEPRECATED ***
def saveTrainedModelAsOnnx(model):
    '''Saves the trained model as a ONNX object'''

    # Ask user if he/she would like to save the trained model
    response=input("Would you like to save this model? (y/n)")

    # Action to perform based on user's response
    if (response=='y'):    # Save the model

        # Define the dimensions and datatype of the TensorFlow model's input: [None,43], float32. "None" indicates unknown batch size (number of records).
        input_signature = [tf.TensorSpec([None, 43], tf.float32, name='x')]   # the tf.TensorSpec => num of input values and num of output values, nothing to do with the number of records!

        onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=13)
        onnx.save(onnx_model, "pws_qnr_dnn_model.onnx")

    elif (response=='n'):  # Do not save the model
        print("Ok, the model will not be saved")


---


**Model Optimization Functions**


---

**Hyperparameter Tuning:**
* Batch size and training epochs
* Optimization algorithms
* Learning rate and momentum
* Network weight initialization
* Activation functions
* Dropout regularization
* Number of neurons in the hidden layer

**Optimization Functions:**
1. optimizeTestSizeSplit()
2. optimizeDimensionsOfSequentialNeuralNetwork()
3. optimizeDepthOfSequentialNeuralNetwork()
4. gridSearchCVOfSNN()

In [None]:
def optimizeTestSizeSplit(trainTestDataFrames, shuffleTrainTestSplit):
    '''Function to find optimal Train-Test Split.  Must pass a merged dataframe when using with data split by nut_phase!'''

    # Function variables
    lossVals=[]                         # List for storing loop loss score values
    accuracyVals=[]                     # List for storing loop accuracy values
    testBatchSize=list(range(5,55,5))   # list of range percent of data to use as test in training of model. Range 5-80%, incremented by 5%
    models=[]                           # stores models generated
    splitFraction=[]                    # stores percent values assayed
    AvgEdgeWeights=[]                   # stores calculated average edge weights received from "calcAvgEdgeWeights()" function call
    toShuffle=shuffleTrainTestSplit     # stores boolean value declaring whether or not to shuffle dataset during Train Test split
    i=0                                 # Loop index

    # Columns to remove or select from original dataframe to create working dataframes for model training
    nonEssential_columns = ["rec_num","sample_id","nut_phase","Phase_1a","Phase_1b","Phase_2a","Phase_2b","Phase_3","Phase_4"]
    output_columns = ["Phase_1a","Phase_1b","Phase_2a","Phase_2b","Phase_3","Phase_4"]

    # Shuffle the dataset
    df_sh = shuffle(trainTestDataFrames)

    # Create dataframes for training SNN model
    X_input=df_sh.drop(columns=nonEssential_columns)
    y_output=df_sh[output_columns]


    # While loop to iterate through various test_size fractions to find optimal percentage
    # producing the lowest loss score and highest accuracy value
    while i < len(testBatchSize):
        testSize=testBatchSize[i]/100

        # Split the dataset so that test dataset size ="testSize", where "testSize" is given by the percentage of the current loop iteration
        X_train, X_test, y_train, y_test = train_test_split(X_input, y_output, test_size=testSize, shuffle=True)

        # Train the model
        model.fit(X_train, y_train, epochs=100, batch_size=10, verbose=0, validation_data=(X_test, y_test)) #, callbacks=callbacks_list)

        # Evaluate the model
        score = model.evaluate(X_test, y_test, verbose=0)

        # Store current loop iteration loss and accuracy values in their respective lists
        lossVals.append(score[0])
        accuracyVals.append(score[1])

        # Save the models
        models.append(model)

        # Display current iteration percent value
        print(str(int(testSize*100)) + "%")

        # increment loop index by one (1)
        i=i+1

        # AvgEdgeWeights=calcAvgEdgeWeights(models)

    # Display accuracy and loss scores
    # plt.plot(splitFraction, lossVals)
    # plt.xlabel("Test Fraction (%)")
    # plt.ylabel("Loss Function Score")
    # plt.show()

    # plt.plot(splitFraction, accuracyVals)
    # plt.xlabel("Test Fraction (%)")
    # plt.ylabel("Accuracy Score")
    # plt.show()



In [None]:
def arrangeArraysInColumnMajorOrder(inputArray):
    '''Function inputs an array of 2D lists (3D object) and returns the averages of the lists within each 2D list block'''

    listsToAvg=inputArray
    block=0
    arrStruct=0
    elem=0
    tempList=[]
    elements=[]
    groupedElements=[]

    while block<len(listsToAvg): #n=2
        while elem<len(listsToAvg[0][0]): #n=8
            while arrStruct<len(listsToAvg[0]): #n=3
                tempList.append(listsToAvg[block][arrStruct][elem])
                arrStruct+=1
            elements.append(tempList)
            arrStruct=0
            elem+=1
            tempList=[]
        groupedElements.append(elements)
        elements=[]
        elem=0
        block+=1
    block=0

    i=0

    return(groupedElements)


In [None]:
testList1=[[1,2,3,4,5,6,7,8],
          [6,7,8,9,10,11,12,13],
          [3,5,8,2,10,15,23,17],
          [3,5,8,2,10,15,23,17]]

testList2=[[5,6,7,8,9,10,12,13],
          [11,12,13,14,15,16,17,19],
          [11,2,13,14,15,16,17,19],
          [3,5,8,2,10,15,23,17]]

testList3=[[15,6,27,8,9,10,12,13],
          [11,2,13,14,15,16,17,19],
          [3,5,8,2,10,15,23,17],
          [23,5,28,2,10,15,23,17]]

testList4=[[15,6,27,8,9,10,12,13],
          [11,2,13,14,15,16,17,19],
          [3,5,8,2,10,15,23,17],
          [23,5,28,2,10,15,23,17]]

testList5=[[15,6,27,8,9,10,12,13],
          [11,2,13,14,15,16,17,19],
          [3,5,8,2,10,15,23,17],
          [23,5,28,2,10,15,23,17]]

allLists=listMultiAppend(testList1, testList2, testList3, testList4, testList5)

print(arrangeArraysInColumnMajorOrder(allLists))



[[[1, 6, 3, 3], [2, 7, 5, 5], [3, 8, 8, 8], [4, 9, 2, 2], [5, 10, 10, 10], [6, 11, 15, 15], [7, 12, 23, 23], [8, 13, 17, 17]], [[5, 11, 11, 3], [6, 12, 2, 5], [7, 13, 13, 8], [8, 14, 14, 2], [9, 15, 15, 10], [10, 16, 16, 15], [12, 17, 17, 23], [13, 19, 19, 17]], [[15, 11, 3, 23], [6, 2, 5, 5], [27, 13, 8, 28], [8, 14, 2, 2], [9, 15, 10, 10], [10, 16, 15, 15], [12, 17, 23, 23], [13, 19, 17, 17]], [[15, 11, 3, 23], [6, 2, 5, 5], [27, 13, 8, 28], [8, 14, 2, 2], [9, 15, 10, 10], [10, 16, 15, 15], [12, 17, 23, 23], [13, 19, 17, 17]], [[15, 11, 3, 23], [6, 2, 5, 5], [27, 13, 8, 28], [8, 14, 2, 2], [9, 15, 10, 10], [10, 16, 15, 15], [12, 17, 23, 23], [13, 19, 17, 17]]]


In [None]:
def optimizeDimensionsOfSequentialNeuralNetwork(dataset, maxLayerWidth, networkDepth, numInputFeatures, layersToAddPerIteration, epochs, shuffleTrainTestSplit, userDefinedVerbose):
    '''Function finds the optimal depth (layers) and width (nodes/layer) of a sequential neural network for a given dataset'''

    # Variables
    rndSeed=42                            # Sets the random seed. "If neither the global seed nor the operation-level seed is set: A randomly picked seed is used for this op. - TensorFlow"
    modelNum=1
    layerNum=0
    sum=0.0
    modelVersion=0.1
    constructNum=1
    nodes=numInputFeatures
    maxNumNodes=maxLayerWidth
    numNodes=numInputFeatures
    verbose=userDefinedVerbose
    toShuffle=shuffleTrainTestSplit
    trainingDataset=dataset
    maxNumLayers=networkDepth
    layerCount=layersToAddPerIteration

    # Lists
    totalNumNodes=[]
    performanceScores=[]
    trainAccuracyToLossRatios=[]
    testAccuracyToLossRatios=[]
    modelDimensions=[]
    trainToTestLossRatio=[]
    tempTrainLossScores=[]
    tempTestLossScores=[]
    tempTrainAccScores=[]
    tempTestAccScores=[]
    resultsTrainLoss=[]
    resultsTestLoss=[]
    resultsTrainAcc=[]
    resultsTestAcc=[]
    numModels=[]
    models=[]
    results=[]
    trainLossSum=[]
    testLossSum=[]
    trainStdDevs=[]
    testStdDevs=[]
    trainTestLossRatio=[]
    sse=[]


    # Begin recording total runtime
    startTime = time()

    while(layerCount < networkDepth):
        while(numNodes < maxLayerWidth):
            # Clears all previously created models from memory
            tf.keras.backend.clear_session()

            # Set random seed. "If neither the global seed nor the operation-level seed is set: A randomly picked seed is used for this op. - TensorFlow"
            tf.random.set_seed(rndSeed)

            # Create an instance of a Sequential Neural Network object
            model=Sequential()

            # Build the Neural Network
            # -- Add Input layer --------------------------------------------
            model.add(tf.keras.layers.Input(shape=(numInputFeatures, )))         # input layer (follows matrix mult (A X B), B is m rows and n cols, thus A is k rows and m cols (where k=#records in the dataset and m=#input cols))

            # -- Add Hidden layers ------------------------------------------
            while(layerNum < layerCount):
                model.add(tf.keras.layers.Dense(numNodes, activation="relu"))
                layerNum+=1
            layerNum=0
            numNodes+=numInputFeatures

            # -- Add Output layer -------------------------------------------
            model.add(tf.keras.layers.Dense(6, activation="softmax"))        # output layer.  Softmax converts a vector of values to a probability distribution with a range (0,1).


            # -- Compile the model ------------------------------------------
            model=compileNeuralNetwork(model)

            # -- Train the model --------------------------------------------
            model=trainNeuralNetwork(model, trainingDataset, epochs, verbose)

            # -- Evaluate model ---------------------------------------------
            print("\n" + "Model number: " + str(modelNum+modelVersion) + "\n" +
                  "Model construct number: " + str(constructNum) + "\n" +
                  "Model's dimensions (depth, height): " + "(" + str(layerCount)
                   + ", " + str(numNodes-numInputFeatures)  + ")")

            performanceScores=getModelsPerformance(model, trainingDataset)


            # -- Record loss and accuracy values ----------------------------
            # Append loss scores to the "lossScore" list
            # Train dataset loss --> [0][0]
            # Test dataset loss --> [1][0]
            tempTrainLossScores.append(performanceScores[0][0])
            tempTestLossScores.append(performanceScores[1][0])

            # Append validation accuracy value to the "accuracyScore" list
            # Train dataset accuracy --> [0][1]
            # Test dataset accuracy --> [1][1]
            tempTrainAccScores.append(performanceScores[0][1])
            tempTestAccScores.append(performanceScores[1][1])

            # Store Results
            if (modelVersion==0.5):
                resultsTrainLoss.append(tempTrainLossScores)  # elements are listed in ascending order of network dimension
                resultsTestLoss.append(tempTestLossScores)    # elements are listed in ascending order of network dimension
                resultsTrainAcc.append(tempTrainAccScores)    # elements are listed in ascending order of network dimension
                resultsTestAcc.append(tempTestAccScores)      # elements are listed in ascending order of network dimension
                tempTrainLossScores=[]
                tempTestLossScores=[]
                tempTrainAccScores=[]
                tempTestAccScores=[]

            # Calculate the validation accuracy value to loss score ratio and append to  to the "accuracyToLossRatio" list
            trainAccuracyToLossRatios.append(np.array(performanceScores[0][1])/np.array(performanceScores[0][0]))    # must convert list to numpy array to be able to perform calculations
            testAccuracyToLossRatios.append(np.array(performanceScores[1][1])/np.array(performanceScores[1][0]))

            # Train to test loss ratio calculation
            trainToTestLossRatio.append(np.array(performanceScores[0][0])/np.array(performanceScores[1][0]))

            # model.summary()
            modelsDimensions="(" + str(layerCount) + ", " + str(numNodes-numInputFeatures)  + ")"
            modelDimensions.append(modelsDimensions)
            modelVersion+=0.1
            constructNum+=1
        numModels.append(modelNum)
        models.append("model_"+str(modelNum)+" (n="+str(layerCount)+")")
        modelNum+=1
        modelVersion=0.1
        numNodes=numInputFeatures
        layerCount+=2

    results=listMultiAppend(resultsTrainLoss, resultsTestLoss, resultsTrainAcc, resultsTestAcc)
    results=arrangeArraysInColumnMajorOrder(results)
    numOfResults=[]
    numOfResults=list(range(1, len(results[0])+1, 1))

    i=0
    j=0
    sum=0.0

    refList=results[0]

    # # Add up train losses for each dimension
    while (i < len(refList)):
        while (j < len(refList[0])):
            sum=sum+refList[i][j]
            j+=1
        trainLossSum.append(sum)
        trainStdDevs.append(st.stdev(refList[i]))
        sum=0
        j=0
        i+=1
    j=0
    i=0

    refList=results[1]

    # Add up test losses for each dimension
    while (i < len(refList)):         # n=5
        while (j < len(refList[1])):  # n=3
            sum=sum+refList[i][j]
            j+=1
        testLossSum.append(sum)
        testStdDevs.append(st.stdev(refList[i]))
        sum=0
        j=0
        i+=1
    j=0
    i=0

    # Calculate the sum of square error for test loss
    for num in testLossSum:
        sse.append(num**2/networkDepth)

    # Calculate the difference between test and train loss (test > train)
    # thus optimal performance occurs when test-train --> 0
    while i < len(testLossSum):
        trainTestLossRatio.append(trainLossSum[i]-testLossSum[i])
        i+=1
    i=0


    # -- Plot SNN Dimensions Optimization Results ---------------------------
    dims=['(n, '+str(nodes)+')', '(n, '+str(nodes*2)+')', '(n, '+str(nodes*3)+')', '(n, '+str(nodes*4)+')', '(n, '+str(nodes*5)+')']

    # -- Plot train loss score ----------------------------------------------
    plt.plot(dims, results[0])    # extract and plot Train loss scores
    plt.xlabel("Network Dimension")
    plt.ylabel("Train Loss Scores")
    plt.legend(models, loc='best')
    plt.figure(figsize=(10, 6))    # (width_size, height_size)
    plt.show()

    # -- Plot test loss score -----------------------------------------------
    plt.plot(dims, results[1])    # extract and plot Train loss scores
    plt.xlabel("Network Dimension")
    plt.ylabel("Test Loss Scores")
    plt.legend(models, loc='best')
    plt.figure(figsize=(10, 6))    # (width_size, height_size)
    plt.show()

    # -- Plot train accuracy scores -----------------------------------------
    # plt.plot(dims, results[2])    # extract and plot Train accuracy scores
    # plt.xlabel("Network Dimension")
    # plt.ylabel("Train Accuracy Scores")
    # plt.legend(models, loc='best')
    # plt.figure(figsize=(10, 6))    # (width_size, height_size)
    # plt.show()

    # -- Plot test accuracy scores ------------------------------------------
    # plt.plot(dims, results[3])    # extract and plot Test accuracy scores
    # plt.xlabel("Network Dimension")
    # plt.ylabel("Test Accuracy Scores")
    # plt.legend(models, loc='best')
    # plt.figure(figsize=(10, 6))    # (width_size, height_size)
    # plt.show()

    # -- Plot sum of loss differences ------------------------------------------
    plt.plot(dims, trainLossSum)   # extract and plot Test accuracy scores
    plt.plot(dims, testLossSum)    # extract and plot Test accuracy scores
    plt.xlabel("Network Dimension")
    plt.ylabel("Sum of Test Loss Differences")
    plt.legend(["Train Loss Diff", "Test Loss Diff"], loc='upper right')
    plt.figure(figsize=(10, 6))    # (width_size, height_size)
    plt.show()

    # # -- Plot standard deviations of loss ------------------------------------------
    # plt.plot(dims, trainStdDevs)   # extract and plot Test accuracy scores
    # plt.plot(dims, testStdDevs)    # extract and plot Test accuracy scores
    # plt.xlabel("Network Dimension")
    # plt.ylabel("Standard Deviation")
    # plt.legend(["Train stdDev", "Test stdDev"], loc='upper right')
    # plt.figure(figsize=(10, 6))    # (width_size, height_size)
    # plt.show()

    # -- Plot SSE of test loss ------------------------------------------
    plt.plot(dims, sse)   # extract and plot Test accuracy scores
    plt.xlabel("Network Dimension")
    plt.ylabel("Sum of Squared Error (SSE) for Test Loss")
    plt.figure(figsize=(10, 6))    # (width_size, height_size)
    plt.show()

    # # -- Plot Train minusTest diff.  Optimal occurs when diff = 0 ------------
    # plt.plot(dims, trainTestLossRatio)    # extract and plot Test accuracy scores
    # plt.xlabel("Network Dimension")
    # plt.ylabel("Test minus Train Loss")
    # plt.figure(figsize=(10, 6))    # (width_size, height_size)
    # plt.show()

    # -- Print total run time -----------------------------------------------
    print("\n" + "Total run time: " + str(int((time()-startTime)/60)) + "mins")

    # -- Print Summary of Results -------------------------------------------
    print("\n" + "------------------- Results -----------------------")
    print("Training loss sums:")
    print(trainLossSum)
    print("Test loss sums:")
    print(testLossSum)
    print("Train-Test loss Difference:")
    print(np.subtract(testLossSum, trainLossSum))


In [None]:
#######################################################################
#----------------------------------------------------------------------
# TEST FOR "optimizeSequentialNeuralNetworkDimensions()" FUNCTION
#----------------------------------------------------------------------
#######################################################################

# Variables
filename1="snn_nut_phase_questionnaire_training_dataset_NO_age; 09-26-2023.csv"
filename2="snn_nut_phase_questionnaire_training_dataset_WITH_age; 09-26-2023.csv"
filename3="snn_nut_phase_questionnaire_norm_training_dataset_NO_age; 09-22-2023.csv"
filename4="snn_nut_phase_questionnaire_norm_training_dataset_WITH_age; 09-22-2023.csv"
toShuffle=True

# Read CSV datafile and convert to a Pandas Dataframe object
df=pd.DataFrame(readCSVFile(filename4))

# PARTITION THE DATASET BY NUTRITIONAL PHASE
splitDataFrame = splitDataFrameIntoInputOutputByNutPhase(df)

# SPLIT INPUT AND OUTPUT DATAFRAMES INTO TRAIN, TEST SETS (NOTE: "TRUE" HYPERPARAMETER SELECTED FOR SHUFFLING OF DATASET DURING SPLIT PROCEDURE)
modelTrainingDatasets = splitNutPhaseSeparatedDataFrameIntoTrainTestSets(splitDataFrame, True)

# MERGE THE TRAIN AND TEST SETS, RESPECTIVELY
mergedDatasets = mergePWSNutPhaseSubsets(modelTrainingDatasets)

# Parameters: (dataset, maxLayerWidth(height), networkDepth(number of hidden layers(even #)+1), numInputFeatures, layersToAddPerIteration, epochs, shuffleTrainTestSplit, userDefinedVerbose)
optimizeDimensionsOfSequentialNeuralNetwork(mergedDatasets, 374, 17, 44, 2, 30, False, 0)


FileNotFoundError: [Errno 2] No such file or directory: 'snn_nut_phase_questionnaire_norm_training_dataset_WITH_age; 09-22-2023.csv'

In [None]:
def optimizeDepthOfSequentialNeuralNetwork(dataset, maxLayerWidth, networkDepth, numInputFeatures, shuffleTrainTestSplit, userDefinedVerbose):
    '''Function finds the optimal depth of a neural network needed for a given dataset'''

    # Variables
    rndSeed=42                            # sets the random seed. "If neither the global seed nor the operation-level seed is set, a randomly picked seed is used for this op. - TensorFlow"
    hiddenLayerNum=1                      # hidden layer reference number
    hiddenLayerCount=0                    # count of total number of hidden layers
    numHiddenLayers=networkDepth
    nodes=maxLayerWidth
    trainingDataset=dataset
    toShuffle=shuffleTrainTestSplit
    verbose=userDefinedVerbose
    performanceScores=[]
    lossScores=[]
    accuracyScores=[]
    accuracyToLossRatios=[]
    x_vals=list(range(1,networkDepth,1))

    # Keywords:
    # "Input Layer: this represents the input variables, sometimes called the visible layer.
    # Hidden Layers: these are the layers of nodes between the input and output layers. The network may contain one or more of these layers.
    # Output Layer: the final layer of nodes. Produces the output variables.
    # Size: number of nodes in the model.
    # Width: number of nodes in a specific layer.
    # Depth: number of layers in the neural network.
    # Capacity: type or structure of functions that can be learned by the current network configuration.
    # Architecture: the specific arrangement of the layers and nodes in the neural network.""
    # *** Note: choose depth over width ***

    # Read CSV datafile and convert to a Pandas Dataframe object
    df=pd.DataFrame(readCSVFile(trainingDataset))

    # Shuffle the dataframe prior to splitting into Train and Test sets
    shuffledDataFrame=shuffleDataFrame(df)

    # Split the dataset into a Train set and a Test set
    trainingDataset=splitDataFrameIntoTrainTestSets(shuffledDataFrame, toShuffle)


    while (hiddenLayerNum < (len(x_vals)+1)):
        tf.keras.backend.clear_session()    # clears all previously created models from memory

        # Initialize a Sequential Neural Network Object (SNN)
        tf.random.set_seed(rndSeed)    # "If neither the global seed nor the operation-level seed is set: A randomly picked seed is used for this op. - TensorFlow"
        model=Sequential()    # create an instance of a Sequential Neural Network object

        # --- Build the model ---
        # Add input layer
        model.add(tf.keras.layers.Input(shape=(numInputFeatures, )))    # input layer (follows matrix mult (A X B), B is m rows and n cols, thus A is k rows and m cols (where k=#records in the dataset and m=#input cols))

        # Add hidden layers to model
        while (hiddenLayerCount < hiddenLayerNum):
            model.add(tf.keras.layers.Dense(nodes, activation="relu"))    # N-hidden layer
            hiddenLayerCount+=1    # increment nested while loop index by one(1)
        hiddenLayerCount=0    # reset nested while loop index to zero(0)

        # Add output layer
        model.add(tf.keras.layers.Dense(6, activation="softmax"))   # output layer.  Softmax converts a vector of values to a probability distribution with a range (0,1).

        # compile the model
        model=compileNeuralNetwork(model)

        # train the model
        model=trainNeuralNetwork(model, trainingDataset, verbose)

        # print model's performance with the current number of hidden layers
        print("\n" + "MODEL#: " + str(hiddenLayerNum) + "\n" + "Number of hidden layers: " + str(hiddenLayerNum))
        performanceScores=getModelsPerformance(model, trainingDataset)

        # Append loss score to the "lossScore" list
        lossScores.append(performanceScores[0])

        # Append validation accuracy value to the "accuracyScore" list
        accuracyScores.append(performanceScores[1])

        # Calculate the validation accuracy value to loss score ratio and append to  to the "accuracyToLossRatio" list
        accuracyToLossRatios.append(np.array(performanceScores[1])/np.array(performanceScores[0]))    # must convert list to numpy array to be able to perform calculations
        print("\n")

        hiddenLayerNum+=1    # increment outer while loop index by one(1)

    hiddenLayerCount=0       # reset nested while loop index to zero(0)
    hiddenLayerNum=0         # reset outer while loop index to zero(0)
    tf.keras.backend.clear_session()    # clears all previously created models from memory

    # Plot loss score change with increasing number of hidden layers
    plt.plot(x_vals, lossScores)
    plt.xlabel("Number of Hidden Layers")
    plt.ylabel("Loss Function Score")
    plt.show()

    # Plot accuracy score change with increasing number of hidden layers
    plt.plot(x_vals, accuracyScores)
    plt.xlabel("Number of Hidden Layers")
    plt.ylabel("Accuracy Score on Validation Dataset")
    plt.show()

    # Plot accuracy-to-loss score ratio with increasing number of hidden layers
    plt.plot(x_vals, accuracyToLossRatios)
    plt.xlabel("Number of Hidden Layers")
    plt.ylabel("Accuracy-to-Loss Ratio")
    plt.show()

    print("x_vals: "+str(len(x_vals)))
    print("loss_score: "+str(len(lossScores)))
    print("acc_score: "+str(len(accuracyScores)))
    print("acc_to_loss_ratio: "+str(len(accuracyToLossRatios)))

    #return(model)



In [None]:
####################################################################
#-------------------------------------------------------------------
#  TEST FOR "optimizeDepthOfSequentialNeuralNetwork()" FUNCTION
#-------------------------------------------------------------------
####################################################################

# Variables
filename="nut_phase_questionnaire_data_fullset.csv"

# Parameters: (dataset, maxLayerWidth, networkDepth, numInputFeatures, shuffleTrainTestSplit, userDefinedVerbose)
optimizeDepthOfSequentialNeuralNetwork(filename, 86, 15, 43, True, 0)


In [None]:
def createSNNModelForHPTuning(neurons, batch_size, learning_rate, epochs, numFeatures):
    '''Function creates a Sequential Neural Network for tuning hyperparameters of the Nut_Phase Questionnaire deep learning model'''

    tf.random.set_seed(42)
    model = Sequential()   # create an instance of a Sequential object

    # Add layers to the network
    model.add(tf.keras.layers.Input(shape=(numFeatures, )))
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # first hidden layer (1)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # second hidden layer (2)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # third hidden layer (3)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # fourth hidden layer (4)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # fifth hidden layer (5)
    model.add(tf.keras.layers.Dense(6, activation="softmax"))        # output layer.  Softmax converts a vector (array) of values into a probability distribution with a range (0,1).

    model.compile(keras.optimizers.Adam(learning_rate=learning_rate), loss='categorical_crossentropy', metrics=['accuracy'])

    return(model)


In [None]:
def gridSearchCVOfSNN():
    '''Grid search cross-validation for tuning hyperparameters of deep Sequential Neural Network'''

    # Variables
    filename="nut_phase_questionnaire_data_fullset.csv"
    modelTrainingDatasets=[]

    # Read CSV datafile and convert to a Pandas dataframe object
    dataFrame=readCSVFile(filename)

    # Separate the dataframe by nut_phase
    splitDataFrame = splitDataFrameIntoInputOutputByNutPhase(dataFrame)

    # Split dataframes into Train, Test sets.  "True" hyperparameter selected for shuffling of dataset
    modelTrainingDatasets = splitNutPhaseSeparatedDataFrameIntoTrainTestSets(splitDataFrame, True)

    # Merge the Train and the Test sets, respectively
    mergedDatasets = mergePWSNutPhaseSubsets(modelTrainingDatasets)

    # Extract input (x-values - features) and output(y-values - labels) train sets
    X_train= pd.DataFrame(mergedDatasets[0][0]).to_numpy()
    y_train= pd.DataFrame(mergedDatasets[0][1]).to_numpy()

    # param_grid=dict(neurons=neurons, learn_rate=learn_rate, batch_size=batch_size, epochs=epochs). Possible # of combinations: 54
    param_grid={
        "neurons": [6, 36, 86],
        "batch_size": [6, 12],
        "learning_rate": [0.001,0.01,0.2],
        "epochs": [10,20,30]
    }

    # Initialize a KerasClassifier object and pass to it a Sequential Neural Network object
    model=KerasClassifier(build_fn=createSNNModelForHPTuning)

    # Initialize a Grid Search Cross-Validation object
    grid=GridSearchCV(estimator=model, param_grid=param_grid, verbose=0, n_jobs=1, cv=5)

    # Run GridSearchCV with all parameters
    grid_result=grid.fit(X_train, y_train)

    # Print best parameters:
    print("Best: %f using %s" % (grid_results.best_score_, grid_results.best_params_))
    best_model=grid_result.best_estimator_


In [None]:
####################################################################
#-------------------------------------------------------------------
#           TEST FOR "gridSearchCVOfSNN()" FUNCTION
#-------------------------------------------------------------------
####################################################################

gridSearchCVOfSNN()


In [None]:
######################################
#  REFERENCE GRIDSEARCHCV FUNCTION
######################################

def create_model(neurons, batch_size, learning_rate, epochs):
    '''Function for tuning hyperparameters of the Nut_Phase Questionnaire Sequential Neural Network model'''

    # Notes: categorical_crossentropy is used as the loss function since this is a classification model.
    # Initialize a Sequential Neural Network object
    model=Sequential()

    # Add layers to the network
    model.add(tf.keras.layers.Input(shape=(43, )))
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # first hidden layer (1)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # second hidden layer (2)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # third hidden layer (3)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # fourth hidden layer (4)
    model.add(tf.keras.layers.Dense(neurons, activation="relu"))     # fifth hidden layer (5)
    model.add(tf.keras.layers.Dense(6, activation="softmax"))        # output layer.  Softmax converts a vector (array) of values into a probability distribution with a range (0,1).

    model.compile(keras.optimizers.Adam(learning_rate=learning_rate), loss="categorical_crossentropy", metrics=["accuracy"])

    return(model)


In [None]:
######################################
#  REFERENCE GRIDSEARCHCV FUNCTION
######################################

# Variables
filename="nut_phase_questionnaire_data_fullset.csv"
modelTrainingDatasets=[]

# Read CSV datafile and convert to a Pandas dataframe object
dataFrame=readCSVFile(filename)

# Separate the dataframe by nut_phase
splitDataFrame = splitDataFrameIntoInputOutputByNutPhase(dataFrame)

# Split dataframes into Train, Test sets.  "True" hyperparameter selected for shuffling of dataset
modelTrainingDatasets = splitNutPhaseSeparatedDataFrameIntoTrainTestSets(splitDataFrame, True)

# Merge the Train and the Test sets, respectively
mergedDatasets = mergePWSNutPhaseSubsets(modelTrainingDatasets)

# Extract input (x-values - features) and output(y-values - labels) train sets
X_train= pd.DataFrame(mergedDatasets[0][0]).to_numpy()
Y_train= pd.DataFrame(mergedDatasets[0][1]).to_numpy()

# param_grid=dict(neurons=neurons, learn_rate=learn_rate, batch_size=batch_size, epochs=epochs). Possible # of combinations: 54
param_grid={
    "neurons": [36, 86, 172],
    "batch_size": [6, 12],
    "learning_rate": [0.001,0.01,0.2],
    "epochs": [10,20,30]
}

# Initialize a Grid Search Cross-Validation object
grid = GridSearchCV(estimator=KerasClassifier(build_fn=create_model), param_grid=param_grid, n_jobs=1, verbose=0, cv=5)

# Run GridSearchCV with all parameters
grid_results=grid.fit(X_train,Y_train)

# Print best parameters:
print("Best: %f using %s" % (grid_results.best_score_, grid_results.best_params_))

best_model=grid_results.best_estimator_

---

**Main Function (Driver)**

---



In [None]:
#---------------------------------------------------------------------------------
# Title: PWS Nutritional Phase Predictor
# Author: Carlos Sulsona
# Date: 08/15/2023
# Overview: Dataset was split by Nut_Phase and shuffling was conducted during the
# Train/Test split.
# Description: dataset is split into six(6) subsets by nutritional phase. Each
# subset is then split into Train/Test sets. Train and test sets are then merged
# to produce contiguous Train/Test datasets. These "full" datasets are then
# used to train and test the Sequential Neural Network model.
#---------------------------------------------------------------------------------

# FUNCTION GLOBAL VARIABLES
filename1="snn_nut_phase_questionnaire_training_dataset_NO_age; 09-26-2023.csv"           # features: 43; nodes - 129
filename2="snn_nut_phase_questionnaire_training_dataset_WITH_age; 09-26-2023.csv"         # features: 44; nodes - 176
filename3="snn_nut_phase_questionnaire_norm_training_dataset_NO_age; 09-22-2023.csv"      # features: 43; nodes - 129
filename4="snn_nut_phase_questionnaire_norm_training_dataset_WITH_age; 09-22-2023.csv"    # features: 44; nodes - 176
filename5="snn_nut_phase_questionnaire_num_training_dataset_NO_age; 10-26-2023.csv"       # features: 43; nodes - 129
modelTrainingDatasets=[]


#--- METHOD CALLS ------------------------------------------
# READ CSV DATAFILE AND CONVERT TO A PANDAS DATAFRAME OBJECT
dataFrame=readCSVFile(filename1)

# PARTITION THE DATASET BY NUTRITIONAL PHASE
partitionedDataFrames = splitDataFrameIntoInputOutputByNutPhase(dataFrame)

# SPLIT INPUT AND OUTPUT DATAFRAMES INTO TRAIN, TEST SETS (NOTE: "TRUE"
# HYPERPARAMETER SELECTED FOR SHUFFLING OF DATASET DURING SPLIT PROCEDURE)
trainTestSplits = splitNutPhaseSeparatedDataFrameIntoTrainTestSets(partitionedDataFrames, True)

# MERGE THE TRAIN AND TEST SETS, RESPECTIVELY
mergedTrainTestDatasets = mergePWSNutPhaseSubsets(trainTestSplits)

# BUILD A SEQUENTIAL NEURAL NETWORK (SNN) - parameter (number of features, nodes per layer)
model = constructSequentialNeuralNetwork(43, 129)

# COMPILE THE SEQUENTIAL NEURAL NETWORK
model = compileNeuralNetwork(model)

# DISPLAY SUMMARY OF SNN STRUCTURE
displayNeuralNetworkSummary(model)

# DISPLAY GRAPHICAL STRUCTURE OF SNN
#displayNeuralNetworkStructure(model)

# TRAIN THE NEURAL NETWORK (NOTE: VERBOSE SET AT '2' TO DISPLAY INFORMATION ONTO
# SCREEN DURING TRAINING OF MODEL)
# (model, trainingDatasets, epochs, userDefinedVerbose):
trainNeuralNetwork(model, mergedTrainTestDatasets, 50, 2)

# VERIFY THE MODEL HAS LEARNED
verifyModelHasLearned(model)

# ASSESS NEURAL NETWORK'S PERFORMANCE
modelsPerformance=getModelsPerformance(model, mergedTrainTestDatasets)

tempModel=model
tempModel_1=model

# SAVE THE TRAINED MODEL (USER WILL BE PROMPTED TO SAVE MODEL)
#saveTrainedModelAsH5(model)
saveTrainedModelAsKeras(model)
# saveTrainedModelAsOnnx(model)

#findOptimalTestSize(mergedDatasets)

#find optimal neural network depth
#findOptimalDimensionsOfSequentialNeuralNetwork(mergedDatasets, 43)


None
Epoch 1/50
41/41 - 3s - 76ms/step - accuracy: 0.3444 - loss: 1.5711 - val_accuracy: 0.5566 - val_loss: 1.1218
Epoch 2/50
41/41 - 0s - 10ms/step - accuracy: 0.7925 - loss: 0.6544 - val_accuracy: 1.0000 - val_loss: 0.1104
Epoch 3/50
41/41 - 0s - 8ms/step - accuracy: 0.9793 - loss: 0.1613 - val_accuracy: 0.9528 - val_loss: 0.3014
Epoch 4/50
41/41 - 0s - 6ms/step - accuracy: 0.9751 - loss: 0.1516 - val_accuracy: 0.9906 - val_loss: 0.0856
Epoch 5/50
41/41 - 0s - 5ms/step - accuracy: 0.9876 - loss: 0.0412 - val_accuracy: 0.9717 - val_loss: 0.2799
Epoch 6/50
41/41 - 0s - 7ms/step - accuracy: 0.9668 - loss: 0.1944 - val_accuracy: 1.0000 - val_loss: 0.0250
Epoch 7/50
41/41 - 0s - 6ms/step - accuracy: 1.0000 - loss: 0.0110 - val_accuracy: 0.9811 - val_loss: 0.0390
Epoch 8/50
41/41 - 0s - 10ms/step - accuracy: 1.0000 - loss: 0.0023 - val_accuracy: 0.9811 - val_loss: 0.0320
Epoch 9/50
41/41 - 0s - 12ms/step - accuracy: 1.0000 - loss: 0.0012 - val_accuracy: 0.9811 - val_loss: 0.0362
Epoch 10/5

---

**Test Model's Ability To Make A Prediction**

---



In [None]:
###################################################################################################
#--------------------------------------------------------------------------------------------------
#                                        PREDICT CASE
#--------------------------------------------------------------------------------------------------
###################################################################################################
import numpy as np

#--- VARIABLES --------------------
yes_counter=0
ns_counter=0

#--- TEST SEQUENTIAL NEURAL NETWORK MODEL'S ABILITY TO MAKE AN ACCURATE PREDICTION - PREDICT PWS NUTRITIONAL PHASE ---
# nut_phases [.....1a(n=7)......|.1b(n=3)..|...2a(n=5)....|.......2b(n=7)........|...........3(n=13).............|.....4(n=6).....]
testSubmission = [0,0,0,0,0,0,0,                 # 1a
                  0,0,0,                         # 1b
                  0,0,0,0,0,                     # 2a
                  1,0,1,1,1,1,1,0,1,             # 2b
                  0,0,0,1,1,0,0,0,1,0,0,0,1,     # 3
                  0,1,0,1,1,1]                   # 4


# -- Use this line of code when including age(yrs) --
# testSubmission = [0.33,                          # age(yrs)
#                   0,0,0,0,0,0,0,                 # 1a
#                   0,0,0,                         # 1b
#                   0,0,0,0,0,                     # 2a
#                   1,0,1,1,1,1,1,0,1,             # 2b
#                   0,0,0,1,1,0,0,0,1,0,0,0,1,     # 3
#                   0,1,0,1,1,1]                   # 4

#--- COUNT THE NUMBER OF "YES" AND "NS" SELECTIONS MADE --------------------
for selection in testSubmission:
    if (selection==1):
        yes_counter+=1
    elif (selection==0.5):
        ns_counter+=1

#--- DETERMINE IF SELECTIONS VALID FOR USE IN PREDICTIVE ANALYTICS WERE MADE --------------------
if (ns_counter!=0 and yes_counter==0):
    print("Please make selections other than just 'NS' on Nut_Phase Questionnaire form")

elif (yes_counter==0):
    print("Please make valid selections on Nut_Phase Questionnaire form")

elif (yes_counter!=0):
    #--- MAKE PREDICTION --------------------
    keras_model = tf.keras.models.load_model('pws_qnr_dnn_model_05242024.keras')
    input_data = np.array([testSubmission])
    prediction = keras_model.predict([input_data])
    print(prediction)


    #--- DISPLAY PREDICTED NUTRITIONAL PHASE --------------------
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    print("\n"+ "Predicted PWS Nutritional Phase:" + "\n"+
          output_columns[prediction.argmax()] + "\n")


    #--- PROBABILITY DISTRIBUTION VALUES FOR ALL NUTRITIONAL PHASES REPRESENTED AS PERCENTAGES --------------------
    i=0   # loop counting index
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]

    print("Probability Distribution:")

    for prob_dist_val in prediction[0,]:
      print(output_columns[i] + ": " + str(round(prediction[0,i]*100,1)) + "%")
      i=i+1


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 181ms/step
[[2.7802112e-20 5.4950245e-13 6.4088529e-13 1.5490211e-07 9.9999988e-01
  6.0535192e-23]]

Predicted PWS Nutritional Phase:
Phase_3

Probability Distribution:
Phase_1a: 0.0%
Phase_1b: 0.0%
Phase_2a: 0.0%
Phase_2b: 0.0%
Phase_3: 100.0%
Phase_4: 0.0%


---

**Main Function_2 (Driver)**

---


In [None]:
#---------------------------------------------------------------------------------
# Title: PWS Nutritional Phase Predictor
# Author: Carlos Sulsona
# Date: 08/22/2023
# Overview: Dataset was split by Nut_Phase and "PRE-SHUFFLING" of subsets was
# conducted prior to Train/Test split.
# Description: dataset is split into six(6) subsets by nutritional phase then
# each subset is shuffled. Shuffled subsets are then merged row_wise to produce a
# contiguous full dataset. This "pre-shuffled" dataset is then used to generate
# Train/Test sets for training and testing the Sequential Neural Network model.
#---------------------------------------------------------------------------------

# FUNCTION GLOBAL VARIABLES
filename1="snn_nut_phase_questionnaire_training_dataset_NO_age; 09-26-2023.csv"           # features: 43
filename2="snn_nut_phase_questionnaire_training_dataset_WITH_age; 09-26-2023.csv"         # features: 44
filename3="snn_nut_phase_questionnaire_norm_training_dataset_NO_age; 09-22-2023.csv"      # features: 43
filename4="snn_nut_phase_questionnaire_norm_training_dataset_WITH_age; 09-22-2023.csv"    # features: 44
filename5="snn_nut_phase_questionnaire_num_training_dataset_NO_age; 10-26-2023.csv"       # features: 43

modelTrainingDatasets=[]


#--- METHOD CALLS ------------------------------------------
# READ CSV DATAFILE AND CONVERT TO A PANDAS DATAFRAME OBJECT
dataFrame=readCSVFile(filename1)

# PARTITION THE DATASET BY NUTRITIONAL PHASE
splitDataFrame = splitDataFrameByNutPhase(dataFrame)

# SHUFFLE THE PARTITIONED DATAFRAMES
shuffledDataFrames=shufflePWSNutPhasePartitionedDataFrames(splitDataFrame)

# MERGE THE PARTITIONED DATAFRAMES ROW-WISE
mergedDataFrames=mergeDataFramesRow_Wise(shuffledDataFrames)

# SPLIT THE MERGED DATAFRAME INTO INPUT AND OUTPUT DATAFRAMES BY NUTRITIONAL PHASE
dataFramesForTrainTestSplit=splitDataFrameIntoInputOutputByNutPhase(mergedDataFrames)

# SPLIT INPUT AND OUTPUT DATAFRAMES INTO TRAIN, TEST SETS (NOTE: "FALSE"
# HYPERPARAMETER SELECTED FOR SHUFFLING OF DATASET DURING SPLIT PROCEDURE)
trainTestSplits = splitNutPhaseSeparatedDataFrameIntoTrainTestSets(dataFramesForTrainTestSplit, False)

# MERGE THE TRAIN AND TEST SETS, RESPECTIVELY
mergedDatasets = mergePWSNutPhaseSubsets(trainTestSplits)

# BUILD A SEQUENTIAL NEURAL NETWORK (SNN) - parameter (number of features, nodes per layer)
model2 = constructSequentialNeuralNetwork(43, 129)

# COMPILE THE SEQUENTIAL NEURAL NETWORK
model2 = compileNeuralNetwork(model2)

# DISPLAY SUMMARY OF SNN STRUCTURE
displayNeuralNetworkSummary(model2)

# DISPLAY GRAPHICAL STRUCTURE OF SNN
# displayNeuralNetworkStructure(model2)

# TRAIN THE NEURAL NETWORK (NOTE: VERBOSE SET AT '2' TO DISPLAY INFORMATION ONTO
# SCREEN DURING TRAINING OF MODEL)
# (model, trainingDatasets, epochs, userDefinedVerbose):
trainNeuralNetwork(model2, mergedDatasets, 50, 2)

# VERIFY THE MODEL HAS LEARNED
verifyModelHasLearned(model2)

# ASSESS NEURAL NETWORK'S PERFORMANCE
# modelsPerformance=getModelsPerformance(model2, mergedDatasets)

# SAVE THE TRAINED MODEL (USER WILL BE PROMPTED TO SAVE MODEL)
# saveTrainedModel(model2)


#findOptimalTestSize(mergedDatasets)

#find optimal neural network depth
#findOptimalDimensionsOfSequentialNeuralNetwork(mergedDatasets, 43)

tempModel=model2
tempModel_2=model2


None
Epoch 1/50
41/41 - 2s - 61ms/step - accuracy: 0.4315 - loss: 1.4562 - val_accuracy: 0.7925 - val_loss: 1.0650
Epoch 2/50
41/41 - 0s - 11ms/step - accuracy: 0.8506 - loss: 0.5699 - val_accuracy: 0.8491 - val_loss: 0.4770
Epoch 3/50
41/41 - 0s - 7ms/step - accuracy: 0.9751 - loss: 0.0965 - val_accuracy: 0.9811 - val_loss: 0.2424
Epoch 4/50
41/41 - 0s - 6ms/step - accuracy: 0.9710 - loss: 0.1412 - val_accuracy: 0.9811 - val_loss: 0.1673
Epoch 5/50
41/41 - 0s - 7ms/step - accuracy: 0.9876 - loss: 0.0992 - val_accuracy: 0.9528 - val_loss: 0.2727
Epoch 6/50
41/41 - 0s - 7ms/step - accuracy: 1.0000 - loss: 0.0151 - val_accuracy: 0.9811 - val_loss: 0.2319
Epoch 7/50
41/41 - 0s - 4ms/step - accuracy: 1.0000 - loss: 0.0023 - val_accuracy: 0.9811 - val_loss: 0.2950
Epoch 8/50
41/41 - 0s - 4ms/step - accuracy: 1.0000 - loss: 7.5106e-04 - val_accuracy: 0.9811 - val_loss: 0.3555
Epoch 9/50
41/41 - 0s - 5ms/step - accuracy: 1.0000 - loss: 3.5982e-04 - val_accuracy: 0.9811 - val_loss: 0.3919
Epoc

In [None]:
###################################################################################################
#--------------------------------------------------------------------------------------------------
#                                        PREDICT CASE
#--------------------------------------------------------------------------------------------------
###################################################################################################

#--- VARIABLES --------------------
yes_counter=0
ns_counter=0

#--- TEST SEQUENTIAL NEURAL NETWORK MODEL'S ABILITY TO MAKE AN ACCURATE PREDICTION - PREDICT PWS NUTRITIONAL PHASE ---
# nut_phases [.....1a(n=7)......|.1b(n=3)..|...2a(n=5)....|.......2b(n=7)........|...........3(n=13).............|.....4(n=6).....]
testSubmission = [0,0,0,0,0,0,0,                 # 1a
                  0,0,0,                         # 1b
                  0,0,0,0,0,                     # 2a
                  1,0,1,1,1,1,1,0,1,             # 2b
                  0,0,0,1,1,0,0,0,1,0,0,0,1,     # 3
                  0,1,0,1,1,1]                   # 4


# -- Use this line of code when including age(yrs) --
# testSubmission = [0.33,                          # age(yrs)
#                   0,0,0,0,0,0,0,                 # 1a
#                   0,0,0,                         # 1b
#                   0,0,0,0,0,                     # 2a
#                   1,0,1,1,1,1,1,0,1,             # 2b
#                   0,0,0,1,1,0,0,0,1,0,0,0,1,     # 3
#                   0,1,0,1,1,1]                   # 4


#--- COUNT THE NUMBER OF "YES" AND "NS" SELECTIONS MADE --------------------
for selection in testSubmission:
    if (selection==1):
        yes_counter+=1
    elif (selection==0.5):
        ns_counter+=1

#--- DETERMINE IF SELECTIONS VALID FOR USE IN PREDICTIVE ANALYTICS WERE MADE --------------------
if (ns_counter!=0 and yes_counter==0):
    print("Please make selections other than just 'NS' on Nut_Phase Questionnaire form")

elif (yes_counter==0):
    print("Please make valid selections on Nut_Phase Questionnaire form")

elif (yes_counter!=0):
    #--- MAKE PREDICTION --------------------
    arr = np.array([testSubmission])
    prediction = tempModel.predict([arr])
    print(prediction)


    #--- DISPLAY PREDICTED NUTRITIONAL PHASE --------------------
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]
    print("\n"+ "Predicted PWS Nutritional Phase:" + "\n"+
          output_columns[prediction.argmax()] + "\n")


    #--- PROBABILITY DISTRIBUTION VALUES FOR ALL NUTRITIONAL PHASES REPRESENTED AS PERCENTAGES --------------------
    i=0   # loop counting index
    output_columns = ["Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"]

    print("Probability Distribution:")

    for prob_dist_val in prediction[0,]:
      print(output_columns[i] + ": " + str(round(prediction[0,i]*100,1)) + "%")
      i=i+1


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 153ms/step
[[9.1957072e-26 1.5331028e-19 1.2938971e-25 1.0000000e+00 1.9820524e-20
  1.8079337e-14]]

Predicted PWS Nutritional Phase:
Phase_2b

Probability Distribution:
Phase_1a: 0.0%
Phase_1b: 0.0%
Phase_2a: 0.0%
Phase_2b: 100.0%
Phase_3: 0.0%
Phase_4: 0.0%


---

**Model Prediction Functions**

---

In [None]:
def predictPWSNutPhase(model, dataframe, toShuffle):
    '''Function takes in a SNN model and a dataframe. Dataframe contains records
       of nut_phase questionnaire responses to be analyzed by the model for
       predicting the PWS nutritional phase of each patient'''

    # Function variables
    i=0                       # inner while loop iteration index
    j=0                       # outer while loop iteration index
    numModels=0               # stores the number of models passed by the function call
    numCorrect=0
    numIncorrect=0
    records=[]
    predictions=[]
    refColumns=[]
    comparisons=[]
    mismatchIndeces=[]


    #-- DATASET PREPARATION------------------------------------------------
    # Dataset to analyze
    df=dataframe

    # Target columns
    output_columns = ["Phase_1a", "Phase_1b", "Phase_2a", "Phase_2b","Phase_3", "Phase_4"]

    # Actual targets
    diagnosis=df["nut_phase"]
    diagnosis.columns=["Diagnosis"]

    # Shuffle dataset
    if (toShuffle==True):
        df=shuffleDataFrame(df)

    # Remove irrelevant data
    df=df.drop(columns=["rec_num", "sample_id", "nut_phase", "Phase_1a","Phase_1b","Phase_2a", "Phase_2b","Phase_3", "Phase_4"])

    # Convert dataframe elements (records) to a list object then append items to "records" list
    while (i < len(df)):
        rec=list(df.loc[i])
        records.append(rec)
        i+=1
    i=0    # reset loop index

    #-- MAKE PREDICTIONS ----------------------------------------------------
    # Make predictions
    while (j < len(records)):
        prediction = model.predict([records[j]], verbose=0)
        predictions.append(output_columns[prediction.argmax()])
        j+=1
    j=0    # reset loop index

    #-- EVALUATE AND COMPARE RESULTS ----------------------------------------
    # Convert diagnosis and predictions list objects to Pandas Data Frame objects
    diagnosis=pd.DataFrame(diagnosis)
    diagnosis.columns=["Actual"]
    predictions=pd.DataFrame(predictions)
    predictions.columns=["Predicted"]

    # Append diagnosis and predicted columns to "refColumns" list object
    refColumns.append(diagnosis)
    refColumns.append(predictions)

    # Append dataframes
    results=pd.concat(refColumns, axis=1)

    # Compare predicted nut_phase with subject's recorded diagnosis
    while (i < len(diagnosis)):
        if (diagnosis.loc[i][0]==predictions.loc[i][0]):
          comparisons.append("True")
          numCorrect+=1
        elif (diagnosis.loc[i][0]!=predictions.loc[i][0]):
          comparisons.append("**False")
          numIncorrect+=1
          mismatchIndeces.append(i+2)
        i+=1
    i=0

    #-- DISPLAY RESULTS ----------------------------------------------------
    # Expand "results" dataframe to include "comparisons"
    comparisons=pd.DataFrame(comparisons)
    comparisons.columns=["Result"]
    refColumns=[diagnosis, predictions, comparisons]
    results=pd.concat(refColumns, axis=1)

    # Print summary of results
    print("Total number of records: " + str(len(diagnosis)))
    print("Number of correct predictions: " + str(numCorrect))
    print("Number of incorrect predictions: " + str(numIncorrect) + "\n")
    print("Percent correct: " + str(round((numCorrect)/len(diagnosis)*100,1)) + "%")
    print("Percent incorrect: " + str(round((numIncorrect)/len(diagnosis)*100,1)) + "%" + "\n")

    # If mismatches are found, then display Excel WorkSheet row numbers for the respective records
    if (len(mismatchIndeces)!=0):
        print("Records predicted incorrectly (Excel WorkSheet row number): ")
        print(mismatchIndeces)
    elif (len(mismatchIndeces)==0):
        print("-- No incorrect predictions found --")

    return(results)


In [None]:
def multiClassConfusionMatrixMetrics_MultiModel(listOfModels, dataframe):
    '''Function evaluates multiple multiclass models using a multiclass confusion matrix
    and returns various performance metrics: accuracy, precision, recall, and f-1_score.
    Results can then be used to compare models and select the one with best performance'''

    # Variables
    i=0
    j=0
    df=dataframe
    modelName=""
    confusionMatrices=[]
    results=[]
    accuracy=[]
    outcomesAcc=[]
    accuracyMetrics={}
    accMetrics={}
    allAccMetrics=[]
    performanceMetrics={}
    allPerfMetrics=[]
    nutPhasePerfMetrics=[]
    f1_scores=pd.DataFrame()
    f1_scores_cols=[]


    # Reference
    nut_phases=["Phase_1a", "Phase_1b", "Phase_2a", "Phase_2b", "Phase_3", "Phase_4"]
    accMetrics={"0":"TP", "1":"FP", "2":"FN", "3":"TN"}    # TP-True Positive, FP-False Positive, FN-False Negative, TN-True Negative

    # Make prediction (model, dataset, shuffle_dataset (True or False))
    for model in models:
        results.append(predictPWSNutPhase(model, df, False))


    # Create the multiclass confusion matrix
    for result in results:
        confusionMatrices.append(pd.crosstab(result.Predicted, result.Actual))

    # Create a heatmap of the confusion matrix
    for confusionMatrix in confusionMatrices:
        modelName=("Model_" + str(i+1))
        fig=plt.figure(figsize=(17,5))
        ax1=plt.subplot(121)
        ax1.set_title("Model_" + str(i+1))
        sn.heatmap(confusionMatrix, annot=True, cmap="Blues")
        i+=1
    i=0

    # Number of records
    numRecords=confusionMatrices[0].sum().sum()

    # Overall accuracy
    for confusionMatrix in confusionMatrices:
        acc=round((np.diag(confusionMatrix).sum()/numRecords*100),2)
        accuracy.append(acc)


    #-- PREDICTION ASSESSMENT FOR EACH NUT_PHASE -----------------------
    # Classes: Phase_1a, Phase_1b, Phase_2a, Phase_2b, Phase_3, Phase_4
    # Accuracy categories: {"0":"TP", "1":"FP", "2":"FN", "3":"TN"}
    # Accuracy for each nutritional phase
    while (j < len(confusionMatrices)):
        while (i < len(confusionMatrices[0])):
            TP=confusionMatrices[j].iloc[i,i]            # (TP) True positive predictions
            FP=confusionMatrices[j].iloc[i,:].sum()-TP   # (FP) False positive predictions
            FN=confusionMatrices[j].iloc[:,i].sum()-TP   # (FN) False negative predictions
            TN=numRecords-(TP + FP + FN)                 # (TN) True negative predictions
            outcomesAcc=listMultiAppend(TP, FP, FN, TN)
            accuracyMetrics[nut_phases[i]+".m"+str(j+1)]=outcomesAcc
            outcomesAcc=[]
            i+=1
        allAccMetrics.append(accuracyMetrics)
        accuracyMetrics={}
        i=0
        j+=1
    j=0

    #-- MODEL PERFORMANCE CALCULATIONS ---------------------------------
    while (j < (len(allAccMetrics))):
        metricsDict=allAccMetrics[j]

        while (i < len(allAccMetrics[0])):
            listMetrics=metricsDict[nut_phases[i]+".m"+str(j+1)]

            # Model's performance calculations
            accuracy=(round(((listMetrics[0] + listMetrics[3])/numRecords)*100, 2))        # [0] accuracy
            precision=(round((listMetrics[0]/(listMetrics[0] + listMetrics[1]))*100, 2))   # [1] precision
            recall=(round((listMetrics[0]/(listMetrics[0] + listMetrics[2]))*100, 2))      # [2] recall
            f1_score=(round(((2*precision * recall)/(precision + recall)), 2))             # [3] f-1_score

            # Store results
            nutPhasePerfMetrics=listMultiAppend(accuracy, precision, recall, f1_score)
            performanceMetrics[nut_phases[i]+".m"+str(j+1)]=nutPhasePerfMetrics
            nutPhasePerfMetrics=[]
            i+=1
        allPerfMetrics.append(performanceMetrics)
        performanceMetrics={}
        i=0
        j+=1
    j=0

    #-- DISPLAY PERFORMANCE RESULTS ------------------------------------
    while (j < len(allPerfMetrics)):
        print("\n" + "MODEL "+str(j+1)+" SUMMARY: --")

        while (i < len(allPerfMetrics[0])):
            perfDict=allPerfMetrics[j]
            listMetrics=perfDict[nut_phases[i]+".m"+str(j+1)]

            print("Performance for class "+ nut_phases[i] + "\n" +
                "Accuracy: " + str(listMetrics[0]) + "%"+ "\n" +
                "Precision: " + str(listMetrics[1]) + "%"+ "\n" +
                "Recall: " + str(listMetrics[2]) + "%" + "\n" +
                "F1_Score: " + str(listMetrics[3]) + "%" + "\n")
            i+=1
        # print model's overall accuracy
        print("Overall accuracy: " + str(listMetrics[0]) + "%")
        print("Total number of records analyzed: " + str(confusionMatrices[j].sum().sum()))
        print("\n"+"\n")
        i=0
        j+=1
    j=0

    #-- DISPLAY CLASSIFICATION REPORTS ----------------------------------------------
    while (i < len(allPerfMetrics)):
        print("\n" + "\n" + "Classification Report for Model_" + str(i+1))
        report=pd.DataFrame(metrics.classification_report(results[i].Actual, results[i].Predicted, output_dict=True)).T
        report.transpose()
        report.columns=["precision", "recall", "f1-score", "#_of_records"]
        print(report)
        f1_score=pd.DataFrame(report["f1-score"])
        if (i==0):
            f1_scores=f1_score
        elif (i==1):
            f1_scores=pd.concat([f1_scores, f1_score], axis=1)
        f1_scores_cols.append("Model_" + str(i+1) + "_f1")
        i+=1
    i=0

    # Print table of f-1 scores to compare models
    f1_scores.columns=[f1_scores_cols]
    print("\n")
    print(f1_scores)
    print("\n")

    #-- METRICS NOTE: accuracy, precision, recall, f-_score -----------------
    print("The F1-score combines the precision and recall of a classifier into a single" + "\n" +
      " metric by taking their harmonic mean. It is primarily used to compare the" + "\n" +
      " performance of two classifiers. - Educative.io"  + "\n" + "\n" +

      "Precision or Recall? Precision measures the extent of error caused by False Pos (FPs)" + "\n"
      " while Recall measures the extent of error caused by False Neg (FNs)."  + "\n"
      "Depending on the case, go with the metric and model that produces the least desirable" + "\n"
      " outcome. Thus if FP, then choose model with highest Precision. If FN, then choose" + "\n"
      " model with highest Recall.")


In [None]:
filename1="snn_nut_phase_questionnaire_training_dataset_NO_age; 09-26-2023.csv"                 # features: 43
filename2="snn_nut_phase_questionnaire_training_dataset_WITH_age; 09-26-2023.csv"               # features: 44
filename3="snn_nut_phase_questionnaire_normalized_training_dataset_NO_age; 09-22-2023.csv"      # features: 43
filename4="snn_nut_phase_questionnaire_normalized_training_dataset_WITH_age; 09-22-2023.csv"    # features: 44

# Read CSV datafile and convert to a Pandas Dataframe object
df=pd.DataFrame(readCSVFile(filename1))

models=[tempModel_1, tempModel_2]
results=multiClassConfusionMatrixMetrics_MultiModel(models, df)

results


Total number of records: 347
Number of correct predictions: 347
Number of incorrect predictions: 0

Percent correct: 100.0%
Percent incorrect: 0.0%

-- No incorrect predictions found --


IndexError: ignored



---

**Model Calibration**

---

