# Improving your Keras model

- Dataset: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

# Setup

In [1]:
# ! conda install keras -y
# ! conda install tensorflow -y
# ! conda install xlrd -y

In [2]:
# imports
import pandas as pd
import numpy as np

import sklearn
from sklearn.utils import class_weight
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical

import tensorflow as tf

In [3]:
# get the dataset from UCI ML Repository
# ! curl -o default.xls https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls

In [4]:
# load the dataset
df = pd.read_excel('data/default.xls', header=1)
df.shape

(30000, 25)

In [5]:
# Check for missing data
df.isnull().sum().sum()

0

Split into input (X) and output (y) variables

In [6]:
# predictors include all variables but ID and default
X = df.drop(['ID', 'default payment next month'], axis=1)
# convert target to categorical
y = to_categorical(df['default payment next month'])
# note that the y-variable is now one-hot encoded
print(y[:5])

[[0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]


In [7]:
# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Baseline Model

How many layers should the model contain?   
* There's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very few. **One hidden layer is sufficient for the large majority of problems.**
* There are really two decisions that must be made regarding the hidden layers: how many hidden layers to actually have in the neural network and how many neurons will be in each of these layers. 
* Neural networks with two hidden layers can represent functions with any kind of shape. There is currently no theoretical reason to use neural networks with any more than two hidden layers.
 - [source](https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw)


## Define the model

The number of nodes in the input layer is always determined by number of predictors. The number of neurons comprising that layer is equal to the number of features (columns) in your data. Note: Some NN configurations add one additional node for a bias term.

In [9]:
# number of nodes in the input layer 
nodes_input_layer = X_train.shape[1]
print(nodes_input_layer)

23


Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.
* If the NN is a regressor, then the output layer has a single node.

* If the NN is a classifier, then it also has a single node unless softmax is used in which case the output layer has one node per class label in your model.

In [10]:
# number of nodes in output layer
nodes_output_layer = 2

The number of nodes in the hidden layers is not easy to determine. There is no universal answer for this question yet. Ultimately, the selection of an architecture for your neural network will come down to trial and error.
* Using too few neurons in the hidden layers will result in underfitting
* Too many neurons in the hidden layers may result in overfitting   

There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:

*    The number of hidden neurons should be between the size of the input layer and the size of the output layer.
*    The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
*    The number of hidden neurons should be less than twice the size of the input layer.

- [source](https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw)
* [further reading](https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/)

In [11]:
# number of nodes in first hidden layer
nodes_hidden_layer = 12

Rules for the activation function

Input and Output Layers:
* The input layer does not require an activation function.
* For regression problems, the output layer does not require an activation function.
* For binary classification problems with a single output variable, the activation function should be "sigmoid".
* For multi-label classification problems with a single output variable, the activation function should be "softmax".  

Hidden Layers:
* The rectified linear activation function, or ReLU activation function, is perhaps the most common function used for hidden layers.
* Sigmoid and Tanh used to be popular but were more susceptible to vanishing gradients that prevent deep models from being trained
* Recurrent networks still commonly use Tanh or sigmoid activation functions, or even both. 


Additional reading:
* [Jason Brownlee](https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/)
* [Keras documentation](https://keras.io/api/layers/activations/)

In [12]:
# activation function for the hidden layer
activation_function_hidden_layer = 'relu'

In [13]:
# activation function for the output layer
activation_function_output_layer = 'softmax'

Keras has three APIs for models, which are three options for instantiating the model:
* The Sequential API is the simplest. It groups a linear stack of layers.
* The Functional API is more complex. It groups layers into an object with training and inference features.
* The Subclassed API is the most complex but provides more flexibility

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor. A Sequential model is not appropriate when:

* Your model has multiple inputs or multiple outputs
* Any of your layers has multiple inputs or multiple outputs
* You need to do layer sharing
* You want non-linear topology (e.g. a residual connection, a multi-branch model)

Further reading:
* [Keras documentation](https://keras.io/guides/sequential_model/)
* [Introduction to Three Keras Model APIs](https://medium.com/analytics-vidhya/beginner-level-introduction-to-three-keras-model-apis-24a45f7af3c9)
* [How to Use the Keras Functional API for Deep Learning](https://machinelearningmastery.com/keras-functional-api-deep-learning/)

In [14]:
# define the model using the sequential api
model = Sequential()

2022-09-01 21:49:34.411270: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model layers can be added one-by-one using the `.add` method.  "Dense" is the most common type of layer. The Keras docs define it as "Just your regular densely-connected NN layer."  

Other layer types typically included in more complex models:
* Activation layer
* Embedding layer
* Masking layer


Further reading:
* [Understanding Keras — Dense Layers](https://medium.com/@hunterheidenreich/understanding-keras-dense-layers-2abadff9b990)
* [Keras documentation](https://keras.io/api/layers/core_layers/dense/)

In [15]:
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )

## Compile the Model

The purpose of loss functions is to compute the quantity that a model should seek to minimize during training. The "default" selections are as follows:
* Regression: Mean Squared Error
* Classification: [Cross-Entropy](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) - either binary or categorical.

Additional possibilities include:
* Regression: Root Mean Squared Error, Mean Squared Logarithmic Error, Mean Absolute Error
* Binary Classification: Hinge Loss, Squared Hinge Loss
* Multi-Class Classification: Sparse Multiclass Cross-Entropy, Kullback Leibler Divergence

Further reading: 
* [How to choose loss functions](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/)
* [Loss and Loss Functions for Training Deep Learning Neural Networks](https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/)
* [Keras Documentation](https://keras.io/api/losses/)

In [16]:
# loss function
loss_function='categorical_crossentropy'

While several possible optimizers are available, the industry has converged on Adam as the standard choice.
* The optimizer is a procedure to update network weights iteratively based in training data.
* Traditional optimizers were: stochastic gradient descent (SGD), 'AdaGrad', and 'RMSProp' but these are no longer used.
* Adam is an optimization algorithm that was developed in 2015 and realizes the benefits of both AdaGrad and RMSProp.

Further reading:
* [Introduction to the Adam Optimization Algorithm](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)
* [Keras documentation](https://keras.io/api/optimizers/)

In [17]:
# optimization algo
optimization_algorithm='adam'

A metric is a function that is used to judge the performance of your model. Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when training the model. It is not required to specify a metric, and (unlike other parameters) it does not change the training or performance.
The most common metrics are:
* mean squared error (for regression)
* accuracy (for classification)

Further reading:
- [How to use metrics](https://machinelearningmastery.com/custom-metrics-deep-learning-keras-python/#:~:text=Keras%20allows%20you%20to%20list,()%20function%20on%20your%20model.)
- [Keras documentation](https://keras.io/api/metrics/)

In [18]:
# there are two ways to call the built-in metrics
list_of_metrics=['accuracy']
list_of_metrics=[keras.metrics.Accuracy(), keras.metrics.Precision(), keras.metrics.Recall()]

the `.compile()` method configures the model for training. It is a preliminary step to `.fit()`. It requires at least three parameters (but several more can be added as options). 

Some of the parameters:
* loss [required]
* optimizer [required]
* metrics [optional]

Additional reading:
* [Keras documentation](https://keras.io/api/models/model_training_apis/)

In [19]:
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )

## Fit the model

Two hyperparameters that often confuse beginners are the batch size and number of epochs. They are both integer values and seem to do the same thing.  
* Stochastic gradient descent is an iterative learning algorithm that uses a training dataset to update a model.
* The batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated.
* The number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset.
* An epoch is comprised of one or more batches.


How large should the epochs be? Typically, a higher number of epochs will result in a more accurate model but with longer training time and a higher risk of overfitting on the training dataset.
* The number of epochs is traditionally large, often hundreds or thousands
* You may see examples of the number of epochs in the literature and in tutorials set to 10, 100, 500, 1000, and larger.
* You can run the algorithm for as long as you like and even stop it using other criteria besides a fixed number of epochs, such as a change (or lack of change) in model error over time.

In [20]:
# how many epochs?
epochs=10

How large should the batch size be? 
* The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset.
* Given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the size of the training dataset.

Smaller batch sizes (like 32) are often used because:

* Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error.
* Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU).
* The batch size is often set at something small, and is not tuned by the practitioner. Small batch sizes such as 32 do work well generally.
* Popular batch sizes include 32, 64, and 128 samples.

Tradeoffs:
* The larger the batch, the more accurate the model. 
* This comes at the cost of having to use the model to make many more predictions before the estimate can be calculated, and in turn, the weights updated.
* A smaller batch results is less accurate - but can result in faster learning and sometimes a more robust model.


Further reading:
* [How to Control the Stability of Training Neural Networks With the Batch Size](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/)
* [Difference Between a Batch and an Epoch in a Neural Network](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/)
* [How to Configure Batch Size](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/)

What is the relationship between epochs and batch size? 

* An epoch is comprised of one or more batches.
* One training epoch means that the learning algorithm has made one pass through the training dataset
* In each epoch, examples are separated into randomly selected “batch size” groups.  

For example, suppose we have a dataset with 1000 rows. If we set batch=200 and epochs=10, then the model will pass through the entire dataset 10 times (totalling 10,000 rows). During each epoch, the model will update the weights 5 times (once for every 200 rows) resulting in 50 updates to the weights. (Note: If the dataset does not divide evenly by batch size, it simply means that the final batch has fewer samples than the other batches.)

In [21]:
# batch size
batch_size=10

The `.fit()` method trains the model for a fixed number of epochs (iterations on a dataset). It is the most time-consuming step. The only required inputs are the X and y training data, but a number of optional parameters including:

* epochs
* batch size
* validation data
* class weight

Additional reading:
* [Keras documentation](https://keras.io/api/models/model_training_apis/#fit-method)

In [22]:
# the 'verbose' option simply determines whether you want to show the output while training.
show_output = 0
show_output = 1

In [23]:
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
          verbose = show_output
         )

2022-09-01 21:49:34.575025: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-09-01 21:49:34.575440: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500005000 Hz


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fcf58fa0220>

## Evaluate the Model

Once you fit a deep learning neural network model, you must evaluate its performance on a test dataset.
* The Keras deep learning API model is very limited in terms of the metrics that you can use to report the model performance.
* It's common to use the scikit-learn metrics API to evaluate a deep learning model.

Further Reading:
* [Metrics for Keras](https://machinelearningmastery.com/custom-metrics-deep-learning-keras-python/)
* [Calculate Precision, Recall, F1, and More for Deep Learning](https://machinelearningmastery.com/how-to-calculate-precision-recall-f1-and-more-for-deep-learning-models/)
* [Keras documentation](https://keras.io/api/metrics/)

In [24]:
# make probability predictions with the model (they come in pairs)
y_probs = model.predict(X_test)
# make class predictions with the model
y_preds = (y_probs > 0.5).astype(int)

In [25]:
# Accuracy calculated using Keras' method
metric = tf.keras.metrics.Accuracy()
metric.update_state(y_test, y_preds)
metric.result().numpy()

0.76414144

In [26]:
# using the built-in metrics with Keras 
metric = tf.keras.metrics.BinaryAccuracy()
metric.update_state(y_test, y_preds)
metric.result().numpy()

0.76414144

In [27]:
# using the built-in metrics with Keras 
metric = tf.keras.metrics.CategoricalAccuracy()
metric.update_state(y_test, y_preds)
metric.result().numpy()

0.76414144

In [28]:
# Evaluate the model using sklearn
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))
print(sklearn.metrics.roc_auc_score(y_test, y_preds))
print(sklearn.metrics.precision_score(y_test, y_preds, average='macro'))
print(sklearn.metrics.recall_score(y_test, y_preds, average='macro'))

Accuracy: 0.7641414141414141
0.5300129237415453
0.5807490324522434
0.5300129237415453


In [29]:
# Evaluate the model using sklearn
print(sklearn.metrics.classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.79      0.95      0.86      7742
           1       0.37      0.11      0.18      2158

   micro avg       0.76      0.76      0.76      9900
   macro avg       0.58      0.53      0.52      9900
weighted avg       0.70      0.76      0.71      9900
 samples avg       0.76      0.76      0.76      9900



## Recap the Baseline Model

In [30]:
# define the model using the sequential api
model = Sequential()
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
         )
# Evaluate
y_probs = model.predict(X_test)
y_preds = (y_probs > 0.5).astype(int)
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 0.6419191919191919


# Improving the Baseline Model

## Add more hidden layers

Same as the baseline model but with two additional hidden layers.

In [31]:
# define the model using the sequential api
model = Sequential()
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# Here we add two more hidden layers
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
          verbose=0
         )
# Evaluate
y_probs = model.predict(X_test)
y_preds = (y_probs > 0.5).astype(int)
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))

Accuracy: 0.7820202020202021


## Add more epochs
Same as the baseline, but longer epochs and larger batch size.

In [32]:
epochs = 150
batch_size = 32

In [33]:
# define the model using the sequential api
model = Sequential()
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
          verbose=0
         )
# Evaluate
y_probs = model.predict(X_test)
y_preds = (y_probs > 0.5).astype(int)
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))

Accuracy: 0.7821212121212121


# More epochs and more layers 

In [34]:
epochs = 150
batch_size = 32

In [35]:
# define the model using the sequential api
model = Sequential()
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# Here we add two more hidden layers
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
          verbose=0
         )
# Evaluate
y_probs = model.predict(X_test)
y_preds = (y_probs > 0.5).astype(int)
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))

Accuracy: 0.7820202020202021


## Standardize the Predictors

In [36]:
epochs = 10
batch_size = 10

In [37]:
# Standardize the predictors
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [38]:
# define the model using the sequential api
model = Sequential()
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
         )
# Evaluate
y_probs = model.predict(X_test)
y_preds = (y_probs > 0.5).astype(int)
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 0.8168686868686869


## Use weights to balance the dataset

In [40]:
# Calculating default Ratio
non_default = len(df[df['default payment next month']==0])
default = len(df[df['default payment next month']==1])
ratio = float(default/(non_default+default))
print('Default Ratio: ', ratio)

Default Ratio:  0.2212


In [41]:
# define a class weight
class_weight = {0:ratio, 1:1-ratio}

In [42]:
# define the model using the sequential api
model = Sequential()
# add layers
model.add(Dense(nodes_hidden_layer, 
                activation=activation_function_hidden_layer, 
                input_shape = (nodes_input_layer,) # note: the final comma is important
               )
         )
model.add(Dense(nodes_output_layer, 
                activation=activation_function_output_layer )
         )
# compile the model
model.compile(loss=loss_function, 
              optimizer=optimization_algorithm, 
              metrics=list_of_metrics
             )
# fit the keras model on the dataset
model.fit(X_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
          class_weight=class_weight
         )
# Evaluate
y_probs = model.predict(X_test)
y_preds = (y_probs > 0.5).astype(int)
print('Accuracy: {}'.format(sklearn.metrics.accuracy_score(y_test, y_preds)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 0.7335353535353535


Note that balancing the dataset reduced the overall accuracy but improved the precision and recall

**Baseline Metrics**







              precision    recall  f1-score   support

           0       0.79      0.95      0.86      7742
           1       0.37      0.11      0.18      2158

    micro avg       0.76      0.76      0.76      9900
    macro avg       0.58      0.53      0.52      9900
    wghtd avg       0.70      0.76      0.71      9900 
    sampl avg       0.76      0.76      0.76      9900

In [43]:
# Evaluate the model using sklearn
print(sklearn.metrics.classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.88      0.76      0.82      7742
           1       0.43      0.64      0.51      2158

   micro avg       0.73      0.73      0.73      9900
   macro avg       0.65      0.70      0.66      9900
weighted avg       0.78      0.73      0.75      9900
 samples avg       0.73      0.73      0.73      9900

