<a href="https://colab.research.google.com/github/Deep-Learning-Challenge/challenge-notebooks/blob/master/1.Multilayer%20Perceptrons/1.Lessons/2.Evaluate%20The%20Performance%20of%20Deep%20Learning%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>

# Evaluate The Performance of Deep Learning Models

There are many decisions to make when designing and configuring your deep learning models. Most of these decisions must be resolved empirically through trial and error and evaluating them on real data. As such, it is critically important to have a robust way to evaluate the performance of your neural network and deep learning models. In this lesson, you will discover a few ways that you can use to evaluate model performance using Keras. After completing this lesson, you will know:

* How to evaluate a Keras model using an automatic verification dataset.
* How to evaluate a Keras model using a manual verification dataset.
* How to evaluate a Keras model using k-fold cross-validation.

Let's get started.

## Runtime Setup

In [None]:
import sys

dataset_name = "pima-indians-diabetes.data.csv"
if 'google.colab' in sys.modules:
    DATASET = f"https://github.com/Deep-Learning-Challenge/challenge-notebooks/raw/master/datasets/{dataset_name}
else:
    DATASET = f"../../datasets/{dataset_name}"
    
DATASET

## Empirically Evaluate Network Configurations

There is a myriad of decisions you must make when designing and configuring your deep learning models. Many of these decisions can be resolved by copying the structure of other people's networks and using heuristics. Ultimately, the best technique is to actually design small experiments and empirically evaluate options using real data. This includes high-level decisions like the number, size, and type of layers in your network. It also consists of the lower level decisions like the choice of the loss function, activation functions, optimization procedure, and the number of epochs.

Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of instances. As such, you need to have a robust test harness that allows you to estimate a given configuration's performance on unseen data and reliably compare the performance to other configurations.

## Data Splitting

A large amount of data and the complexity of the models require very long training times. It is typical to use a simple separation of data into training and test datasets or training and validation datasets. Keras provides two convenient ways of evaluating your deep learning
algorithms this way:

1. Use an automatic verification dataset.
2. Use a manual verification dataset.

### Use an Automatic Verification Dataset

Keras can separate a portion of your training data into a validation dataset and evaluate your model's performance on that validation dataset each epoch. You can do this by setting the `validation_split` argument on the `fit()` function to a percentage of your training dataset size. For example, a reasonable value might be 0.2 or 0.33 for 20% or 33% of your training
data held back for validation. The example below demonstrates using an automatic validation dataset on the Pima Indians onset of diabetes dataset.

In [None]:
# MLP with automatic validation set
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow. keras.layers import Dense
import numpy

# fix random seed for reproducibility
numpy.random.seed(7)

# load pima indians dataset
dataset = numpy.loadtxt(DATASET, delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Running the example, you can see that each epoch's verbose output shows the loss and accuracy of both the training dataset and the validation dataset.

In [None]:
# Fit the model
model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10)

### Use a Manual Verification Dataset

Keras also allows you to manually specify the dataset to use for validation during training. In this example we use the handy `train_test_split()` function from the Python scikit-learn machine learning library to separate our data into a training and test dataset. We use 67% for training and the remaining 33% of the data for validation. The validation dataset can be specified to the `fit()` function in Keras by the `validation_data` argument. It takes a tuple of
the input and output datasets.

In [None]:
# MLP with automatic validation set
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow. keras.layers import Dense
from sklearn.model_selection import train_test_split
import numpy

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load pima indians dataset
dataset = numpy.loadtxt(DATASET, delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)

# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Like before, running the example provides a verbose output of training that includes the loss and accuracy of the model on both the training and validation datasets for each epoch.

In [None]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10)

## Manual k-Fold Cross-Validation

The gold standard for machine learning model evaluation is `k-fold cross-validation`. It provides a robust estimate of the performance of a model on unseen data. It does this by splitting the training dataset into k subsets, taking turns training models on all subsets except one held out, and evaluating model performance on the held-out validation dataset. The process is repeated until all subsets are allowed to be the held-out validation set. The performance measure is then averaged across all models that are created.

Cross-validation is often not used for evaluating deep learning models because of the more significant computational expense. For example, k-fold cross-validation is often used with 5 or 10 folds. As such, 5 or 10 models must be constructed and evaluated, significantly adding to the evaluation time. Nevertheless, when the problem is small enough or if you have sufficient compute resources, k-fold cross-validation can give you a less biased estimate of the performance of your model.

In the example below, we use the handy `StratifiedKFold` [class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) from the scikit-learn Python machine learning library to split the training dataset into ten folds. The folds are stratified, meaning that the algorithm attempts to balance each class's number of instances in each fold. The example creates and evaluates ten models using the ten splits of the data and collects all of the scores. The verbose output for each epoch is turned off by passing `verbose=0` to the `fit()` and `evaluate()` functions on the model. The performance is printed for each model, and it is stored. The average and standard deviation of the model performance is then printed at the end of the run to provide a robust estimate of model accuracy.

In [None]:
# MLP for Pima Indians Dataset with 10-fold cross validation
import tensorflow as tf
tf.compat.v1.disable_eager_execution()

from tensorflow.keras.models import Sequential
from tensorflow. keras.layers import Dense

from sklearn.model_selection import StratifiedKFold
import numpy

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load pima indians dataset
dataset = numpy.loadtxt(DATASET, delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []

for train, test in kfold.split(X, Y):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    # Fit the model
    model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)

    # evaluate the model
    scores = model.evaluate(X[test], Y[test], verbose=0)
    print(f"{model.metrics_names[1]}: {scores[1]*100:.2f}%")
    cvscores.append(scores[1] * 100)

print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))

Notice that we had to re-create the model for each loop to fit and evaluate it with the fold data. In the next lesson, we will look at how we can use Keras models natively with the scikit-learn machine learning library.

## Summary

In this lesson, you discovered the importance of having a robust way to estimate your deep learning models' performance on unseen data. You learned three ways that you can evaluate the performance of your deep learning models in Python using the Keras library:

* Automatically splitting a training dataset into train and validation datasets.
* Manually and explicitly defining training and validation dataset.
* Evaluating performance using k-fold cross-validation, the gold-standard technique.