# Training Neural Networks

You are advised to run this Jupyter Notebook on Google Colab. From the Colab toolbar, select *Runtime* > *Change runtime type* > *T4 GPU* > *Save* before running the Notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from keras import Model
from keras import Input
from keras.layers import Dense
from keras.layers import Rescaling
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import BatchNormalization

from keras.optimizers import RMSprop

from keras.callbacks import EarlyStopping

from keras.datasets import mnist

from keras.applications import ResNet50
import keras.applications.resnet as resnet

from keras.preprocessing.image import load_img
from keras.preprocessing import image_dataset_from_directory

In [None]:
# If you are running on Google Colab, uncomment the next line before executing this code cell.

! pip install keras_tuner

import keras_tuner

In [None]:
import os
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_dir = "./drive/My Drive/Colab Notebooks/" # You may need to change this, depending on where your notebooks are on Google Drive
else:
    base_dir = "."
dataset_dir = os.path.join(base_dir, "datasets")

In [None]:
def plot_keras_history(history, metric):
    fig, axes = plt.subplots(1, 2, figsize=(6, 3))
    fig.tight_layout()
    axes[0].plot(history.history["loss"], label="train loss")
    axes[0].plot(history.history["val_loss"], label="val loss")
    axes[0].set_title("Loss")
    axes[0].legend()
    axes[1].plot(history.history[metric], label="train " + metric)
    axes[1].plot(history.history["val_" + metric], label="val " + metric)
    axes[1].set_title(metric)
    axes[1].legend()
    plt.show()

## Vanishing Gradients

In [None]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [None]:
X_train = X_train.reshape((60000, 28, 28, 1))
X_test = X_test.reshape((10000, 28, 28, 1))

The lecture discusses three solutions to the vanishing gradients problem: using ReLU in place of sigmoid for the activation functions of the hidden layers; using Glorot uniform initialization rather than, e.g., random normal initialization; and using Batch Normalization layers. (Note that Glorot uniform initialization is the default in any case.)

To see whether these really are helpful, here's a grid search that tries all 8 combinations:

In [None]:
def build_mnist_model(hp):
    hp_activation = hp.Choice("activation", ["sigmoid", "relu"])
    hp_initialization = hp.Choice("initialization", ["random_normal", "glorot_uniform"])
    hp_is_batch_normalized = hp.Boolean("is_batch_normalized")
    inputs = Input(shape=(28, 28, 1))
    x = Rescaling(scale=1./255)(inputs)
    x = Conv2D(filters=32, kernel_size=(3, 3),
               activation=hp_activation,
               kernel_initializer=hp_initialization)(x)
    if hp_is_batch_normalized:
        x = BatchNormalization()(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    x = Conv2D(filters=64, kernel_size=(3, 3),
               activation=hp_activation,
               kernel_initializer=hp_initialization)(x)
    if hp_is_batch_normalized:
        x = BatchNormalization()(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    x = Conv2D(filters=64, kernel_size=(3, 3),
               activation=hp_activation,
               kernel_initializer=hp_initialization)(x)
    if hp_is_batch_normalized:
        x = BatchNormalization()(x)
    x = Flatten()(x)
    x = Dense(units=64, activation=hp_activation,
              kernel_initializer=hp_initialization)(x)
    if hp_is_batch_normalized:
        x = BatchNormalization()(x)
    outputs = Dense(units=10, activation="softmax",
                    kernel_initializer=hp_initialization)(x)
    convnet = Model(inputs, outputs)
    convnet.compile(optimizer=RMSprop(learning_rate=0.0001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    return convnet

In [None]:
tuner = keras_tuner.GridSearch(
    build_mnist_model,
    objective="val_accuracy",
    directory = base_dir,
    project_name="tuner_state",
    overwrite=True)

In [None]:
tuner.search(X_train, y_train, epochs=20, validation_split=0.25)

In [None]:
tuner.get_best_hyperparameters()[0].values

There is randomness so we may get different results each time we run it.

But, sure enough, most times I run it the winning combination includes Batch Normalization - and there is usually even a small improvement in validation accuracy and test set accuracy.

## Cats and Dogs

To illustrate transfer learning, we'll use a dataset of images of cats and dogs. It comes from Microsoft researchers, for a Kaggle competition: https://www.kaggle.com/c/dogs-vs-cats. It contains 12,500 medium-resolution JPEGs depicting cats and 12,500 depicting dogs.
We use a subset of the full dataset:
- training set: 1000 cats and 1000 dogs;
- validation set: 500 cats and 500 dogs;
- test set: 500 cats and 500 dogs.

In [None]:
cats_and_dogs_dir = os.path.join(dataset_dir, "cats_and_dogs")
train_dir = os.path.join(cats_and_dogs_dir, "train")
val_dir = os.path.join(cats_and_dogs_dir, "validation")
test_dir = os.path.join(cats_and_dogs_dir, "test")

In [None]:
# Let's look at one of the images

train_dogs_dir = os.path.join(train_dir, "dogs")
filenames = [os.path.join(train_dogs_dir, filename) for filename in os.listdir(train_dogs_dir)]

idx = 400 # Change this if you want to look at a different dog
some_example = load_img(filenames[idx], target_size=(200,200))

plt.imshow(some_example)
plt.show()

Keras gives us an extremely useful function: `image_dataset_from_directory`.

It has lots of arguments, some of which are explained here:
- `directory`: Where the data is located.
- `labels`: The default is `"inferred"`, meaning the labels are taken from the directory structure.
- `label_mode`: For binary classification, use `"binary"` so that the labels are encoded as 0 or 1; for multiclass classification, use `"int"` (default) so that the labels are encoded as integers.
- `color_mode`: Either `"grayscale"`, `"rgb"` (default) or `"rgba"`. Images will be converted to have 1, 3, or 4 channels, based on the value you give.
- `batch_size`: With this argument, we can read in and process the dataset in mini-batches, rather than reading the whole dataset into main memory. Its default value is 32.
- `image_size`: A dataset may contain images of different sizes. Neural networks don't work with different-sized inputs. So this resizes the images to all be the same size. The default is $256 \times 256$.
- `shuffle`: Whether to shuffle the data (default is True).
- `seed`: Optional random seed for shuffling (default is None).

...and a few more.

This function will also decode images from one format into the format that it uses internally. For example, if the raw images are JPEGs, it will decompress them.

In [None]:
train_dataset = image_dataset_from_directory(directory=train_dir, label_mode="binary", image_size=(224, 224))
val_dataset = image_dataset_from_directory(directory=val_dir, label_mode="binary", image_size=(224, 224))
test_dataset = image_dataset_from_directory(directory=test_dir, label_mode="binary", image_size=(224, 224))

Let's train a model using this quite small dataset. Note how `fit` has an argument `validation_data` instead of `validation_split` - this is because we have a separate validation set that we can use, rather than splittng one off from the training set.

In [None]:
inputs = Input(shape=(224, 224, 3))
x = Rescaling(scale=1./255)(inputs)
x = Conv2D(filters=128, kernel_size=(3, 3), activation="relu")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(filters=128, kernel_size=(3, 3), activation="relu")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(filters=64, kernel_size=(3, 3), activation="relu")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(filters=32, kernel_size=(3, 3), activation="relu")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(units=512, activation="relu")(x)
outputs = Dense(units=1, activation="sigmoid")(x)
convnet = Model(inputs, outputs)

In [None]:
convnet.compile(optimizer=RMSprop(learning_rate=0.001), loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
convnet_history = convnet.fit(train_dataset, epochs=30,
                validation_data=val_dataset,
                callbacks=[EarlyStopping(monitor="val_loss", patience=4, restore_best_weights=True)],
                verbose=0)

In [None]:
plot_keras_history(convnet_history, "accuracy")

In [None]:
wikipedia_dataset = image_dataset_from_directory(
    directory=os.path.join(dataset_dir, "wikipedia_cats_and_dogs"), batch_size=12, shuffle=False, label_mode=None, image_size=(224, 224))

In [None]:
predictions = convnet.predict(wikipedia_dataset)

In [None]:
predictions

In [None]:
def plot_image_grid(dataset, num_images=12, images_per_row=3, titles=None):
    images_per_col = num_images // images_per_row
    plt.figure(figsize=(10, 10))
    for batch_images in dataset.take(1):
        for i in range(num_images):
            ax = plt.subplot(images_per_row, images_per_col, i + 1)
            plt.imshow(batch_images[i].numpy().astype("uint8"))
            plt.title(titles[i])
            plt.axis("off")

In [None]:
plot_image_grid(wikipedia_dataset, titles=np.where(predictions > 0.5, "dog", "cat"))

# ResNet50

ResNet50 is a deep convolutional neural network pre-trained on the ImageNet dataset. Let's take a look at its architecture.

In [None]:
resnet50 = ResNet50(weights="imagenet", include_top=True, input_shape=(224, 224, 3))

In [None]:
resnet50.summary()

In [None]:
predictions = resnet50.predict(wikipedia_dataset)
resnet.decode_predictions(predictions, top=3)

In [None]:
plot_image_grid(wikipedia_dataset, titles=[prediction[0][1] for prediction in resnet.decode_predictions(predictions, top=1)])

## Transfer Learning

Earlier, we loaded the whole ResNet50 model, so that we could take a look at it - but we don't want that.

We want just the *base* - everything except the top - everything except the last two layers.

In [None]:
resnet50_base = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))

In [None]:
inputs = Input(shape=(224, 224, 3))
x = resnet.preprocess_input(inputs)
x = resnet50_base(x)
x = Flatten()(x)
x = Dense(units=16, activation="relu")(x)
outputs = Dense(units=1, activation="sigmoid")(x)
transfer_model = Model(inputs=inputs, outputs=outputs)

We freeze the weights in the layers of the convolutional base. If we did not, then the features that ResNet50 learned previously would be lost.

In [None]:
for layer in resnet50_base.layers:
    layer.trainable = False

Now we can compile, and train.

In [None]:
transfer_model.compile(optimizer=RMSprop(learning_rate=0.001), loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
transfer_model_history = transfer_model.fit(train_dataset, epochs=30,
                validation_data=val_dataset,
                callbacks=[EarlyStopping(monitor="val_loss", patience=4, restore_best_weights=True)],
                verbose=0)

In [None]:
plot_keras_history(transfer_model_history, "accuracy")

Now that our new top layers are well-trained, we can unfreeze all layers in the base (or just the top ones in the base) and continue training. For reasons we won't go into, it is thought best not to unfreeze `BatchNormalization` layers:

In [None]:
for layer in resnet50_base.layers:
    if isinstance(layer, BatchNormalization):
        layer.trainable = False
    else:
        layer.trainable = True

In Keras, re-compilation is needed at this point.

You probably want a lower learning rate to avoid damaging the pretrained weights.

One could even imagine different learning rates for different layers: smaller ones at the bottom of the base than at the top of the base.

In [None]:
transfer_model.compile(optimizer=RMSprop(learning_rate=0.0001), loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
transfer_model_history = transfer_model.fit(train_dataset, epochs=30,
                validation_data=val_dataset,
                callbacks=[EarlyStopping(monitor="val_loss", patience=4, restore_best_weights=True)],
                verbose=0)

In [None]:
plot_keras_history(transfer_model_history, "accuracy")

In this case, continuing training after unfreezing seems to have been unhelpful - possibly because the base model was already pretty good at breeds of cats and dogs.

In [None]:
predictions = transfer_model.predict(wikipedia_dataset)

In [None]:
plot_image_grid(wikipedia_dataset, titles=np.where(predictions > 0.5, "dog", "cat"))

Finally, we can do Error Estimation to compare the models - without transfer learning and with transfer learning.

In [None]:
test_loss, test_acc = convnet.evaluate(test_dataset)
test_acc

In [None]:
test_loss, test_acc = transfer_model.evaluate(test_dataset)
test_acc

We did much better. But keep in mind that this one was easy. ResNet50 already knows about different breeds of cats and dogs - so all we're asking it to learn is that different dog breeds are all types of dog; and similarly for cats. It won't do so well on, e.g., types of animal that were not part of the ImageNet dataset.