# Understanding Starting/Stopping/Resuming Training
## Understanding custom callbacks
This notebook is for training and understanding purposes only. All algorithms and credits go to pyimagesearch.com, specifically https://www.pyimagesearch.com/2019/09/23/keras-starting-stopping-and-resuming-training/ and Adrian Rosebrock (A wonderful source and inspiration for Computer Vision and Deep Learning)

As this notebook is for training and understanding purposes, rather than downloading the source code right away. The code will be typed in order to build "muscle-memory". Author-readable comments will appear from time to time.

At times, there is a need to stop training prematurely. Some of the common reasons are : <br>

1. Validation loss has plateau-ed
2. No improvements of results have been observed
3. Limited time on computing (GPU) resources

While Keras's inbuilt learning rate scheduler class can address certain issues, they are typically contingent on number of epochs (e.g. LR/epochs). So how would we know what is the <br>

1. proper initial learning rate and learning rate range?
2. at what epochs do we start implementing learning rate decay?

In any case, it is a good practice to save model's weight at specific interval and design a custom callback to achieve that purposes.

In general, custom callback classes inherit the tf.keras.callbacks.Callback class by caling <br>
**super(<class_name>, self).__init__()**
<br>

This way, the custom callback classes contains all of Callback methods (including patience - number of epoch to wait*)
<br>

Under the custom callback classes, one can define pre-built function as following:
<br>

1. def on_{training|test|predict}_begin(self, logs=None):
2. def on_{training|test|predict}_end(self, logs=None):
3. def on_{training|test|predict}_batch_begin(self, batch, logs=None):
4. def on_{training|test|predict}_batch_end(self, batch, logs=None):
5. def on_epoch_begin(self, epochs, logs=None):
6. def on_epoch_end(self, epochs, logs=None):

In [1]:
import matplotlib
matplotlib.use("Agg")
%matplotlib inline

import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.models import load_model
import tensorflow.keras.backend as K
import numpy as np
import argparse
import cv2
import sys
import os

In [2]:
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D, AveragePooling2D, MaxPooling2D, ZeroPadding2D, Activation, Dense
from tensorflow.keras.layers import Flatten, Input, add
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2

class ResNet:
    @staticmethod
    def residual_module(data, K, stride, chanDim, red=False, reg=0.0001, bnEps=2e-5, bnMom=0.9):
        #note that K here refers to the number of kernels, rather than the bacckend
        shortcut = data
        
        # note that the batchnormalization in keras and tf.nn is different
        # in Keras, there is an optional non-zero epsilon to ensure that it is not divisible by zero
        # the momentum term is to take into account for moving average/variance
        bn1 = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(data)
        act1 = Activation("relu")(bn1)
        
        # there is also a l2 regularizer on the kernel
        conv1 = Conv2D(int(K*0.25), (1,1), use_bias=False, kernel_regularizer=l2(reg))(act1)
        
        bn2 = BatchNormalization(axis=chanDim, epsilon = bnEps, momentum = bnMom)(conv1)
        act2 = Activation('relu')(bn2)
        conv2 = Conv2D(int(K*0.25), (3,3), strides=stride, padding="same", use_bias=False, kernel_regularizer=l2(reg))(act2)
        
        bn3 = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(conv2)
        act3 = Activation("relu")(bn3)
        conv3 = Conv2D(K, (1, 1), use_bias=False,kernel_regularizer=l2(reg))(act3)
        
        if red:
            shortcut = Conv2D(K, (1, 1), strides=stride, use_bias=False, kernel_regularizer=l2(reg))(act1)
            
        x = add([conv3, shortcut])
        
        return x
    
    @staticmethod
    def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
        inputShape = (height, width, depth)
        chanDim = -1
        
        if K.image_data_format() == "channels_first":
            inputShape = (depth, height, width)
            chanDim = 1
            
        inputs = Input(shape=inputShape)
        x = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(inputs)
        x = Conv2D(filters[0], (3, 3), use_bias = False, padding = "same", kernel_regularizer=l2(reg))(x)
        
        for i in range(0, len(stages)):
            stride = (1, 1) if i == 0 else (2, 2)
            x = ResNet.residual_module(x, filters[i + 1], stride, chanDim, red=True, bnEps=bnEps, bnMom=bnMom)
            
            for j in range(0, stages[i] - 1):
                x = ResNet.residual_module(x, filters[i + 1], (1, 1), chanDim, bnEps=bnEps, bnMom=bnMom)
        
        x = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(x)
        x = Activation("relu")(x)
        x = AveragePooling2D((8,8))(x)
        
        x = Flatten()(x)
        x = Dense(classes, kernel_regularizer=l2(reg))(x)
        x = Activation("softmax")(x)
        
        model = Model(inputs, x, name="resnet")
        
        return model

In [3]:
from tensorflow.keras.callbacks import Callback
import os

class EpochCheckpoint(Callback):
    def __init__(self, outputPath, every=5, startAt=0):
        super(Callback, self).__init__()
        
        self.outputPath = outputPath
        self.every = every
        self.intEpoch = startAt
        
    def on_epoch_end(self, epoch, logs={}):
        if (self.intEpoch + 1) % self.every == 0:
            p = os.path.sep.join([self.outputPath, "epoch_{}.hdf5".format(self.intEpoch+1)])
            self.model.save(p, overwrite=True)
            
        self.intEpoch +=1

In [12]:
from tensorflow.keras.callbacks import BaseLogger
import matplotlib.pyplot as plt
import numpy as np
import json
import os

class TrainingMonitor(BaseLogger):
    def __init__(self, figPath, jsonPath=None, startAt=0):
        super(TrainingMonitor, self).__init__()
        self.figPath = figPath
        self.jsonPath = jsonPath
        self.startAt = startAt
        
    def on_train_begin(self, logs={}):
        self.H = {}
        
        if self.jsonPath is not None:
            if os.path.exists(self.jsonPath):
                self.H = json.loads(open(self.jsonPath).read())
                
                if self.startAt > 0:
                    for k in self.H.keys():
                        self.H[k] = self.H[k][:self.startAt]
    
    def on_epoch_end(self, epoch, logs={}):
            # loop over the logs and update the loss, accuracy, etc for the entire training process
            # in this case, history/logs is what being written on model.fit
            for (k, v) in logs.items():
                l = self.H.get(k, [])
                l.append(float(v))
                self.H[k] = l

            # check to see if the training history should be serialized to file
            if self.jsonPath is not None:
                f = open(self.jsonPath, "w")
                f.write(json.dumps(self.H))
                f.close()

            # ensure at least two epochs have passed before plotting
            # (epoch starts at zero)
            if len(self.H["loss"]) > 1:
                # plot the training loss and accuracy
                N = np.arange(0, len(self.H["loss"]))
                plt.style.use("ggplot")
                plt.figure()
                plt.plot(N, self.H["loss"], label="train_loss")
                plt.plot(N, self.H["val_loss"], label="val_loss")
                plt.plot(N, self.H["accuracy"], label="train_acc")
                plt.plot(N, self.H["val_accuracy"], label="val_acc")
                plt.title("Training Loss and Accuracy [Epoch {}]".format(len(self.H["loss"])))
                plt.xlabel("Epoch #")
                plt.ylabel("Loss/Accuracy")
                plt.legend()

                # save the figure
                plt.savefig(self.figPath)
                plt.close()

In [5]:
print("[INFO] loading Fashion MNIST...")
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()

trainX = np.array([cv2.resize(x, (32, 32)) for x in trainX])
testX = np.array([cv2.resize(x, (32, 32)) for x in testX])
trainX = trainX.astype("float32")/255.0
testX = testX.astype("float32")/255.0

# two ways to expand the dimensions of the last axis
trainX = tf.expand_dims(trainX, axis=-1)
testX = testX.reshape((testX.shape[0], 32, 32, 1))

[INFO] loading Fashion MNIST...


In [6]:
# a label binarizer returns the index of one hot encoding
lb = LabelBinarizer()

trainY = lb.fit_transform(trainY)
testY = lb.transform(testY)

augmented = ImageDataGenerator(width_shift_range = 0.1, height_shift_range =0.1, horizontal_flip = True, fill_mode = "nearest")

In [None]:
# ap = argparse.ArgumentParser()
# ap.add_argument("-c","--checkpoints", required=True, help="path to output checkpoint directory")
# ap.add_argument("-m","--model", type=str, help="path to specific model checkpoint to load in hdf5")
# ap.add_argument("-s","--start-epoch",type=int, default=0, help="epoch to restart training at")
# args = vars(ap.parse_args())

In [7]:
LR = 1e-1

opt = SGD(lr = LR)
model = ResNet.build(32,32,1,10,(9,9,9),(64,64,128,256),reg=0.0001)
model.compile(loss="categorical_crossentropy",optimizer=opt, metrics=["accuracy"])

#print("[INFO] loading {}...".format("saved_model"))
#model = load_model()

print("[INFO] getting old learning rate: {}".format(K.get_value(model.optimizer.lr)))
K.set_value(model.optimizer.lr, 1e-2)
print("[INFO] new learning rate: {}".format(K.get_value(model.optimizer.lr)))

[INFO] getting old learning rate: 0.10000000149011612
[INFO] new learning rate: 0.009999999776482582


In [10]:
callbacks = [EpochCheckpoint(r"C:\Users\Innovations\Desktop\AI", every=5, startAt=0), TrainingMonitor(r"C:\Users\Innovations\Desktop\AI\resnet_fashion_mnist.png",r"C:\Users\Innovations\Desktop\AI\resnet_fashion_mnist.json", startAt=0)]
print("[INFO] training network...")
model.fit(augmented.flow(trainX, trainY, batch_size=128), validation_data=(testX, testY), steps_per_epoch = len(trainX)//128, epochs=80, callbacks=callbacks, verbose=0)

[INFO] training network...
  ...
    to  
  ['...']


<tensorflow.python.keras.callbacks.History at 0x229fea3f188>

In [13]:
print("[INFO] loading {}...".format("saved_model"))
model = load_model(r"C:/Users/Innovations/Desktop/AI/epoch_80.hdf5")
K.set_value(model.optimizer.lr, 1e-3)
callbacks = [EpochCheckpoint(r"C:\Users\Innovations\Desktop\AI", every=5, startAt=80), TrainingMonitor(r"C:\Users\Innovations\Desktop\AI\resnet_fashion_mnist.png",r"C:\Users\Innovations\Desktop\AI\resnet_fashion_mnist.json", startAt=80)]
print("[INFO] training network...")
model.fit(augmented.flow(trainX, trainY, batch_size=128), validation_data=(testX, testY), steps_per_epoch = len(trainX)//128, epochs=30, callbacks=callbacks, verbose=1)

[INFO] loading saved_model...
[INFO] training network...
  ...
    to  
  ['...']
Train for 468 steps, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x22a48268288>

In [15]:
from tensorflow.keras.callbacks import ReduceLROnPlateau

reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor = 0.5, patience=5, min_lr=0.00001)
print("[INFO] loading {}...".format("saved_model"))
model = load_model(r"C:/Users/Innovations/Desktop/AI/epoch_110.hdf5")
K.set_value(model.optimizer.lr, 1e-2)
callbacks = [EpochCheckpoint(r"C:\Users\Innovations\Desktop\AI", every=5, startAt=110), TrainingMonitor(r"C:\Users\Innovations\Desktop\AI\resnet_fashion_mnist.png",r"C:\Users\Innovations\Desktop\AI\resnet_fashion_mnist.json", startAt=110), reduce_lr]
print("[INFO] training network...")
model.fit(augmented.flow(trainX, trainY, batch_size=128), validation_data=(testX, testY), steps_per_epoch = len(trainX)//128, epochs=20, callbacks=callbacks, verbose=1)

[INFO] loading saved_model...
[INFO] training network...
  ...
    to  
  ['...']
Train for 468 steps, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x22a8a35c6c8>