# Problem 1 - Learning Rate, Batch Size, FashionMNIST

Recall cyclical learning rate policy discussed in Lecture 4. The learning rate changes in cyclical manner between lrmin and lrmax, which are hyperparameters that need to be specified. For this problem you first need to read carefully the article referenced below as you will be making use of the code there (in Keras) and modifying it as needed. For those who want to work in Pytorch there are open source implementations of this policy available which you can easily search for and build over them. You will work with FashionMNIST dataset and LeNet-5.

References:
1. Leslie N. Smith Cyclical Learning Rates for Training Neural Networks. Available at https://arxiv.org/abs/1506.01186.
2. Keras implementation of cyclical learning rate policy. Available at https://www.pyimagesearch.com/2019/08/05/keras-
learning-rate-finder/.


1. Fix batch size to 64 and start with 10 candidate learning rates between 10−9 and 101 and train your model for 5 epochs for each learning rate. Plot the training loss as a function of learning rate. You should see a curve like Figure 3 in reference below. From that figure identify the values of lrmin and lrmax. 

In [10]:
! conda install ipykernel --name Python3
! python -m ipykernel install
! pip3 install cv2


EnvironmentLocationNotFound: Not a conda environment: /Users/aragaom/opt/anaconda3/envs/Python3

Installed kernelspec python3 in /usr/local/share/jupyter/kernels/python3
[31mERROR: Could not find a version that satisfies the requirement cv2 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for cv2[0m[31m
You should consider upgrading via the '/Users/aragaom/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [42]:
# import the necessary packages
import os
# initialize the list of class label names
CLASSES = ["top", "trouser", "pullover", "dress", "coat",
	"sandal", "shirt", "sneaker", "bag", "ankle boot"]
# define the minimum learning rate, maximum learning rate, batch size,
# step size, CLR method, and number of epochs
MIN_LR = 1e-5
MAX_LR = 1e-2
BATCH_SIZE = 64
STEP_SIZE = 8
CLR_METHOD = "triangular"
NUM_EPOCHS = 50
# define the path to the output learning rate finder plot, training
# history plot and cyclical learning rate plot
LRFIND_PLOT_PATH = os.path.sep.join(["output", "lrfind_plot.png"])
TRAINING_PLOT_PATH = os.path.sep.join(["output", "training_plot.png"])
CLR_PLOT_PATH = os.path.sep.join(["output", "clr_plot.png"])

In [22]:
from tensorflow.keras import datasets, layers, models, losses

def create_model():
  model = models.Sequential()

  model.add(layers.Conv2D(6, 5, activation='tanh', input_shape=trainX.shape[1:]))
  model.add(layers.AveragePooling2D(2))


  model.add(layers.Conv2D(16, 5, activation='tanh'))
  model.add(layers.AveragePooling2D(2))

  model.add(layers.Conv2D(120, 5, activation='tanh'))

  model.add(layers.Flatten()) # dense layer is a linear layer and we flatten the input before putting it there.
  model.add(layers.Dense(84, activation='tanh'))

  model.add(layers.Dense(10, activation='softmax'))

  return model

In [12]:
import tensorflow as tf
def load_data_mnist_tf(batch_size, resize=None):   
    
    # load dataset
    mnist_train, mnist_test = tf.keras.datasets.mnist.load_data()

    # normalisation and cast as Int datatype
    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,tf.cast(y, dtype='int32')) 
    # the pixel values must be personalized, so each feature has the same affect ont he output 

    # resize images if resize is not None
    resize_fn = lambda X, y: (tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
    # resizing of the image fucntion ?? 


    # load train and test batches
    train_iter = tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(batch_size).shuffle(len(mnist_train[0])).map(resize_fn)
    test_iter = tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(batch_size).map(resize_fn)
    
    return (train_iter, test_iter)

In [43]:
# set the matplotlib backend so figures can be saved in the background
import matplotlib
matplotlib.use("Agg")
# import the necessary packages
from learningratefinder import LearningRateFinder
from clr_callback import CyclicLR
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import fashion_mnist
import matplotlib.pyplot as plt
import numpy as np
import argparse
# import cv2
import sys

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-f", "--lr-find", type=int, default=0,
	help="whether or not to find optimal learning rate")
args, unknown = ap.parse_known_args()
resize_fn = lambda X, y: (tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
# load the training and testing data
print("[INFO] loading Fashion MNIST data...")
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()
# Fashion MNIST images are 28x28 but the network we will be training
# is expecting 32x32 images
# trainX = np.array([tf.image.resize(x, [32,32]) for x in trainX])
# testX = np.array([tf.image.resize(x [32,32]) for x in testX])
# scale the pixel intensities to the range [0, 1]
trainX = tf.pad(trainX, [[0, 0], [2,2], [2,2]])/255
testX = tf.pad(testX, [[0, 0], [2,2], [2,2]])/255

trainX = tf.expand_dims(trainX, axis=3, name=None)
testX = tf.expand_dims(testX, axis=3, name=None)


# reshape the data matrices to include a channel dimension (required
# for training)
# trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))
# testX = testX.reshape((testX.shape[0], 28, 28, 1))
# convert the labels from integers to vectors
lb = LabelBinarizer()
trainY = lb.fit_transform(trainY)
testY = lb.transform(testY)
# construct the image generator for data augmentation
aug = ImageDataGenerator(width_shift_range=0.1,
	height_shift_range=0.1, horizontal_flip=True,
	fill_mode="nearest")

# initialize the optimizer and model
print("[INFO] compiling model...")
opt = SGD(learning_rate=MIN_LR, momentum=0.9)
model = create_model()
model.compile(loss="categorical_crossentropy", optimizer=opt,
	metrics=["accuracy"])


[INFO] loading Fashion MNIST data...
[INFO] compiling model...


In [46]:
print(args)
if args.lr_find > 0:
    	# initialize the learning rate finder and then train with learning
	# rates ranging from 1e-10 to 1e+1
	print("[INFO] finding learning rate...")
	lrf = LearningRateFinder(model)
	lrf.find(
		aug.flow(trainX, trainY, batch_size=BATCH_SIZE),
		1e-10, 1e+1,
		stepsPerEpoch=np.ceil((len(trainX) / float(BATCH_SIZE))),
		batchSize=BATCH_SIZE)
	# plot the loss for the various learning rates and save the
	# resulting plot to disk
	lrf.plot_loss()
	plt.savefig(LRFIND_PLOT_PATH)
	# gracefully exit the script so we can adjust our learning rates
	# in the config and then train the network for our full set of
	# epochs
	print("[INFO] learning rate finder complete")
	print("[INFO] examine plot and adjust learning rates before training")
	sys.exit(0)

Namespace(lr_find=0)


In [45]:
stepSize = STEP_SIZE * (trainX.shape[0] // BATCH_SIZE)
clr = CyclicLR(
	mode=CLR_METHOD,
	base_lr=MIN_LR,
	max_lr=MAX_LR,
	step_size=stepSize)
# train the network
print("[INFO] training network...")
H = model.fit(
	x=aug.flow(trainX, trainY, batch_size=BATCH_SIZE),
	validation_data=(testX, testY),
	steps_per_epoch=trainX.shape[0] // BATCH_SIZE,
	epochs=NUM_EPOCHS,
	callbacks=[clr],
	verbose=1)
# evaluate the network and show a classification report
print("[INFO] evaluating network...")
predictions = model.predict(x=testX, batch_size=BATCH_SIZE)
print(classification_report(testY.argmax(axis=1),
	predictions.argmax(axis=1), target_names=CLASSES))

[INFO] training network...
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
[INFO] evaluating network...
              precision    recall  f1-score   support

         top       0.84      0.85      0.84      1000
     trouser       0.98      0.98      0.98      1000
    pullover       0.84      0.82      0.83      1000
       dress       0.88      0.92      0.90      1000
        coat       0.85      0.79      0.82      1000
      sandal       0.98     

In [47]:
N = np.arange(0, NUM_EPOCHS)
plt.style.use("ggplot")
plt.figure()
plt.plot(N, H.history["loss"], label="train_loss")
plt.plot(N, H.history["val_loss"], label="val_loss")
plt.plot(N, H.history["accuracy"], label="train_acc")
plt.plot(N, H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="lower left")
plt.savefig("training_plot.png")
# plot the learning rate history
N = np.arange(0, len(clr.history["lr"]))
plt.figure()
plt.plot(N, clr.history["lr"])
plt.title("Cyclical Learning Rate (CLR)")
plt.xlabel("Training Iterations")
plt.ylabel("Learning Rate")
plt.savefig("crl_plot.png")

2. Use the cyclical learning rate policy (with exponential decay) and train your network using batch size 64 and lrmin and lrmax values obtained in part 1. Plot train/validation loss and accuracy curve (similar to Figure 4 in reference).

3. We want to test if increasing batch size for a fixed learning rate has the same effect as decreasing learning rate for a fixed batch size. Fix learning rate to lrmax and train your network starting with batch size 32 and incrementally going upto 4096 (in increments of a factor of 2; like 32, 64...). You can choose a step size (in terms of number of epochs) to increment the batch size. Plot the training loss vs. log2(batch size). Is the generalization of your final model similar or different than cyclical learning rate policy?