<a href="https://colab.research.google.com/github/dwgb93/SIAM-Neural-Nets/blob/main/Baby's_Second_Neural_Network_Convolution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#My First Neural Network: Convolution

Convolutional Neural nets take a long time to run. Make sure you're connected to a GPU by clicking 
##Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU

When we're dealing with image classificaiton, we want to identify features in an image that are clustered close together. For example, if we want to tell the difference between an 8 and a 9, we need to be able to distinguish between the pixels that make the bottom loop of an 8 and the stem of a 9.

Once again, we'll start with some bookkeeping. You'll notice this looks different that before. By importing exactly what we need up front, we can eliminate some of the messy-looking code later on.

In [1]:
import numpy as np
import time
from tensorflow.keras.datasets.mnist import load_data

from tensorflow.keras import Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D, AveragePooling2D, BatchNormalization, GlobalMaxPooling2D, SpatialDropout2D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler

Next we'll download the data we are going to use

In [2]:
# the data is already split between train and test sets
(x_train, y_train), (x_test, y_test) = load_data() #Notice how this is much cleaner than before.

# Since we're dealing with image data, we need to reshape each vector
# This lets our network learn using pixels that are close together
#x_train, x_test = x_train / 255.0, x_test / 255.0 # consider uncommenting this to try with normalization
x_train = x_train.reshape(len(x_train), 28, 28, 1)
x_test = x_test.reshape(len(x_test), 28, 28, 1)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


Let's build our neural network again. This time, we'll have several convolutional layers.

We'll start with the input layer. Since we have a 28x28 image, we need 784 neurons.


In [26]:
model = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(32, kernel_size=(3, 3), activation="relu"),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(64, kernel_size=(3, 3), activation="relu"),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dropout(0.5),
    Dense(10, activation="softmax"),
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_40 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_41 (Conv2D)           (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 5, 5, 64)          0         
_________________________________________________________________
flatten_11 (Flatten)         (None, 1600)              0         
_________________________________________________________________
dropout_13 (Dropout)         (None, 1600)              0         
_________________________________________________________________
dense_21 (Dense)             (None, 10)              

Now, let's train the neural network!

We'll start a little differently than before. First, we'll use minibatches. This lets us compute a noisy estimate of the gradients at each time step, helping speed up training and escape local minima.

Additionally, we will randomly split the dataset into a training and validation set. Much like a test set, this will separate some of the images so they are NOT used for training. We can see how well the trained network is performing on new data as we go. This lets us tune parameters as we go to avoid overfitting.

Finally, we'll save the best model as we go, so we keep the best model before overfitting starts.

In [27]:
model_checkpoint = ModelCheckpoint('best_MNIST_CNN_model.hdf5', monitor='val_loss', save_best_only=True, save_freq="epoch")
callbacks_list=[model_checkpoint]

model.fit(x=x_train,y=y_train, batch_size=50,epochs=10,verbose=1,validation_split=0.2,callbacks=callbacks_list)
model.evaluate(x_test,  y_test, verbose=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
313/313 - 1s - loss: 0.0278 - accuracy: 0.9906


[0.02784702740609646, 0.9905999898910522]

Wow! Nearly 99% accuracy, in under a minute.
How can we do better?

Let's go a little bigger, adding dense layers after our convolutional layers.


We are going to recreate one of the very first convolutional neural networks

#LeNet

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

In [8]:
model2 = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(6, kernel_size=(5, 5), padding="same", activation="tanh"),
    AveragePooling2D(pool_size=(2, 2)),
    Conv2D(16, kernel_size=(5, 5), activation="tanh"),
    AveragePooling2D(pool_size=(2, 2)),
    Conv2D(120, kernel_size=(5, 5), activation="tanh"),
    Flatten(),
    Dense(84, activation="tanh"),
    Dense(10, activation="softmax"),
])

model2.compile(optimizer='SGD',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_6 (Conv2D)            (None, 28, 28, 6)         156       
_________________________________________________________________
average_pooling2d (AveragePo (None, 14, 14, 6)         0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 10, 10, 16)        2416      
_________________________________________________________________
average_pooling2d_1 (Average (None, 5, 5, 16)          0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 1, 1, 120)         48120     
_________________________________________________________________
flatten_2 (Flatten)          (None, 120)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 84)               

Training:

In [9]:
def lr_scheduler(epoch, lr):
    if epoch < 2:
      return 5e-4
    elif epoch < 5:
      return 2e-4
    elif epoch < 8:
      return 1e-4
    elif epoch < 12:
      return 5e-5
    else:
      return 1e-5

callbacks_list=[LearningRateScheduler(lr_scheduler, verbose=1)]

model2.fit(x=x_train,y=y_train, batch_size=64,epochs=20,verbose=1,validation_split=0.2) #, callbacks=callbacks_list)
model2.evaluate(x_test, y_test, verbose=2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
313/313 - 0s - loss: 0.0468 - accuracy: 0.9851


[0.04680196940898895, 0.9850999712944031]

That's not much better, but we're using 1998 technology.

Let's jump forward a few decades, by using ReLU activation, MaxPooling, Adam, and a Learning Rate Scheduler.

That's a little better! But we're plateauing around 99%. Squeezing those last few bits of performance out is tricky, without some serious tweaks like data augmentation and ensemble networks.

Okay, now let's go for state of the art.

This is called SimpleNet: 13 convolutional layers, with Batch Normalization every layer, and Dropout.

# SimpleNet
https://arxiv.org/pdf/1608.06037.pdf

In [22]:
model3 = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),

    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Dropout(0.1),
    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    Dropout(0.1),

    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    Dropout(0.1),

    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),

    Conv2D(512, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Dropout(0.1),

    Conv2D(2048, kernel_size=(1, 1), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Dropout(0.1),

    Conv2D(256, kernel_size=(1, 1), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),

    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),

    Flatten(),
    Dense(10, activation="softmax"),
])


In [23]:
model3.compile(optimizer = Adam(learning_rate=0.01),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model3.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_88 (Conv2D)           (None, 28, 28, 64)        640       
_________________________________________________________________
batch_normalization_88 (Batc (None, 28, 28, 64)        256       
_________________________________________________________________
conv2d_89 (Conv2D)           (None, 28, 28, 128)       73856     
_________________________________________________________________
batch_normalization_89 (Batc (None, 28, 28, 128)       512       
_________________________________________________________________
conv2d_90 (Conv2D)           (None, 28, 28, 128)       147584    
_________________________________________________________________
batch_normalization_90 (Batc (None, 28, 28, 128)       512       
_________________________________________________________________
dropout_10 (Dropout)         (None, 28, 28, 128)      

That's a lot of parameters!

As a result, this is going to take a little longer to train. Make sure you're connected to GPU runtime!

We're also going to do a few tricks to make sure we get the best results.

First, we'll make sure that we only save the best model. That way, if we start to overfit, we can resume training from what worked best.

Next, we'll start with a high learning rate, then decrease it over time. This should help us escape some local minima, and keep learning.

In [24]:
model_checkpoint = ModelCheckpoint('best_SimpleNet.hdf5', monitor='val_loss', save_best_only=True, save_freq="epoch")
reduce_lr = ReduceLROnPlateau(monitor='val_accuracy', patience=1, verbose=1, factor=0.2, min_lr=1e-6)

callbacks_list=[model_checkpoint, reduce_lr]

model3.fit(x=x_train,y=y_train, batch_size=100 ,epochs=10,verbose=1,validation_split=0.2,callbacks=callbacks_list)
model3.evaluate(x_test,  y_test, verbose=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10

Epoch 00004: ReduceLROnPlateau reducing learning rate to 0.0019999999552965165.
Epoch 5/10
Epoch 6/10

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.0003999999724328518.
Epoch 7/10
Epoch 8/10

Epoch 00008: ReduceLROnPlateau reducing learning rate to 7.999999215826393e-05.
Epoch 9/10
Epoch 10/10

Epoch 00010: ReduceLROnPlateau reducing learning rate to 1.599999814061448e-05.
313/313 - 3s - loss: 0.0146 - accuracy: 0.9949
6.345659414927165 minutes


#SimpleNet Smol

This takes the same ideas as above, but uses a 5x fewer parameters, and a special 2D version of Dropout after every BatchNorm layer to really try and avoid overfitting.

Let's see how it does!

In [7]:
model4 = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(66, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),
    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),
    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(144, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    SpatialDropout2D(0.2),

    Conv2D(144, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(178, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(216, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    GlobalMaxPooling2D(),
    Dropout(0.2),

    Flatten(),
    Dense(10, activation="softmax"),
])


In [8]:
model4.compile(optimizer=Adam(learning_rate=0.01),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model4.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_24 (Conv2D)           (None, 28, 28, 66)        660       
_________________________________________________________________
batch_normalization_24 (Batc (None, 28, 28, 66)        264       
_________________________________________________________________
conv2d_25 (Conv2D)           (None, 28, 28, 64)        38080     
_________________________________________________________________
batch_normalization_25 (Batc (None, 28, 28, 64)        256       
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 28, 28, 64)        36928     
_________________________________________________________________
batch_normalization_26 (Batc (None, 28, 28, 64)        256       
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 28, 28, 64)       

In [9]:
model_checkpoint2 = ModelCheckpoint('best_SimpleNet_Smol.hdf5', monitor='val_loss', save_best_only=True, save_freq="epoch")
reduce_lr2 = ReduceLROnPlateau(monitor='val_accuracy', patience=1, verbose=1, factor=0.2, min_lr=1e-6)

callbacks_list=[model_checkpoint2, reduce_lr2]

model4.fit(x=x_train,y=y_train, batch_size=100 ,epochs=10,verbose=1,validation_split=0.2,callbacks=callbacks_list)
model4.evaluate(x_test,  y_test, verbose=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.0019999999552965165.
Epoch 6/10
Epoch 7/10

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.0003999999724328518.
Epoch 8/10

Epoch 00008: ReduceLROnPlateau reducing learning rate to 7.999999215826393e-05.
Epoch 9/10

Epoch 00009: ReduceLROnPlateau reducing learning rate to 1.599999814061448e-05.
Epoch 10/10

Epoch 00010: ReduceLROnPlateau reducing learning rate to 3.199999628122896e-06.
313/313 - 2s - loss: 0.0120 - accuracy: 0.9960
3.940545924504598 minutes
