<a href="https://colab.research.google.com/github/dwgb93/SIAM-Neural-Nets/blob/main/EdgeRunner-AI/Baby's_Second_Neural_Network_Convolution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#My Second Neural Network: Convolution

Convolutional Neural nets take a long time to run. Make sure you're connected to a GPU by clicking
##Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU

When we're dealing with image classificaiton, we want to identify features in an image that are clustered close together. For example, if we want to tell the difference between an 8 and a 9, we need to be able to distinguish between the pixels that make the bottom loop of an 8 and the stem of a 9.

Once again, we'll start with some bookkeeping. You'll notice this looks different that before. By importing exactly what we need up front, we can eliminate some of the messy-looking code later on.

In [33]:
from keras.datasets.mnist import load_data

from keras import Input
from keras.optimizers import Adam
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D, AveragePooling2D, BatchNormalization, GlobalMaxPooling2D, SpatialDropout2D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler
from keras.optimizers.schedules import CosineDecay

Next we'll download the data we are going to use

In [2]:
# the data is already split between train and test sets
(x_train, y_train), (x_test, y_test) = load_data() #Notice how this is much cleaner than before.

# Since we're dealing with image data, we need to reshape each vector
# This lets our network learn using pixels that are close together
#x_train, x_test = x_train / 255.0, x_test / 255.0 # consider uncommenting this to try with normalization
# OR
#x_train, x_test = x_train / 127.5 - 1, x_test / 127.5 - 1 # Normalized from -1 to 1 (good with tanh)
# OR
#x_train, x_test = (x_train - 20) / 200.0, (x_test - 20) / 200 # mean 0, variance 1 approximately
x_train = x_train.reshape(len(x_train), 28, 28, 1)
x_test = x_test.reshape(len(x_test), 28, 28, 1)

Let's build our neural network again. This time, we'll have several convolutional layers.

We'll start with the input layer. Since we have a 28x28 image, we need 784 neurons.


In [None]:
model = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(32, kernel_size=(3, 3), activation="relu"),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(64, kernel_size=(3, 3), activation="relu"),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    #Dropout(0.5),
    Dense(10, activation="softmax"),
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

Now, let's train the neural network!

We'll start a little differently than before. First, we'll use minibatches. This lets us compute a noisy estimate of the gradients at each time step, helping speed up training and escape local minima.

Additionally, we will randomly split the dataset into a training and validation set. Much like a test set, this will separate some of the images so they are NOT used for training. We can see how well the trained network is performing on new data as we go. This lets us tune parameters as we go to avoid overfitting.

Finally, we'll save the best model as we go, so we keep the best model before overfitting starts.

In [None]:
model_checkpoint = ModelCheckpoint('best_MNIST_CNN_model.keras', monitor='val_loss', save_best_only=True, save_freq="epoch")
callbacks_list=[model_checkpoint]

model.fit(x=x_train, y=y_train, batch_size=50, epochs=10, verbose=1, validation_split=0.2, callbacks=callbacks_list)
model.evaluate(x_test,  y_test, verbose=2)

Wow! Nearly 99% accuracy, in under a minute.
How can we do better?

Let's go a little bigger, adding dense layers after our convolutional layers.


We are going to recreate one of the very first convolutional neural networks

#LeNet

~http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf~
https://www.iro.umontreal.ca/~lisa/pointeurs/lecun-01a.pdf

In [None]:
model2 = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(6, kernel_size=(5, 5), padding="same", activation="tanh"),
    AveragePooling2D(pool_size=(2, 2)),
    Conv2D(16, kernel_size=(5, 5), activation="tanh"),
    AveragePooling2D(pool_size=(2, 2)),
    Conv2D(120, kernel_size=(5, 5), activation="tanh"),
    Flatten(),
    Dense(84, activation="tanh"),
    Dense(10, activation="softmax"),
])

model2.compile(optimizer='SGD',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model2.summary()

Training:

In [None]:
def lr_scheduler(epoch, lr):
    if epoch < 2:
      return 5e-4
    elif epoch < 5:
      return 2e-4
    elif epoch < 8:
      return 1e-4
    elif epoch < 12:
      return 5e-5
    else:
      return 1e-5

callbacks_list=[LearningRateScheduler(lr_scheduler, verbose=1)]

model2.fit(x=x_train,y=y_train, batch_size=64, epochs=20, verbose=1, validation_split=0.2) #, callbacks=callbacks_list)
model2.evaluate(x_test, y_test, verbose=2)

That's not much better, but we're using 1998 technology.

Let's jump forward a few decades, by using ReLU activation, MaxPooling, Adam, and a Learning Rate Scheduler.

That's a little better! But we're plateauing around 99%. Squeezing those last few bits of performance out is tricky, without some serious tweaks like data augmentation and ensemble networks.

Okay, now let's go for state of the art.

This is called SimpleNet: 13 convolutional layers, with Batch Normalization every layer, and Dropout.

# SimpleNet
https://arxiv.org/abs/1608.06037

In [None]:
model3 = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),

    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Dropout(0.1),
    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    Dropout(0.1),

    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(128, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    Dropout(0.1),

    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),

    Conv2D(512, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Dropout(0.1),

    Conv2D(2048, kernel_size=(1, 1), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    Dropout(0.1),

    Conv2D(256, kernel_size=(1, 1), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),

    Conv2D(256, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),

    Flatten(),
    Dense(10, activation="softmax"),
])


In [None]:
model3.compile(optimizer = Adam(learning_rate=0.01),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model3.summary()

That's a lot of parameters!

As a result, this is going to take a little longer to train. Make sure you're connected to GPU runtime!

We're also going to do a few tricks to make sure we get the best results.

First, we'll make sure that we only save the best model. That way, if we start to overfit, we can resume training from what worked best.

Next, we'll start with a high learning rate, then decrease it over time. This should help us escape some local minima, and keep learning.

In [None]:
model_checkpoint = ModelCheckpoint('best_SimpleNet.keras', monitor='val_loss', save_best_only=True, save_freq="epoch")
reduce_lr = ReduceLROnPlateau(monitor='val_accuracy', patience=1, verbose=1, factor=0.2, min_lr=1e-6)

callbacks_list=[model_checkpoint, reduce_lr]

model3.fit(x=x_train, y=y_train, batch_size=100, epochs=10, verbose=1, validation_split=0.2, callbacks=callbacks_list)
model3.evaluate(x_test,  y_test, verbose=2)

#SimpleNet Smol

This takes the same ideas as above, but uses a 5x fewer parameters, and a special 2D version of Dropout after every BatchNorm layer to really try and avoid overfitting.

Let's see how it does!

In [27]:
model4 = Sequential([
    Input(shape=x_train[0].shape),
    Conv2D(66, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),
    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),
    Conv2D(64, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(96, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(144, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    MaxPooling2D(pool_size=(2, 2), padding="same"),
    SpatialDropout2D(0.2),

    Conv2D(144, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(178, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    SpatialDropout2D(0.2),

    Conv2D(216, kernel_size=(3, 3), padding="same", activation="relu"),
    BatchNormalization(momentum=0.95),
    GlobalMaxPooling2D(),
    Dropout(0.2),

    Flatten(),
    Dense(10, activation="softmax"),
])


Cosine Decay: https://keras.io/api/optimizers/learning_rate_schedules/cosine_decay/

In [28]:
# https://www.youtube.com/watch?v=s2NkEYVp_44
cosine_decay_scheduler = CosineDecay(
    initial_lr=, decay_steps=, warmup_target=, warmup_steps=
    )

In [None]:
model4.compile(optimizer=Adam(learning_rate=0.01),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model4.summary()

In [None]:
model_checkpoint2 = ModelCheckpoint('best_SimpleNet_Smol.keras', monitor='val_loss', save_best_only=True, save_freq="epoch")
reduce_lr2 = ReduceLROnPlateau(monitor='val_accuracy', patience=1, verbose=1, factor=0.2, min_lr=1e-6)

callbacks_list=[model_checkpoint2, reduce_lr2]

model4.fit(x=x_train, y=y_train, batch_size=100, epochs=10, verbose=1, validation_split=0.2, callbacks=callbacks_list)
model4.evaluate(x_test,  y_test, verbose=2)

In [38]:
model4.evaluate(x_test,  y_test, verbose=2)

313/313 - 2s - 6ms/step - accuracy: 0.9971 - loss: 0.0095


[0.009528209455311298, 0.9970999956130981]

# Challenge

Try to get a test-set accuracy higher than 0.9971.

Rules (Hard Mode):
- Must be a single neural network
- Network must be trained on the training data
- No augmentations
- No ensembles

Ranks:
- \> 99.5 - Bronze
- \> 99.6 - Silver
- \> 99.7 - Gold
- \> 99.75 - World record!

Unlimited Class (augmentations and ensembles allowed):
- \> 99.8 - Platinum
- \> 99.9 - Diamond
- \> 99.91 - World record? (unclear - there are 100%s on Kaggle
- \> 99.96 - Unpossible (there are errors in the dataset)

Winner(s) get(s) a something. Maybe.

### Helpful Resources
- Optimizers: https://keras.io/api/optimizers/ (`!pip install -U keras` then try to get Muon working)
- Activations: https://keras.io/api/layers/activations/
- Layers: https://keras.io/api/layers/