Answers
-------
1. Advantages of a CNN over a fully connected DNN for image classification:
  - Less parameters. The conv layers are not fully connected, instead they are connected to smaller regions in the 
    input layer and slide over it. This means:
    * Less computational resources
    * Less prone to overfitting
  - Locality
    * When a pattern is learned in one part of the image, it can be detected in another
      part of the image.
  - The input can remain 2D. There is no need to flatten it.
    * This helps with maintaining the structural integrity of images.

2.  Assume an RGB image of size 200x300 inputted into a CNN with 3 conv layers, each with 3x3 filters, stride=2 and "same"
  padding.
  - First conv layer: outputs 100 feature maps
  - Second conv layer: outputs 200 feature maps
  - Third conv layer: outputs 400 feature maps

  a. Number of parameters in this CNN:
  - First layer:
    * 100 filters x 3 x 3 x 3 channels (RGB) + 100 bias terms = 2,800 parameters
  - Second layer:
    * 200 filters x 3 x 3 x 100 channels (prev layer) + 200 bias terms = 180,200 parameters
  - Third layer:
    * 400 filters x 3 x 3 x 200 channels + 400 bias terms = 720,400
  - Total:
    * 2800 + 180200 + 720400 = 903,400 parameters

  b. Assume we're using 32 bit floats. At least how much RAM will this CNN require to predict a single image?
    - Since we're only doing prediction, we can unload a conv layer after it's done so we'll compute the RAM 
      needed per layer and take the max amount as the answer. Also, since the stride is 2 and padding is same,
      we know that the size of the feature maps is divided by 2 every time
    - First layer:
      * 2,800 parameters + (200 / 2 * 300 / 2) * 100 feature maps + 200 * 300 * 3 input image = 1,682,800 
      * Since each pixel is 4 bytes: 4 * 1,682,800 = 6,731,200 bytes ~ 5.45 MB
    - Second layer:
      * 180,200 + (100 / 2 * 150 / 2) * 200 + (200 / 2 * 300 / 2) * 100 = 2,430,200
      * 2,430,200 * 4 = 9,720,800 bytes ~ 9.3 MB
    - Third layer:
      * 720,400 + (50 / 2 * 75 / 2) * 400 + (100 / 2 * 150 / 2) * 200 = 1,845,400
      * 1,845,400 * 4 = 7,381,600 bytes ~ 7 MB

  c. How much RAM will we need for training a batch of 50 such images?
    - Load all layers and their parameters into memory:
      5.45 + 9.3 + 7 = 21.75 MB
    - Multiply by 50:
      21.75 * 50 = 1,087MB
  
3. If the GPU memory runs out during training:
  - Reduce the batch size
  - Use larger stride size
  - Remove one or more layers
  - Use 16-bit floats
  - Distribute the training across multiple devices

4. Max pooling helps by concentrating the strongest signal. It helps reducing the number of learnable parameters, providing
  support against overfitting and also provides faster computation. A max pooling layer has no parameters, therefore, it's
  more efficient.

5. Local response normalization layer 

6. 

7. A fully convolutional layer (FCN) is a type of CNN with no dense layer at its top. Instead it uses convolutional layers.
  This makes the network tolerant to inputs of different sizes. To make a dense layer into a convolutional layer, we use a 
  filter size that's the same as the input size and a stride of 1 with padding set to valid.

8. Main difficulty with semantic segmentation is that CNNs learn to identify objects in an image regardless of their location.
  SS requires knowledge of the object's location in order to classify each pixel in the image.


In [32]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds


from functools import partial

from sklearn.datasets import fetch_openml


  from .autonotebook import tqdm as notebook_tqdm


## Exercise 9 - CNN on MNIST

In [30]:
# Exercise 9 - CNN on MNIST

mnist = fetch_openml("mnist_784", as_frame=False)
X, y = mnist.data, mnist.target
X_train, X_test, y_train, y_test = X[:60000] / 255, X[60000:] / 255, y[:60000], y[60000:]
y_train, y_test = y_train.astype(float), y_test.astype(float)


  warn(


In [25]:
DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same", 
                        activation="relu", kernel_initializer="he_normal")

model = tf.keras.Sequential([
  tf.keras.layers.Reshape(target_shape=[28,28,1]),

  DefaultConv2D(filters=64, kernel_size=7, input_shape=[28,28,1]),
  tf.keras.layers.MaxPool2D(),
  DefaultConv2D(filters=128),
  DefaultConv2D(filters=128),
  DefaultConv2D(filters=256),
  tf.keras.layers.MaxPool2D(),
  DefaultConv2D(filters=256),
  DefaultConv2D(filters=256),
  tf.keras.layers.MaxPool2D(),

  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),
  tf.keras.layers.Dropout(0.5),
  
  tf.keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
  tf.keras.layers.Dropout(0.5),

  tf.keras.layers.Dense(units=10, activation="softmax"),
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [None]:
# !!! Run this on Kaggle using a T4x2 GPU. Too slow for this machine...

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=30, validation_split=0.1)

## Exercise 10 - transfer learning for large image classification

In [None]:
dataset, info = tfds.load("caltech_birds2011", 
                          split=["train[:80%]", "train[80%:]", "test"],
                          as_supervised=True, 
                          with_info=True)

dataset_size = info.splits["train"].num_examples
class_names = info.features["label"].names
n_classes = info.features["label"].num_classes

train_set_raw, valid_set_raw, test_set_raw = dataset

In [None]:
# Preprocess the images

batch_size = 32

# Resizing and using the Xception built-in preprocessing as a single Keras preprocessing model
preprocess = tf.keras.Sequential([
  tf.keras.layers.Resizing(height=224, width=224, crop_to_aspect_ratio=True),
  tf.keras.layers.Lambda(tf.keras.applications.xception.preprocess_input)
])
train_set = train_set_raw.map(lambda X,y: (preprocess(X), y))

# Shuffle and batch the training set
train_set = train_set.shuffle(1000, seed=42).batch(batch_size).prefetch(1)

# Preprocess for validation and test sets
valid_set = valid_set_raw.map(lambda X,y: (preprocess(X), y))
valid_set = valid_set.batch(batch_size)

test_set = test_set_raw.map(lambda X,y: (preprocess(X), y))
test_set = test_set.batch(batch_size)

In [None]:
# Loading the Xception model

# We set include_top=False so that it excludes the global avg pooling and dense output layer.
# We'll add our own output softmax layer for the flowers labels
base_model = tf.keras.applications.xception.Xception(weights="imagenet", include_top=False)

# Adding our own "top" layers
dense1 = tf.keras.layers.Dense(800, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.01))(base_model.output)
#dense2 = tf.keras.layers.Dense(400, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.01))(dense1)
#dense3 = tf.keras.layers.Dense(400, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.01))(dense2)
avg = tf.keras.layers.GlobalAveragePooling2D()(dense1)
output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
model = tf.keras.Model(inputs=base_model.input, outputs=output)

# Freezing the weights of the pretrained layers so that we don't corrupt them during training
for layer in base_model.layers:
  layer.trainable = False

optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

#model.summary()

In [None]:
# We start by doing 3 epochs on the new top with everything below it frozen
history = model.fit(train_set, validation_data=valid_set, epochs=3)

In [None]:
# Fitting more layers - USE GPU! very slow

# Now that we calibrated the top, we can unfreeze more layers below for training. The first calibration ensures 
# that the large gradients don't corrupt the well trained layer weights

for layer in base_model.layers[120:]:
  layer.trainable = True

# Need to re-compile
# Notice that we decreased the learning rate also to not corrupt the unfrozen, well trained layers
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

# Training for longer
history = model.fit(train_set, validation_data=valid_set, epochs=10)