<a href="https://colab.research.google.com/github/chadmh/Short-Hands-on-Tutorial-for-Deep-Learning-in-Tensorflow/blob/master/4_Pretrained_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4.1 Adapting Pre-Trained CNNs to Image Processing Tasks

While it might be satisfying to create a CNN from scratch, in practice one will often get better performance from customizing an industry standard model.  One such network is ResNet50.  This is a 50-layer CNN [published in 2105 by Kaiming He, et al](https://arxiv.org/abs/1512.03385). The code used to load and preprocess the MNIST data is similar to earlier notebooks; however, because ResNet50 expects a 3 channel (color) image with a size of at least 32 x 32 pixels, the preprocessing code will need to convert the image from 28 x 28 x 1 to 32 x 32 x 3.  This can be done by zero padding the original images and replicating them across 3 channels.

In [None]:
# Import the needed tensorflow components
import tensorflow as tf
import tensorflow_datasets as tfds

# Load the MNIST dataset.  Load checks whether the dataset is locally available and downloads it from 
# its official repository if at http://yann.lecun.com/exdb/mnist if it cannot be found.
(train, test), info = tfds.load('mnist',                  # Pick the MNIST dataset
                                 split=['train', 'test'], # Load both the training and testing parts of the dataset
                                 with_info=True,          # Generate summary information about the dataset
                                 as_supervised=True)      # return both the inputs and labels as a tuple

print(info.description)
print(info.splits)

# Define the data preprocessing pipeline.  For MNIST, the only needed preprocessing is to convert from unit8 to 
# float.  Other data sets are likely more extensive.
def preprocess_data(input, label):
  # Convert unit8 to real on [0, 1]
  input = tf.cast(input, tf.float32) / 255.0

  # Apply zero padding to make the image 32 x 32; center original image
  input = tf.image.pad_to_bounding_box(input, 2, 2, 32, 32)

  # Make 1 channel image 3 channels
  input = tf.tile(input, tf.constant([1,1,3], tf.int32))

  return input, label

# Assign the preprocessing pipeline to each dataset: train and test
train = train.map(preprocess_data)
test = test.map(preprocess_data)

# Tell each dataset how many images it will load at once for processing
BATCH_SIZE=128
train = train.batch(BATCH_SIZE)
test = test.batch(BATCH_SIZE)

The MNIST database of handwritten digits.
{'test': <tfds.core.SplitInfo num_examples=10000>, 'train': <tfds.core.SplitInfo num_examples=60000>}


This notebook will demonstrate customizing the ResNet50 model for the MNIST dataset.  In reality, the weights we will use for this instantiation of ResNet50 were trained on natural photos rather than digits, so performance may suffer, but such will be a good learning opportunity.

The basic approach is to replace the final layer of the trained model with our own 10 neuron layer and then to retrain just the final layer while keeping all other parameters locked. The advantage here is that training large models takes a significant amount of time.  By locking the pretrained part, we retain all the useful edge recognition and other kernals and save a huge amount of training time.  Replacing the last layer with our custom layer allows us to fine-tune the model to our unique labels and create a high-performance custom model for our situation.

In [None]:
resnet_model = tf.keras.models.Sequential()
resnet_model.add(tf.keras.applications.resnet50.ResNet50(
    include_top=False,
    pooling='avg',
    input_shape=(32, 32, 3),
    weights='imagenet',
    ))
resnet_model.add(tf.keras.layers.Dense(10))
resnet_model.layers[0].trainable = False
resnet_model.compile(optimizer='adam', 
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
                     metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

resnet_model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
resnet50 (Functional)        (None, 2048)              23587712  
_________________________________________________________________
dense_3 (Dense)              (None, 10)                20490     
Total params: 23,608,202
Trainable params: 20,490
Non-trainable params: 23,587,712
_________________________________________________________________


Training is done using the fit function just as before.

In [None]:
resnet_model.fit(train, epochs=2, validation_data=test)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f1f6be57a90>

Using the fine-tuned professional model, the performance actually drops relative to the models in [Notebook1](https://colab.research.google.com/github/chadmh/Short-Hands-on-Tutorial-for-Deep-Learning-in-Tensorflow/blob/master/1_Introduction.ipynb) and [Notebook 3](https://colab.research.google.com/github/chadmh/Short-Hands-on-Tutorial-for-Deep-Learning-in-Tensorflow/blob/master/3_Convolutional_Neural_Networks.ipynb).  While one might initially be surprised by this turn of events, the reason is actually quite simple.  The pretrained weights that we used for the ResNet50 model were trained using the [ImageNet dataset](https://image-net.org/index.php). This dataset focuses on a broad array of animals, plants, people and things, but not digits.  It also is optimized for larger images with a default input size of (224, 224, 3).  The issue could be resolved by retraining from scratch on digit data.  The model performs much better when applied to photographic images as that is what is used in its training data.

This performance loss, however, clearly shows that simply grabbing and using an industry-standard model without knowing its assumptions and training biases will lead to sub-standard performance.  Newer or larger does not necessarily mean better in the context of a specific machine learning problem.  Knowing which model will apply well in which situation is a matter of reading the scientific literature and developing hands-on experience.





# 4.2 Applications of CNNs to production pipelines

CNNs typically attempt to classify an image as a single item.  In general an image may contain multiple items of interest.  For example, a security camera may take an image containing people, bikes, dogs and cars.  Applying the CNN to the entire image at once, therefore, tends to confuse the network as it sees features associated with multiple classes.

One approach to resolve this issue is to split the large image into smaller windows and process each with the CNN.  This reduces the likelihood of having multiple classes of objects in a single input.  The window size and overlap is application dependent. The system tracks which windows flag a high-confidence hit on a class of interest and then localize the object based on the window position.  This approach scales to large and varied images but can suffer from poor execution speed as the system must process and book-keep the individual images. This can mean that the original image is processed multiple times due to overlap and scaling as the system tries to match the input images to the trained model. 

A more modern approach is to treat the image analysis as a regression rather than classification problem.  This is the approach taken by YOLO (You Only Look Once) networks as discussed in the next notebook.