# Module 4, Lab 2: Using ResNet with Fully Convolutional Layers and Transfer Learning to Implement Semantic Segmentation

In this tutorial, we will add fully-convolutional layers to the ResNet model using transfer learning.

We will introduce and use the concept of *Transfer Learning*, where pre-existing learned knowledge is used to inform our approach and improve our results.

Let's get started!

We built up a lot of code in the last tutorial, so we won't replicate it.  Instead, we're going to import Python files from the directory of this lab that implements the same functionality. Feel free to view that code to inspect it!



In [1]:
# Copyright (c) Microsoft. All rights reserved.
#
# Licensed under the MIT license. See LICENSE.md file in the project root
# for full license information.
# ==============================================================================

# For Azure Notebooks, we will update Microsoft Cognitive Toolkit version to 2.4 
# you can comment out the following line if you are running in your own local Jupyter Notebook setup and already have
# CNTK 2.4 installed
!pip install --upgrade --no-deps https://cntk.ai/PythonWheel/CPU-Only/cntk-2.4-cp35-cp35m-linux_x86_64.whl

import cntk as C
print ("Using Microsoft Cognitive Toolkit version {}".format(C.__version__))

import numpy as np
print ("Using numpy version {}".format(np.__version__))

cntk-2.4-cp35-cp35m-linux_x86_64.whl is not a supported wheel on this platform.
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Using Microsoft Cognitive Toolkit version 2.5.1
Using numpy version 1.13.1


In [2]:
import time
import os
import cv2
import gc

from cntk.learners import learning_rate_schedule, UnitType
from cntk.device import try_set_default_device, gpu
from tqdm import tqdm
from cntk.initializer import he_normal
from cntk.layers import AveragePooling, BatchNormalization, Convolution, Dense
from cntk.ops import element_times, relu, sigmoid
from cntk import load_model, placeholder, Constant

import coco   # local class to read the COCO images and labels
import helper # some functions to plot images
import training_helper  #
import cntk_resnet_fcn
import matplotlib.pyplot as plt

# Paths relative to current python file.
abs_path  = os.path.dirname(os.path.abspath("."))
data_path = os.path.join(abs_path, "../../data/M4")
zip_path = os.path.join(abs_path, "../data-zip")

model_path = os.path.join(abs_path, "Lab2/models")
base_model_file = os.path.join(model_path, "ResNet18_ImageNet_CNTK.model")


We are going to use the Cognitive Toolkit's default policy to use the best available device (GPU, if available, else CPU).

In [3]:
try:
    isUsingGPU = C.device.try_set_default_device(C.device.gpu(0))
except ValueError:
    isUsingGPU = False
    C.device.try_set_default_device(C.device.cpu())
    
print ("[i] The Cognitive Toolkit is using the {} for processing".format("GPU" if isUsingGPU else "CPU"))


[i] The Cognitive Toolkit is using the CPU for processing


Finally, if you are interested in running training (and have a GPU, or a CPU with lots of patience), set `make_model` to true and run the training code below.

Otherwise, set this to false, and download the already-baked pre-trained model directly from Microsoft.

In [5]:
make_model = False

# Transfer Learning

Nowadays, few convolutionl image networks are trained from scratch with purely random initialization.  Mostly, this is because of the difficulty in finding a suitably large, labelled, dataset.  If the target dataset is very small, it is likely that training with it would lead to unacceptably large generalization error.

Instead, the most common approach is to *pre-train* using a more generic large dataset (for example, ImageNet, or CoCo) and then use the learned features from this set as an initial seed to continue training with the target smaller dataset.  

This is using the intuition that the knowledge gained (within the layers of the network) while learning to recognize certain objects could be useful when trying to recognize objects of a different type. In other words, the network will learn more useful and generic *features* in its layers when trained against a larger dataset, which it can then apply (*transfer*) to the dataset of interest.

This approach is called *Transfer Learning*. 

![Transfer Learning](images/Transfer_Learning.png "Transfer Learning")

## Considerations when using Transfer Learning

If the target dataset is sigificantly smaller in size but of similar content to the original dataset, it is not always a good idea to fine-tune as it may lead to overfitting - since the data is similar to the original data, already learned features are very likely to be relevant ("transferrable") to this dataset as well, and already generalized from the content in the larger original dataset.

If the target dataset is relatively large in size and again of similar content to the original dataset, we can try fine-tuning only as we are less likely to overfit.

If our target dataset is smaller in size but dissimilar in content to the original dataset, it is likely that the later domain-specific layers in the original network have learned features that are not relevant to our target network. In this case, it might work better to remove some of these later layers, and add any new domain-specific layers to  activations from earlier in the original network.

If our target dataset is large and very dissimilar in content to the original dataset, we could train from scratch, but in practice it is often useful to initialize with weights taken from a pre-trained model. However, we don't need to freeze any of the layers and would typically just use our original dataset as a starting point.

Additionally, we will tend to transition from having general (*well-transferrable*) features early in the network, to more specific (*less-transferrable*) features in later layers of the network. This is called *domain discrepancy*. More advanced strategies might take account of this, and could include usung different learning rates by layer - that is, adjusting the learning weight of a layer (determining how *slushy* to make it, from frozen to fully tweakable) depending on its depth in the network.

![Transfer Learning Strategies](images/Transfer_Learning_Strategies.png "Transfer Learning Strategies")


## Our Examples

In our examples,  we are going to take a pre-trained set of weights for an object classifier network (ResNet18, trained on ImageNet), and try two different approaches:
 * In the first approach, we will *freeze* these weights, and only train additional layers to perform semantic segmentation.
 * In the second approach, we will not freeze the original weights, but *fine-tune* them by continuing to train and allowing backpropagation to alter the weights in the earlier layers of the network.
 
Note that this is not quite the same model that we used in Module 4, Lab 1, as we're using a small model that is pre-trained. As a result, our segmentation accuracy will be more *blocky*, as we are upsampling from smaller layers. But this is an artefact of engineering the lab to run in a limited computation environment, and to keep training size down. In general, transfer learning will offer better accuracy for limited training sets, and quite often faster training as well (if freezing layers).


In [6]:
feature_node_name = "features"
last_hidden_node_name = "z.x"
image_height = 224
image_width = 224
num_channels = 3

#
# Defines the fully convolutional models for image segmentation (transfer learning)
#
def create_transfer_learning_model(input, num_classes, model_file, freeze=False):

    base_model = load_model(model_file)
    base_model = C.as_composite(base_model[3].owner)

    # Load the pretrained classification net and find nodes
    feature_node = C.logging.find_by_name(base_model, feature_node_name)
    last_node = C.logging.find_by_name(base_model, last_hidden_node_name)
    
    base_model = C.combine([last_node.owner]).clone(C.CloneMethod.freeze if freeze else C.CloneMethod.clone, {feature_node: C.placeholder(name='features')})
    base_model = base_model(C.input_variable((num_channels, image_height, image_width)))

    r1 = C.logging.find_by_name(base_model, "z.x.x.r")
    r2_2 = C.logging.find_by_name(base_model, "z.x.x.x.x.r")
    r3_2 = C.logging.find_by_name(base_model, "z.x.x.x.x.x.x.r")

    up_r1 = cntk_resnet_fcn.OneByOneConvAndUpSample(r1, 3, num_classes)
    up_r2_2 = cntk_resnet_fcn.OneByOneConvAndUpSample(r2_2, 2, num_classes)
    up_r3_2 = cntk_resnet_fcn.OneByOneConvAndUpSample(r3_2, 1, num_classes)
    
    merged = C.splice(up_r1, up_r2_2, up_r3_2, axis=0)

    resnet_fcn_out = Convolution((1, 1), num_classes, init=he_normal(), activation=sigmoid, pad=True)(merged)

    z = cntk_resnet_fcn.UpSampling2DPower(resnet_fcn_out,2)
    
    return z


Now, let's setup and call our function to setup the large dataset that is  needed for this lab. 

Why are we doing this? We need this step because Azure Notebooks has two storage areas: a persistent but slow area, and a transient but fast area. To ensure that this lab is able to execute as fast as possible on Azure Notebooks, we unzip our data set into this transient area once per session.  If it has already been unzipped then we'll automatically detect this and skip the step, so it is safe to run this next code block at any time.

Depending on how you are executing the lab, this step can take a minute or two.

In [7]:
import sys
import os
import zipfile
import fnmatch

def hydrate(zip_path, dest_path):
    print("Start unzipping data files in {0} to {1}".format(zip_path,dest_path))
    if (os.path.exists(zip_path) == False):
        print("The source folder {0} doesn't exist, so quitting".format(zip_path))
        quit()

    zipfile_count = len(fnmatch.filter(os.listdir(zip_path), '*.zip'))
    if (zipfile_count == 0):
        print("No zip (.zip) files in {0}, so quitting ".format(zip_path))

    print("zip file count:%s" % zipfile_count)

    if (os.path.exists(dest_path) == False):
        print("Destination folder {0} doesn't exist, creating it".format(dest_path))
        os.makedirs(dest_path)

        # Extract all zip files from zip_path to dest_path
        print("Start unzipping files to {0}".format(dest_path))
        for item in os.listdir(zip_path): # loop through items in dir
            if item.endswith(".zip"): # check for ".zip" extension
                print("   unzipping {0} ...".format(item))
                file_name = os.path.join(zip_path,item) # get full path of files
                zip_ref = zipfile.ZipFile(file_name) # create zipfile object
                zip_ref.extractall(dest_path) # extract file to dir
                zip_ref.close() # close file
    else:
        print("data folder already populated")

    print("Complete: Files have been unzipped to {0}".format(dest_path))
    
hydrate(zip_path, data_path)

Start unzipping data files in Y:\Courses\Computer Vision\dev290x-1 (1)\library\Module4\../data-zip to Y:\Courses\Computer Vision\dev290x-1 (1)\library\Module4\../../data/M4
zip file count:6
data folder already populated
Complete: Files have been unzipped to Y:\Courses\Computer Vision\dev290x-1 (1)\library\Module4\../../data/M4


We need to load our dataset.

In [8]:
# Configure the data source

    
print('[i] Configuring data source...')
try:
    source = coco.CocoMs(os.path.join(data_path, "CocoMS"))
    training_input_image_files, training_target_mask_files = source.get_data(train_data_folder='/Training')
    validation_input_image_files, validation_target_mask_files = source.get_data(train_data_folder='/Validation')
    print('[i] # training samples:   ', len(training_input_image_files))
    print('[i] # validation samples: ', len(validation_input_image_files))
    print('[i] # classes:            ', source.num_classes)
    print('[i] Image size:           ', (224,224))
except (ImportError, AttributeError, RuntimeError) as e:
    print('[!] Unable to load data source:', str(e))    


[i] Configuring data source...
Initializing CocoMS


[i] reading images and labels...: 100%|####################################################################################################################################| 2986/2986 [00:00<00:00, 47395.06it/s]
[i] reading images and labels...: 100%|###################################################################################################################################################| 97/97 [00:00<?, ?it/s]


[i] # training samples:    2986
[i] # validation samples:  97
[i] # classes:             2
[i] Image size:            (224, 224)


We are going to encapsulate our image processing routines from the last lab into a handy function for re-use.

This creates some images to visualize how well our semantic segmenter worked out against our test set.

In [9]:
# Drawing

def process_images():
    print("[i] Started image processing...", flush=True)
    tic = time.time()
    
    input_images_rgb = []
    for x in tqdm(validation_input_images, ascii=True, desc='[i] Converting input images (BGR2RGB)...'):
        img = np.moveaxis(x,0,2).astype(np.uint8)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        input_images_rgb.append(img)

    target_masks_rgb=[]
    for x in tqdm(validation_target_masks, ascii=True, desc='[i] Coloring ground truth images...'):
        target_masks_rgb.append(helper.masks_to_colorimg(x))

    pred_rgb=[]
    for x in tqdm(pred, ascii=True, desc='[i] Coloring prediction images...'):
        pred_rgb.append(helper.masks_to_colorimg(x))

    output_images_rgb = []
    for index in tqdm(range(len(input_images_rgb)), ascii=True, desc='[i] Combining input images + predictions...'):
        img = cv2.bitwise_or(input_images_rgb[index], pred_rgb[index])
        output_images_rgb.append(img)

    print('Image Processing time: {} s.'.format(time.time() - tic))
    print("Image processing finished... Now plotting (this can take a while - 2-3 minutes on Azure Notebooks)...", flush=True)
    helper.plot_side_by_side([input_images_rgb, target_masks_rgb, pred_rgb, output_images_rgb])

# Our Trainer

This time, our training function was create transfer learning models for the network.

In [10]:
def train(train_image_files, train_mask_files, val_image_files, val_mask_files, base_model_file, freeze=False):
    # Create model
    sample_img, sample_mask = source.files_to_data([train_image_files[0]], [val_image_files[0]])
    x = C.input_variable(sample_img[0].shape)
    y = C.input_variable(sample_mask[0].shape)
    
    z = create_transfer_learning_model(x, source.num_classes, base_model_file, freeze)
    dice_coef = cntk_resnet_fcn.dice_coefficient(z, y)


    # Prepare model and trainer
    if (isUsingGPU):
        lr_mb = [0.001] * 5 + [0.0001] * 5 + [0.00001]*5 + [0.000001]*5 + [0.0000001]*5
    else:
        # training without a CPU is really slow, so we'll deliberatly shrink the amount of training
        # to just an epoch if we're on a CPU - just to give a flavor of what happens during training
        # and then read in a pre-trained model for inference instead.
        lr_mb = [0.0001] * 1 # deliberately shrink if training on CPU...
    lr = learning_rate_schedule(lr_mb, UnitType.sample)
    momentum = C.learners.momentum_as_time_constant_schedule(0.9)
    trainer = C.Trainer(z, (-dice_coef, -dice_coef), C.learners.adam(z.parameters, lr=lr, momentum=momentum))
                        
    training_errors = []
    test_errors = []

    # Get minibatches of training data and perform model training
    minibatch_size = 8
    num_epochs = len(lr_mb)
     
    for e in range(0, num_epochs):
        for i in tqdm(range(0, int(len(train_image_files) / minibatch_size)), ascii=True, 
                               desc="[i] Processing epoch {}/{}".format(e, num_epochs-1)):
            data_x_files, data_y_files = training_helper.slice_minibatch(train_image_files, train_mask_files, i, minibatch_size)
            data_x, data_y = source.files_to_data(data_x_files, data_y_files)
            trainer.train_minibatch({z.arguments[0]: data_x, y: data_y})
            gc.collect()
     
        # Measure training error
        training_error = training_helper.measure_error(source, data_x_files, data_y_files, z.arguments[0], y, trainer, minibatch_size)
        training_errors.append(training_error)
        
        # Measure test error
        test_error = training_helper.measure_error(source, val_image_files, val_mask_files, z.arguments[0], y, trainer, minibatch_size)
        test_errors.append(test_error)

        print("epoch #{}: training_error={}, test_error={}".format(e, training_errors[-1], test_errors[-1]))
        
    return trainer, training_errors, test_errors

# Freezing

And now time to start training.  In the first approach, we will *freeze* these weights, and only train additional layers to perform semantic segmentation.


In [11]:
#
# We need to convert our validation filenames to image data for the predictor...
#

print ("[i] Converting file lists to image data...", flush=True)
tic = time.time()
validation_input_images, validation_target_masks = \
        source.files_to_data(validation_input_image_files, validation_target_mask_files)
print('Converting validation image data time: {} s.'.format(time.time() - tic))
print("[i] Converting file lists finished...")

[i] Converting file lists to image data...
Converting validation image data time: 3.605370044708252 s.
[i] Converting file lists finished...


In [12]:
# Training with original layer weights frozen)

if make_model:
    print("[i] Starting training (with frozen weights)...", flush=True)
    frozen = True
    tic = time.time()
    trainer, training_errors, test_errors = train(training_input_image_files, training_target_mask_files,  
                                                  validation_input_image_files, validation_target_mask_files, 
                                                  base_model_file, frozen)
    print('Training time: {}'.format(time.time() - tic))
    print("[i] Training finished...")
    model = trainer.model
else:
    print("[i] Skipping training, using pre-trained model...")
    model = C.load_model(os.path.join(model_path, 'cntk-resnet-fcn-transfer-frozen.dnn'))

# Prediction

print("[i] Starting prediction...", flush=True)

pred = []
for idx in tqdm(range(0, len(validation_input_images)),ascii=True, desc='[i] Predicting...'):
    pred += list(model.eval(validation_input_images[idx]))

print('[i] {} images predicted.'.format(len(pred)))
print("[i] Prediction finished...")

if make_model:
    helper.plot_errors({"training": training_errors, "test": test_errors}, title="Simulation Learning Curve")

[i] Skipping training, using pre-trained model...
[i] Starting prediction...


[i] Predicting...: 100%|##########################################################################################################################################################| 97/97 [00:17<00:00,  9.81it/s]


[i] 97 images predicted.
[i] Prediction finished...


In [13]:
process_images()
print("Garbage collection reclaimed {} objects".format(gc.collect()))

[i] Started image processing...


[i] Converting input images (BGR2RGB)...: 100%|##################################################################################################################################| 97/97 [00:00<00:00, 404.35it/s]
[i] Coloring ground truth images...: 100%|#######################################################################################################################################| 97/97 [00:00<00:00, 325.07it/s]
[i] Coloring prediction images...: 100%|#########################################################################################################################################| 97/97 [00:00<00:00, 592.86it/s]
[i] Combining input images + predictions...: 100%|##############################################################################################################################| 97/97 [00:00<00:00, 1005.99it/s]


Image Processing time: 0.7565522193908691 s.
Image processing finished... Now plotting (this can take a while - 2-3 minutes on Azure Notebooks)...


100%|###########################################################################################################################################################################| 388/388 [00:15<00:00, 24.35it/s]


Garbage collection reclaimed 0 objects


Did you notice how much faster training is with transfer learning and frozen initial layers versus the full model training we did in Module 4 Lab 1? We're just training a few additional layers to implement Semantic Segmentation based on a base model of ResNet...

For the frozen layers, we don't have to retrain them -- their weights are frozen. What this means is that we don't have to run back prop on them to calculate error gradients and update weights. We just do forward prop through those layers and back prop only on the final additional layers to train their weights.

# Fine-Tuning

In the second approach, we will not freeze the original weights, but *fine-tune* them by continuing to train and allowing backpropagation to alter the weights in the earlier layers of the network.

In [14]:
# Training with Fine-Tuning

if make_model:
    print("[i] Starting training (with fine-tuning)...", flush=True)
    frozen = False
    tic = time.time()
    trainer, training_errors, test_errors = train(training_input_image_files, training_target_mask_files, 
                                                  validation_input_image_files, validation_target_mask_files, 
                                                  base_model_file, frozen)
    print('Training time: {} s.'.format(time.time() - tic))
    print("[i] Training finished...")
    model = trainer.model
else:
    model = C.load_model(os.path.join(model_path, 'cntk-resnet-fcn-transfer-finetune.dnn'))

# Prediction

print("[i] Starting prediction...", flush=True)

pred = []
for idx in tqdm(range(0, len(validation_input_images)),ascii=True, desc='[i] Predicting...'):
    pred += list(model.eval(validation_input_images[idx]))

print('[i] {} images predicted.'.format(len(pred)))
print("[i] Prediction finished...")

if make_model:
    helper.plot_errors({"training": training_errors, "test": test_errors}, title="Simulation Learning Curve")
    # clean-up some variables we no longer need to reduce our memory footprint...
    # otherwise our Azure Notebook might run out of memory
    del trainer
    del training_errors
    del test_errors

[i] Starting prediction...


[i] Predicting...: 100%|##########################################################################################################################################################| 97/97 [00:10<00:00,  9.68it/s]


[i] 97 images predicted.
[i] Prediction finished...


In [15]:
process_images()
print("Garbage collection reclaimed {} objects".format(gc.collect()))

[i] Started image processing...


[i] Converting input images (BGR2RGB)...: 100%|#################################################################################################################################| 97/97 [00:00<00:00, 1593.61it/s]
[i] Coloring ground truth images...: 100%|#######################################################################################################################################| 97/97 [00:00<00:00, 679.91it/s]
[i] Coloring prediction images...: 100%|#########################################################################################################################################| 97/97 [00:00<00:00, 698.47it/s]
[i] Combining input images + predictions...: 100%|##############################################################################################################################| 97/97 [00:00<00:00, 8112.94it/s]


Image Processing time: 0.3802478313446045 s.
Image processing finished... Now plotting (this can take a while - 2-3 minutes on Azure Notebooks)...


100%|###########################################################################################################################################################################| 388/388 [00:14<00:00, 26.53it/s]


Garbage collection reclaimed 0 objects


# Conclusions

In this lab, we added fully-convolutional layers to a base ResNet model and use the concept of Transfer Learning to train our models.

Transfer Learning is the defacto means of training models nowadays for new data sets. It is a technique that allows a network to learn a new skill (recognition of a new class of object, for instance) by employing the knowledge it has already learned about similar types of skills (i.e., an ability to detect edges, corners, and more complex shapes, built up in its various layers through learning to recognise difference classes of objects with a large data set).

We too a pre-trained set of weights for an object classifier network (ResNet, trained on ImageNet), and tried two different approaches - the first where we froze the base model weights, and only allowed the new layers to learn, and the second where we seeded the initial layers with weights from the base model, but allowed all layers in the model to update and learn a solution to our problem.