# Milestone 2: Covolutional Neural Networks

Making a ResNetV2-20 model to perform the CIFAR-10 image classification task.

## Model Specfications

Model: ResNetV2-20
- Input layer: Input size: (32 x 32) x 3
    - conv2d (3 x 3) x 64
- ResBlock 1: Input size (32 x 32) x 64
     - conv2d (3 x 3) x 16
     - conv2d (3 x 3) x 16
- ResBlock 2: Input size (32 x 32) x 16
     - conv2d (3 x 3) x 16
     - conv2d (3 x 3) x 16
- ResBlock 3: Input size (32 x 32) x 16
     - conv2d (3 x 3) x 16
     - conv2d (3 x 3) x 16  
- ResBlock 4: Input size (32 x 32) x 16
     - conv2d (3 x 3) x 32, stride 2
     - conv2d (3 x 3) x 32
- ResBlock 5: Input size (16 x 16) x 32
     - conv2d (3 x 3) x 32
     - conv2d (3 x 3) x 32
- ResBlock 6: Input size (16 x 16) x 32
     - conv2d (3 x 3) x 32
     - conv2d (3 x 3) x 32
- ResBlock 7: Input size (16 x 16) x 32
     - conv2d (3 x 3) x 64, stride 2
     - conv2d (3 x 3) x 64
- ResBlock 8: Input size (8 x 8) x 64
     - conv2d (3 x 3) x 64
     - conv2d (3 x 3) x 64
- ResBlock 9: Input size (8 x 8) x 64
     - conv2d (3 x 3) x 64
     - conv2d (3 x 3) x 64
- Pooling: input size (8 x 8) x 64
     - GlobalAveragePooling/AdaptiveAveragePooling((1,1))
- Output layer: Input size (64,)
     - Dense/Linear (64,10)
     - Activation: Softmax



Data: CIFAR-10 tiny images
- 32 x 32 x 3 RGB colour images
- Train/Test split: Use data splits already given (50,000 train, 10,000 test). From the 50,000 train images, use 45,000 for training and 5,000 for validation every epoch inside the training loop. Reserve the 10,000 test set images for final evaluation.
- Pre-processing inputs: 
     - Depending on data source, scale int8 inputs to [0, 1] by dividing by 255
     - ImageNet normalization 
          - From the RGB channels, subtract means [0.485, 0.456, 0.406] and divide by standard deviations [0.229, 0.224, 0.225]
     - 4 pixel padding on the side, then apply 32x32 crop randomly sampled from the padded image or its horizontal flip as in Section 3.2 of [3]
- Preprocessing labels: Use integer indices


Hyperparameters:
- Optimizer: AdamW
- learning rate: 1e-3 
- beta_1: 0.9
- beta_2: 0.999
- weight decay: 0.0001
- Number of epochs for training: 50 (TBD)
- Batch size: 256 (TBD)


Metrics to record:
- Total training time (from start of training script to end of training run)
- Training time per 1 epoch (measure from start to end of each epoch and average over all epochs)
- Inference time per batch (measure per batch and average over all batches)
- Last epoch training loss
- Last epoch eval accuracy (from the 5,000 evaluation dataset)
- Held-out test set accuracy (from the 10,000 test dataset)



<h2> Library import </h2>

In [1]:
# Necessary Libraries
import numpy as np

import mxnet as mx
from mxnet import gluon, nd, autograd as ag, npx
from mxnet.gluon import nn


# Libraries for datasets and pre-preprocessing
from mxnet.gluon.data.vision import transforms, CIFAR10
import gluoncv
from gluoncv.data import transforms as gcv_transforms
import torch.utils # needed to split the training DS into train_data and cv_data


# json library neded to export metrics 
import json
import time

# Miscellaneous libraries incase I need them for testing
import matplotlib as plt
import math




In [2]:
# UNCOMMENT if multiple gpus
# # number of GPUs to use
# num_gpus = 1
# ctx = [mx.gpu(i) for i in range(num_gpus)]

In [3]:
#labels just for reference
labels = {
    0: "airplane",
    1: "automobile",
    2: "bird",
    3: "cat",
    4: "deer",
    5: "dog",
    6: "frog",
    7: "horse",
    8: "ship",
    9: "truck"
}


<h2> Dataset import & Data pre-processing/transformation </h2>

<h3> Transformation functions </h3>

<p> transform_train will be used on both train_data and cv_data, while transform_test will be used on test_data. Since training dataset provides more randomized data (and should be more generalizable), I will not be performing the random operations on the testing dataset. </p>

In [4]:
transform_train = transforms.Compose([ gcv_transforms.RandomCrop(32, pad=2), # Randomly crop an area and resize it to be 32x32, then pad it to be 36x36 
                                    transforms.RandomFlipLeftRight(), # Applying a random horizontal flip
                                    transforms.ToTensor(), # Transpose the image from height*width*num_channels to num_channels*height*width
                                                           # and map values from [0, 255] to [0,1]
                                    # Normalize the image with mean and standard deviation calculated across all images
                                    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 
                                ])

transform_test = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 
                                ])

<h3> Importing transformed datasets and splitting train and cv </h3>

<h4> IMPORTANT: Run the Following cells if you want the full dataset (50,000 train + 10,000 test)


In [None]:
# Creating the train and test DS
full_train_ds = CIFAR10(train=True).transform_first(transform_train, lazy=False)
test_ds = CIFAR10(train= False).transform_first(transform_test, lazy=False)

# Splitting the training datasets into the train_data and cv_data
train_size = int(0.9 * len(full_train_ds))
cv_size = len(full_train_ds) - train_size
train_ds, cv_ds = torch.utils.data.random_split(full_train_ds, [train_size, cv_size]) 

In [None]:
print("Dataset\t\t Length \t\t Type")
print("full_train_ds","\t", len(full_train_ds), "\t", type(full_train_ds))
print("train_ds","\t", len(train_ds), "\t", type(train_ds))
print("cv_ds","\t\t", len(cv_ds), "\t", type(cv_ds))
print("test_ds","\t", len(test_ds), "\t", type(test_ds))

Dataset		 Length 		 Type
full_train_ds 	 50000 	 <class 'mxnet.gluon.data.dataset.SimpleDataset'>
train_ds 	 45000 	 <class 'torch.utils.data.dataset.Subset'>
cv_ds 		 5000 	 <class 'torch.utils.data.dataset.Subset'>
test_ds 	 10000 	 <class 'mxnet.gluon.data.dataset.SimpleDataset'>


In [None]:
# Loading the datasets into the DataLoader
batch_size = 256
train_data = gluon.data.DataLoader(train_ds, batch_size=batch_size,  shuffle=True, last_batch='discard')
cv_data = gluon.data.DataLoader(cv_ds, batch_size=batch_size,  shuffle=True, last_batch='discard')
test_data = gluon.data.DataLoader(test_ds, batch_size=batch_size,  shuffle=True, last_batch='discard')


In [None]:
# see data shape
for data, label in train_data:
    print("train_data\t", data.shape, label.shape)
    break
for data, label in cv_data:
    print("cv_data\t\t",data.shape, label.shape)
    break
for data, label in test_data:
    print("test_data\t",data.shape, label.shape)
    break


train_data	 (256, 3, 32, 32) (256,)
cv_data		 (256, 3, 32, 32) (256,)
test_data	 (256, 3, 32, 32) (256,)


<h4> IMPORTANT: Run the following cells below if you want to use the proof-of-concept dataset (1024 train + 10000 test)

In [5]:
# Creating the train and test DS
full_train_ds = CIFAR10(train=True).transform_first(transform_train, lazy=False)
test_ds = CIFAR10(train= False).transform_first(transform_test, lazy=False)

In [6]:
# IF USING PROOF-OF-CONCEPT
poc_ds = full_train_ds[0:1024]

train_size = int(0.9 * len(poc_ds))
cv_size = len(poc_ds) - train_size
train_ds, cv_ds = torch.utils.data.random_split(poc_ds, [train_size, cv_size]) 

In [7]:
for data, label in poc_ds:
    print("train_data\t", data.shape, label.shape)
    break
print("train_ds + cv_ds = poc_ds")
print(len(train_ds), "\t +", len(cv_ds), "\t =", len(poc_ds))

train_data	 (3, 32, 32) ()
train_ds + cv_ds = poc_ds
921 	 + 103 	 = 1024


In [8]:
# Loading the datasets into the DataLoader
batch_size = 64
train_data = gluon.data.DataLoader(train_ds, batch_size = batch_size , shuffle=True, last_batch= 'discard')
cv_data = gluon.data.DataLoader(cv_ds, batch_size = batch_size,  shuffle=True , last_batch= 'keep')
test_data = gluon.data.DataLoader(test_ds, batch_size= batch_size,  shuffle=True , last_batch= 'keep')


In [9]:
# see data shape
for data, label in train_data:
    print("train_data\t", data.shape, label.shape)
    break
for data, label in cv_data:
    print("cv_data\t\t",data.shape, label.shape)
    break
for data, label in test_data:
    print("test_data\t",data.shape, label.shape)
    break
print(len(train_data))
print(len(cv_data)) # total is 16 batch steps

train_data	 (64, 3, 32, 32) (64,)
cv_data		 (64, 3, 32, 32) (64,)
test_data	 (64, 3, 32, 32) (64,)
14
2


<h4> Defining ResNetV2 class structure

<h5> Defining the Basic Block structure

In [10]:
class BasicBlock(nn.Block):
    def __init__ (self, in_channels, channels, strides = 1 , **kwargs):
        super(BasicBlock, self).__init__(**kwargs)
        conv_kwargs = {
            "kernel_size": (3,3),
            "padding": 1,
            "use_bias": False
        }
        self.strides = strides
        self.in_channels = in_channels
        self.channels = channels

        self.bn1 = nn.BatchNorm(in_channels= in_channels)        
        self.conv1 = nn.Conv2D(channels, strides= strides,  in_channels= in_channels, **conv_kwargs) 
        
        self.bn2 = nn.BatchNorm(in_channels= channels)
        self.conv2 = nn.Conv2D(channels, in_channels= channels, **conv_kwargs)
        self.relu = nn.Activation('relu')
        
    def downsample(self,x):
    # Downsample with 'nearest' method (this is striding if dims are divisible by stride)
    # Equivalently x = x[:, :, ::stride, ::stride].contiguous()   
        x = x[:,:, ::self.strides, ::self.strides]
        #creating padding tenspr for extra channels
        (b, c, h, w) = x.shape
        num_pad_channels = self.channels - self.in_channels
        pad = mx.nd.zeros((b, num_pad_channels, h,w))
        # append this padding to the downsampled identity
        x = mx.nd.concat(x , pad, dim = 1)
        return x

    def forward(self, x):
        if self.strides > 1:
            residual = self.downsample(x)
        else:
            residual = x
        x = self.bn1(x)
        x = self.relu(x)
        x = self.conv1(x)

        x = self.bn2(x)
        x = self.relu(x)
        x = self.conv2(x)
        return x + residual

<h5> Defining the ResNetV2 CNN structure

In [11]:
class ResNetV2(nn.Block):
    def __init__(self, **kwargs):
        super(ResNetV2, self).__init__(**kwargs)

        self.input_layer = nn.Conv2D(in_channels = 3, channels= 16, kernel_size=(3,3), padding=1)

        self.layer_1 = BasicBlock(16,16)
        self.layer_2 = BasicBlock(16,16)
        self.layer_3 = BasicBlock(16,16)

        self.layer_4 = BasicBlock(16,32, strides = 2)
        self.layer_5 = BasicBlock(32,32)
        self.layer_6 = BasicBlock(32,32)

        self.layer_7 = BasicBlock(32,64, strides = 2)
        self.layer_8 = BasicBlock(64,64)
        self.layer_9 = BasicBlock(64,64)

        self.flatten = nn.Flatten()

        self.pool = nn.GlobalAvgPool2D(layout = 'NCHW')
        self.output_layer = nn.Dense(units=10, in_units=64)

    
    def forward (self, x):
        out = self.input_layer(x)
        out = self.layer_1(out)
        out = self.layer_2(out)
        out = self.layer_3(out)
        out = self.layer_4(out)
        out = self.layer_5(out)
        out = self.layer_6(out)
        out = self.layer_7(out)
        out = self.layer_8(out)
        out = self.layer_9(out)
        # print("Before Pool: ", out.shape)
        out = self.pool(out)
        # print("After Pool: ", out.shape)
        out = self.flatten(out)
        # print("After Flattening: ", out.shape)
        out = self.output_layer(out)
        return out

In [12]:
net = ResNetV2()
net.initialize()
# net.collect_params
# print(net.collect_params)
# net.summary

In [13]:
## sanity check to see all the layers
# params = net.collect_params()

# for key, value in params.items():
#     print(key, value)

In [14]:
trainer = gluon.Trainer(params = net.collect_params(),
                    optimizer='adam',
                    optimizer_params = {'learning_rate': 0.001, 'beta1': 0.9, 'beta2': 0.999, 'wd':0.0001}
                    ) # The guidelines state using AdamW optimizer, unsure whether 'adam' is sufficient


In [15]:
# # second sanity check to see whether running a rand input results in no errors
# inputs = mx.np.random.normal(size=(4, 3, 32, 32)).as_nd_ndarray()
# outputs = net(inputs)
# outputs

<h2> Running model on Training and CV dataset  </h2>

In [16]:
%%time

# Initializing time related variables and lists (to make it easier for metric outputs)

# initializing the training times
tic_total_train = time.time()
epoch_times = []

epochs = 50 # for full dataset

num_examples = len(train_ds)  #should return 45000 for full ds and 921 for proof-of-concept ds

# defining the accuracy evaluation metric
metric = mx.metric.Accuracy()

# Loss function
softmax_ce = gluon.loss.SoftmaxCrossEntropyLoss()

for epoch in range(epochs):
    tic_train_epoch = time.time()
    # creating cumulative loss variable
    cum_loss = 0
    # Resetting train_data iterator

    
    # Looping over train_data iterator
    for data, label in train_data:
        
        # Inside training scope
        with ag.record():
            # Inputting the data into the nn
            outputs = net(data)
            
            # outputs =outputs.argmax(axis=1)
            # label = label.astype('float32').mean().asscalar()
            # # Computing the loss
            loss = softmax_ce(outputs,label)

        # Backpropogating the error
        loss.backward()
    
        # Summation of loss (divided by sample_size in the end)
        cum_loss += nd.sum(loss).asscalar()
        metric.update(label,outputs)

        trainer.step(batch_size)
    
    # Get evaluation results    
    name, acc = metric.get()  
    metric.reset()
    toc_train_epoch = time.time()
    epoch_times.append(toc_train_epoch - tic_train_epoch)
    
    ## CROSS VALIDATION DATASET
    # Looping over cv_data iterator
    tic_val = time.time() # initializing cv timer
    for data, label in cv_data:
        val_outputs = net(data)
        
        metric.update(label, val_outputs)
    
    # Getting evaluation results for cv dataset
    name, val_acc = metric.get()
    metric.reset()
    
    # Evaluating time elapse between the cv dataset
    toc_val = time.time() 
    
    print("Epoch %s | Loss: %.6f, Train_acc: %.6f, Val_acc: %.6f, in %.2fs " %
    (epoch+1, cum_loss/num_examples, acc, val_acc, epoch_times[epoch]))
print("-"*70)
toc_total_train = time.time() # total training time
    

Epoch 1 | Loss: 2.120704, Train_acc: 0.143973, Val_acc: 0.145631, in 3.30s 
Epoch 2 | Loss: 1.887566, Train_acc: 0.260045, Val_acc: 0.165049, in 3.10s 
Epoch 3 | Loss: 1.790941, Train_acc: 0.281250, Val_acc: 0.165049, in 3.30s 
Epoch 4 | Loss: 1.681264, Train_acc: 0.348214, Val_acc: 0.349515, in 2.98s 
Epoch 5 | Loss: 1.573798, Train_acc: 0.366071, Val_acc: 0.281553, in 2.98s 
Epoch 6 | Loss: 1.452661, Train_acc: 0.437500, Val_acc: 0.320388, in 3.69s 
Epoch 7 | Loss: 1.418639, Train_acc: 0.448661, Val_acc: 0.203883, in 2.92s 
Epoch 8 | Loss: 1.291924, Train_acc: 0.475446, Val_acc: 0.194175, in 3.05s 
Epoch 9 | Loss: 1.283327, Train_acc: 0.502232, Val_acc: 0.233010, in 3.05s 
Epoch 10 | Loss: 1.211334, Train_acc: 0.555804, Val_acc: 0.291262, in 3.06s 
Epoch 11 | Loss: 1.092815, Train_acc: 0.577009, Val_acc: 0.271845, in 3.04s 
Epoch 12 | Loss: 0.937524, Train_acc: 0.651786, Val_acc: 0.145631, in 3.05s 
Epoch 13 | Loss: 0.889176, Train_acc: 0.660714, Val_acc: 0.262136, in 3.06s 
Epoch 14

<h2> Running CNN model on Hold-Out dataset

In [22]:
metric = mx.metric.Accuracy()

# Looping over cv_data iterator
for data, label in test_data:
    test_outputs = net(data)
    
    metric.update(label, test_outputs)

# Getting evaluation results for cv dataset
name, test_acc = metric.get()
print('Test_acc: ', test_acc)

Test_acc:  0.3666


In [18]:
# export JSON file 

metrics = {

    'model_name': 'ResNetV2-20',
    'framework_name': 'MxNet',
    'dataset': 'CIFAR-10',
    'task': 'classification',
    'total_training_time': toc_total_train - tic_total_train, #s
    'average_epoch_training_time': np.average(epoch_times), #s
    'average_batch_inference_time': 1000*np.average(toc_val - tic_val)/math.ceil(len(cv_ds)/batch_size), #ms
    'final_training_loss': cum_loss/num_examples, 
    'final_evaluation_accuracy': val_acc, 
    'final_test_accuracy': test_acc 
}

# Exporting the metrics file
with open('m2-mxnet-cnn.json', 'w') as outfile:
    json.dump(metrics, outfile)

In [19]:
metrics

{'model_name': 'ResNetV2-20',
 'framework_name': 'MxNet',
 'dataset': 'CIFAR-10',
 'task': 'classification',
 'total_training_time': 159.05431532859802,
 'average_epoch_training_time': 3.0897912979125977,
 'average_batch_inference_time': 42.50967502593994,
 'final_training_loss': 0.01753664880699754,
 'final_evaluation_accuracy': 0.33980582524271846,
 'final_test_accuracy': 0.3666}