# Building Resnet 50 from scratch with Keras

Resnets are one of the most popular convolutional networks available in deep learning literature. All major libraries (e.g. Keras) have fully baked implementations of Resnets available for engineers to use on daily basis. There are a number of online tutorials available which illuminate the basic principles behind the resnets. Here are a couple of them:


*   [Detailed Guide to Understand and Implement ResNets](https://cv-tricks.com/keras/understand-implement-resnets/)
*   [Hitchhiker’s Guide to Residual Networks in Keras](https://towardsdatascience.com/hitchhikers-guide-to-residual-networks-resnet-in-keras-385ec01ec8ff)

There is also one useful tutorial about building the key modules in popular networks like VGG, Inception and ResNet.

*   [How to Develop VGG, Inception and ResNet Modules from Scratch in Keras](https://machinelearningmastery.com/how-to-implement-major-architecture-innovations-for-convolutional-neural-networks/)


However, there is a lack of articles walking through the nitty gritties of a complete ResNet implementation. There are several details which need to be properly addressed in building a complete ResNet. In this article, we will focus on building ResNet 50 from scratch. Our presentation in this tutorial is a simplified version of the code available in the [Keras Applications](https://github.com/keras-team/keras-applications) GITHUB repository. 

One key goal of this tutorial is to give you hands on experience of building large complex CNNs with the help of [Keras Functional API](https://keras.io/guides/functional_api/). 

For basic introduction to Resnets, we suggest looking at the articles mentioned above or the original paper

* [2016, Deep Residual Learning for Image Recognition](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)

Following is table-1 from the paper which describes various ResNet architectures. 
![Resnet Architectures](https://drive.google.com/uc?id=1CH2tuV2hMFZ7BAyeURm8tLhQUvqvinfm)

Please spend some time looking at the column for the architecture of 50 layer ResNet. 


Without further ado, let's get into implementing a Resnet 50 network with Keras.


We start by importing relevant modules from Keras.

In [None]:
from tensorflow.keras import layers, backend, models, utils


In the sequel, we will need to create various batch normalization layers. All of them will have same epsilon value which is a small float added to the variance to avoid dividing by zero. 

In [None]:
# epsilon for Batch Normalization layers
BN_EPS= 1.001e-5


There are five stages of the ResNet which have been labeled as conv1, conv2, conv3, conv4 and conv5 in the paper (first column of the table above). These are followed by a global average pool and a simple fully connected 1000 way classification layer. 

Input to the ResNet 50 network is typically a batch of images with size ( 224, 224, 3). 

* Stage conv1 output is (56, 56, 64). i.e. the size has reduced 4 times to (56, 56) and the number of channels has increased to 64.
* Stage conv2 output is (56, 56, 256).
* Stage conv3 output is (28, 28, 512).
* Stage conv4 output is (14, 14, 1024)
* Stage conv5 output is (7, 7, 2048)
* conv2 has 3 residual blocks, conv3 has 4 residual blocks, conv4 has 6 and conv5 has 3.


## The Residual Blocks
Let's start by defining functions for building the residual blocks in the ResNet50 network. We will slowly increase the complexity of residual blocks to cover all the needs of ResNet 50.

Every residual block essentially consists of three convolutional layers along the residual path and an identity connection from input to output. There are some details which will come up later. Let's look at the residual blocks in conv2 stage.

Output of the conv1 stage is a tensor of size (None, 56, 56, 64). Its implementation will be discussed later.

The first convolutional layer in this residual block is a 1x1 layer with 64 filters. Second one is a 3x3 layer with 64 filters. The last is again a 1x1 layer with 4 times number of filters at 256 filters. 

For now, let's just build the convolutional layers for the residual path.

In [None]:
def conv_131(input, filters):
    # number of channels in output tensor
    num_output_channels = 4 * filters
    # The 1x1 first convolution layer
    net = layers.Conv2D(filters, 1)(input)
    net = layers.BatchNormalization(epsilon=BN_EPS,)(net)
    net = layers.Activation('relu')(net)
    # The 3x3 second convolution layer
    net = layers.Conv2D(filters, 3, padding='same')(net)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    net = layers.Activation('relu')(net)
    # The 1x1 third convolution layer
    net = layers.Conv2D(num_output_channels, 1)(net)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    net = layers.Activation('relu')(net)
    return net


The 1x1 layers don't require a padding parameter as 1x1 convolution is nothing but an inner product over the channels in the input tensor and it doesn't lead to any changes in the image size (unless a different stride is specified). It can definitely change the number of channels from input to output.

Each conv layer is followed by batch normalization and then a relu activation. 


## Building a model

We will write a simple function which can build a Keras model from a network building function. A network building function, like `conv_131` above, takes a layer as input, adds some more layers on top of it.

The model building function considers the shape of the input tensor to the network and uses it to create an input layer. It then feeds the input layer to the network building function and builds the whole network. Finally, the input and output are combined to form a Keras model.  

This function will be quite handy in displaying the architecture of any network in rest of the tutorial.

In [None]:
def build_model(input_shape, net_fn):
    img_input =  layers.Input(shape=input_shape)
    net = net_fn(img_input)
    inputs = img_input
    outputs = net
    model = models.Model(inputs, outputs)
    return model

Let's use this function to build a small model consisting of a single block of the three convolutional layers in conv2 stage of ResNet 50. As discussed earlier, the input shape is (56, 56, 64). We then print the model architecture summary.

In [None]:
model = build_model((56, 56, 64), lambda input: conv_131(input, 64))
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 56, 56, 64)]      0         
_________________________________________________________________
conv2d (Conv2D)              (None, 56, 56, 64)        4160      
_________________________________________________________________
batch_normalization (BatchNo (None, 56, 56, 64)        256       
_________________________________________________________________
activation (Activation)      (None, 56, 56, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 56, 56, 64)        36928     
_________________________________________________________________
batch_normalization_1 (Batch (None, 56, 56, 64)        256       
_________________________________________________________________
activation_1 (Activation)    (None, 56, 56, 64)       

## The identity path
It's time to add the identity path in our 3 CNN layer block. However, there is a catch. The input to first residual block is (56, 56, 64). But the output of the third layer in this block is (56, 56, 256). The number of channels changes from 64 to 256. Hence, the input cannot be added to the output of the residual path directly. A solution is to add a 1x1 conv layer in the identity path (whenever the number of input channels is not equal to the number of output channels). 

The function `residual_131_v1` below incorporates this. Note the following in this:

*   We check if the number of input and output channels is different.
*   If yes, then a 1x1 conv layer is added with batch normalization but no relu activation.
*   The output of the last (1x1) conv layer is bacth normalized.
*   Then it is added to the output of the shortcut identity path.
*   Finally, the sum undergoes a common relu activation.


In [None]:
def residual_131_v1(input, filters):
    # shape of input tensor
    input_shape = input.shape
    # number of channels in input tensor
    num_input_channels = input_shape[3]
    # number of channels in output tensor
    num_output_channels = 4 * filters
    # if input and output channels are same then we can feed
    # the input directly as identity shortcut
    # otherwise, we need to add a convolutional layer in identity path
    conv_in_identity_path = num_output_channels != num_input_channels
    if conv_in_identity_path is True:
        # add a conv layer to increase the number of channels
        shortcut = layers.Conv2D(num_output_channels, 1)(input)
        # batch normalize (activation will come later)
        shortcut = layers.BatchNormalization(epsilon=BN_EPS)(shortcut)
    else:
        shortcut = input
    # The 1x1 first convolution layer
    net = layers.Conv2D(filters, 1)(input)
    net = layers.BatchNormalization(epsilon=BN_EPS,)(net)
    net = layers.Activation('relu')(net)
    # The 3x3 second convolution layer
    net = layers.Conv2D(filters, 3, padding='same')(net)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    net = layers.Activation('relu')(net)
    # The 1x1 third convolution layer
    net = layers.Conv2D(num_output_channels, 1)(net)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    # Add identity shortcut to residual output before activation
    net = layers.Add()([shortcut, net])
    net = layers.Activation('relu')(net)
    return net


Let's build a model with a residual block and print its summary.

In [None]:
model = build_model((56, 56, 64), lambda input: residual_131_v1(input, 64))
model.summary()

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 56, 56, 64)] 0                                            
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 56, 56, 64)   4160        input_2[0][0]                    
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 56, 56, 64)   256         conv2d_4[0][0]                   
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 56, 56, 64)   0           batch_normalization_4[0][0]      
_______________________________________________________________________________________

## The conv1 stage

There are still one problem with the residual block. But before that, let's complete the implementation of the conv1 stage. See the function below.

* Input is a (224, 224, 3) image.
* The first layer is a large 7x7 convolutional layer that downsamples the resolution to (112, 112) and increases the number of channels to 64.
* We achieve this in two steps as follows. 
* Add a padding of 3 pixels on all sides increasing resolution to (230, 230)
* Perform a 7x7 valid convolution with stride 2 with 64 filters and achieve an output of (112, 112, 64).
* Next, the output goes through batch normalization and relu activation.
* Finally, we add a padding of 1 pixels and then do a max pooling of 3x3 to achieve an output tensor of size (56,56,64). 




In [None]:
def conv1(img_input):
    # pad in advance for valid convolution (output is 230x230)
    net = layers.ZeroPadding2D(padding=(3, 3))(img_input)
    # perform the big 7x7 convolution with 2x2 stride to (112x112x64)
    net = layers.Conv2D(64, (7, 7),
                      strides=(2, 2),
                      padding='valid',
                      kernel_initializer='he_normal')(net)
    # batch normalization before activation (output is 112x112x64)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    # relu activation (output is 112x112x64)
    net = layers.Activation('relu')(net)
    # pad again for max pooling (output is 114x114x64)
    net = layers.ZeroPadding2D(padding=(1, 1))(net)
    # 3x3 max pooling with 2x2 stride (output is 56x56x64)
    net = layers.MaxPooling2D((3, 3), strides=(2, 2))(net)
    return net


Let's combine the conv1 stage with our first residual block to see if everything is working; build a partial residual network and print it.

In [None]:
def partial_resnet_v1(img_input):
  net = conv1(img_input)
  net = residual_131_v1(net, 64)
  return net

model =  build_model((224, 224, 3), partial_resnet_v1)
model.summary()

Model: "functional_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
zero_padding2d (ZeroPadding2D)  (None, 230, 230, 3)  0           input_3[0][0]                    
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, 112, 112, 64) 9472        zero_padding2d[0][0]             
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 112, 112, 64) 256         conv2d_7[0][0]                   
_______________________________________________________________________________________

## The stack of all residual blocks of conv2 stage

Everything is going good so far. Recall from the network table that the conv2 stage for ResNet 50 has 3 residual blocks. Let's write a simple function to build all of them and combine them.

This is also known as the stack of residual blocks.

In [None]:
def residual_stack_v1(input, filters, blocks):
    net = input
    for i in range(blocks):
        net = residual_131_v1(net, filters)
    return net


If you have been paying attention, you may notice something interesting. 



* Output of first residual block is (56, 56, 256).
* Output of second residual block is also (56, 56, 256).
* Thus, for the second (and also the third) residual block, the number of channels in both input and output are same.
* Hence, the identity connection doesn't require any 1x1 convolutional layer.
* In fact, the first 1x1 conv layer in the second residual block actually reduces the number of channels from 256 back to 64 and the third layer increases it again to 256.

Let's now build a partial resnet with both conv1 and conv2 stages complete.



In [None]:
def partial_resnet_v2(img_input):
  net = conv1(img_input)
  net = residual_stack_v1(net, 64, 3)
  return net

model =  build_model((224, 224, 3), partial_resnet_v2)
model.summary()

Model: "functional_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
zero_padding2d_2 (ZeroPadding2D (None, 230, 230, 3)  0           input_4[0][0]                    
__________________________________________________________________________________________________
conv2d_12 (Conv2D)              (None, 112, 112, 64) 9472        zero_padding2d_2[0][0]           
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 112, 112, 64) 256         conv2d_12[0][0]                  
_______________________________________________________________________________________

## Size reduction from one stage to next stage

Look back at the architecture table. conv2 stage has output size of (56, 56) but the conv3 stage works on a size of (28, 28). We need to perform a downsampling here. This is the job of the first convolution layer in the first residual block of a particular stage of the network (conv3, conv4, conv5). 

A little modification in the residual 131 block generation function achieves this. See the code below:

* A new parameter stride1 has been introduced. This only applies to the first 1x1 convolution layer.
* If stride1=2, then the first 1x1 conv layer reduces the input size by a factor of 4 (2 in width, 2 in height).
* The other two conv layers remain as it is.
* This change also applies to the identity path. As the output size is halved, the 1x1 conv layer in the identity path also needs a stride of 2.

In [None]:
def residual_131_v2(input, filters, stride1=1):
    # shape of input tensor
    input_shape = input.shape
    # number of channels in input tensor
    num_input_channels = input_shape[3]
    # number of channels in output tensor
    num_output_channels = 4 * filters
    # if input and output channels are same then we can feed
    # the input directly as identity shortcut
    # otherwise, we need to add a convolutional layer in identity path
    conv_in_identity_path = num_output_channels != num_input_channels
    if conv_in_identity_path is True:
        # add a conv layer to increase the number of channels
        shortcut = layers.Conv2D(num_output_channels, 1, strides=stride1)(input)
        # batch normalize (activation will come later)
        shortcut = layers.BatchNormalization(epsilon=BN_EPS)(shortcut)
    else:
        shortcut = input
    # The 1x1 first convolution layer
    net = layers.Conv2D(filters, 1, strides=stride1)(input)
    net = layers.BatchNormalization(epsilon=BN_EPS,)(net)
    net = layers.Activation('relu')(net)
    # The 3x3 second convolution layer
    net = layers.Conv2D(filters, 3, padding='same')(net)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    net = layers.Activation('relu')(net)
    # The 1x1 third convolution layer
    net = layers.Conv2D(num_output_channels, 1)(net)
    net = layers.BatchNormalization(epsilon=BN_EPS)(net)
    # Add identity shortcut to residual output before activation
    net = layers.Add()([shortcut, net])
    net = layers.Activation('relu')(net)
    return net


we need to modify our stack building function. The first block in the stack will have a stride of 2 (for conv3, conv4, conv5) layers and stride of 1 in conv2 layer. All other blocks in the stack will have a stride of 1.

In [None]:
def residual_stack_v2(input, filters, blocks, stride1=2):
    net = input
    net = residual_131_v2(net, filters, stride1=stride1)
    for i in range(blocks-1):
        net = residual_131_v1(net, filters)
    return net


## The complete ResNet 50 CNN

We are now ready to build all the 5 stages of the ResNet 50 CNN. See the code below. The only thing missing is the classification layer on top of the CNN.

In [None]:
def cnn_resnet50(img_input):
  net = conv1(img_input)
  net = residual_stack_v2(net, 64, 3, stride1=1)
  net = residual_stack_v2(net, 128, 4)
  net = residual_stack_v2(net, 256, 6)
  net = residual_stack_v2(net, 512, 3)
  return net

model =  build_model((224, 224, 3), cnn_resnet50)
model.summary()

Model: "functional_9"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
zero_padding2d_4 (ZeroPadding2D (None, 230, 230, 3)  0           input_5[0][0]                    
__________________________________________________________________________________________________
conv2d_23 (Conv2D)              (None, 112, 112, 64) 9472        zero_padding2d_4[0][0]           
__________________________________________________________________________________________________
batch_normalization_23 (BatchNo (None, 112, 112, 64) 256         conv2d_23[0][0]                  
_______________________________________________________________________________________

## The classification layer.

It's important to keep the classification layer separate (called the top in Keras docs). The CNN part can work on images of larger sizes too. A (224,224,3) input image size is necessary only if the network is being used for classification purposes. For transfer learning, people often load the CNN (without the top classification layer) with pre-trained weights and train a different classification layer on top of it.

Following is the top classification layer for Res Nets. If the pretrained weights for ImageNet classification are to be loaded, then the number of classes will be 1000.


* The output of the CNN is of size (7,7,2048). 
* There is a simple global average pooling reducing the size to a vector of 2048 length.
* This is followed by a dense layer with 1000 outputs. (2048 * 1000 + 1000 weights).
* The output of dense layer goes through a softmax activation to convert it to classification probablities. 



In [None]:
def top_resnet(net, classes=1000):
    # add the top classification network
    net = layers.GlobalAveragePooling2D()(net)
    net = layers.Dense(classes, activation='softmax')(net)
    return net

We now combine the CNN with the classifier to build the overall ResNet 50 classifier.  Build a model around it. 

Voila! It's done. We have a complete ResNet 50 architecture built for us.

In [None]:
def resnet_classifier(img_input):
  net = cnn_resnet50 (img_input)
  net = top_resnet(net)
  return net
  
model =  build_model((224, 224, 3), resnet_classifier)
model.summary()

Model: "functional_11"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
zero_padding2d_6 (ZeroPadding2D (None, 230, 230, 3)  0           input_6[0][0]                    
__________________________________________________________________________________________________
conv2d_76 (Conv2D)              (None, 112, 112, 64) 9472        zero_padding2d_6[0][0]           
__________________________________________________________________________________________________
batch_normalization_76 (BatchNo (None, 112, 112, 64) 256         conv2d_76[0][0]                  
______________________________________________________________________________________

## Where to go from here

There are some additional details involved in a robust implementation which have been skipped in this tutorial.

* Every layer should be named appropriately.
* Some image processing systmes may follow channels first approach where the image dimensions are (3,224,224). 
* Every image needs to go through a basic preprocessing. This involves, convering the image to BGR format (from RGB) if necessary, then zero center each color channel with respect to the ImageNet dataset.
* Loading the pretrained weights using the model.load_weights function.
* Add a global pooling if the top classification network is ignored.

For a more complete implementation addressing these aspects, please look at the source code in Keras Applications repository mentioned above. 

We hope that we have been able to give you a first hand experience of building large and complex convolutional networks easily with the help of Keras.

Enjoy! 