# Implement ResNeXt with Keras

*By **Carmel WENGA** Deep Learning Ingineer, Nzhinusoft* 


**ResNeXt** : an image classification model proposed by *Xie et al. (2017)* in <a href="https://arxiv.org/pdf/1611.05431.pdf">Aggregated Residual Transformations for Deep Neural Networks</a>

## Import Requirements

In [1]:
from keras.layers import Activation
from keras.layers import Add
from keras.layers import AveragePooling2D
from keras.layers import BatchNormalization
from keras.layers import Concatenate
from keras.layers import Conv2D
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import GlobalMaxPooling2D
from keras.layers import Input
from keras.layers import Lambda
from keras.layers import MaxPooling2D
from keras.layers import ZeroPadding2D
from keras.models import Model
from keras.optimizers import Adam
from keras.initializers import glorot_uniform

Using TensorFlow backend.


# 1. Implement Split-Transform-Merge functions

## a. The Split function

The original implementation of **ResNeXt** uses grouped convolution to implement the split operation. This function divides its input channels into groups. The total number of group is define by the cardinality which is equals to 4 in our case. The building block implement here is shown in the following image. It's the first reformulation of the original building block of the **ResNeXt** model.

<img src="images/building-block.png" width="300"/>

Inputs are represented as 4-dimensional tensor where the last dimension represent the number of channels. If the input's shape is $(x,y,z,c)$, the `split()` function returns a list (`groups`) of 4 tensors with following shapes: $(x,y,z,c_1)$, $(x,y,z,c_2)$, $(x,y,z,c_3)$ and $(x,y,z,c_4)$. The number of channel per group (`group_size`) equals to input channels (`inputs_channels`) divides by the cardinality.

In [2]:
def split(inputs, cardinality):
    inputs_channels = inputs.shape[3]
    group_size = inputs_channels // cardinality    
    groups = list()
    for number in range(1, cardinality+1):
        begin = int((number-1)*group_size)
        end = int(number*group_size)
        block = Lambda(lambda x:x[:,:,:,begin:end])(inputs)
        groups.append(block)
    return groups

## b. Transform and Merge operations

After the splitting operation, tensors corresponding to each group are transformed through a set of convolutional operations in the `transform()` function. As shown in the above figure, these operations are $1\times1$ and $3\times3$ convolutional layers (`Conv2D`) respectively. Each convolutional layer is followed by `BatchNormalization` and `ReLU` layers. Once the transformation operation as been applied to all groups, the returned tensors are concatenated through the `Concatenate` layer. This is the **Merge** operation.

In [3]:
def transform(groups, filters, strides, stage, block):
    f1, f2 = filters    
    conv_name = "conv2d-{stage}{block}-branch".format(stage=str(stage), block=str(block))
    bn_name = "batchnorm-{stage}{block}-branch".format(stage=str(stage), block=str(block))
    
    transformed_tensor = list()
    i = 1
    
    for inputs in groups:
        # first conv of the transformation phase
        x = Conv2D(filters=f1, kernel_size=(1,1), strides=strides, padding="valid", 
                   name=conv_name+'1a_split'+str(i), kernel_initializer=glorot_uniform(seed=0))(inputs)
        x = BatchNormalization(axis=3, name=bn_name+'1a_split'+str(i))(x)
        x = Activation('relu')(x)

        # second conv of the transformation phase
        x = Conv2D(filters=f2, kernel_size=(3,3), strides=(1,1), padding="same", 
                   name=conv_name+'1b_split'+str(i), kernel_initializer=glorot_uniform(seed=0))(x)
        x = BatchNormalization(axis=3, name=bn_name+'1b_split'+str(i))(x)
        x = Activation('relu')(x)
        
        # Add x to transformed tensor list
        transformed_tensor.append(x)
        i+=1
        
    # Concatenate all tensor from each group
    x = Concatenate(name='concat'+str(stage)+''+block)(transformed_tensor)
    
    return x

## c. The Transition phase

The last layer of the above building block realizes $1\times1$ convolutions on the return tensor of `transform()` function. This is what does the `transition()` function.

In [4]:
def transition(inputs, filters, stage, block):
    x = Conv2D(filters=filters, kernel_size=(1,1), strides=(1,1), padding="valid", 
                   name='conv2d-trans'+str(stage)+''+block, kernel_initializer=glorot_uniform(seed=0))(inputs)
    x = BatchNormalization(axis=3, name='batchnorm-trans'+str(stage)+''+block)(x)
    x = Activation('relu')(x)
    
    return x

# 2. Implement Residual Blocks

To implement the **ResNeXt** architecture, we consider two types of Residual or Building block. We consider `identity_block` and `downsampling`. As describe in the figure below, **ResNeXt-50** is composed of four stages that contain 3, 4, 6 and 3 building blocks respectively.

<img src="images/resnext-archi.png" width="350"/>

As described by **He *et al.***(2016) in their paper <a href="https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf">Deep Residual Learning for Image Recognition</a>, ***Identity Shortcut*** are used when the input and the output are of the same dimensions. Building blocks contained is the same stage are have the same input and output. They are `identity_block` and the ***Identity Shorcut*** defined by the function `identity_block()` can be use. Important parameters of this latter are :

<ul>
    <li> <b>inputs</b> : the input tensor of the building block</li> 
    <li> <b>filters</b> : it is a set of three filters corresponding to the output channels of the first, second and third convolutional operation of the building block
    <li> <b>strides</b> : equals to (1,1) for identity blocks
</ul>


In [5]:
def identity_block(inputs, filters, cardinality, stage, block, strides=(1,1)):
    
    conv_name = "conv2d-{stage}{block}-branch".format(stage=str(stage),block=str(block))
    bn_name = "batchnorm-{stage}{block}-branch".format(stage=str(stage),block=str(block))
    
    #save the input tensor value
    x_shortcut = inputs
    x = inputs
    
    f1, f2, f3 = filters
    
    # divide input channels into groups. The number of groups is define by cardinality param
    groups = split(inputs=x, cardinality=cardinality)
    
    # transform each group by doing a set of convolutions and concat the results
    f1 = int(f1 / cardinality)
    f2 = int(f2 / cardinality)
    x = transform(groups=groups, filters=(f1, f2), strides=strides, stage=stage, block=block)
    
    # make a transition by doing 1x1 conv
    x = transition(inputs=x, filters=f3, stage=stage, block=block)
    
    # Last step of the identity block, shortcut concatenation
    x = Add()([x,x_shortcut])
    x = Activation('relu')(x)
    
    return x

When transit from a stage to another stage, input's heigth and width are divided by 2 using strides of 2. The input and output are then of different dimensions. In this case, we use the `downsampling()` function in which we use the **Projection Shortcut** to match dimensions before doing the `Add()` operation.

In [6]:
def downsampling(inputs, filters, cardinality, strides, stage, block):
    
    # useful variables
    conv_name = "conv2d-{stage}{block}-branch".format(stage=str(stage), block=str(block))
    bn_name = "batchnorm-{stage}{block}-branch".format(stage=str(stage), block=str(block))
    
    # Retrieve filters for each layer
    f1, f2, f3 = filters
    
    # save the input tensor value
    x_shortcut = inputs
    x = inputs
    
    # divide input channels into groups. The number of groups is define by cardinality param
    groups = split(inputs=x, cardinality=cardinality)
    
    # transform each group by doing a set of convolutions and concat the results
    f1 = int(f1 / cardinality)
    f2 = int(f2 / cardinality)
    x = transform(groups=groups, filters=(f1, f2), strides=strides, stage=stage, block=block)
    
    # make a transition by doing 1x1 conv
    x = transition(inputs=x, filters=f3, stage=stage, block=block)
    
    # Projection Shortcut to match dimensions 
    x_shortcut = Conv2D(filters=f3, kernel_size=(1,1), strides=strides, padding="valid", 
               name='{base}2'.format(base=conv_name), kernel_initializer=glorot_uniform(seed=0))(x_shortcut)
    x_shortcut = BatchNormalization(axis=3, name='{base}2'.format(base=bn_name))(x_shortcut)
    
    # Add x and x_shortcut
    x = Add()([x,x_shortcut])
    x = Activation('relu')(x)
    
    return x

# ResNeXt-50

The following `ResNeXt50()` function uses previews defined functions to build the **ResNeXt-50** architecture. We considered the first stage of the architecture as the first convolutional layer (`conv2D`) with the three operations that follows it. The other stages of the architecture all start with the `downsampling` building block and the rest of its blocks are `identity_block`. After all building block stages, follows and average pooling (`AveragePooling2D`). The network ends with fully connected layer identify with the `Dense` layer.

In [7]:
def ResNeXt50(input_shape, classes):
    
    # Transform input to a tensor of shape input_shape 
    x_input = Input(input_shape)
    
    # Add zero padding
    x = ZeroPadding2D((3,3))(x_input)
    
    # Initial Stage. Let's say stage 1
    x = Conv2D(filters=64, kernel_size=(7,7), strides=(2,2), 
               name='conv2d_1', kernel_initializer=glorot_uniform(seed=0))(x)
    x = BatchNormalization(axis=3, name='batchnorm_1')(x)
    x = Activation('relu')(x)
    x = MaxPooling2D((3,3), strides=(2,2))(x)
    
    # Stage 2
    x = downsampling(inputs=x, filters=(128,128,256), cardinality=4, strides=(2,2), stage=2, block="a")
    x = identity_block(inputs=x, filters=(128,128,256), cardinality=4, stage=2, block="b")
    x = identity_block(inputs=x, filters=(128,128,256), cardinality=4, stage=2, block="c")
    
    
    # Stage 3
    x = downsampling(inputs=x, filters=(256,256,512), cardinality=4, strides=(2,2), stage=3, block="a")
    x = identity_block(inputs=x, filters=(256,256,512), cardinality=4, stage=3, block="b")
    x = identity_block(inputs=x, filters=(256,256,512), cardinality=4, stage=3, block="c")
    x = identity_block(inputs=x, filters=(256,256,512), cardinality=4, stage=3, block="d")
    
    
    # Stage 4
    x = downsampling(inputs=x, filters=(512,512,1024), cardinality=4, strides=(2,2), stage=4, block="a")
    x = identity_block(inputs=x, filters=(512,512,1024), cardinality=4, stage=4, block="b")
    x = identity_block(inputs=x, filters=(512,512,1024), cardinality=4, stage=4, block="c")
    x = identity_block(inputs=x, filters=(512,512,1024), cardinality=4, stage=4, block="d")
    x = identity_block(inputs=x, filters=(512,512,1024), cardinality=4, stage=4, block="e")
    x = identity_block(inputs=x, filters=(512,512,1024), cardinality=4, stage=4, block="f")
    
    
    # Stage 5
    x = downsampling(inputs=x, filters=(1024,1024,2048), cardinality=4, strides=(2,2), stage=5, block="a")
    x = identity_block(inputs=x, filters=(1024,1024,2048), cardinality=4, stage=5, block="b")
    x = identity_block(inputs=x, filters=(1024,1024,2048), cardinality=4, stage=5, block="c")
    
    
    # Average pooling
    x = AveragePooling2D(pool_size=(2,2), padding="same")(x)
    
    # Output layer
    x = Flatten()(x)
    x = Dense(classes, activation="softmax", kernel_initializer=glorot_uniform(seed=0), 
              name="fc{cls}".format(cls=str(classes)))(x)
    
    # Create the model
    model = Model(inputs=x_input, outputs=x, name="resnet50")
    
    return model

Build the **ResNeXt50** model with input shape of $(224,224,3)$ for 2 classes.

In [8]:
model = ResNeXt50(input_shape=(224,224,3), classes=2)

Instructions for updating:
Colocations handled automatically by placer.


In [9]:
optimizer = Adam(lr=0.0001)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

In [10]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 224, 224, 3)  0                                            
__________________________________________________________________________________________________
zero_padding2d_1 (ZeroPadding2D (None, 230, 230, 3)  0           input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 112, 112, 64) 9472        zero_padding2d_1[0][0]           
__________________________________________________________________________________________________
batchnorm_1 (BatchNormalization (None, 112, 112, 64) 256         conv2d_1[0][0]                   
__________________________________________________________________________________________________
activation

# References

<ol>
    <li> <i>Xie et al. (2017)</i> <a href="https://arxiv.org/pdf/1611.05431.pdf">Aggregated Residual Transformations for Deep Neural Networks</a> </li>
    <li> <i>He et al. (2015)</i> <a href="https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf">Deep Residual Learning for Image Recognition</a> </li>
    <li> <i> Marco Peixeiro </i> GitHub repository : <a href="https://github.com/marcopeix/Deep_Learning_AI/blob/master/4.Convolutional%20Neural%20Networks/2.Deep%20Convolutional%20Models/Residual%20Networks.ipynb">Residual Networks</a></li>
    <li> GitHub Repository: <a href="https://github.com/taki0112/ResNeXt-Tensorflow#what-is-the-transition-">ResNeXt-Tensorflow</a></li>
</ol>