In [1]:
# Import Dependencies
import torch
import torch.nn as nn

# a) Space to Depth Layer

1. In a Convolutional Neural Network, usually the input layers would be a RGB image with a Height, Width and Number of Channels like (H,W,C).

2. Now if the input image shape is large, ex. `(512,512,3)`, then using this as input, followed by usual convolutional layers, we have two options to reduce the feature size immidiately, specially in a resource constrained environment like a edge hardware.

3. Use a Conv2D layer with larger Kernel Size ex. (7,7) or (5,5) along with large stride ex. (2,2) or (3,3). This leads to a smaller output  following the formula for a Conv2D layer, given as:`[ (W - K + 2 * P) / S] + 1`, where, `W`: Input Width, `K`: Kernel Size, `P`: Padding Size, `S`: Stride. For example, for an input image of shape `(512,512,3)`, using a large `Kernel Size (K) of (5,5)`, a `Stride (S) of (3,3)` and a `Padding (P) of 1`, `32 Filters with Kernel K`, we can reduce the input feature dimensions from `(512,512,3)` to `(170,170,32)`.
In this process, we lose a lot of input information, both in the space i.e. Height and Width as well as the depth i.e. the channels.

4. The second option is to re-arrage the spatial information in the input in such a way that the dimension of features is reduced as well as comparatively less information is lost. This is what a `Space-to-Depth Layer` does. This layer tries to re-arrange the blocks of spatial data i.e. the width and height information into depth. More specifically, this op outputs a copy of the input tensor where values from the height and width dimensions are moved to the depth dimension. The attr block_size indicates the input block size.

```Important Points for implementing Space to Depth layer:```

1. Non-overlapping blocks of size `block_size x block size` are rearranged into depth at each location.

2. The depth of the output tensor is `block_size * block_size * input_depth`.

3. The Y, X coordinates within each block of the input become the high order component of the output channel index.

4. The input tensor's height and width must be divisible by `block_size`.

In [2]:
# Normal Convolution Layer
conv_layer = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(8,8), stride=(8,8), padding=1)
conv_layer

Conv2d(3, 32, kernel_size=(8, 8), stride=(8, 8), padding=(1, 1))

In [3]:
# Space to Depth Layer

def space2Depth(x, block_size):
    # Input Shape
    n,c,h,w = x.size()
    
    # Input Depth - number of Input channels
    # Output Channels - block_size * block_size * input_depth
    # Kernel Size - (block_size, block_size)
    s_to_d = nn.Conv2d(in_channels= c, out_channels= c * block_size**2, kernel_size=(block_size, block_size), stride=block_size)
    
    return s_to_d(x)

In [4]:
# Test Conv2D Layer on our sample input
x = torch.rand((3, 512,512))

conv_out = conv_layer(torch.unsqueeze(x, 0))
s2d_out = space2Depth(torch.unsqueeze(x, 0), block_size=8)

print("conv_out.shape: ", conv_out.shape)
print("space-to-depth.shape: ", s2d_out.shape)

conv_out.shape:  torch.Size([1, 32, 64, 64])
space-to-depth.shape:  torch.Size([1, 192, 64, 64])


### Summary:

From above results we can see that:

1. Usin a Conv2D layer for reducing the input size leads to loss of information.
2. Using a Space to Depth mapping, with the variable of `block_size`, we can reduce the size of the input by whatever amount we want, while still maintaining the original input information the the channels.

# b) 3D Convolution

3D Convolution is used when the input has more than 3 input planes/channels. For example, RGB image frames from video over timesteps. So, these frames can be represented by the shape - `(N,T,H,W,C)` where, `N`: Batch Size, `T`: Timestep, `H`: Input Height, `W`: Input Width, `C`: Number of channels. 

Example:

Let's imagine we are making a Video Classification/Activity Recognition model. The first layer needs to take in take in `N` number of video frames at a time as to classify an action/activity in a video, the model needs to see at-least `N` number of video frames as the actions or activites span across some frames. Hence, your input would be `(Batch Size, Timesteps, Number of Input Channels, Input Frame Height, Input Frame Width)`.

Say, we want to use 16 frames as input, resized to (512,512,3) i.e. RGB image. Hence, the input shape becomes: `(1, 16, 3, 512, 512)` for 1 batch_size.

In [5]:
conv3d_layer = nn.Conv3d(in_channels=16, out_channels=32, kernel_size=(3,3,3), stride=(2,2,2), padding=(1,1,1))

In [6]:
# Test Conv3D Layer on our sample input
# Input is a RGB frame from video, hence (512,512,3)
# We take 16 frames as input, hence, the 16 in timestep
x = torch.rand((16, 3, 512,512))

conv3d_out = conv3d_layer(torch.unsqueeze(x, 0))

# Number of parameters in a Conv3D layer
params_conv3d = sum(p.numel() for p in conv3d_layer.parameters() if p.requires_grad)

print("conv3d_out.shape: ", conv3d_out.shape)
print("number of parameters in standard Conv3d: ", params_conv3d)

conv3d_out.shape:  torch.Size([1, 32, 2, 256, 256])
number of parameters in standard Conv3d:  13856


# c) Pointwise Convolution (1 x 1)

A pointwise convolution basically is similar to using a 2D Convolution, the only difference being that the `kernel_size` used is `(1, 1)`. In this, the 
layer does not have any spatial, but cross-channel interaction.

Let's see some advantages of using this:
1. This helps in dimensionality reduction when filter size is less than the number of input channels. For ex. if we have an input of shape `(32,512,512)` where we have 32 input channels. We want to reduce the number of channels but without losing the information. Applying a Conv2D with a (1,1) kernel, say 16 number of filters helps us to reduce the output dimension to `(16,512,512)`.
2. Efficient low dimensional embedding or feature pooling.
3. Applying non-linearity after convolution - Pointwise Convolution acts like a feature pooling like MaxPooling, so the activations can be applied directly on top of the Pointwise Convolution output directly.

In [7]:
pt_conv = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(1,1))

In [8]:
# Test Conv2D Layer on our sample input
x = torch.rand((16, 512,512))

pt_conv_out = pt_conv(torch.unsqueeze(x, 0))

print("pt_conv_out.shape: ", pt_conv_out.shape)

pt_conv_out.shape:  torch.Size([1, 32, 512, 512])


# d) DephWise Separable Convolution

Introduced in MobileNet paper, this convolution consists of two steps:
1. Depthwise Convolution: The depthwise convolution unlike the standard convolution acts only on a single channel of the input map at a time.
2. Pointwise Convolution: Here, we use 1 x 1 kernels to have no spatial, but cross-channel interaction.

In a standard convolution, say we have input of shape `(3,7,7)`, and apply `128 filters` of size `(3,3,3)`, we get an output of `(128,5,5)`.

Say, `N`: number of kernels of dimension `(h, h, D)`, `H`: input height, `W`: input width, then the total number of multiplications for a normal Conv2D layer is given as: `N * h * h * D * (H - h + 1) * (W - h + 1)`

Using a depthwise separable convolution layer, the total number of multiplications reduce down to: `(h * h + N) * D * (H - h + 1) * (W - h + 1)`

Hence, ratio: `(1 / N + 1 / h**2)`, if `N >> h`, `ratio = (1 / h**2)`

1) Depthwise Convolution:
First, we apply depthwise convolution to the input layer. Instead of using a single filter of size 3 x 3 x 3 in 2D convolution, we used 3 kernels, separately. Each filter has size 3 x 3 x 1. Each kernel convolves with 1 channel of the input layer (1 channel only, not all channels!). Each of such convolution provides a map of size 5 x 5 x 1. We then stack these maps together to create a 5 x 5 x 3 image. After this, we have the output with size 5 x 5 x 3. We now shrink the spatial dimensions, but the depth is still the same as before.

In [9]:
# Normal Convolution
# Filter Size: (3,3,3), total 32 filters
normal_conv = nn.Conv2d(in_channels=10, out_channels=32, kernel_size=(3,3), stride=(1,1))

In [10]:
# Depthwise Separable Convolution

def depthwiseSeparableConv(in_ch, out_ch):
    # 1. Depthwise Convolution
    # Filter Size: (3,3,1), total 10 filters, 1 per input channel
    #  Each kernel convolves with 1 channel of the input layer
    depthwise_convolution = nn.Conv2d(in_channels=in_ch, out_channels=in_ch, kernel_size=(3,3), groups=10)

    # 2. Pointwise Convolution
    # Now, to extend the depth, we use a cheaper 1x1 convolution with required number of output channels
    pointwise_conv = nn.Conv2d(in_channels=in_ch, out_channels=out_ch, kernel_size=(1,1))
    
    depthwise_separable_conv = nn.Sequential(depthwise_convolution, pointwise_conv)
    
    return depthwise_separable_conv

In [11]:
# Test Conv2D Layer on our sample input
x = torch.rand((10, 128, 128))

normal_conv_out = normal_conv(torch.unsqueeze(x, 0))

# Number of parameters in a normal Conv2D
params_conv2d = sum(p.numel() for p in normal_conv.parameters() if p.requires_grad)

print("normal_conv_out.shape: ", normal_conv_out.shape)
print("number of parameters in standard Conv2d: ", params_conv2d)

normal_conv_out.shape:  torch.Size([1, 32, 126, 126])
number of parameters in standard Conv2d:  2912


In [12]:
# Test Depthwise Separable Conv Layer on our sample input
x = torch.rand((10, 128, 128))

depthwise_separable_conv = depthwiseSeparableConv(in_ch=10, out_ch=32)
depthwise_separable_conv_out = depthwise_separable_conv(torch.unsqueeze(x, 0))

# Number of parameters in a Depthwise Separable Convolution
params_depthwise = sum(p.numel() for p in depthwise_separable_conv.parameters() if p.requires_grad)

print("normal_conv_out.shape: ", depthwise_separable_conv_out.shape)
print("number of parameters in depthwise separable convolution: ", params_depthwise)

normal_conv_out.shape:  torch.Size([1, 32, 126, 126])
number of parameters in depthwise separable convolution:  452


# e) Transposed Convolution

The normal convoluion layer, whether Conv2D or Conv2D, are usually used to reduce the input dimension and get features in the depth i.e. number of channels. Once we are done with the convolutions, the feature size becomes very small as compared to the input dimension depending on the kernel size and the stride used.

But in some cases, such as Segmentation models, we want to learn the low level features but then want to Upsample the features from low dimension to a high dimension spatially. For example going from a feature size of `(32,128,128)` to an output layer of size `(1,257,257)`.

In this case, we use a upsampling layer. One layer to perform up-sampling is Transpose Convolution layer.

In [13]:
transposed_conv = nn.ConvTranspose2d(in_channels=32, out_channels=16, kernel_size=(3,3), stride=(2,2))

In [14]:
# Test Conv2D Layer on our sample input
x = torch.rand((32, 128, 128))

transconv_out = transposed_conv(torch.unsqueeze(x, 0))

# Number of parameters in a Transposed Conv2d Layer
params_transposedconv2d = sum(p.numel() for p in transposed_conv.parameters() if p.requires_grad)

print("transconv_out.shape: ", transconv_out.shape)
print("number of parameters in standard TransposedConv2d: ", params_transposedconv2d)

transconv_out.shape:  torch.Size([1, 16, 257, 257])
number of parameters in standard TransposedConv2d:  4624


# f) Pixel Shuffle

Pixel Shuffle is another type of layer that can be used for performing up-sampling. The difference here is that this layer performs `depth-to-space` conversion i.e. it uses the channels in a layer and re-arrange it in a way so as to obtain the desired up-sampling width and height.

The advantage in this as compared to Transposed Convolution is that the transposed convolution layer up-sampling suffers with the Checkerboard Artifact problem, where the up-sampled image shows checkerboard like features. This is not desired specially in applications like super-resolution etc.

Pixel Shuffle layer solves this problem by re-arranging the channel features into spatial dimension.

In [15]:
pixel_shuffle_layer = nn.PixelShuffle(upscale_factor=2)

In [16]:
# Test Conv2D Layer on our sample input
x = torch.rand((32, 128, 128))

pixel_shuffle_out = pixel_shuffle_layer(torch.unsqueeze(x, 0))

# Number of parameters in a Pixel Shuffle Layer
params_pixshuffle = sum(p.numel() for p in pixel_shuffle_layer.parameters() if p.requires_grad)

print("pixel_shuffle_out.shape: ", pixel_shuffle_out.shape)
print("number of parameters in standard PixelShuffler layer: ", params_pixshuffle)

pixel_shuffle_out.shape:  torch.Size([1, 8, 256, 256])
number of parameters in standard PixelShuffler layer:  0


In the above code, see that how the Pixel Shuffle layer re-arranged the input channels from `(1,32,128,128)` to `(1,8,256,256)`, thereby increasing the output in the spatially while decreasing the ouput over depth.

# g) Dilated Convolution

Dilated convolutions are similar to 2D Convolutions, except that they "inflate" the kernel by inserting spaces between the kernel elements. The important factor here is the `Dilation Rate (l)` that indicates that how much we want to widen the kernel by.

In [17]:
# Normal Convolution without Dilation
normal_conv = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3,3), stride=(1,1))

# Convolution with Dilation
dilated_conv = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3,3), stride=(1,1), dilation=2)

In [18]:
# Test Conv2D Layer on our sample input
x = torch.rand((3, 512, 512))

normal_conv_out = normal_conv(torch.unsqueeze(x, 0))
dilated_conv_out = dilated_conv(torch.unsqueeze(x, 0))

print("normal_conv_out.shape: ", normal_conv_out.shape)
print("dilated_conv_out.shape: ", dilated_conv_out.shape)

normal_conv_out.shape:  torch.Size([1, 32, 510, 510])
dilated_conv_out.shape:  torch.Size([1, 32, 508, 508])


# h) Grouped Convolution

Introduced in `Alexnet` paper.

# i) Shuffle Grouped Convolution

Introduced in `ShuffleNet` paper.

# j) Pointwise Grouped Convolution 

# k) Squeeze and Excite

Introduced in `SqueezeNet` paper.

**Layer Parameter Calculation:**

`(number of input channels) * (number of filters) * (kernel_width * kernel_height)`

This paper uses 3 main stratergies for designing the model architecture:

**Strategy 1 - Replace 3x3 filters with 1x1 filters.**

Given a budget of a certain number of convolution filters, we will choose to make the majority of these filters 1x1, since a 1x1 filter has 9X fewer parameters than a 3x3 filter.

```
in_channels = 3, out_channels (filters) = 32

Parameters in a 1x1 Conv layer: (3 * 32 * (1 * 1)) = 96
Parameters in a 1x1 Conv layer: (3 * 32 * (3 * 3)) = 864

i.e `9x` more parameters than a 1x1 Conv layer.
```

**Strategy 2 - Decrease the number of input channels to 3x3 filters.**

Consider a convolution layer that is comprised entirely of 3x3 filters. To maintain a small total number of parameters in a CNN, it is important not only to decrease the number of 3x3 filters, but also to decrease the number of input channels to the 3x3 filters.

```
in_channels = 3, out_channels (filters) = 32, kernel_size = (3, 3)
Total Parameters: (3 * 32 * (3 * 3)) = 864

in_channels = 3, out_channels (filters) = 64, kernel_size = (3, 3)
Total Parameters: (3 * 64 * (3 * 3)) = 1,728
```

**Strategy 3 - Downsample late in the network so that convolution layers have large activation maps.**

Most commonly, downsampling is engineered into CNN architectures by setting the (stride > 1) in some of the convolution or pooling layers. If early layers in the network have large strides, then most layers will have small activation maps.

Conversely, if most layers in the network have a stride of 1, and the strides greater than 1 are concentrated toward the end4 of the network, then many layers in the network will have large activation maps. Our intuition is that large activation maps (due to delayed downsampling) can lead to higher classification accuracy, with all else held equal.

In [19]:
class SqueezeExcite(nn.Module):
    def __init__(self, in_planes, squeeze_planes, expand_planes):
        super(SqueezeExcite, self).__init__()
        
        # Squeeze Layer - 1 x 1 Convolution (Pointwise Convolution)
        self.conv1 = nn.Conv2d(in_channels=in_planes, out_channels=squeeze_planes, kernel_size=(1,1), stride=(1,1))
        self.bn1 = nn.BatchNorm2d(squeeze_planes)

        # Expand Layer - (1 x 1 Convolution, 3 x 3 Convolution)
        # 1 x 1 Convolution
        self.conv2 = nn.Conv2d(in_channels=squeeze_planes, out_channels=expand_planes, kernel_size=(1,1), stride=(1,1))
        self.bn2 = nn.BatchNorm2d(expand_planes)
        
        # 3 x 3 Convolution
        self.conv3 = nn.Conv2d(in_channels=squeeze_planes, out_channels=expand_planes, kernel_size=(3,3), stride=(1,1), padding=1)
        self.bn3 = nn.BatchNorm2d(expand_planes)
        
        # Activation
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        # Input goes through Squeeze layer first
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        
        # Expand Layer: Then output of Squeeze Layer goes through 2 layers in parallel
        # 1. A Pointwise Conv Layer
        out1 = self.conv2(x)
        out1 = self.bn2(out1)
        
        # 2. A 3 x 3 Conv Layer
        out2 = self.conv3(x)
        out2 = self.bn3(out2)
        
        # Before sending the output, the outputs of expand layers is concatenated
        out = torch.cat([out1, out2], dim=1)
        out = self.relu(out)
        
        return out

In [20]:
se_layer = SqueezeExcite(in_planes=256, squeeze_planes=48, expand_planes=192)
se_layer

SqueezeExcite(
  (conv1): Conv2d(256, 48, kernel_size=(1, 1), stride=(1, 1))
  (bn1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2): Conv2d(48, 192, kernel_size=(1, 1), stride=(1, 1))
  (bn2): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv3): Conv2d(48, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (bn3): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
)

In [21]:
# Test Conv2D Layer on our sample input
x = torch.rand((256, 512, 512))

se_layer_out = se_layer(torch.unsqueeze(x, 0))

# Number of parameters in a Pixel Shuffle Layer
params_se_layer = sum(p.numel() for p in se_layer.parameters() if p.requires_grad)

print("se_layer_out.shape: ", se_layer_out.shape)
print("number of parameters in SqueezeExcite layer: ", params_se_layer)

se_layer_out.shape:  torch.Size([1, 384, 512, 512])
number of parameters in SqueezeExcite layer:  105744


As you can see above, we have `256` input channels. Following `Stratergy-1`, we replaced the 3 x 3 Convolutions with 1 x 1 Convolutions.
This thereby helps decrease the number of input channels to 3 x 3 filters in the Expand layer following `Stratergy-2`.