# An overview and Pytorch implementation of the DenseNet architecture

DenseNet refers to a Convolutional Neural Network architecture that was introduced in the paper ["Densely Connected Convolutional Networks"](https://arxiv.org/abs/1608.06993) wich won the "Best Paper Award" at the [CVPR 2017](http://cvpr2017.thecvf.com/).

The purpouse of this notebook is to overview the DenseNet architecture from an intuitive and code-oriented point of view. At the end we will have a flexible and working implementation of this architecture.

The original paper is very well formated and includes many good illustrations and so we refer those who want to dive deeper into the technical details to read the paper.

# Imports

In [1]:
from collections import OrderedDict

import torch

# Building blocks

As many other CNN architectures, DenseNet is composed of 4 main building blocks:

![building_blocks](data/bb.png)

---

## Stem

The stem block is the first block of any architecture. Although it has received very little attention in most papers (it's hard to find a paper where the desing of this block is analysed or even justified), it has a very important impact in the perfomance of the architecture.

In the original DenseNet paper, the authors ommit any justification, clarification or experiments that led them to chose the stem block used although it's specified in the architecture tables. 

All the DenseNet variants trained for IMAGENET use `7x7, stride 2` convolution followed by a `3x3 ,stride 2` max pool, wich was the same stem block used in [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385).

It appears that this block is intended to heavily decrease the spatial dimensons of the input (x4) before the convolution blocks, while at the same time compensate that loss of information by increasing the depth dimensions (channels).

However, for a smaller dataset like CIFAR (32x32) the authors decided to use a different stem block composed by a single `3x3, stride 1` convolution, where the spatial dimensions are not decreased.

This is a logical decision but, surprisingly, it's ommited in the paper and only reflected on the official  code released.

Bellow is the implementation of the stem block allowing the 2 variants used in the original paper, note that other variants could be easily added.

In [2]:
class StemBlock(torch.nn.Sequential):
    """
    Initial Convolution Block.
    """
    def __init__(self, n_input_channels=3, n_init_features=64, mode='7x7'):
        """
        Parameters
        ----------
        n_input_channels: int, optional
            Default 3 (RGB image)
        n_init_features: int, optional
            Default 64 as in [1] DenseNet-121.
        mode: {'7x7', '3x3'}, optional
            Default '7x7' (as in [1] Imagenet)
            mode '3x3_A' is the stem block used in [1] for CIFAR 10/100

        Notes
        -----
        [1]: Densely Connected Convolutional Networks
            https://arxiv.org/abs/1608.06993
        """        
        super().__init__()

        C = n_input_channels
        F = n_init_features

        if mode == '7x7':
            self.add_module('conv', torch.nn.Conv2d(C, F, kernel_size=7, stride=2, padding=3, bias=False))
            self.add_module('norm', torch.nn.BatchNorm2d(F))
            self.add_module('relu', torch.nn.ReLU(inplace=True))

            self.add_module('pool', torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

        elif mode == '3x3':
            self.add_module('conv', torch.nn.Conv2d(C, F, kernel_size=3, stride=1, padding=1, bias=False))
        
        else:
            raise ValueError("mode must be '7x7' or '3x3'")

Examples:

In [3]:
StemBlock()

StemBlock (
  (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (relu): ReLU (inplace)
  (pool): MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1))
)

In [4]:
StemBlock(n_init_features=128)

StemBlock (
  (conv): Conv2d(3, 128, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (relu): ReLU (inplace)
  (pool): MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1))
)

In [5]:
StemBlock(mode='3x3')

StemBlock (
  (conv): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)

---

## Convolution

The convolutional blocks are the main component of any architecture. The desing and connection between these blocks have received a lot of attention because of their big impact in the perfomance of the architecture.

The goal of the convolutional blocks is to generate new feature maps by combining the information of the set of feature maps received as input.

In traditional architectures like VGG each layer is simply conected to the next. More recent architectures like all the Inception and Resnets use a more complex connectivity pattern.

In Densenet, the authors extend the ideas related to the `skip connections` presented in [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385).

The ResNet convolutional blocks implement these skip connections using the identity transform:

![Resnet](http://www.deeplearningmodel.net/img/resnet/resnet_block.png)

Whereas in DenseNet the authors take this connection one step further by connecting every layer in the block with the previous:

![Densenet](https://cdn-images-1.medium.com/max/1600/1*KOjUX1ST5RnDOZWWLWRGkw.png)

In practice, the feature maps generated by one layer are **concatenated** with the ones that it received as input. This concatenation is what the next layer receive.

This conceptually simple connection has many intrinsic benefits wich are best described in the original paper.

In addition with this connection pattern the authors tried some recent convolution tricks in order to improve the efficiency of the blocks like the use of bottlenecks populariced by the Inception architecture.

Bellow is a implementation of the so called DenseBlock, note that the convolution / bottleneck part could be easily modified and that some of the parameters might require familiarity with the paper in order to be understood.

In [6]:
class DenseBlock(torch.nn.Sequential):
    """
    Block where all the layers are connected to previous ones.
    """    
    def __init__(self,  n_layers, n_input_features, growth_rate=32, bottleneck_size=4, drop_rate=0):
        """
        n_layers: int
        n_input_features: int
        growth_rate: int, optional
            Default 32 as in [1] DenseNet-121.
        bottleneck_size: int, optional
            Default 4 as in [1] DenseNet-BC
            If 0, no bottleneck.
        drop_rate: int, optional
            Default 0.
            In [1] 0.2 is used for CIFAR10/100

        Notes
        -----
        [1]: Densely Connected Convolutional Networks
            https://arxiv.org/abs/1608.06993
        """        
        super().__init__()

        I = n_input_features
        G = growth_rate
        B = bottleneck_size
        D = drop_rate

        for N in range(n_layers):
            F = I + (N * G)

            layer = torch.nn.Sequential()

            layer.add_module('norm.0', torch.nn.BatchNorm2d(F))
            layer.add_module('relu.0', torch.nn.ReLU(inplace=True)) 
            layer.add_module('conv.0', torch.nn.Conv2d(F, G * B, kernel_size=1, stride=1, bias=False))

            if B > 1:
                layer.add_module('norm.1', torch.nn.BatchNorm2d(G * B))
                layer.add_module('relu.1', torch.nn.ReLU(inplace=True))
                layer.add_module('conv.1', torch.nn.Conv2d(G * B, G, kernel_size=3, stride=1, padding=1, bias=False))

            if D > 0:
                layer.add_module('drop', torch.nn.Dropout(drop_rate, inplace=True))

            self.add_module('denselayer.{}'.format(N), layer)    
            
    def forward(self, x):
        for layer in self:
            out = layer(x)
            # THE magic line
            x = torch.cat([x, out], 1)
        return x  

In [7]:
DenseBlock(n_layers=2, n_input_features=64)

DenseBlock (
  (denselayer.0): Sequential (
    (norm.0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (relu.0): ReLU (inplace)
    (conv.0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (norm.1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (relu.1): ReLU (inplace)
    (conv.1): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  )
  (denselayer.1): Sequential (
    (norm.0): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
    (relu.0): ReLU (inplace)
    (conv.0): Conv2d(96, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (norm.1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (relu.1): ReLU (inplace)
    (conv.1): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  )
)

In [8]:
DenseBlock(n_layers=3, n_input_features=256, drop_rate=0.2)

DenseBlock (
  (denselayer.0): Sequential (
    (norm.0): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (relu.0): ReLU (inplace)
    (conv.0): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (norm.1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (relu.1): ReLU (inplace)
    (conv.1): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (drop): Dropout (p = 0.2, inplace)
  )
  (denselayer.1): Sequential (
    (norm.0): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True)
    (relu.0): ReLU (inplace)
    (conv.0): Conv2d(288, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (norm.1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (relu.1): ReLU (inplace)
    (conv.1): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (drop): Dropout (p = 0.2, inplace)
  )
  (denselayer.2): Sequential (
    (norm.0): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True)
    (

## Downsampling

In all the exiting architectures, except for the ones based on dilated convolutions, the convolutional blocks are intercaled with downsampling blocks with reduce the spatial dimensions of the feature maps.

This blocks have also received very little attention and in practice most of the modern architectures just apply a `2x2, stride 2` max pool between convolutional blocks. Alternatives to max pooling have been explored in some papers, for example increasing the stride of a normal convolution can produce the same spatial reduction.

In Densenet the authors name this downsampling blocks as Transition blocks and they also include a convolution before the pooling wich is used to reduce the number of feature maps by a hyper parameter called `compression factor`.

In [9]:
class Transition(torch.nn.Sequential):
    """
    Transition between DenseBlocks.
    """    
    def __init__(self,  n_input_features, n_ouput_features):
        """
        n_input_features: int
        n_output_features: int
            In general this number is the result of applying a "compress factor" to n_input_features.
            In [1] 0.5 is used for DenseNet-BC.
        Notes
        -----
        [1]: Densely Connected Convolutional Networks
            https://arxiv.org/abs/1608.06993
        """        
        super().__init__()

        I = n_input_features
        O = n_ouput_features

        self.add_module('norm', torch.nn.BatchNorm2d(I))
        self.add_module('relu', torch.nn.ReLU(inplace=True))
        self.add_module('conv', torch.nn.Conv2d(I, O, kernel_size=1, stride=1, bias=False))        
        

        self.add_module('pool', torch.nn.AvgPool2d(kernel_size=2, stride=2))  

In [10]:
Transition(64, 32)

Transition (
  (norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (relu): ReLU (inplace)
  (conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (pool): AvgPool2d (size=2, stride=2, padding=0, ceil_mode=False, count_include_pad=True)
)

# Densenet

Once we have the above building blocks the implementation of Densenet (the final assembly) is trivial. There are many hyperparameters that might not result intuitive unless you have read the paper but many others are easily understandable like:

- Number of blocks
- Number of layer for each block

The bellow implementation could be easily extended to other domains beyond classification like segmentation or detection (maybe in future notebooks):

In [11]:
class DenseNet(torch.nn.Module):
    """
    Use Dense and Transition blocks to build a DenseNet [1].
    """
    def __init__(self,
                 n_input_channels=3,
                 n_init_features=64,
                 stem_mode='7x7',
                 blocks=(6, 12, 24, 16),
                 compression_factors=(0.5, 0.5, 0.5), 
                 growth_rate=32,
                 bottleneck_size=4, 
                 drop_rate=0, 
                 n_classes=1000):
        """
        n_input_channels: int, optional
            Default 3 (RGB image)
        n_init_features: int, optional
            Default 64 as in [1] DenseNet-121.
        stem_mode: {'7x7', '3x3'}
            Default '7x7' (as in [1])
            mode '3x3' is the stem block used in [2].
        blocks: list/tuple of ints
            Default (6, 12, 24, 26) as in [1] DenseNet-121.
            This parameter configures:
            - Number of DenseBlocks (len(blocks))
            - Number of DenseLayers (block[i]) for the each (ith) block.
        compression_factors: list/tuple of floats, optional
            Default (0.5, 0.5, 0.5) as in [1] DenseNet-BC.
            Adjust the compression factor of each transition block.
            len(compression_factors) should be equal to len(transitions)
            The number of features maps will be reduced by compression_factors[i] at each transition block.
        growth_rate: int, optional
            Default 32 as in [1] DenseNet-121.
        bottleneck_size: int, optional
            Default 4 as in [1] DenseNet-BC.
        drop_rate: float, optional
            Default 0.
            In [1] 0.2 is used for CIFAR10/100
        num_classes: int, optional
            Default 1000

        Notes
        -----
        [1]: Densely Connected Convolutional Networks
            https://arxiv.org/abs/1608.06993
        """
        super().__init__()

        C = n_input_channels
        F = n_init_features
        CF = compression_factors
        G = growth_rate
        B = bottleneck_size
        D = drop_rate
        
        self.features = torch.nn.Sequential()

        self.features.add_module('stem', StemBlock(C, F, mode=stem_mode))

        for i, N in enumerate(blocks):
            self.features.add_module('denseblock.{}'.format(i), DenseBlock(N, F, G, B, D))

            F += (N * G)
            
            if i < len(CF):
                self.features.add_module('transition.{}'.format(i), Transition(F, int(F * CF[i])))
                
                F = int(F * CF[i])
        
        self.final_pool = torch.nn.Sequential(OrderedDict([
            ('norm', torch.nn.BatchNorm2d(F)),
            ('relu', torch.nn.ReLU(inplace=True)),
            ('pool', torch.nn.AdaptiveAvgPool2d(1)),
        ]))
        
        self.classifier = torch.nn.Linear(F, n_classes)

    def forward(self, x):
        out = self.features(x)
        out = self.final_pool(out)
        return self.classifier(out.squeeze())

We can use the above class to generate all the original DenseNet variants or create our own

- DenseNet-121-BC

In [12]:
DenseNet()

DenseNet (
  (features): Sequential (
    (stem): StemBlock (
      (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (norm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU (inplace)
      (pool): MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1))
    )
    (denseblock.0): DenseBlock (
      (denselayer.0): Sequential (
        (norm.0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm.1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
        (relu.1): ReLU (inplace)
        (conv.1): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer.1): Sequential (
        (norm.0): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(96, 128, kernel_size=(1, 1),

- DenseNet-161

In [13]:
DenseNet(
    n_init_features=96,
    growth_rate=48,
    blocks=(6, 12, 36, 24),
    bottleneck_size=1,
)

DenseNet (
  (features): Sequential (
    (stem): StemBlock (
      (conv): Conv2d(3, 96, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (norm): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU (inplace)
      (pool): MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1))
    )
    (denseblock.0): DenseBlock (
      (denselayer.0): Sequential (
        (norm.0): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(96, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
      )
      (denselayer.1): Sequential (
        (norm.0): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(144, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
      )
      (denselayer.2): Sequential (
        (norm.0): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(1

- DenseNet-BC for CIFAR

In [14]:
DenseNet(
    n_init_features=24,
    stem_mode='3x3',
    blocks=(16, 16, 16),
    compression_factors=(0.5, 0.5),
    growth_rate=12,
    n_classes=10,
)

DenseNet (
  (features): Sequential (
    (stem): StemBlock (
      (conv): Conv2d(3, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    )
    (denseblock.0): DenseBlock (
      (denselayer.0): Sequential (
        (norm.0): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(24, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm.1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True)
        (relu.1): ReLU (inplace)
        (conv.1): Conv2d(48, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer.1): Sequential (
        (norm.0): BatchNorm2d(36, eps=1e-05, momentum=0.1, affine=True)
        (relu.0): ReLU (inplace)
        (conv.0): Conv2d(36, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm.1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True)
        (relu.1): ReLU (inplace)
        (conv.1): Conv2d(48, 12, kernel_size=(3, 3), s