### Densely Connected Convolutional Networks

**Authors:** G. Huang, Z. Liu, L. Maaten, K. Weinberger  
**Link:** https://arxiv.org/pdf/1608.06993.pdf

---

- Introduced Dense Convolutional Network (DenseNet) which connects each layer to every other layer in a feed-forward fashion, this enables maximum information flow between layers in the network.
- A DenseNet is made up of `Dense Blocks` connected by `Transition Layers`.
- DenseNet Advantages
    - Alleviate vanishing gradient problem 
    - Strengthen feature propagation
    - Encourage feature reuse
    - Reduce the number of parameters
- Traditional CNN's with $L$ layers have $L$ connections (one between each layer and its subsequent layer) and DenseNet has $\frac{L(L+1)}{2}$ direct connections
- Concatenating feature maps learned by different layers increases variation in the input of subsequent layers and improves efficiency.

---

**DenseBlock**

- Consider a single image $\mathbf{x_{0}}$ that is passed through a CNN
- $\mathbf{x_{0}} \rightarrow$ Input image
- $L \rightarrow$ # of Layers in the network
- $H_{\ell}(\cdot) \rightarrow$ A non-linear transformation implemented by layer $$\ell$$. Non-linear transformation can be a composite function of operations such as Batch Normalization, ReLU, Pooling, or Convolution. `NOTE - Composite function used: [BN - ReLU - Conv(3x3)]`
- $\mathbf{x_{\ell}} \rightarrow$ Output of the $\ell^{th}$ layer
![](images/dense-block-details.png)
- **Layer Connectivity**
    - Traditional CNNs: $\mathbf{x_{\ell}} = H_{\ell}(\mathbf{x_{\ell-1}})$
    - ResNet - Adds a skip-connection that bypasses non-linear transformations with an identity function: $\mathbf{x_{\ell}} = H_{\ell}(\mathbf{x_{\ell-1}}) + \mathbf{x_{\ell-1}}$ 
    - <span style="color:steelblue">DenseNet - The $\ell^{th}$ layer receives the feature maps of all preceding layers $\mathbf{x_{0}}, \mathbf{x_{1}}, \ldots, \mathbf{x_{\ell-1}}$ as input: $\mathbf{x_{\ell}} = H_{\ell}([\mathbf{x_{0}}, \mathbf{x_{1}}, \ldots, \mathbf{x_{\ell-1}}])$, where $[\mathbf{x_{0}}, \mathbf{x_{1}}, \ldots, \mathbf{x_{\ell-1}}]$ is concatenation of the feature maps produced in layers $0, 1, \ldots, \ell-1$</span>
- **Pooling Layers** - Essential part of CNN that changes the size of feature map
    - Concatenation operation is not possible when size of feature map changes: To facilitate pooling network is divided into multiple densely connected **`dense blocks`**
    - Layers between two adjacent **`dense blocks`** are referred to as **`transition layers`**. They change feature map sizes via convolution and pooling. `NOTE: Transition layer used: [BN - Conv(1x1) - Pool(2x2)]`
    ![](images/densenet-full.png)
- **Growth Rate $(k)$**
    - Growth Rate $(k)$ is a hyper-parameter
    - $k_{0} \rightarrow$ # of channels in the input image
    - Suppose each $H_{\ell}(\cdot)$ produces $k$ feature maps as output, this suggests that $\ell^{th}$ layer has $k\times(\ell-1) + k_0$ feature map inputs
- **Bottleneck Layers**
    - Used 1x1 convolution as bottleneck layer before each 3x3 convolution to reduce the number of input feature maps, and thus to improve computational efficiency
    - *Each 1x1 convolution reduces the input to $4k$ feature maps*
    - Network with bottleneck layer: `DenseNet-B`
- **Compression**
    - To improve model compactness authors reduced number of feature maps at transition layers.
    - If a dense block contains $m$ feature maps, the following transition layer was enabled to generate $\lfloor \theta m \rfloor$ output feature maps, where $0 < \theta \leq 1$ and $\theta \rightarrow$ Compression factor. When $\theta = 1$, the number of feature maps across transition layer remains unchanged
    - Used $\theta = 0.5$ for experiment
    - DenseNet with compression: `DenseNet-C` and DenseNet with bottleneck and compression: `DenseNet-BC`

---

**DenseNet-BC (ImageNet)** `NOTE: Check paper for CIFAR and SVHN implementation details`

![](images/densenet-table.png)
- `conv` layer shown in the table corresponds the sequence `BN-ReLU-Conv`
- 4 Dense Blocks
- Input size: 224x224 with initial $2k$ convolutions with 7x7 filters with stride 2
- Data augmentation same as ResNet
- Training:
    - Weight decay: $10^{-4}$
    - Weight initialization: As discussed in paper [<span style="color:teal">Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification</span>](https://arxiv.org/pdf/1502.01852.pdf) - "Microsoft Research Asia" (MSRA) weight initialization. It is similar to `Xavier` initialization except it is designed for ReLU instead of TanH activation. In this method weights are initialized with a zero-mean Gaussian distribution whose standard deviation is $\sqrt{\frac{2}{n_l}}$, where $n_l = k^{2}_{l}d_{l-1}$, $k_l \rightarrow$ Spatial filter size in layer $l$; $d_{l-1} \rightarrow$ Number of filters in layer $l-1$
    - Nestrov Momentum of 0.9 without dampening
    - Train model for 90 epochs with mini-batch size: 256, learning rate: 0.1 set initially and lowered by a factor of 10 after epoch 30 and epoch 60. `NOTE: Because of GPU memory constraints, largest model (DenseNet-161) is trained with a mini-batch size 128. To compensate for the smaller batch size, model was trained for 100 epochs, and the learning rate was divided by 10 after epoch 90`

---

**DenseNet-BC-121 ($k=32$): Up to First Transition Layer** (NOTE: `Conv` correspond to `BN-ReLU-Conv`)

- **Initial**
    - $224 \times 224 \times 3 \rightarrow$ Conv(7x7, s:2, p:3, c: $2k$) $\rightarrow BN-ReLU$ $\rightarrow 112 \times 112 \times 64 $
    - $112 \times 112 \times 64 \rightarrow$ Pool.Max(3x3, s:2, p:1) $\rightarrow 56 \times 56 \times 64 $
    
---

- **DenseBlock-1**
    - $56 \times 56 \times 64 \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $

---

- **Transition-1**
    - $\ell=6$ and $compression=0.5$
    - $56 \times 56 \times [64+32+32+32+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $(k\times \ell + k_0)\times compression$) $\rightarrow 56 \times 56 \times 128$
    - $56 \times 56 \times 128 \rightarrow$ Pool.Avg(2x2, s:2, p:0) $\rightarrow 28 \times 28 \times 128$

---

- **DenseBlock-2 ...** 

**Vanishing Gradient Problem** 

- As the information about the input or gradient passes through many layers, it can vanish by the time it reaches the end (or beginning) of the network). Vanishing Gradient problem depends on the choice of activation function. Non-linear activation functions such as sigmoid or tanh squash their inputs into a very small range ([0, 1] or [-1, 1]), as a result even a large change in input will produce a small change in output which results in small gradient. The problem becomes worse when the network is deep.

- Addressing Vanishing Gradient Problem
    - Using ReLU Activation
    - ResNet
    - Highway Networks
    - Stochastic Depth - It shortens ResNets by randomly dropping layers (because many layers contribute very little and can be dropped) during training to allow better information and gradient flow
