### Densely Connected Convolutional Networks
- **Authors:** Gao Huang et al.
- **[ArXiv Link](https://arxiv.org/pdf/1608.06993.pdf)** 

---
- Introduced Dense Convolutional Network (DenseNet) which connects each layer to every other layer in a feed-forward fashion, this enables maximum information flow between layers in the network.
- DenseNet Advantages
    - Alleviate vanishing gradient problem 
    - Strengthen feature propagation
    - Encourage feature reuse
    - Reduce the number of parameters
- Traditional CNN's with $L$ layers have $L$ connections (one between each layer and its subsequent layer) and DenseNet has $\frac{L(L+1)}{2}$ direct connections
- Concatenating feature maps learned by different layers increases variation in the input of subsequent layers and improves efficiency.
---
- Consider a single image $\mathbf{x_{0}}$ that is passed through a CNN
- $\mathbf{x_{0}} \rightarrow$ Input image
- $L \rightarrow$ # of Layers in the network
- $H_{\ell}(\cdot) \rightarrow$ A non-linear transformation implemented by layer $\ell$. Non-linear transformation can be a composite function of operations such as Batch Normalization, ReLU, Pooling, or Convolution. `NOTE - Composite function used: [BN - ReLU - Conv(3x3)]`
- $\mathbf{x_{\ell}} \rightarrow$ Output of the $\ell^{th}$ layer
![](images/dense-block-details.png)
- **Layer Connectivity**
    - Traditional CNNs: $\mathbf{x_{\ell}} = H_{\ell}(\mathbf{x_{\ell-1}})$
    - ResNet - Adds a skip-connection that bypasses non-linear transformations with an identity function: $\mathbf{x_{\ell}} = H_{\ell}(\mathbf{x_{\ell-1}}) + \mathbf{x_{\ell-1}}$ 
    - <font color=steelblue>DenseNet - The $\ell^{th}$ layer receives the feature maps of all preceding layers $\mathbf{x_{0}}, \mathbf{x_{1}}, \ldots, \mathbf{x_{\ell-1}}$ as input: $\mathbf{x_{\ell}} = H_{\ell}([\mathbf{x_{0}}, \mathbf{x_{1}}, \ldots, \mathbf{x_{\ell-1}}])$, where $[\mathbf{x_{0}}, \mathbf{x_{1}}, \ldots, \mathbf{x_{\ell-1}}]$ is concatenation of the feature maps produced in layers $0, 1, \ldots, \ell-1$</font>
- **Pooling Layers** - Essential part of CNN that changes the size of feature mapt
    - Concatenation operation is not possible when size of feature map changes: To facilitate pooling network is divided into multiple densely connected **`dense blocks`**
    - Layers between two adjacent **`dense blocks`** are referred to as **`transition layers`**. They change feature map sizes via convolution and pooling. `NOTE: Transition layer used: [BN - Conv(1x1) - Pool(2x2)]`
    ![](images/densenet-full.png)
- **Growth Rate $(k)$**
    - Growth Rate $(k)$ is a hyper-parameter
    - $k_{0} \rightarrow$ # of channels in the input image
    - Suppose each $H_{\ell}(\cdot)$ produces $k$ feature maps as output, this suggests that $\ell^{th}$ layer has $k\times(\ell-1) + k_0$ feature map inputs
- **Bottleneck Layers**
    - Used 1x1 convolution as bottleneck layer before each 3x3 convolution to reduce the number of input feature maps, and thus to improve computational efficiency
    - *Each 1x1 convolution reduces the input to $4k$ feature maps*
    - Network with bottleneck layer: `DenseNet-B`
- **Compression**
    - To improve model compactness authors reduced number of feature maps at transition layers.
    - If a dense block contains $m$ feature maps, the following transition layer was enabled to generate $\lfloor \theta m \rfloor$ output feature maps, where $0 < \theta \leq 1$ and $\theta \rightarrow$ Compression factor. When $\theta = 1$, the number of feature maps across transition layer remains unchanged
    - Used $\theta = 0.5$ for experiment
    - DenseNet with compression: `DenseNet-C` and DenseNet with bottleneck and compression: `DenseNet-BC`

---

**DenseNet-BC (ImageNet)** `NOTE: Check paper for CIFAR and SVHN implementation details`
![](images/densenet-table.png)
- `conv` layer shown in the table corresponds the sequence `BN-ReLU-Conv`
- 4 Dense Blocks
- Input size: 224x224 with initial $2k$ convolutions with 7x7 filters with stride 2
- Data augmentation same as ResNet
- Training:
    - Weight decay: $10^{-4}$
    - Weight initialization: As discussed in paper [Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/pdf/1502.01852.pdf) - "Microsoft Research Asia" (MSRA) weight initialization. It is similar to `Xavier` initialization except it is designed for ReLU instead of TanH activation. In this method weights are initialized with a zero-mean Gaussian distribution whose standard deviation is $\sqrt{\frac{2}{n_l}}$, where $n_l = k^{2}_{l}d_{l-1}$, $k_l \rightarrow$ Spatial filter size in layer $l$; $d_{l-1} \rightarrow$ Number of filters in layer $l-1$
    - Nestrov Momentum of 0.9 without dampening
    - Train model for 90 epochs with mini-batch size: 256, learning rate: 0.1 set initially and lowered by a factor of 10 after epoch 30 and epoch 60. `NOTE: Because of GPU memory constraints, largest model (DenseNet-161) is trained with a mini-batch size 128. To compensate for the smaller batch size, model was trained for 100 epochs, and the learning rate was divided by 10 after epoch 90`

---

**DenseNet-BC-121 ($k=32$): Up to First Transition Layer** (NOTE: `Conv` correspond to `BN-ReLU-Conv`)

- **Initial**
    - $224 \times 224 \times 3 \rightarrow$ Conv(7x7, s:2, p:3, c: $2k$) $\rightarrow BN-ReLU$ $\rightarrow 112 \times 112 \times 64 $
    - $112 \times 112 \times 64 \rightarrow$ Pool.Max(3x3, s:2, p:1) $\rightarrow 56 \times 56 \times 64 $
---
- **DenseBlock-1**
    - $56 \times 56 \times 64 \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
    - $56 \times 56 \times [64+32+32+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $4k$) $\rightarrow 56 \times 56 \times 128 \rightarrow$ Conv(3x3, s:1, p:1, c: $k$) $\rightarrow 56 \times 56 \times 32 $
---
- **Transition-1**
    - $\ell=6$ and $compression=0.5$
    - $56 \times 56 \times [64+32+32+32+32+32+32] \rightarrow$ Conv(1x1, s:1, p:0, c: $(k\times \ell + k_0)\times compression$) $\rightarrow 56 \times 56 \times 128$
    - $56 \times 56 \times 128 \rightarrow$ Pool.Avg(2x2, s:2, p:0) $\rightarrow 28 \times 28 \times 128$
    ---
- **DenseBlock-2 ...** 

**Vanishing Gradient Problem** 
- As the information about the input or gradient passes through many layers, it can vanish by the time it reaches the end (or beginning) of the network). Vanishing Gradient problem depends on the choice of activation function. Non-linear activation functions such as `sigmoid` or `tanh` squash their inputs into a very small range ([0, 1] or [-1, 1]), as a result even a large change in input will produce a small change in output which results in small gradient. The problem becomes worse when the network is deep.
- Addressing Vanishing Gradient Problem
    - Using ReLU Activation
    - ResNet
    - Highway Networks
    - Stochastic Depth - It shortens ResNets by randomly dropping layers (because many layers contribute very little and can be dropped) during training to allow better information and gradient flow

**Removing Computational Bottleneck Using: `CONV(1x1, s:1, p:0)`**

1x1 convolutional layers are used as dimension reduction modules to remove computational bottlenecks as well as to increase non-linearity. For example, a feature map with size 100 x 100 x C channels on convolution with $k$ 1x1 filters would result in a feature map of size 100 x 100 x $k$.
![](images/conv-1x1.png)

---

- Suppose a convolutional layer outputs a tensor (feature maps) of size ($N$, $F$, $H$, $W$), where $N$: Batch size; $F$: Number of convolutional filters; $H$ and $W$: Height and width of feature maps. Now if this output is fed into a convolution layer with $f$ 1x1 filters with zero padding and stride 1, then the output tensor will have size ($N$, $f$, $H$, $W$). Thus using 1x1 convolution layers changes dimensionality (number of filters).
    - If $f > F \rightarrow$ then dimensionality (number of filters) is increased
    - If $f < F \rightarrow$ then dimensionality (number of filters) is decreased
    
![](images/bottleneck-comparison.png)

**Naive Convolutions** *(CS231 Assignment-2: My solution)*
![](images/im2col.png)
![](images/im2col-for-loop.png)

In [223]:
import numpy as np

def conv_forward_naive(x, w, b, conv_param):
    """
    A naive implementation of the forward pass for a convolutional layer.

    The input consists of N data points, each with C channels, height H and width
    W. We convolve each input with F different filters, where each filter spans
    all C channels and has height HH and width HH.

    Input:
    - x: Input data of shape (N, C, H, W)
    - w: Filter weights of shape (F, C, HH, WW)
    - b: Biases, of shape (F,)
    - conv_param: A dictionary with the following keys:
    - 'stride': The number of pixels between adjacent receptive fields in the
      horizontal and vertical directions.
    - 'pad': The number of pixels that will be used to zero-pad the input.

    Returns a tuple of:
    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
    H' = 1 + (H + 2 * pad - HH) / stride
    W' = 1 + (W + 2 * pad - WW) / stride
    - cache: (x, w, b, conv_param)
    """
    out = None
    #############################################################################
    # TODO: Implement the convolutional forward pass.                           #
    # Hint: you can use the function np.pad for padding.                        #
    #############################################################################
    N, C, H, W = x.shape
    F, C, HH, WW = w.shape
    stride = conv_param['stride']
    pad = conv_param['pad']
    
    # Pad input
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
    
    # Calculate output dimensions
    H_out = 1 + (H + 2 * pad - HH) / stride
    W_out = 1 + (W + 2 * pad - WW) / stride
    
    # Create 'out' array of output data shape filled with zeros
    out = np.zeros((N, F, H_out, W_out))
    
    ##----- im2col implementation - CS231n: winter1516_lecture_11.pdf -----##
    # Calculate new size = K * K * C
    filter_new_size = HH * WW * C 
    
    # Reshape Filter: New shape = # of Filters x (K * K * C)
    filter_reshaped = np.reshape(w, (F, filter_new_size))
    #print 'Filter Reshaped Size: ', filter_reshaped.shape
    
    # Convolution Steps
    for i in range(H_out):
        top = i * stride # Top index
        bottom = top + HH # Bottom index = Top index + Filter Height
        
        for j in range(W_out):
            left = j * stride # Left index
            right = left + WW # Right index = Left index + Filter Width
            
            # Slice x_padded as per top to bottom range and left to right range 
            # NOTE: Resulting shape = N x C x K x K
            x_slice = x_padded[:, :, top:bottom, left:right]
            
            # Reshape x_slice: New shape = (K * K * C) x N
            x_slice_reshaped = np.reshape(x_slice, (filter_new_size, N))
            #print 'X Slice Reshaped Size: ', x_slice_reshaped.shape
            
            # Calculate: [# of Filters x (K * K * C) . (K * K * C) x N] + b, i.e. y = w'x + b
            temp_y = filter_reshaped.dot(x_slice_reshaped).T + b
            # print 'Dot Product + Sum Shape: ', temp_y.shape
            out[:, :, i, j] = temp_y
    ##---------------------------------------------------------------------## 
    
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################
    cache = (x, w, b, conv_param)
    return out, cache

In [285]:
# Test: Conv(1x1, s:1, p:0)

np.set_printoptions(precision=3)

x_shape = (1, 20, 5, 5)
w_shape = (1, 20, 1, 1)
stride = 1
padding = 0

x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)
w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)
b = np.linspace(-0.1, 0.2, num=w.shape[0])
conv_param = {'stride': stride, 'pad': padding}

out, _ = conv_forward_naive(x, w, b, conv_param=conv_param)
print 'X shape: {} with Stride: {} and Padding: {} returns out shape: {}'.format(x.shape, stride, padding, out.shape)
print 'Number of Parameters: {}'.format(out.size)
print 'Output Feature Map: \n', out

X shape: (1, 20, 5, 5) with Stride: 1 and Padding: 0 returns out shape: (1, 1, 5, 5)
Number of Parameters: 25
Output Feature Map: 
[[[[ 0.612  0.613  0.614  0.615  0.616]
   [ 0.618  0.619  0.62   0.621  0.622]
   [ 0.624  0.625  0.626  0.627  0.628]
   [ 0.63   0.631  0.632  0.633  0.634]
   [ 0.636  0.637  0.638  0.639  0.64 ]]]]


In [293]:
# Bottleneck vs Normal Computation Time comparison
def run_bottleneck(N, C, H, W, f1, s1, p1, f2, k2, s2, p2):
    """
    Inputs: N, C, H, W, f1, s1, p1, f2, k2, s2, p2
    """
    x_1_shape = (N, C, H, W)
    w_1_shape = (f1, C, 1, 1)
    x_1 = np.linspace(-0.1, 0.5, num=np.prod(x_1_shape)).reshape(x_1_shape)
    w_1 = np.linspace(-0.2, 0.3, num=np.prod(w_1_shape)).reshape(w_1_shape)
    b_1 = np.linspace(-0.1, 0.2, num=w_1.shape[0])

    w_2_shape = (f2, f1, k2, k2)
    w_2 = np.linspace(-0.2, 0.3, num=np.prod(w_2_shape)).reshape(w_2_shape)
    b_2 = np.linspace(-0.1, 0.2, num=w_2.shape[0])

    conv_param_1 = {'stride': s1, 'pad': p1}
    conv_param_2 = {'stride': s2, 'pad': p2}
    out, _ = conv_forward_naive(x_1, w_1, b_1, conv_param=conv_param_1)
    out, _ = conv_forward_naive(out, w_2, b_2, conv_param=conv_param_2)
    return out

def run_normal(N, C, H, W, f1, k, s, p):
    """
    Inputs: N, C, H, W, f1, k1, s1, p1
    """
    x_1_shape = (N, C, H, W)
    w_1_shape = (f1, C, k, k)
    x_1 = np.linspace(-0.1, 0.5, num=np.prod(x_1_shape)).reshape(x_1_shape)
    w_1 = np.linspace(-0.2, 0.3, num=np.prod(w_1_shape)).reshape(w_1_shape)
    b_1 = np.linspace(-0.1, 0.2, num=w_1.shape[0])
    conv_param_1 = {'stride': s, 'pad': p}
    out, _ = conv_forward_naive(x_1, w_1, b_1, conv_param=conv_param_1)
    return out

In [297]:
# Test Bottleneck: Input (256 depth) -> [Conv(1x1, s:1, p:0, 64 depth) -> Conv(3x3, s:1, p:1, 256 depth)]
%timeit -n 100 run_bottleneck(1, 256, 96, 96, 64, 1, 0, 256, 3, 1, 1)

100 loops, best of 3: 1.05 s per loop


In [298]:
# Test Normal: Input (256 depth) -> Conv(3x3, s:1, p:1, 256 depth)
%timeit -n 100 run_normal(1, 256, 96, 96, 256, 3, 1, 1)

100 loops, best of 3: 3.4 s per loop
