# Chapter 8: Modern Convolutional Neural Networks

### Dhuvi karthikeyan

02/15/2023

## 8.1 Deep CNNs (AlexNet)

### 8.1.1 Representation Learning

Pre 2012 era in computer vision:

* Image representation was calculated mechanistically with geometrically inspired inductive biases:
    * Bag of Visual Words (2003)
    * SIFT: Scale Invariant Feature Transform (2004)
    * HOG: Histograms of oriented gradient (2005)
    * SURF: Speeded Up Robust Features (2006)
    
* Contrarians:
    * Yann LeCun
    * Yoshua Bengio
    * Andrew Ng
    * Shun-ichi Amari
    * Geoff Hinton
    * Juergen Schmidhuber
    
All thought the features should be learned along with what to do with them. The features should be hierarchically learned such that the richness is encapsulated in it. 

**AlexNet:** 2012, vs LeNet in 1995 is explained by a dearth of data and compute. ImageNet (1e6 images) was a competition that had only recently come out.

### 8.1.2 AlexNet

By winning the ImageNet Large Scale Visual Recognition Challenge (2012) by a large margin, AlexNet was able to shift the zeitgeist from cleverly constructed featurization to automatic learning of features by deep networks.

#### Architecture

Image -> 11x11 Conv (96) -> MaxPool(3) -> 5x5 Conv (256), MaxPool(3) -> [3x3 Conv(384)]*3, MaxPool(3) -> FC(4096) -> FC(4096) -> FC(1000)

#### Activation Functions

ReLU function helped in many cases (training stability and efficiency)

#### Capacity Control and Preprocessing

Dropout in the FC layers controls for complexity of the model and overfitting

## 8.2 Networks Using Blocks (VGG)

Visual Geometry Group @OXford first came up with the idea of using blocks as a building unit of neural networks. 

### 8.2.1 VGG Blocks

The functional unit of a CONV layer block consists of:
    1. Convolutional layer + padding (to maintain activation dimensions)
    2. ReLU or similar activation function
    3. Pooling-layer for downsampling (distillation?)

**The above imposes a limit of log_k(d) where k is the max pooling dimension on the number of layers that can be added before the dimension of the activation is 1x1.**

As such one of the key breakthroughs was the idea of using multiple conv layers between pooling operations.

Simonyan and Zisserman (2014) were interested in figuring out whether shallow wide Convnets vs Long Narrow ones were better. 
    
    * Parameter-wise the number of params is 5*5*In_channels*Out_channels vs      3*3*In_channels*Out_channels
    * Conv -> Conv with 3x3 window reaches same number of pixels as one 5x5 window with 1/3 fewer params

Deeper and narrower = better


### 8.2.2 VGG Network

VGG applies the pooling at the end of each block and has 5 blocks which downsamples the image by two every time from 224x224 to 112, 56, 28, 14.

## 8.3 Network in Network

Network in Network blocks (2013) sought to address the challenge of having a large nunmber of parameters for the fully connected layers at the end of a convnet. 
    
    * Unable to move the FC layer up without destroying the spatial structure of the features

### 8.3.1 NiN Blocks

* Uses a single CONV layer followed by 2 sequential 1x1 conv layers (FC layer at each pixel location)

* Kernel sizes and the output channels are the same as AlexNet

* Number of output channels is reduced to nunmber of classes by a nin_block

* Global average pooling layer, vector of logits [No Fully Connected Layer]

Interestingly taking the average didn't harm the accuracy. Averaging across low-res images adds translational invariance as well

## 8.4 Multi-Branch Networks

GoogLeNet won the (2014) ImageNet Challenge with a network that distilled a stem, body, and head in CNNs. 

The design:
    
    * Stem: First 2-3 convolutions that extract low-level features from images
    * Body: Convolutional blocks that process information
    * Head: Maps features learned for the dowstream target task
    


### 8.4.1 Inception Block

Four parallel branches of information flow that input is fed into:
    1. 1x1 conv -> Concat layer
    2. 1x1 conv -> 3x3 conv -> Concat layer
    3. 1x1 conv -> 5x5 conv -> Concat layer
    4. 3x3 MaxPool -> 1x1 conv -> Concat layer
    
* Explores the images with different spatial filters
* Allocation of diff number of parameters for diff filters

### 8.4.2 GoogLeNet Model

What is going on here with the number of channels accounting

## 8.5 Batch Normalization

* Pre-processing (Rescaling)
* Numerical Stability
* Regularization

**All in one**

### 8.5.1 Training Deep Networks

Standardization of input features in standard machine learning helped model convergence by placing all the features (and thus parameters) on the same scale:

    * Want rescaling such that theres unity across the diagonal
    * Mean zero and variance sums to one

In MLPs (and CNNs) the intermediate representations of the input experience distribution drift where along the layers and within one layer the variables can take on values of different magnitudes. Normalization of these could help with model convergence.


* This framework helps explain why batch norm and layer norm have been so successful in extending the depth in which we can train neural networks 


Since deep nets are prone to overfitting, regularization via noise injections, is also a direct fallout of batch norm. 

$$ BN(x) = \gamma * \frac{x-\hat{\mu}}{\hat{\sigma}} + \beta $$

Because of the nature of the batch sizes and the proportionality to noise, 50-100 is bsz that works best in terms of the noise stability. 

### 8.5.2 Batch Norm Layers

Implementation between FC and convolutions are slightly diff

#### Fully Connected Layers

h = $\phi (BN(Wx+b))$

#### Convolutional Layers

Batch norm is applied on a per channel basis across all the locations to preserve invariance.

#### Layer Normalization

Batch norm but applied to one obs @ a time:

$$ x -> LN(x) = \frac{x - \hat{\mu}}{\hat{\sigma}}$$

This applies the mean and var across the single input and normalizes it. It empirically prevents divergence by applying a standardization that is deterministic 



In [1]:
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # Use is_grad_enabled to determine whether we are in training mode
    if not torch.is_grad_enabled():
        # In prediction mode, use mean and variance obtained by moving average
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # When using a fully connected layer, calculate the mean and
            # variance on the feature dimension
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # When using a two-dimensional convolutional layer, calculate the
            # mean and variance on the channel dimension (axis=1). Here we
            # need to maintain the shape of X, so that the broadcasting
            # operation can be carried out later
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # In training mode, the current mean and variance are used
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # Update the mean and variance using moving average
        moving_mean = (1.0 - momentum) * moving_mean + momentum * mean
        moving_var = (1.0 - momentum) * moving_var + momentum * var
    Y = gamma * X_hat + beta  # Scale and shift
    return Y, moving_mean.data, moving_var.data

### 8.5.6 Discussion 

* Batch norm modulates hidden representations by taking minibatch statistics and transforming the values, boosting numerical stability
* Batch norm implementation is different for diff network architectures
* Batch norm has different functionality in training vs test
* Noise-injection regularization peaks at mini-batches of around 50-100 training examples


Batch norm makes the optimization landscape smoother?

Internal covariate shift (debunked?) 

## 8.6 Residual Networks (ResNet)

### 8.6.1 Function Classes

Since neural networks are approximators of complex functions, the notion of a model architecture representing a particular class of functions is intuitive. Nested function classes are a desired property of adding complexity to NNs as they offer a formal framework by which increasing complexity necessarily leads to an optimization topology that contains regions closer to the true function than without the added layer. 

Mathmematically, one of the ways in which this is possible is if the added layer is able to be learn the identity function. He et al (2016) developed the ResNet which won 2015 ImageNet competition and introduced the residual block which baked in the idea that each additional layer should have the identity function.

### 8.6.2 Residual Block

### 8.6.3 ResNet Model

Model starts off like GoogLeNet

Blocks:

* Block 1 (1x):
    * 7x7 Conv
    * BN
    * 3x3 MaxPool
    * ReLU
* Block 2 (2x):
    * ResBlock (No 1x1 conv) 
* Block 3 (3x):
    * ResBlock (with 1x1 conv)
    * ResBlock (w/o 1x1 conv)
* Block 4:
    * Global AvePool
    * FC Layers

### 8.6.5 ResNeXt

ResNet has an inherent tradeoff between non-linearity and dimensionality in a Residual Block.
    
    * Easily addressible by increasing the number of layers or the width of the conv filter
    * Increase the number of channels (comes at quadratically increasing cost)
    
Branching a residual block into having g groups or branches and having them be equivalent architectures, thus sharing b intermediate dimensions over g groups or b/g channels per group.

Computational cost goes from O(c_i * c_o) to O(g * (c_i/g/) * g(c_o/g)) = O(c_i * (c_o/g)) which is g times faster and holds g times less params. 

**Note** Must ensure no information leakage between groups

### 8.6.6 Summary and Discussion

Nested function classes result in strictly more expressive and therefore powerful models. Allowing input to pass through allows this, changing inductive bias from f(x) = 0 to f(x) = x.

## 8.7 Densely Connected Networks (DenseNet)

DenseNet - Dense Convolutional Network

### 8.7.1 ResNet to DenseNet 

Densenet takes the idea of Resnet and asks what if instead of adding the input to the residual, we could instead apply a higher dimensional feature aggregation and uses the concatenation operator.

$$ x --> [x, f_1(x), f_2([x, f_1(x)]), f_3([x, f_1(x),f_2([x, f_1(x)])])] -> MLP
$$

### 8.7.2 Dense Blocks

In [None]:
class DenseBlock(nn.Module):
    def __init__(self, num_convs, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate input and output of each block along the channels
            X = torch.cat((X, Y), dim=1)
        return X

### 8.7.3 Transition Layers

Each denseblock increases the number of channels. Transition layers seek to regularize the numebr of channels to prevent excess complexity.

In [None]:
def transition_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

### DenseNet Model

The stem is more or less the same as ResNet -> GoogLeNet => AlexNet but the body as the alternatign dense and transition blocks

## 8.8 Designing Convolution Network Architectures

Neural architecture searches (NAS) are usually quite costly but are worth exploring.

### 8.8.1 AnyNet Design Space

AnyNet consists of stem, body, and head. 

* Stem:
    * Take input images and use 3x3 Conv
    * BN halves resolution donw to 1/2
    * Generate c0 channels as input for body
* Body:
    * Can have a depth of d_i per stage
    * Output channels at each layer of c_i
    * Number of groups g_i
    * Bottleneck ratios k_i
* Head:
    * Pooling operations
    * Convolution to get to specific channels
    * Pass to dense layers
    * Dense layers cut down to n_classes

### 8.8.2 Distributions and Parameters of Design Spaces

Params of the design space are hyperparams of the network instance. However due to permutations of possibilities exploding with network depth its infeasible to brute force in most cases. Instead finding strategies to determine better guidelines may have a higher yeild in a constrained env.

1. Assume general design principles actually exist and compliant networks offer good performance. Assumes distribution over networks
2. Need not train networks to convergence to assess performance but instead the intermediate results will be enough. Multi-fidelity optimization requires only a few passes through the data.
3. Results on small scale generalize to larger scale, optimize over toy models and verify at scale
4. Aspects of the design can be factorized independently of each other

**How well do these assumptions hold**

$$ F(e,p) = P_{net~p}{e(net) \leq e} $$

Find the distribution p over networks s.t. the nets have an error less than or equal to the other distributions p'. Take a sample of networks with errors from p and use the empirical CDF to find:

$$ \hat{F}(e,Z) = \frac{1}{n} \sum_{i=1}^n(e_i \leq e)$$

### 8.8.3 RegNet

Architecture that came out of the AnyNet design space

* shares bottleneck ratios across design stages
* shares group widths 
* increases channels across stages
* increases network depths across stages

### 8.8.5 Discussion

Transformers have a significant degree less inductive biases than CNNs for images due to the insane amount of training from images. 