In [2]:
import tensorflow as tf
print(tf.__version__)

2.1.0



[slides](https://github.com/arthurredfern/UT-Dallas-CS-6301-CNNs/blob/master/Lectures/xNNs_060_Training.pdf)

[disk](file:///F:/Data/xNNs_060_Training.pdf)

# Forward Pass

Modify the forward pass to impove convergence and generalization on inference data.

## Convergence

### Batch Normalization
Normalization done on a per channel basis. During training compute mean and variance for batch.

Input:X(n,c,h,w) 

Output: X(n,c,h,w)

<img src=../img/bn.PNG>

__During Training__

$$\mu_{c,b}=\frac{1}{(NHW)}\sum_{n,h,w}X(n,c,h,w)$$
$$\sigma_{c,b}^2=\frac{1}{(NHW)}\sum_{n,h,w}(X(n,c,h,w)-\mu_{c,b})^2$$

Transform data per batch per channel mean and variance and per trainable scale $\gamma_c$ and trainable bias $\beta_c$

$$Y[:,c:,:]=\gamma_c \frac{(X[:,c,:,:]-\mu_{c,b})}{ \sigma_{c,b}}+\beta _c$$

__During Inference__ Track running average of mean and variance across batches for use during inference, $\alpha \approx 0.99$

$$\mu_c = \alpha \mu_c+(1-\alpha)\mu_{c,b}$$
$$\sigma_c^2 = \alpha \sigma_c^2+(1-\alpha)\sigma_{c,b}^2$$

During inference, fix the running average parameters and either normalize as in the above formula, or absorb the numbers into convolution paramenters.

$$Y[:,c:,:] = \frac{\gamma_c }{\sigma_{c,b}}X[:,c,:,:]+(\beta_c-\frac{\gamma_c\mu_c }{\sigma_{c,b}})$$

Here multiply the convolution parameters by a scalar $\frac{\gamma_c }{\sigma_{c,b}}$ and add a bias term $\beta_c-\frac{\gamma_c\mu_c }{\sigma_{c,b}}$

__Why BN Works__ [paper](https://arxiv.org/pdf/1805.11604.pdf)

Batch normalization seeks to stabilize the distribution of inputs to a given network layer during training.The batch norm layer sets the first two moments of the distribution of each activation to zero and one respectively, (mean and variance) Then BN inputs are scaled by trainable parameters. BN is applied pre activation.

Internal covariate shift - distribution of input to a layer changes due to an update of a parameter of an earlier layer. This change is believed to constantly change the training problem as parameters are updated. Networks with BN have been tested and do not significantly reduce ICS compared to networks wout BN.

Batch Norm smooths the loss function. The loss changes at a smaller rate with smaller magnitude of gradients as well. With a smoother loss surface, larger step size can be used wihich leads to faster convergence. 



__Tips__
- Batch normalization doesn't work as well when the batch size is small. This can become an issue when processing high resolution images that stretch your training system's memory limit
- the link between batch norm and reducing internal covariate shift is [tenuous](https://arxiv.org/pdf/1805.11604.pdf)


In [None]:
class Batch_Normalization():
    """
    Implementation of a BatchNormalization for a computational graph.
    Input: tensor with shape (B,H,W,C)
    OutputL tensor with shape (B,H,W,C) Normalized by channel
    """
    def __init__(self):
        self.gamma = None
        self.beta = None
        self.mu = None
        self.sigma = None
        self.mu_ra = None
        self.sigma_ra = None
        self.alpha = 0.99
        self.fitted = False
        
    def forward(self, inputs, training=False):
        """Note that each input is a batch, and all statistics are
        calculated at the channel level."""
        if not self.fitted:
            self.gamma =tf.random.normal(inputs.shape[-1])
            self.beta = tf.random.normal(inputs.shape[-1])
            self.mu_ra = tf.random.normal(inputs.shape[-1])
            self.sigma_ra = tf.random.normal(inputs.shape[-1])
            
        if training:
            #batch statistics
            self.mu = tf.reduce_mean(inputs, axis=-1)
            self.sigma = tf.reduce_variance(inputs, axis=-1)
            #exp moving average
            self.mu_ra = self.alpha*self.mu_ra+(1-self.alpha)*self.mu
            self.sigma_ra = self.alpha*self.sigma_ra+(1-self.alpha)*self.sigma
            
            self.scaled = (inputs-self.mu)/tf.math.sqrt(self.sigma+ 1e-5)
            result = self.gamma*self.scaled + self.beta
            return result
        
        else: #uses exp. average
            result = self.gamma*(inputs-self.mu_ra)/tf.math.sqrt(self.sigma_ra + 1e-5) + self.beta
            return result

    def backward(self, dz, lr):
        #update this with correct mat_mul code
        self.gamma += -lr*dz*self.scaled 
        self.beta += -lr*dz
        dz_dx = dz*self.gamma/tf.math.sqrt(self.sigma_ra)
        return dz_dx

### Batch Renormalization

Starts with batch norm, but gradually transition from using sample mean and variance, to the running avg and variance statistic.

- Can help in situations with small batch sizes that might lead to large sample statistics.
- Integrates the exp. average during training which could lead to better performance



### Group Normalization

Divide channels into groups, compute mean and variance based on groups and channels for each instance.

Group normalization creates a new dimension by splitting the channel dimension. This is the axis along which mean and variance are computed (not batch dimension). because the statistics are at the instance level, there is no difference betweeen execution of training and inference.

__Instance Norm (C//C)__

Normalize each channel of each image. 
<img src=../img/in.PNG>

__Layer Norm (C//1)__
Normalize across all channels in an image (every pixel mean/variance)
<img src=../img/ln.PNG>

__Group Norm(C//variable)__
Groups a subset of the channels together for normalization.
<img src=../img/gn.PNG>

In [None]:
class Group_Normalization():
    """
    Implementation of a Group Normalization for a computational graph.
    Input: tensor with shape (B,H,W,C)
    OutputL tensor with shape (B,H,W,C) Normalized by channel
    """
    def __init__(self, num_groups):
        self.num_groups = num_groups
        self.gamma = None
        self.beta = None
        self.mu = None
        self.sigma = None
        self.fitted = False
        
    def forward(self, inputs):
        if not self.fitted:
            self.gamma =tf.random.normal(inputs.shape[-1])
            self.beta = tf.random.normal(inputs.shape[-1])

        B,H,W,C = inputs.shape
        inputs = tf.reshape(inputs, [B, H, W, self.num_groups, C//self.num_groups])
        #group statistics
        self.mu, self.sigma = tf.nn.moments(inputs, [1,2,4])
        #Normalize (add small pos num to prevent div by zero)
        self.scaled = (inputs-self.mu)/tf.math.sqrt(self.sigma + 1e-5)
        result = self.gamma*tf.reshape(self.scaled,[B,H,W,C]) + self.beta
        return result

    def backward(self, dz, lr):
        #TO DO
        return None

### Layer Normalization

Normalization without batched, Use the mean and variance statistics as an estimate of the pop mean and variance for the full dataset.

## Regularization (Generalization)


### Stochastic Width
 Useful in wider networks. 
 
 __Dropout zeros out a random set of layer outputs per batch.__ Forces multiple groups of output features to be able to estimate a class. (Similar to ensembling, multiple pathways for class prediction within a network) 

__Dropconnect zeros out a random set of layer weights per batch.__ Forces multiple groups of input features to be able to generate an output feature.

### Stochastic Depth (layer skipping)

Used in very deep networks. Randomly skips layers where the probability of being skipped increases the deeper a layer is into the network.

### Stochastic Branching

__ShakeShape and ShakeDrop__ use do regularize residual networks (originally ResNeXt) the idea being to skip a branch in ResNet style block.

### Noise Addition

Add noise to the network $\rightarrow$ noisy activation  functions dataset augmentation...

## Error Calculation

### Classification Loss Function
__Softmax- cross entropy__ - uses KL divergence to compare the true probability mass function with the estimated probability mass function (output of the softmax). KL Divergence = cross entropy when  applied to one hot vectors.

Other options 
- KL divergence with label smoothing
- noise or overconfidence penalization
- optimal transport


### Regression Loss Function
Generally $l_p$ norms (p=1,2). Another might be __Huber loss__ which is curved like L2 for [-1,1], then straightens out to L1 for larger/smaller values. This circumvents issues with L2's large errors that might cause overcorrection during backprop.


### Unequal Class Weightings
Can rebalance the importance of specific classes by messing with the error function. Used in cases of class imbalance, or when a certain type of instance is particularly tricky to classify.

### Auxillary Network Heads

Create extra heads hearlier on in the network that predict the output. Allows gradient to propagate directly to the body of the network. Also encourages earlier features to be stronger

### Weight decay (L1/L2 Loss Penalty Term)

Add a regularizing term to the error function. For a more detailed explanation [here](https://github.com/harrisonjansma/2020_Notes/blob/master/DL/Courses/CS231n%20Conv%20Nets%20Stanford/1_Neural%20Networks%20Parts%201-2-3.ipynb)

[sprecral norm regularization](https://arxiv.org/abs/1705.10941)

# Backward Pass

## Memory Maintenance

### Checkpointing
Strategy to address running out of memory. Save every Nth computation result, then during backpropagation, recompute the necessary intermediate values from the saved checkpoints. 

More compute but allows less memory utilization.

[1](https://www-sop.inria.fr/tropics/papers/DauvergneHascoet06.pdf)

[2](https://arxiv.org/pdf/1604.06174.pdf)

[3](https://arxiv.org/abs/1606.03401)

[4](https://github.com/cybertronai/gradient-checkpointing)

### In Place Activated Batch Norm

Memory optimized BN that utilizes leaky ReLU
https://arxiv.org/pdf/1712.02616.pdf

### Reversible Architectures

Uses large reversible differentiable functions so backpropagation does not need activations to be stored. The network function is "reversed" (inverse operations) and gradients are computed with inverse operations.
https://arxiv.org/pdf/1707.04585.pdf



### Evolving BackPropagation

https://arxiv.org/pdf/1804.00746.pdf
Augmentation of the Back Prop algorithm

# Weight Update

## Batch Size

Effect of batch size:
- smaller batch size $\rightarrow$ less memory, more gradient noise, less effective batch norm, better result, slower onvergence

Good choice for batch size on a single machine is 32.

## Weight Update Considerations

- Weights: $w$
- Error: $e(w)$
- Gradient: $g=\delta e/\delta w$
- Hessian: $H=\delta^2e/\delta w$

__2nd order approximation of error around a point $w_0$__
$neighbor error = error+ 1st order approximate change + 2nd order correction$
$e(w) \approx e(w_0)+(w-w_0)^Tg+0.5(w-w_0)^TH(w-w_0)$

__Newton's Method__

 If $H>0$ (positive definite)
- $w \leftarrow w-H^{-1}g$

Note that this formula replaces the scalar learning rate $\alpha$ with a matrix that determies the appropriate lr for each coefficient.

Newtons method is attracted to critical points, and in error spaces with many many dimensions, critical points are most likely to be saddle points. Saddle points tend to have high error, this is one reason why Newton's method has not been commonly used in DL training.


__Repeated Transformations in Networks__
Transformtions (scalar ops and mms) scale gradients. In systems like RNNs or seqential CNNs, where a transformation is repeatedly applied, the gradient will be repeatedly scaled. (leads to gradient explosion or vanishing gradient) __This is why the scale of transformations should hover around 1__ 

## Optimizers
__SGD__


Applied as either true SGD (bath size=1) or mini-batch SGD (btch-size=n). Neurons that fire together are updated together, whle neurons that do not activate strongly on a certain group of instances are not updated.
$$g \leftarrow \frac{1}{n}\nabla_{\theta}\sum_i L(x_i, y_i,\theta)$$

$$\theta \leftarrow \theta - \alpha g$$

__Nesterov SGD__


Adds momentum to the update equation. Implemented with an exponential moving average for gradients. Nesterov momentum applies interim update first, computes gradients at interim points, calculates velocity update, then reverts and applies new gradient update

$$\tilde{\theta} = \theta +\alpha V $$

$$g \leftarrow \frac{1}{n}\nabla_{\tilde{\theta}}\sum_i L(x_i, y_i,\tilde{\theta})$$
$$V\leftarrow \alpha V - \epsilon g$$

$$\theta \leftarrow \theta + V$$

__AdaGrad__ [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)



__RMSProp__ [slides](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)

__Adam__ [paper](https://arxiv.org/abs/1412.6980)

## Learning Rate Update Schedules

Cyclical learning rates and super convergence appl

https://arxiv.org/abs/1506.01186

https://arxiv.org/abs/1708.07120

https://www.fast.ai/2018/04/30/dawnbench-fastai/

## Regularization in Optimization 

__Gradient Noise__

Add noise to the weight update to regularize model learning.

https://arxiv.org/abs/1511.06485

https://arxiv.org/abs/1511.06807


__Gradient Clipping__ 

Curtail exploding/vanishing gradients with gradient clipping. Set a max/min threshold for a gradient value.

https://arxiv.org/abs/1211.5063

## Parallelization

__Synchronous parallelization__

Give a part of a lartch batch to each of multiple workers

https://arxiv.org/abs/1706.02677

https://arxiv.org/abs/1708.03888

https://arxiv.org/abs/1709.05011

https://arxiv.org/abs/1711.04325

https://arxiv.org/abs/1904.00962

__Ideal Bath Size for Parallel Training__ 

Want a huge batch size. Essentially choose a number that gives each of your worker nodes an optimal batch size for a single macchine (~32)

__Asynchronous Parallelization__

Allow asynchronous updates of gradients to a common parameter server

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40565.pdf

https://arxiv.org/abs/1609.08326

https://arxiv.org/abs/1711.00489

https://arxiv.org/abs/1811.03600

# Evaluation

__Early Stopping__
End training when validation error does no improve for subsequent epochs.
https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf


# Additional Links

• Bag of tricks for image classification with convolutional neural networks
• https://arxiv.org/abs/1812.01187
• Training tips for the transformer model
• https://arxiv.org/abs/1804.00247
• Factorization tricks for LSTM networks
• https://arxiv.org/abs/1703.10722

• AdamW and super-convergence is now the fastest way to train neural nets
• https://www.fast.ai/2018/07/02/adam-weight-decay/
• Training ImageNet in 3 hours for $25; and CIFAR10 for $0.26
• https://www.fast.ai/2018/04/30/dawnbench-fastai/

https://arxiv.org/abs/1807.11205

https://arxiv.org/abs/1811.06992

https://arxiv.org/abs/1805.09501v1

https://arxiv.org/abs/1906.11052

https://arxiv.org/pdf/1706.06083.pdf

• Curriculum learning
• https://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf
• Group normalization
• https://arxiv.org/abs/1803.08494
• Improving neural networks by preventing co-adaptation of feature detectors
• https://arxiv.org/abs/1207.0580

• Large-margin softmax loss for convolutional neural networks
• https://arxiv.org/abs/1612.02295
• Distilling the knowledge in a neural network
• https://arxiv.org/abs/1503.02531
• Effect of depth and width on local minima in deep learning
• https://arxiv.org/abs/1811.08150
• When does label smoothing help?
• https://arxiv.org/abs/1906.02629
• The loss surfaces of multilayer networks
• https://arxiv.org/abs/1412.0233
• Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
• https://arxiv.org/abs/1406.2572
• Hierarchical loss for classification
• https://arxiv.org/abs/1709.01062

Automatic differentiation
• http://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf
• Reverse-mode automatic differentiation: a tutorial
• https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation
• Automatic differentiation of algorithms
• https://www.sciencedirect.com/science/article/pii/S0377042700004222?via%3Dihub
• Automatic reverse-mode differentiation: lecture notes
• http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf
• Training neural networks using features replay
• https://arxiv.org/abs/1807.04511
• Automatic differentiation in ML: Where we are and where we should be going
• https://arxiv.org/abs/1810.11530
• Why momentum really works
• https://distill.pub/2017/momentum/
• On the importance of initialization and momentum in deep learning
• http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
• An overview of gradient descent optimization algorithms
• https://arxiv.org/abs/1609.04747
• Online learning rate adaptation with hypergradient descent
• https://arxiv.org/abs/1703.04782
