# Glossay (Definitions & Questions)

--------

### How do neural networks learn?

Neural networks learn the problem using BackPropagation algorithm. By backpropagation, the neurons learn how much error they did and correct themselves, i.e, correct their “weights” and “biases”. By this they learn the problem to produce correct outputs given the inputs. BackPropagation involves computing gradients for each layer and propagating it backward, hence the name.

### What is batch normalization?

As the data flows through a deep network, the weights and parameters adjust those values, sometimes making the data too big or too small again - a problem the authors refer to as "internal covariate shift". By normalizing the data in each mini-batch, this problem is largely avoided. Batch Normalization normalizes each batch by both mean and variance reference.

To reduce this problem of internal covariate shift, Batch Normalization adds Normalization “layer” between each layers. An important thing to note here is that normalization has to be done separately for each dimension (input neuron), over the ‘mini-batches’, and not altogether with all dimensions. 

There are usually two types in which Batch Normalization can be applied:

- Before activation function (non-linearity)
- After non-linearity

In the original paper, BatchNorm is applied before the applying activation. Most of the activation functions have problems while applied this way. For sigmoid and tanh activation, normalized region is more of linear than nonlinear.

It has been observed that the BatchNorm when applied after activation, performs better and even gives better accuracy. Even though many examples and literature show BatchNorm before activation.

In addition to fastening up the learning of neural networks, BatchNorm also provides a weak form of regularization. How does it introduce Regularization? Regularization may be caused by introduction of noise to the data. Since the normalization is not performed on the whole dataset and just on the mini-batch, they act as noise.

However BatchNorm provides only a weak regularization, it must not be fully relied upon to avoid over-fitting. Yet, other regularization could be reduced accordingly. For example, if dropout of 0.6 (drop rate) is to be given, with BatchNorm, you can reduce the drop rate to 0.4. BatchNorm provides regularization only when the batch size is small.

#### Benefits

- Networks train faster converge much more quickly
- Allows higher learning rates since gradient descent usually requires small learning rates for the network to converge
- Makes weights easier to initialize
- Makes more activation functions viable because batch normalization regulates the values going into each activation function, non-linearities that don't seem to work well in deep networks actually become viable again
- May give better results overall
- Batch Normalization allows us to use much higher learning rates and be less careful about initialization
- It also acts as a regularizer, in some cases eliminating the need for Dropout
- Optimization to help train faster not specifically used to make the network better
- Making normalization as a part of the model architecture and performing the normalization for each training mini-batch
- Attempt to achieve a stable distribution of activation values throughout training

#### Notes from Article Leading to Internal Covariate Shift

In order to understand what batch normalization is, first we need to address which problem it is trying to solve.

Usually, in order to train a neural network, we do some preprocessing to the input data. For example, we could normalize all data so that it resembles a normal distribution (that means, zero mean and a unitary variance). Why do we do this preprocessing? Well, there are many reasons for that, some of them being: preventing the early saturation of non-linear activation functions like the sigmoid function, assuring that all input data is in the same range of values, etc.

But the problem appears in the intermediate layers because the distribution of the activations is constantly changing during training. This slows down the training process because each layer must learn to adapt themselves to a new distribution in every training step. This problem is known as internal covariate shift.

### Internal Covariate Shift

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during trainin 

To improve the training, we seek to reduce the internal covariate shift. By fixing the distribution of the layer inputs x as the training progresses, we expect to improve the training speed. It has been long known (LeCun et al., 1998b;  Wiesler & Ney, 2011) that the network training __converges faster__ if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer. 

By whitening the inputs to each layer,we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.

### How to neural network improve performance? - [Link](https://machinelearningmastery.com/improve-deep-learning-performance/)

- Diagnostics
- Weight Initialization
- Learning Rate
- Activation Functions
- Network Topology
- Batches and Epochs
- Regularization
- Optimization and Loss
- Early Stopping
- [Progressive Resizing](https://towardsdatascience.com/boost-your-cnn-image-classifier-performance-with-progressive-resizing-in-keras-a7d96da06e20)

### What's a method in building a CNN architecture?

https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-ii-hyper-parameter-42efca01e5d7