# Glossay (Definitions & Questions)

--------

### How do neural networks learn?

Neural networks learn the problem using BackPropagation algorithm. By backpropagation, the neurons learn how much error they did and correct themselves, i.e, correct their “weights” and “biases”. By this they learn the problem to produce correct outputs given the inputs. BackPropagation involves computing gradients for each layer and propagating it backward, hence the name.

### What is batch normalization?

As the data flows through a deep network, the weights and parameters adjust those values, sometimes making the data too big or too small again - a problem the authors refer to as "internal covariate shift". By normalizing the data in each mini-batch, this problem is largely avoided. Batch Normalization normalizes each batch by both mean and variance reference.

To reduce this problem of internal covariate shift, Batch Normalization adds Normalization “layer” between each layers. An important thing to note here is that normalization has to be done separately for each dimension (input neuron), over the ‘mini-batches’, and not altogether with all dimensions. 

There are usually two types in which Batch Normalization can be applied:

- Before activation function (non-linearity)
- After non-linearity

In the original paper, BatchNorm is applied before the applying activation. Most of the activation functions have problems while applied this way. For sigmoid and tanh activation, normalized region is more of linear than nonlinear.

It has been observed that the BatchNorm when applied after activation, performs better and even gives better accuracy. Even though many examples and literature show BatchNorm before activation.

In addition to fastening up the learning of neural networks, BatchNorm also provides a weak form of regularization. How does it introduce Regularization? Regularization may be caused by introduction of noise to the data. Since the normalization is not performed on the whole dataset and just on the mini-batch, they act as noise.

However BatchNorm provides only a weak regularization, it must not be fully relied upon to avoid over-fitting. Yet, other regularization could be reduced accordingly. For example, if dropout of 0.6 (drop rate) is to be given, with BatchNorm, you can reduce the drop rate to 0.4. BatchNorm provides regularization only when the batch size is small.

#### Benefits

- Networks train faster converge much more quickly
- Allows higher learning rates since gradient descent usually requires small learning rates for the network to converge
- Makes weights easier to initialize
- Makes more activation functions viable because batch normalization regulates the values going into each activation function, non-linearities that don't seem to work well in deep networks actually become viable again
- May give better results overall
- Batch Normalization allows us to use much higher learning rates and be less careful about initialization
- It also acts as a regularizer, in some cases eliminating the need for Dropout
- Optimization to help train faster not specifically used to make the network better
- Making normalization as a part of the model architecture and performing the normalization for each training mini-batch
- Attempt to achieve a stable distribution of activation values throughout training

#### Notes from Article Leading to Internal Covariate Shift

In order to understand what batch normalization is, first we need to address which problem it is trying to solve.

Usually, in order to train a neural network, we do some preprocessing to the input data. For example, we could normalize all data so that it resembles a normal distribution (that means, zero mean and a unitary variance). Why do we do this preprocessing? Well, there are many reasons for that, some of them being: preventing the early saturation of non-linear activation functions like the sigmoid function, assuring that all input data is in the same range of values, etc.

But the problem appears in the intermediate layers because the distribution of the activations is constantly changing during training. This slows down the training process because each layer must learn to adapt themselves to a new distribution in every training step. This problem is known as internal covariate shift.

### Internal Covariate Shift

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during trainin 

To improve the training, we seek to reduce the internal covariate shift. By fixing the distribution of the layer inputs x as the training progresses, we expect to improve the training speed. It has been long known (LeCun et al., 1998b;  Wiesler & Ney, 2011) that the network training __converges faster__ if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer. 

By whitening the inputs to each layer,we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.

### How to neural network improve performance? - [Link](https://machinelearningmastery.com/improve-deep-learning-performance/)

- Diagnostics
- Weight Initialization
- Learning Rate
- Activation Functions
- Network Topology
- Batches and Epochs
- Regularization
- Optimization and Loss
- Early Stopping
- [Progressive Resizing](https://towardsdatascience.com/boost-your-cnn-image-classifier-performance-with-progressive-resizing-in-keras-a7d96da06e20)

### What's a method in building a CNN architecture?

https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-ii-hyper-parameter-42efca01e5d7

### What is the difference between sparse categorical cross-entropy and categorical cross-entropy?

Sparse categorical cross-entropy is used when output is NOT one-hot encoded (1,2,3 etc.) and categorical cross-entropy when output is one-hot encoded ( [1,0,0] or [0,1,0] ). 

Sparse categorical cross-entropy and categorical cross-entropy can be thought of a multi-class variants of log-loss.

### Hyperparameters - [Link](https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-i-hyper-parameter-8129009f131b)

1. Number of Layers:- It must be chosen wisely as a very high number may introduce problems like over-fitting and vanishing and exploding gradient problems and a lower number may cause a model to have high bias and low potential model. Depends a lot on the size of data used for training


2. Number of hidden units per layer:-These too must be chosen reasonably to find a sweet spot between high bias and variance. Again depends on the data size used for training


3. Activation Function:- The popular choices in this are ReLU, Sigmoid & Tanh(only for shallow networks), and LeakyReLU. Generally choosing a ReLU/LeakyReLU do equally well. Sigmoid/Tanh may do well for shallow networks. Identity helps during regression problems.


4. Optimizer:- It is the algorithm used by the model to update weights of every layer after every iteration. Popular choices are SGD,RMSProp and Adam. SGD works well for shallow networks but cannot escape saddle points and local minima in such cases RMSProp could be a better choice,AdaDelta/AdaGrad for sparse data whereas Adam is a general favorite and could be used to achieve faster convergence. For further reference https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f
    
    
5. Learning Rate:- It is responsible for the core learning characteristic and must be chosen in such a way that it is not too high wherein we are unable to converge to minima and not too low such that we are unable to speed up the learning process. Recommended to try in powers of 10, specifically 0.001,0.01, 0.1,1. The value of the learning rate is dependent a lot on the optimizer used. For SGD - 0.1 generally works well whereas for Adam - 0.001/0.01 but it is recommended to always try all values from the range above. You can also use the decay parameter,to reduce your learning with number of iterations, to achieve convergence.Generally it is better to use adaptive learning rate algorithms like Adam than using a decaying learning rate.
    
    
6. Initialization:- Doesn’t play a very big role as defaults work well but still it is preferred to use He-normal/uniform initialization while using ReLUs and Glorot-normal/uniform (the default is Glorot-uniform) for Sigmoid for better results. One must avoid using zero or any constant value(same across all units) weight initialization
    
    
7. Batch Size:- It is indicative of number of patterns shown to the network before the weight matrix is updated. If batch size is less, patterns would be less repeating and hence the weights would be all over the place and convergence would become difficult. If batch size is high learning would become slow as only after many iterations will the batch size change. It is recommend to try out batch sizes in powers of 2 (for better memory optimization) based on the data-size.
    
    
8. Number of Epochs:- The number of epochs is the number of times the entire training data is shown to the model. It plays an important role in how well does the model fit on the train data. High number of epochs may over-fit to the data and may have generalization problems on the test and validation set, also they could cause vanishing and exploding gradient problems. Lower number of epochs may limit the potential of the model. Try different values based on the time and computational resources you have.
    
    
9. Dropout:- The keep-probability of the Dropout layer can be thought of hyper-parameter which could act as a regularizer to help us find the optimum bias-variance spot. It does so by removing certain connections every iteration therefore the hidden units cannot depend a lot on any particular feature. The values it can take can be anywhere between 0–1 and it is solely based on how much is the model over-fitting.
    
    
10. L1/L2 Regularization:- Serves as another regularizer wherein the very high weight values are curbed so that the model is not dependent on a single feature. This generally reduces variance with a trade-off of increasing bias i.e. lowering accuracy. Should be used when the model continues to over-fit even after considerably increasing Dropout value

### Why do we need a non-linear activation function in an artificial neural network?

Neural networks are used to implement complex functions, and non-linear activation functions enable them to approximate arbitrarily complex functions. Without the non-linearity introduced by the activation function, multiple layers of a neural network are equivalent to a single layer neural network.

### Dead ReLUs

The standard ReLU function also not perfect. The problem that for negative numbers ReLU giving 0 that mean that they will not be activated, and thus some part of your neurons will be dead and never used. The reasons why this can happen is large learning rate and wrong weights initialization. If parameter tweaking is not helping you can try Leaky ReLU, PReLU, ELU or Maxout that does not have this problem.

### Error when checking model input: expected convolution2d_input_1 to have 4 dimensions, but got array with shape (32, 32, 3) - anything similar

Basically this means that you are not adding the correct input shape into the model. Find out what shape is needed and usualy the 4th layer is the number of batch_size. You may add in the input layer say of (50, 50, 3), but the needed dimension is the size which is usually shown with a question mark at times (?, 50, 50, 3).

This could be solved by maybe trying out np.expand_dims(image, axis=0) or using Keras and reshape(image, input_size).

Some links below:
1. [Link 1](https://stackoverflow.com/questions/41563720/error-when-checking-model-input-expected-convolution2d-input-1-to-have-4-dimens)