## Hyperparameters

There are a lots of hyperparameters in neural networks that can be optimzed. This can be a problem since it provides a lot of flexibility and subjectivity when choosing hyperparameters

* **Number of hidden layers**. For most problems single layer is sufficient, but going deeper increasing number of layers increases parameter efficiency and accuracy. One advantage of having deeper neural networks is that you can transfer the learning output of an already trained network to another set of problems, so the network do not have to learn from scratch the first problem to then achieve the second problem. 


* **Number of neurons per hidden layer**. The input and output neuron numbers depend on your dataset. Input = predict variables; ouput = classes or values to be predicted. The same number of neurons in all hidden layers are preferred. Increasing the number of layers generally is better to improve model performance than to increase number of neurons per layer. Use early-stopping method do prevent overfitting. 


* **Learning rate**. Arguably the most important hyperparameter. To tune learning rate start with a large value, that makes algorithm diverge, then divide this value by 3 and try again. 


* **Batch size**. Bath is the number of samples to train the model within the train and validation sets. Should be < 32 and don't make it too small because batch = 1 will take only one sample per time to calculate error.


* **Epochs**. Use early stopping to prevent overfitting 

## Regularization


https://www.deeplearningbook.com.br/overfitting-e-regularizacao-parte-1/
https://www.deeplearningbook.com.br/afinal-por-que-a-regularizacao-ajuda-a-reduzir-o-overfitting/

We do regularization of the neural network model to prevent overfitting when we generalize the application of predictions in training versus test set. 

Another reason to make regularization is to make the neural network less sensitive to drastic changes in input variables, because you shrink the values of weights to make them simple and small, so when you change input values, the neural network will behave with more stability than when you have larger values of weights. 



## Penalization L1

https://www.deeplearningbook.com.br/capitulo-22-regularizacao-l1/


The penalization L1 modifies the non-regularized cost function by adding the **sum of the absolute values of weights** multiplied by the lambda/n. 



![image-2.png](attachment:image-2.png)


The penalization aims to reduce greater weights and make the neural network to prefer smaller weights. 






## Penalization L2

https://www.deeplearningbook.com.br/overfitting-e-regularizacao-parte-2/

Also known as weight decay. It adds a penalty/regularization term into the cost function, in the case of classification problems, we add penalty to the cross entropy function, while for regression we add penalty to the mean squared error function. 

For the cross entropy function, the penalty term is the sum of all the weights of the neural network squared:

![image.png](attachment:image.png)

The penalty term is the lambda of the equation of the image above. The lambda / 2n multiplies the **sum of all weight squares** of the network. n is the sample size. It is important to note that **the regularization term does not include the biases**. 

This regularization term cam be extrapolated to other cost functions as well.

Intuitively, the effect of regularization is to make the network to prefer to learn by small weights. Large values of weights will be allowed only if they improve substantially the first term of the cost function. Therefore, this regularization can be seen as a form to minimize the cost function finding the small weights. It is then a balance between: when lambda is small, we prefer to minimize the original cost function, but when lambda is larger, we prefer small weights. 


* The regularization term is used to rescale the values of weights when in the backpropagation of neural networks in the gradient descent algorithm. In this case, all weights are factored by the regularization term in the derivate calculation to find the new weights and biases.


* We are not shrinking the weights to zero, because one of the terms of the derivate can make the weights to increase, if this causes the non-regularized cost function to minimize. 


* For the **stochastic gradient descent**, the penalization term is included in the calculation for the mean of each mini-batch from the training set used for backpropagation.  Remember that batch size is a hyperparameter you can set to train the neural network.


* Remember to increase lambda penalty parameter when dataset sample size increases because of the proportion of the weight regularization. The regularization has another benefit aside from reducing overfitting of the model: the non-regularized cross-entropy function can stay trapped in local minimas of the cost function. 



### Comparison of L1 and L2 penalization

* Both regularization L1 and L2 for neural networks aims to diminish larger values of weights, but how the weights decrease is different from L1 compared to L2: 


* While in L1 the weights decrease in a constant quantity towards 0 (shrinkage towards 0)
* L2 the weights decrease in proportion to the actual values of weights. 


When there is a very larger weight, the L1 penalization decreases the weight much less than the L2 penalty. In contrast, when absolute weight is small, L1 penalization reduce weights much more than L2 penalization. The result is that L1 penalty tends to concentrate the weights of a neural network in a relatively small number of neural connections of high importances, while other weights are shrank towards 0. 



## Dropout


https://www.deeplearningbook.com.br/capitulo-23-como-funciona-o-dropout/

The dropout technique despite being used for regularization, does not use a penalty term for weights to modify the cost (cross-entropy) function. Instead, **the droput modifies the network itself**. 

How it modifies the network?

It **drops out** a subset of random neurons from layers so each iteration for the training of the model will be done with a particular and different network. See the example of image below:


![image-3.png](attachment:image-3.png) ![image-4.png](attachment:image-4.png)


After inputting the variables, weights and biases, the cost function will be calculated through backpropagation algorithm with different and random subsets of the original network. By doing this, the network will learn a set of weights and biases, and in conditions where part of the hidden neurons have been dropped out. This means that when we train the final model with all neurons, more neurons will be actives and to compensate we reduce by half the number of weights that comes out from the hidden layers.

To understand how it contributes to regularization, think about taking the average of how a given set of neural networks classify a given class. If three out of five networs are classifying 1 instead of 0, then the correct class must be 1 because the average number of networks is classifying as such. Therefore, by averaging the classifications of different networks that were trained with different sets of weights and biases, you will eventually prevent overfitting and getting better accuracies. 

This is similar to cross-validation?