# Regularization

In this part we will see two regularization methods: L1 (Lasso) and L2 (Ridge). The main goal of regularization is to avoid overfitting. In case of a neural network this is very important because we saw earlier that finding a good neural architecture which can solve a problem is difficult. Most of the time it is better to choose an architecture which has such a huge approximation function space that it seems reasonable it can solve our problem. But if we choose a network with too high complexity it can easily overfit our data. Overfitting is harmful in terms of the generalization power.

The key of the approximation power of a network is the neuron. More neurons generally involves higher approximation power. With generally I mean that the structure of connections among the neurons also matters. 

Therefore in order to prevent overfitting we should switch off neurons and only keep the most important ones which are just enough to provide a good approximation function. 

There are different approaches to achieve this but here we will only focus on L1 and L2 regularization. In a later tutorial we will address a much more deep learning specific method, dropout. 

Both L1 and L2 regularizations are functoins of the weights in the neurons. Basically, they say if the network uses a specific weight then it comes with a cost so better not to use except if it really helps to reduce the value of the loss function. Here are the formulas:

\begin{equation}
L^{(reg)}_w(y_{predicted}, y_{target}) = L_w(y_{predicted}, y_{target}) + \beta \cdot \sum_{i,j}{|w_{ij}|},
\end{equation}

\begin{equation}
L^{(reg)}_w(y_{predicted}, y_{target}) = L_w(y_{predicted}, y_{target}) + \beta \cdot \frac{1}{2}\sum_{i,j}{w^2_{ij}}.
\end{equation}

The first one is the L1 regularization and the second one is the L2 regularization. $L_w$ can be a loss like MSE. Now, as we can see these formulas, it is easier to understand why the neurons can be switched off. If a $w_{ij}$ becomes zero then the regularization term decreases, this can decrease the overall loss as well $L^{(reg)}_w$. So for the network it is better to keep only the most important neurons to minimize the loss $L_w$.

Now let us see this in a concrete example. We will examine the activations and the weight of the neurons in a network in case of regularization and without regularization. 