# The Deep Learning Book (Simplified)
## Part II - Modern Practical Deep Networks
*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)
where we are attempting to provide a summary of each chapter highlighting the concepts 
that we found to be most important so that other people can use it as a starting point
for reading the chapters, while including the code for reproducing some of the results. 
Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on 
notation.*


## Chapter 7: Regularization for Deep Learning

Recalling from Chapter 5, **overfitting** is said to occur when the training error keeps decreasing but the test error (or the generalization error) starts increasing. **Regularization** is the modification we make to a learning algorithm that reduces its generalization error, but not its training error. There are various ways of doing this, some of which include restriction on parameter values or adding terms to the objective function, etc.

These constraints are designed to encode some sort of prior knowledge, with a preference towards simpler models to promote generalization (See [Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor)). The sections present in this chapter are listed below: <br>

**1. Parameter Norm Penalties** <br>
**2. Norm Penalties as Constrained Optimization** <br>
**3. Regularization and Under-Constrained Problems** <br>
**4. Dataset Augmentation** <br>
**5. Noise Robustness** <br>
**6. Semi-Supervised Learning** <br>
**7. Mutlitask Learning** <br>
**8. Early Stopping** <br>
**9. Parameter Tying and Parameter Sharing** <br>
**10. Sparse Representations** <br>
**11. Bagging and Other Ensemble Methods** <br>
**12. Dropout** <br>
**13. Adversarial Training** <br>
**14. Tangent Distance, Tangent Prop and Manifold Tangent Classifier** <br>

### 1. Parameter Norm Penalties

The idea here is to limit the capacity (the space of all possible model families) of the model 
by adding a parameter norm <br>
penalty, $\Omega(\theta)$, to the objective function, $J$:

$$ \tilde{J}(\theta; X, y) =  J(\theta; X, y) + \lambda \Omega(\theta)$$

Here, $\theta$ represents only the weights and not the biases, the reason being that the biases require much less data to fit and do not add much variance.

**1.1 $L^2$ Parameter Regularization**

Here, the parameter norm penalty:
$$\Omega(\theta) = \frac {||w||_2^2} {2}$$

This makes the objective function:

$$ \tilde{J} (\theta; X, y) = J(\theta; X, y) + \alpha \frac {w^T w} {2} $$

Applying the 2nd order Taylor-Series approximation at the point $w^*$ where $\tilde{J} (\theta; X, y)$ assumes the minimum value, i.e., $\bigtriangledown_w \tilde {J} (w^*) = 0$:

$$ \hat{J}(w) = J(w^*) + \frac{(w - w^*)^T H(J(w^*))(w - w^*)} {2} $$

Finally, $\bigtriangledown_w \hat{J}(w) = H(J(w^*))(w - w^*)$ and the overall gradient of the objective function becomes:

$$ \bigtriangledown_w \tilde{J}(w) = H(J(w^*))(\tilde{w} - w^*) + \alpha \tilde{w} = 0$$
$$ \tilde{w} = (H + \alpha I)^{-1} H w^* $$

As $\alpha$ approaches 0, $w$ comes closer to $w^*$. Finally, since $H$ is real and symmetric, it can be decomposed into a diagonal matrix $\wedge$ and an orthonormal set of eigenvectors, $Q$. That is, $H = Q^T\wedge Q$.

![l2 reg](images/L2_reg.png)

Because of the marked term, the value of each weight is rescaled along the eigenvectors of $H$. The value of the weights along the $i^{th}$ eigenvector is rescaled by $\frac {\lambda_i}{\lambda_i + \alpha}$, where $\lambda_i$ represents the eigenvalue corresponding to the $i^{th}$ eigenvector.

| Condition| Effect |
| --- | --- |
|  $\lambda_i >> \alpha$ | not much effect |
|  $\lambda_i << \alpha$ | The weight value almost shrunk to zero |

The diagram below illustrates this well.

![L2 scaling](images/L2_scaling.png)

To look at its application to Machine Learning, we have to look at linear regression. The objective function there is exactly quadratic, given by:

![linear_reg](images/linear_reg.png)

**1.2 $L^1$ parameter regularization**

Here, the parameter norm penalty:
$$\Omega(\theta) = ||w||_1 $$

Making the gradient of the overall objective function:

$$ \bigtriangledown_w \tilde{J}(\theta; X, y) = \bigtriangledown_w J(\theta; X, y) + \alpha * sign(w) $$

Now, the last term, sign(w), create a difficulty that the gradient no longer scales linearly with $w$. This leads to a few complexities in arriving at the optimal solution (which I am going to skip):
![l1_reg](images/l1_reg.png)

Our current interpretation of the `max` term is that, there shouldn't be a zero crossing, as the gradient of the absolute value function is not differentiable at zero.

![lasso result](images/lasso_result.png)


Thus, $L^1$ regularization has the property of sparsity, which is its fundamental distinguishing feature from $L^2$. Hence, $L^1$ is used for feature selection as *LASSO*.