# The Deep Learning Book (Simplified)
## Part II - Modern Practical Deep Networks
*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)
where we are attempting to provide a summary of each chapter highlighting the concepts 
that we found to be most important so that other people can use it as a starting point
for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on 
notation.*


## Chapter 7: Regularization for Deep Learning

Recalling from Chapter 5, **overfitting** is said to occur when the training error keeps decreasing but the test error (or the generalization error) starts increasing. **Regularization** is the modification we make to a learning algorithm that reduces its generalization error, but not its training error. There are various ways of doing this, some of which include restriction on parameter values or adding terms to the objective function, etc.

These constraints are designed to encode some sort of prior knowledge, with a preference towards simpler models to promote generalization (See [Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor)). The sections present in this chapter are listed below: <br>

**1. Parameter Norm Penalties** <br>
**2. Norm Penalties as Constrained Optimization** <br>
**3. Regularization and Under-Constrained Problems** <br>
**4. Dataset Augmentation** <br>
**5. Noise Robustness** <br>
**6. Semi-Supervised Learning** <br>
**7. Mutlitask Learning** <br>
**8. Early Stopping** <br>
**9. Parameter Tying and Parameter Sharing** <br>
**10. Sparse Representations** <br>
**11. Bagging and Other Ensemble Methods** <br>
**12. Dropout** <br>
**13. Adversarial Training** <br>
**14. Tangent Distance, Tangent Prop and Manifold Tangent Classifier** <br>

### 1. Parameter Norm Penalties

The idea here is to limit the capacity (the space of all possible model families) of the model 
by adding a parameter norm <br>
penalty, $\Omega(\theta)$, to the objective function, $J$:

$$ \tilde{J}(\theta; X, y) =  J(\theta; X, y) + \lambda \Omega(\theta)$$

Here, $\theta$ represents only the weights and not the biases, the reason being that the biases require much less data to fit and do not add much variance.

**1.1 $L^2$ Parameter Regularization**

Here, the parameter norm penalty:
$$\Omega(\theta) = \frac {||w||_2^2} {2}$$

This makes the objective function:

$$ \tilde{J} (\theta; X, y) = J(\theta; X, y) + \alpha \frac {w^T w} {2} $$

Applying the 2nd order Taylor-Series approximation at the point $w^*$ where $\tilde{J} (\theta; X, y)$ assumes the minimum value, i.e., $\bigtriangledown_w \tilde {J} (w^*) = 0$:

$$ \hat{J}(w) = J(w^*) + \frac{(w - w^*)^T H(J(w^*))(w - w^*)} {2} $$

Finally, $\bigtriangledown_w \hat{J}(w) = H(J(w^*))(w - w^*)$ and the overall gradient of the objective function becomes:

$$ \bigtriangledown_w \tilde{J}(w) = H(J(w^*))(\tilde{w} - w^*) + \alpha \tilde{w} = 0$$
$$ \tilde{w} = (H + \alpha I)^{-1} H w^* $$

As $\alpha$ approaches 0, $w$ comes closer to $w^*$. Finally, since $H$ is real and symmetric, it can be decomposed into a diagonal matrix $\wedge$ and an orthonormal set of eigenvectors, $Q$. That is, $H = Q^T\wedge Q$.

![l2 reg](images/L2_reg.png)

Because of the marked term, the value of each weight is rescaled along the eigenvectors of $H$. The value of the weights along the $i^{th}$ eigenvector is rescaled by $\frac {\lambda_i}{\lambda_i + \alpha}$, where $\lambda_i$ represents the eigenvalue corresponding to the $i^{th}$ eigenvector.

| Condition| Effect of regularization|
| --- | --- |
|  $\lambda_i >> \alpha$ | Not much |
|  $\lambda_i << \alpha$ | The weight value almost shrunk to zero |

The diagram below illustrates this well.

![L2 scaling](images/L2_scaling.png)

To look at its application to Machine Learning, we have to look at linear regression. The objective function there is exactly quadratic, given by:

![linear_reg](images/linear_reg.png)

**1.2 $L^1$ parameter regularization**

Here, the parameter norm penalty:
$$\Omega(\theta) = ||w||_1 $$

Making the gradient of the overall objective function:

$$ \bigtriangledown_w \tilde{J}(\theta; X, y) = \bigtriangledown_w J(\theta; X, y) + \alpha * sign(w) $$

Now, the last term, sign(w), create a difficulty that the gradient no longer scales linearly with $w$. This leads to a few complexities in arriving at the optimal solution (which I am going to skip):
![l1_reg](images/l1_reg.png)

Our current interpretation of the `max` term is that, there shouldn't be a zero crossing, as the gradient of the absolute value function is not differentiable at zero.

![lasso result](images/lasso_result.png)


Thus, $L^1$ regularization has the property of sparsity, which is its fundamental distinguishing feature from $L^2$. Hence, $L^1$ is used for feature selection as *LASSO*.

### 2. Norm penalties as constrained optimization

From chapter 4's section 4, we know that to minimize any function under some constraints, we can construct a generalized Lagrangian function containing the objective function along with the penalties. Suppose we wanted $\Omega(\theta)) < k$, then we could construct the following Lagrangian:
![lagrangian](images/lagrangian.png)

Thus, $\theta^* = argmin_{\theta} (max_{\alpha, \alpha >= 0} \hspace{.2cm} \mathcal{L}(\theta, \alpha; X, y))$. If  $\Omega(\theta) > k$, then $\alpha$ should be large to reduce its value below k. <br>
Likewise, if $\Omega(\theta) < k$, then $\alpha$ should be small. Assuming $\alpha$ to be a constant $\alpha^{*}$:

$$ \theta^* = argmin_{\theta} \hspace{.2cm} J(\theta; X, y) + \alpha^* \Omega(\theta)$$

This is now similar to the parameter norm penalty regularized objective function. Thus, parameter norm penalties naturally impose a constraint, like the L2-regularization defining a constrained L2-ball. Larger $\alpha$ means a smaller constrained region and vice versa. The idea of constraints over penalties, is important for several reasons. Penalties might cause non-convex optimization algorithms to get stuck in local minima due to small values of $\theta$, leading to the formation of so-called `dead cells`, as the weights entering and leaving them are too small to have an impact. Constraints don't enforce the weights to be near zero, rather being confined to a constrained region.

Another reason is that constraints induce higher stability. With higher learning rates, there might be a large weight, leading to a large gradient, which could go on iteratively leading to numerical overflow in the value of $\theta$. Constrains along with reprojection (to the corresponding ball) prevent the weights from becoming too large, thus, maintaining stability. 

A final suggestion made by Hinton was to restrict the individual column norms of the weight matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden unit from having a large weight.

### 3. Regularized & Under-constrained problems

Underdetermined problems are those problems that have infinitely many solutions. In some machine learning problems, regularization is necessary. For e.g., many algorithms (e.g. PCA) require the inversion of $X^TX$, which might be singular. In such a case, we can use a regularized form instead. $(X^TX + \alpha I)$ is guaranteed to be invertible. A logistic regression problem having linearly separable classes with $w$ as a solution, will always have $2w$ as a solution and so on.

Regularization can solve underdetermined problems. For e.g. the Moore-Pentose pseudoinverse  defined in a previous chapter is given as:


This can be seen as performing a linear regression with L2-regularization. 

### 4. Data augmentation

Having more data is the most desirable thing to improving a machine learning model's performance. In many cases, it is relatively easy to artifically generate data. For a classification task, we desire for the model to be invariant to certain types of transformations, and we can generate the corresponding $(x, y)$ pairs by translating the input $x$. But for certain problems, like density estimation, we can't apply this directly unless we have already solved the density estimation problem. 

However, caution needs to be mentioned while data augmentation to make sure that the class doesn't change.  For e.g., if the labels contain both "b"  and "d", then horizontal flipping would be a bad idea for data augmentation. Add random noise to the inputs is another form of data augmentation, while adding noise to hidden units can be seen as doing data augmentation at multiple levels of abstraction.

Finally, when comparing machine learning models, we need to evaluate them using the same hand-designed data augmentation schemes or else it might happen that algorithm A outperforms algorithm B, just because it was trained on a dataset which had more / better data augmentation.

### 5. Noise Robustness

Noise with infinitesimal variance imposes a penalty on the norm of the weights. Noise added to hidden units is very important and is discussed later in **12. Dropout**. Noise can even be added to the weights. This has several interpretations. One of them is that adding noise to weights is a stochastic implementation of Bayesian inference over the weights, where the weights are considered to be uncertain, with the uncertainty being modelled by a probability distribution. It is also interpreted as a more traditional form of regularization by ensuring stability in learning. 

For e.g. in the linear regression case, we want to learn the mapping $y(x)$ for each feature vector $x$, by reducing the mean square error.

$$ J = E_{p(x, y)} [\hat{y} (x) - y] ^ 2 $$

Now, suppose a random noise $\epsilon_w \in \mathcal{N}(\epsilon; 0, \eta I)$ is added to the weights, we get the output $\hat{y}_{\epsilon_w}(x)$ and still want to learn this through reducing the mean square. Minimizing the loss after adding noise to the weights, is equivalent to adding another regularization term, $\eta E_{p(x, y)}(\bigtriangledown_w \hat{y}(x))$, which makes sure that small perturbations in the weight values don't affect the predictions much, thus stabilising training.

**5.1 Injecting noise at output targets**

Sometimes we may have the wrong output labels, in which case maximizing $p(y \hspace{.1cm} | \hspace{.1cm} x)$ may not be a good idea. In such a case, we can add noise to the labels by assigning a probability of (1 - $\epsilon$) that the label is correct and a probability of $\epsilon$ that it is not. In the latter case, all the other labels are equally likely. **Label Smoothing** regularizes a model with $k$ softmax outputs by assigning the classification targets as (1 - $\epsilon$) and $\frac {\epsilon} {k-1}$.

### 6. Semi-Supervised Learning

`P(x, y)` denotes the joint distribution of *x* and *y*, i.e., corresponding to training sample *x*, I have a label *y*. `P(x)` denotes just the distribution of *x*, i.e., just the training examples without any labels. In **Semi-supervised Learning**, we use both `P(x, y)` and `P(x)` to estimate `P(y | x)`.  We want to learn some representation `h = f(x)` such that samples from the same class have similar distributions and a linear classfier in the new space achieves better generalization error.

Instead of separating the supervised and unsupervised criteria, we can instead have a generative model of `P(x)` or `P(x, y)` which shares parameters with the discriminative model, where the shared parameters encode the prior belief that `P(x)` (or `P(x, y)`) is connected to `P(y | x)`.

### 7. Multitask Learning

The idea is to improve the generalization error by pooling together examples from multiple tasks. Similar to how more data leads to more generalizability, using a part of the model for different tasks constrains that part to learn good values. There are two types of model parts:

- Task specific parameters: These parameters benefit only from that particular task.
- Generic parameters, shared across all tasks: These are the ones which benefit from learning through various tasks.

![shared](images/shared.png)

Multitask learning leads to better generalization when there is actually some relationship between the tasks, which actually happens in the context of Deep Learning where some of the factors, which explain the variation observed in the data, are shared across different tasks. 

### 8. Early Stopping

After a certain point of time during training, for a model with extremely high representational capacity, the training error continues to decrease but the validation error begins to increase.  In such a scenario, a better idea woud be to return back to the point where the validation error was the least. Thus, we need to keep calculating the validation accuracy after each epoch and if there is any improvement, we store that parameter setting and upon termination of training, we return the last *saved* parameters. <br>

![early stopping](images/early_stopping.png)

The idea of **Early Stopping** is that if the validation error doesn't improve over a certain fixed number of iterations, we terminate the algorithm. This effectively reduces the capacity of the model by reducing the number of steps required to fit the model. The evaluation on the validation set can be done both on another GPU in parallel or done after the epoch. A drawback of weight decay was to manually tweak the weight decay coefficient, which, if chosen wrongly, can lead the model to local minimia by squashing the weight values too much. In Early Stopping, no such parameter needs to be tweaked which affects the model dynamics.

However, since we are setting aside some part of the training data for validation, we are not using the complete training set. So, once Early Stopping is done, a second phase of training can be done where the complete training set is used. There are two choices here:

- Train from scratch for the same number of steps as in the Early Stopping case.
- Use the weights learned from the first phase of training and retrain using the complete data.

Other than lowering the number of training steps, it reduces the computational cost also by regularizing the model without having to add additional penalty terms. It affects the optimization procedure by restricting it to a smal volume of the parameter space, in the neighbourhood of the initial parameters ($\theta_0$). Suppose $\tau$ and $\epsilon$ represent the number of iterations and the learning rate respectively. Then, $\epsilon\tau$ effectively represents the capacity of the model. Intuitively, this can be seen as the inverse of the weight decay co-efficient $\lambda$. When $\epsilon\tau$ is small (or $\lambda$ is large), the parameter space is small and vice versa. We show this equivalence holds true for a linear model with quadratic cost function (initial parameters $w^{(0)} = 0$). Taking the Taylor Series Approximation of $J(w)$ around the empirically optimal weights $w^*$:

![taylor 1](images/taylor1.png)
![taylor 2](images/taylor2.png)
![weight update](images/weight_update.png)

$$ w^{(\tau)} - w^* = (I - \epsilon Q \wedge Q^T)(w^{(\tau - 1)} - w^*) $$

multiplying with $Q^T$ on both sides and using the fact that $Q^TQ = I$

$$ Q^T(w^{(\tau)} - w^*) = (Q^TI - \epsilon \wedge Q^T)(w^{(\tau - 1)} - w^*) $$ 


$$ Q^T(w^{(\tau)} - w^*) = (I - \epsilon \wedge)Q^T(w^{(\tau - 1)} - w^*) $$

Assuming $\epsilon$ to be small enough that $|1 - \epsilon \lambda_i| < 1$, the parameter trajectory after $\tau$ steps of training:

$$ Q^Tw^{(\tau)} = (I - (I - \epsilon \wedge)^{\tau})Q^Tw^* $$

The equation for $L2$ regularization is given by:

$$ Q^Tw^{(\tau)} = (\wedge + \alpha I)^{-1} \wedge Q^Tw^* $$

$$ (\wedge + \alpha I)^{-1} (\wedge + \alpha I) = I $$

$$ \Rightarrow  (\wedge + \alpha I)^{-1} \wedge = I - (\wedge + \alpha I)^{-1} \alpha$$

$$ \Rightarrow  Q^Tw^{(\tau)} = (I - (\wedge + \alpha I)^{-1} \alpha) Q^Tw^* $$

Thus, if the hyperparameters $\epsilon$, $\alpha$ & $\tau$ such that:

$$ (\wedge + \alpha I)^{-1} \alpha = (I - \epsilon \wedge)^{\tau} $$
 
L2-regularization can be seen as equivalent to Early Stopping and on further simplification, we get, $\epsilon \tau \approx \frac {1} {\lambda}$

### 9. Parameter Tying and Parameter Sharing

Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the case of norm penalty. However, there might be situations where we might have some prior knowledge on the kind of dependencies that the model should encode. Suppose, two models A and B, perform a classification task on similar input and output distributions. In such a case, we'd expect the parameters ($W_a$ and $W_b$) to be similar to each other as well. We could impose a norm penalty on the distance between the weights, but a more popular method is to **force** the set of parameters to be equal. This is the essence behind **Parameter Sharing**. A major benefit here is that we need to store only a subset of the parameters (e.g. storing only $W_a$ instead of both $W_a$ and $W_b$) which leads to large memory savings. In the example of Convolutional Neural Networks or CNNs (discussed in Chapter 9), the feature is computed across different regions of the image and hence, a cat is detected irrespective of whether it is at position `i` or `i+1`.

### 10. Sparse Representations

We can place penalties on even the activation values of the units which indirectly imposes a penalty on the parameters. This leads to representational sparsity, where many of the activation values of the units are zero. 

![rep sparsity](images/rep_sparsity.png)

Another idea could be to average the activation values across various examples and push it towards some value. An example of getting representational sparsity by imposing hard constraint on the activation value is the **Orthogonal Matching Pursuit (OMP) ** algorithm, where a representation `h` is learned for the input `x` by solving the constrained optimization problem:

$$ arg min_{h, ||h||_b < k} ||x - Wh||^2 $$

where $||h||_b$ indicates the number of non-zero entries. The problem can be solved efficiently when $W$ is restricted to be orthogonal.


### 11. Bagging and Other Ensemble Methods

The techniques which train multiple models and take the maximum vote across those models for the final prediction are called ensemble methods. The idea is that it's highly unlikely that multiple models would make the same error in the test set. 

Suppose that we have `K` regression models, with the $i^{th}$ model making an error $\epsilon_i$ on each example, where $\epsilon_i$ is drawn from a zero mean, multivariate normal distribution such that: $ \mathbb{E}(\epsilon_i^2) = v$ and $\mathbb{E} (\epsilon_i \epsilon_j) = c$. The error on each example is then the average across all the models: $\frac {\sum_i \epsilon_i} {K}$.

The mean of this average error is 0 (as the mean of each of the individual $\epsilon_i$ is 0). The variance of the average error is given by:


$$ \mathbb{E} \Big( \frac {\sum_i \epsilon_i} {K} \Big)^2 = \frac { \mathbb{E} (\sum_i \epsilon_i^2 + \sum_i \sum_{j \neq i} \epsilon_i \epsilon_j)} {K^2}$$

$$ \Rightarrow \mathbb{E} \Big( \frac {\sum_i \epsilon_i} {K} \Big)^2 =  \frac { \mathbb{E} \sum_i \epsilon_i^2} {K^2} +  \frac {\sum_i \sum_{j \neq i} \mathbb{E}(\epsilon_i \epsilon_j)} {K^2}$$

$$ \Rightarrow \mathbb{E} \Big( \frac {\sum_i \epsilon_i} {K} \Big)^2 =  \frac {K * v} {K^2} +  \frac {K * (K-1) c} {K^2}$$

$$ \Rightarrow \mathbb{E} \Big( \frac {\sum_i \epsilon_i} {K} \Big)^2 =  \frac {v} {K} +  \frac {(K-1) c} {K}$$

Thus, if `c = v`, then there is no change. If `c = 0`, then the variance of the average error decreases with K. There are various ensembling techniques. In the case of Bagging (Bootstrap Aggregating), the same training algorithm is used multiple times. The dataset is broken into K parts by sampling with replacement (see figure below for clarity) and a model is trained on each of those K parts. Because of sampling with replacement, the K parts have a few similarities as well as a few differences. These differences cause the difference in the predictions of the K models. Model averaging is a very strong technique.

![bagging](images/bagging.png)

### 12. Dropout

**Dropout** is a computationally inexpensive yet, powerful regularization technique. The problem with bagging is that we can't train an exponentially large number of models and store them for prediction later. Dropout makes bagging practical by making an inexpensive approximation. In a simplistic view, dropout trains the ensemble of all sub-networks formed by randomly removing a few non-output units by multiplying their outputs by $0$. For every training sample, a mask is computed for all the input and hidden units independently. For clarification, suppose we have $h$ hidden units in some layer. Then, a mask for that layer refers to a $h$ dimensional vector with values either $0$ (remove the unit) or $1$ (keep the unit).

There are a few differences from bagging though:

- In bagging, the models are independent of each other, whereas in dropout, the different models share parameters, with each model taking as input, a sample of the total parameters.

- In bagging, each model is trained till convergence, but in dropout, each model is trained for just one step and the parameter sharing makes sure that subsequent updates ensure better predictions in the future.

At test time, we combine the predictions of all the models. In the case of bagging with K models, this was given by the arithmetic mean, $\frac {\sum_i p^i (y \hspace{.1cm} | \hspace{.1cm} x)} {K}$. In case of dropout, the probability that a model is chosen is given by $p(\mu)$, with $\mu$ denoting the mask vector. The prediction then becomes $ {\sum_{\mu} p(\mu) p (y \hspace{.1cm} | \hspace{.1cm} x, \mu)} $. This is not computationally feasible, and there's a better method to compute this in one go, using the geomtric mean instead of the arithmetic mean.

We need to take care of two main things when working with geometric mean:
- None of the probabilities should be zero.
- Re-normalization to make sure all the probabilities sum to 1.

![ensemble](images/ensemble.png)
![renormalize](images/renormalize.png)

The advantage for dropout is that $\tilde{p}_{ensemble} (y^{'} | x)$ can be approximate in one pass of the complete model by dividing the weight values by the keep probability (**weight scaling inference rule**). The motivation behind this, is to capture the right expected values from the output of each unit, i.e. the total expected input to a unit at train time is equal to the total expected unit at test time. A big advantage of dropout then, is that it doesn't place any restricted of the *type* of model or training procedure to use.

**Points to note**:
- Reduces the representational capacity of the model and hence, the model should be large enough to begin with.
- Works better with more data.
- Equivalent to L2 for linear regression, with different weight decay coefficient for each input feature.

However, stochasticity is not necessary for regularization (see Fast Dropout), neither sufifficient (see Dropout Boosting). 

**Biological Interpretration**: During sexual reproduction, genes could be swapped between organisms if they are unable to correctly adapt to the unusual features of any organism. Thus, the units in dropout learn to perform well regardless of the presence of other hidden units, and also in many different contexts.

Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g. if some unit $h_i$ learns to detect a nose in a face recognition task. Now, if this $h_i$ is removed, then some other unit either learns to redundantly detect a nose or associates some other feature (like mouth) for recognising a face. In either way, the model learns to make more use of the information in the input. On the other hand, adding noise to the input won't completely removed the nose information, unless the noise is so large as to remove most of the information from the input.


### 13. Adversarial Training

Deep Learning has outperformed humans in the task of Image Recognition (Reference: ImageNet), which might lead us to believe that these models have acquired a human-level understanding of an image. However, experimentally searching for an $x^{'}$ for a given $x$, such that prediction made by the model changes, shows other wise. As shown in the image below, although the newly formed image (adversarial image) looks almost exactly the same to a human, the model classifies it wrongly and with very high confidence. 

![adversarial](images/adversarial.png)

**Adversarial training** refers to training on images which are adversarially generated and it has been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour is the linearity of the model, caused by the main building blocks being primarily linear. Thus, a small change of $\epsilon$ in the input causes a drastic change of $W\epsilon$ in the output. The idea of adversarial training is to avoid this jumping and induce the model to be locally constant in the neighborhood of the training data.

This can also be used in semi-supervised learning. For an unlabelled sample $x$, we can assign the label $\hat{y}(x)$ using our model. Then, we find an adversarial example, $x^{'}$, such that $y(x^{'}) \neq \hat{y}(x)$ (an adversary found this way is called virtual adversarial example). The objective then is to assign the same class to both $x$ and $x^{'}$. The idea behind this is that different classes are assumed to lie on disconnected manifolds and a little push from one manifold shouldn't land in any other manifold.

### 14. Tangent Distance, Tangent Prop and manifold Tangent Classifier

Many ML models assume the data to lie on a low dimensional manifold to overcome the curse of dimensionality. The inherent assumption which follows is that small perturbations that cause the data to move along the manifold (it originally belonged to), shouldn't lead to different class predictions. The idea of the **tangent distance** algorithm to find the K-nearest neighbors using the distance metric as the distance between manifolds. A manifold $M_i$ is approximated by the tangent plane at $x_i$, hence, this technique needs tangent vectors to be specified.

![normal](images/normal.png)

The **tangent prop** algorithm proposed to a learn a neural network based classifier, $f(x)$, which is invariant to known transformations causing the input to move along its manifold. Local invariance would require that $\bigtriangledown_x f(x)$ is perpendicular to the tangent vectors $V^{(i)}$. This can also be achieved by adding a penalty term that minimizes the directional directive of $f(x)$ along each of the $V(i)$.

$$ \Omega(f) = \sum_i (\bigtriangledown_x f(x))^T V(i) $$

It is similar to data augmentation in that both of them use prior knowledge of the domain to specify various transformations that the model should be invariant to. However, tangent prop only resists infinitesimal perturbations while data augmentation causes invariance to much larger perturbations.

**Manifold Tangent Classifier** works in two parts:
- Use Autoencoders to learn the manifold structures using Unsupervised Learning.
- Use these learned manifolds with tangent prop.