## Chap 7 Regularization for Deep Learning
We defined regularization as __any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.__

In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance.

Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe. To some extent, we are always trying to fit a square peg (the data generating process) into a round hole (our model family).

What this means is that __controlling the complexity of the model__ is NOT a simple matter of finding the __model of the right size__, with the right number of parameters. Instead, we might find and -- indeed in practical deep learning scenarios, we almost always do find -- that the __best fitting model__ (in the sense of minimizing generalization error) is a __large model that has been regularized appropriately.__

### 7.1 Parameter Norm Penalties
Many regularization approaches are based on limiting the capacity of models by adding a parameter norm penalty $\Omega(\theta)$ to the objective function $J$. We denote the regularized objective function by $\tilde J$

$$\tilde J(\theta; X,y) = J(\theta; X,y) + \alpha \Omega(\theta)$$

For neural networks, we typically choose to use a parameter norm penalty that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. The biases typically require less data to fit accurately than the weights.

Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions. Each __bias controls only a single variable.__ This means that __we do NOT induce too much variance by leaving the biases unregularized.__ Also, __regularizing the bias parameters can introduce a significant amount of underfitting.__

In the context of neural networks, it is sometimes desirable to __use a separate penalty with a different α coefficient for each layer of the network.__ Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers just to __reduce the size of search space.__

#### 7.1.1 $L^2$ Parameter Regularization
L2 parameter norm penalty commonly known as __weight decay__ This regularization strategy __drives the weights closer to the origin__ by adding a regularization term $\Omega(\theta) = \frac{1}{2} ||w||^2$ to the objective function.

We can gain some insight into the behavior of weight decay regularization by studying the gradient of the regularized objective function. To simplify the presentation, we assume no bias parameter, so $\theta$ is just w. 

__Objective function:__ 
$$\tilde J(w) = \frac{\alpha}{2} w^T w + J(w)$$

__Gradient:__ 
$$\nabla_w \tilde J(w) = \alpha w + \nabla_w J(w)$$

__Weight update:__ 
$$w \leftarrow w - \epsilon  (\alpha w + \nabla_w J(w))$$

$$w \leftarrow (1- \epsilon \alpha)w - \epsilon \nabla_w J(w)$$

We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a __constant factor__ on each step, just before performing the usual gradient update. 

We will further simplify the analysis by making a __quadratic approximation__ to the objective function in the neighborhood of the value of the weights that obtains minimal unregularized training cost, $w^∗ = \arg \min_w J(w)$ 

> 為了簡化模型，我們將目標函數在最低點做二階的泰勒展開式作為近似。

Taylor series expansion around $w^*$
\begin{align}
J(w) & \approx J(w^*) + (w-w^*)^T g + \frac{1}{2} (w-w^*)^T H (w-w^*) \\
&( \because w^* = arg \min_w J(w) \therefore g = 0) \\
\hat J(w) & = J(w^{*}) + \frac{1}{2}(w-w^*)^T H(w-w^*) \\
\end{align}

> $\hat J(w)為J(w)在w^*附近之近似，之後的討論中J(w)都用\hat J(w)代替$

where H is the Hessian matrix of J with respect to w evaluated at $w^*$. Because $w^∗$ is the location of a minimum of J, we can conclude that __H is positive semidefinite.__

The minimum of J occurs where its gradient
\\[\nabla_w J(w) \approx \nabla_w \hat J(w) = H(w-w^*)\\]
is equal to zero. 

We can now solve for the regularized version of J. We use the variable $\tilde w = arg \min_w \tilde J(w)$ to represent the location of the minimum of $\tilde J$. 

\begin{align}
\because \tilde J(w) &= \frac{\alpha}{2} w^T w + J(w) \approx \frac{\alpha}{2} w^T w + \hat J(w) \\
\therefore \nabla_w \tilde J(w) &= \alpha w + \nabla_w \hat J(w) = \alpha w + H(w-w^*)           \\
\nabla_w \tilde J(w) &= 0 \\
                     &\Rightarrow \alpha \tilde w + H(\tilde w - w^*) = 0 \\
                     &\Rightarrow (H+\alpha I)\tilde w = H w^*            \\
                     &\Rightarrow \tilde w = (H+\alpha I)^{-1} H w^*      \\ 
\end{align}

As $\alpha$ approaches 0, the regularized solution $\tilde w$ approaches $w^*$

\begin{align}
w^∗ &= \arg \min_w J(w)        \\ 
\tilde w &= \arg \min_w \tilde J(w) \\
\end{align}

> $w^*$是沒有正則化的最佳解，$\tilde w$是正則化的最佳解 

Because H is __real and symmetric__, we can decompose it into a diagonal matrix D and an orthonormal basis of eigenvectors, Q, such that $H = QDQ^T, QQ^T=Q^TQ=I$

\begin{align}
\tilde w &= (QDQ^T + \alpha I)^{-1} QDQ^T w^* \\ 
&= (QDQ^T + Q \alpha Q^T)^{-1} QDQ^T w^* \\
&= \left[ Q(D+\alpha I)Q^T \right]^{-1} QDQ^T w^* \\
&= Q(D+\alpha I)^{-1} DQ^T w^* \\
\end{align}

We see that the effect of weight decay is to rescale $w^*$ along the axes defined by the eigenvectors of H.

Specifically, the component of $w^∗$ that is aligned with the i-th eigenvector of H is rescaled by a factor of $\frac{\lambda_i}{\lambda_i + \alpha}$

Along the directions where the eigenvalues of H are relatively large, the effect of regularization is relatively small. On the other hand, the effect of regularization will shrink eigenvalues to zero.

\begin{align}
\lambda_i >> \alpha: & \\
& \lambda_i' = \lambda_i \cdot \frac{\lambda_i}{\lambda_i + \alpha} \approx \lambda_i \\
\lambda_i << \alpha: & \\
& \lambda_i' = \lambda_i \cdot  \frac{\lambda_i}{\lambda_i + \alpha} \approx 0 
\end{align}

<img src="ref/Fig7.1.png" width=90%>

__Linear Regression example with L2 norm__ 

__Cost function:__

\begin{align}
J(w) &= (Xw-y)^T (Xw-y) \\ 
\tilde J(w) &= (Xw-y)^T(Xw-y) + \frac{1}{2} \alpha w^T w \\
\end{align}

__Unregularized Solution:__
\begin{align}
\nabla_w J(w) &= 2 X^T(Xw-y) = 0 \\
&\Rightarrow X^T X w - X^Ty = 0 \\
&\Rightarrow w = (X^T X)^{-1} X^T y \\
\end{align}

__Regularized Solution:__
\begin{align}
\nabla_w \tilde J(w) &= 2 X^T(Xw-y) + \alpha w = 0 \\
&\Rightarrow 2X^T X w - 2X^Ty + \alpha w = 0 \\
&\Rightarrow \hat w = \left( X^T X + \frac{\alpha}{2} I \right)^{-1} X^T y \\
\end{align}

__Analysis:__ 

The matrix $X^TX$ is proportional to the covariance matrix $\frac{1}{m}X^TX$. Using $L^2$ regularization replaces this matrix with $\left( X^TX+\alpha I \right)^{-1}$. The new matrix is the same as the original one, but with the addition of α to the diagonal. __The diagonal entries of this matrix correspond to the variance of each input feature.__ We can see that L2 regularization causes the learning algorithm to “perceive” the input X as having __higher variance__, which makes it __shrink the weights on features whose covariance with the output target is low compared to this added variance.__

__Example:__

Let $X^TX = \begin{bmatrix}
10  &  x   & x   \\
x   &  0.1 & x   \\
x   &  x   & 0.2 \\
\end{bmatrix}$, $cov(x_1, x_1) = 100 \cdot cov(x_2, x_2)$. Assume $\alpha = 10000$, then the L2 norm will see $X^TX$ as $(X^TX)' = \begin{bmatrix}
10010 &  x       & x       \\
x     &  10000.1 & x       \\
x     &  x       & 10000.2 \\
\end{bmatrix}$, thus, $cov(x_1, x_1) \approx cov(x_2, x_2) = 10000$ 

#### 7.1.2 L1 Regularization
$$\Omega(\theta) = ||w||_1 = \sum_i |w_i|$$

__Objective function:__
$$\tilde J(w) = \alpha ||w||_1 + J(w)$$
__Gradient:__
\begin{align} 
\nabla_w \tilde J(w) &= \alpha \cdot sign(w) + \nabla_w J(w) \\
& \approx \alpha \cdot sign(w) + \nabla_w \hat J(w) \\
        &= \alpha \cdot sign(w) + H(w - w^*)         \\
\end{align}

The regularization contribution to the gradient no longer scales linearly with each wi; instead it is a __constant factor__ with a sign equal to $sign(w_i)$. One consequence of this form of the gradient is that we will not necessarily see clean algebraic solutions to quadratic approximations of J(w) as we did for L2 regularization.

Because the L1 penalty does not admit clean algebraic expressions in the case of a fully general Hessian, we will also make the further simplifying assumption that the __Hessian is diagonal__, $H = diag([H_{1,1},...,H_{n,n}])$, where each $H_{i,i} > 0$.
This assumption holds if the data for the linear regression problem has been preprocessed to __remove all correlation between the input features__, which may be accomplished using PCA.

Under the assumption that Hessian is diagonal, the Taylor expansion of $J(w)$ around $w^*$ is as follow, 
\begin{align}
\hat J(w) &= J(w^*) + \frac{1}{2} (w-w^*)^T H (w-w^*) \\
&= J(w^*) + \sum_i \left[ \frac{1}{2} H_{i,i} (w_i - w_{i}^{*})^2 \right] \\
\tilde J(w) &= J(w) + \alpha |w| \\
            &\approx \hat J(w) + \alpha |w| \\ 
            &= J(w^*) + \sum_i \left[ \frac{1}{2} H_{i,i} (w_i - w_{i}^{*})^2 + \alpha |w_i| \right] \\ 
\end{align}


The problem of minimizing this approximate cost function has an analytical solution (for each dimension i), with the following form: 
\begin{align}
\nabla_w \tilde J(w) &= \alpha \cdot sign(w) + \nabla_w J(w) \\
&= \alpha \cdot sign(w) + \nabla_w \hat J(w) \\
&= \alpha \cdot sign(w) + H(w-w^*) = \vec 0\\ 
&\because \text{Assume H is a diagonal matrix} \\ 
&\therefore \forall{i=1 \sim n} \\ 
&\Rightarrow \alpha \cdot sign(w_i) + H_{i,i} \cdot (w_i - w_i^*) = 0 \\
&\Rightarrow \tilde w_i = sign(w_i^{*}) \cdot max \left\{ |w_i^*|- \frac{\alpha}{H_{i,i}} , 0 \right\} \\
\end{align}

__Analysis:__

Solve for $w_i$ in the equation: $\alpha \cdot sign(w_i) + H_{i,i} \cdot (w_i - w_i^*) = 0$

We first start with discussing the sign of $w_i$ 

\begin{align} 
w_i &=\left\{
\begin{aligned}
& w_i^* - \frac{\alpha}{H_{i,i}}, (w_i > 0) \\
& w_i^* + \frac{\alpha}{H_{i,i}}, (w_i < 0) \\
& 0, (w_i = 0) \\
\end{aligned} 
\right. 
\end{align} 

Then we discuss all possible values for $w_i^*$

Case 1: $0<w_i^*<\frac{\alpha}{H_{i,i}}$
\begin{align} 
0<w_i^*<\frac{\alpha}{H_{i,i}}&\left\{
\begin{aligned}
(w_i > 0), \; & w_i^* - \frac{\alpha}{H_{i,i}} = w_i <0  \Rightarrow\!\Leftarrow \\
(w_i < 0), \; & w_i^* + \frac{\alpha}{H_{i,i}} = w_i >0 \Rightarrow\!\Leftarrow \\
(w_i = 0) \; &\\
\end{aligned} 
\right. 
\end{align} 

Thus, $0<w_i^*<\frac{\alpha}{H_{i,i}} \Rightarrow \tilde w_i = 0$

Case 2: $w_i^*>\frac{\alpha}{H_{i,i}}$
\begin{align} 
w_i^*>\frac{\alpha}{H_{i,i}}&\left\{
\begin{aligned}
(w_i > 0), \; & w_i^* - \frac{\alpha}{H_{i,i}} = w_i >0 \\
(w_i < 0), \; & w_i^* + \frac{\alpha}{H_{i,i}} = w_i >0 \Rightarrow\!\Leftarrow \\
\end{aligned} 
\right. 
\end{align} 

Thus, $w_i^*>\frac{\alpha}{H_{i,i}} \Rightarrow \tilde w_i = w_i^* - \frac{\alpha}{H_{i,i}}$

Case 3: $-\frac{\alpha}{H_{i,i}}<w_i^*<0$
\begin{align} 
-\frac{\alpha}{H_{i,i}}<w_i^*<0 &\left\{
\begin{aligned}
(w_i > 0), \; & w_i^* - \frac{\alpha}{H_{i,i}} = w_i <0  \Rightarrow\!\Leftarrow \\
(w_i < 0), \; & w_i^* + \frac{\alpha}{H_{i,i}} = w_i >0 \Rightarrow\!\Leftarrow \\
\end{aligned} 
\right. 
\end{align} 

Thus, $-\frac{\alpha}{H_{i,i}}<w_i^*<0 \Rightarrow \tilde w_i = 0$

Case 4: $w_i^* < -\frac{\alpha}{H_{i,i}}$
\begin{align} 
w_i^* < -\frac{\alpha}{H_{i,i}}&\left\{
\begin{aligned}
(w_i > 0), \; & w_i^* - \frac{\alpha}{H_{i,i}} = w_i < 0 \Rightarrow\!\Leftarrow \\
(w_i < 0), \; & w_i^* + \frac{\alpha}{H_{i,i}} = w_i < 0 \\ 
\end{aligned} 
\right. 
\end{align} 

Thus, $w_i^* <-\frac{\alpha}{H_{i,i}} \Rightarrow \tilde w_i = w_i^* + \frac{\alpha}{H_{i,i}}$

__Observation:__

For case 1 and 3, the optimal value of $\tilde w_i = 0$ This occurs because the contribution of J(w) to the regularized objective $\tilde J(w)$ is overwhelmed in direction i by the L1 regularization which pushes the value of $w_i$ to zero.

For case 2 and 4, the regularization doesn't move the optimal value of $w_i$ to zero but instead it just shifts it in that direction by a distance equal to $\frac{\alpha}{H_{i,i}}$

In conclusion, the solution to \\[\alpha \cdot sign(w_i) + H_{i,i} \cdot (w_i - w_i^*) = 0\\] is \\[\tilde w_i = sign(w_i^{*}) \cdot max \left\{ |w_i^*|- \frac{\alpha}{H_{i,i}} , 0 \right\}\\]

In comparison to L2 regularization, L1 regularization results in a solution that is more sparse. Sparsity in this context refers to the fact that some parameters have an optimal value of zero. The sparsity of L1 regularization is a qualitatively different behavior than arises with L2 regularization.

If we revisit that equation using the assumption of a diagonal and positive definite Hessian H that we introduced for our analysis of L1 regularization, we find that $\tilde w_i = \frac{H_{i,i}}{H_{i,i} + \alpha} w_i^*$ This demonstrates that L2 regularization doesn't cause the parameters to become sparse while L1 regularization may do so for large enough $\alpha$. 

In the following section, detailed clarification is made for both L1 and L2 regularization. 
- L1 regularization:

H is the diagonal Hessian matrix around point $w^*$. For each diagonal element (Let's say the i-th element) in H, L1 regularization makes the following correction to the unregularized solution $w^*$ 

\\[\tilde w_i = \frac{H_{i,i}}{H_{i,i} + \alpha} w_i^*\\]

> L1 scales the diagonal matrix H directly.

- L2 regularization: 

H is the Hessian matrix around point $w^*$. By orthornomally diagonized H, L2 regularization rescales $w^*$ in every direction of eigenvector of H. 

\\[\tilde w = Q(D+\alpha I)^{-1} DQ^T w^*, \: H=QDQ^T\\]

> L2 scales the eigenvalues of H, scaled value is less unlikely to become a zero matrix.  

The sparsity property induced by L1 regularization has been used extensively as a __feature selection__ mechanism. Feature selection simplifies a machine learning problem by choosing which subset of the available features should be used.

The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.

### 7.2 Norm Penalties as Constrained Optimization

__Norm Penalties__

The cost function regularized by a parameter norm penalty: 
\\[\tilde J(\theta) = J(\theta) + \alpha \Omega(\theta)\\]

Suppose we want to minimize the cost function \\[f(w)= ||Xw-y||^2_2 = (Xw-y)^T(Xw-y)\\] with L2 regularization. The regularized cost function become \\[\tilde f(w) = (Xw-y)^T(Xw-y) + \frac{1}{2} \alpha w^Tw\\] We then solve the derivative function to obtain $\tilde w = \arg \min_w f(w)$

\begin{align}
\nabla_w \tilde f(w) &= 2X^TXw - 2 X^Ty + \alpha w = 0 \\
\tilde w &= (X^TX + \frac{\alpha}{2} I)^{-1} X^T y \\  
\end{align}

__Constrained Optimization__ 

We can minimize a function subject to constraints by constructing a generalized Lagrange function, consisting of the original objective function plus a set of penalties. Each penalty is a product between a coefficient, called a __Karush–Kuhn–Tucker (KKT) multiplier__, and a function representing
whether the constraint is satisfied. 

Suppose we want to minimize the cost function \\[f(w)=\frac{1}{2} ||Xw-y||^2_2\\] subject to $w^Tw \leq 1$. To do so, we introduce the Lagrangian \\[L(w, \lambda) = f(w) + \lambda(w^Tw-1)\\] and then solve the equation \\[\min_w \max_{\lambda, \lambda \geq 0} L(w, \lambda)\\] We then solve the derivative function to obtain $\tilde w = \arg \min_w L(w, \lambda)$

\begin{align}
\nabla_w L(w, \lambda) &= X^TXw - X^Ty + 2 \lambda w = 0 \\
\Rightarrow \tilde w &= (X^TX+2 \lambda I)^{-1}X^Ty \\ 
\end{align} 

__Norm Penalties OR Constrained Optimization:__

We can thus think of a parameter norm penalty as imposing a constraint on the weights. If Ω is the L2 norm, then the weights are constrained to lie in an L2 ball. If Ω is the L1 norm, then the weights are constrained to lie in a region of limited L1 norm.

Sometimes we may wish to use explicit constraints rather than penalties because penalties can cause non-convex optimization procedures to get __stuck in local minima__ corresponding to small θ.

When training neural networks, this usually manifests as neural networks that train with several __dead units__. These are units that do not contribute much to the behavior of the function learned by the network because __the weights going into or out of them are all very small.__ 

When training with a penalty on the norm of the weights, these configurations can be locally optimal, even if it is possible to significantly reduce J by making the weights larger. Explicit constraints implemented by re-projection can work much better in these cases because they do not encourage the weights to approach the origin. Explicit constraints implemented by re-projection only have an effect when the weights become large and attempt to leave the constraint region.

Finally, explicit constraints with reprojection can be useful because they impose some stability on the optimization procedure. When using high learning rates, it is possible to encounter a positive feedback loop in which large weights induce large gradients which then induce a large update to the weights. If these updates consistently increase the size of the weights, then θ rapidly moves away from the origin until numerical overflow occurs. Explicit constraints with reprojection prevent this feedback loop from continuing to increase the magnitude of the weights
without bound. Hinton et al. (2012c) recommend using constraints combined with a high learning rate to allow rapid exploration of parameter space while maintaining some stability.

In particular, Hinton et al. (2012c) recommend a strategy introduced by Srebro and Shraibman (2005): constraining the norm of each column of the weight matrix of a neural net layer, rather than constraining the Frobenius norm of the entire weight matrix. Constraining the norm of each column separately prevents any one hidden unit from having very large weights. If we converted this constraint into a penalty in a Lagrange function, it would be similar to L2 weight decay but with a separate KKT multiplier for the weights of each hidden unit. Each of these KKT multipliers would be dynamically updated separately to make each hidden unit obey the constraint. In practice, column norm limitation is always implemented as an explicit constraint with reprojection.

### 7.3 Regularization and Under-Constrained Problems
In some cases, regularization is necessary for machine learning problems to be properly defined. Many linear models in machine learning, including linear regression and PCA, depend on inverting the matrix $X^TX$. This is not possible whenever $X^TX$ is singular.

This matrix can be singular whenever the data generating distribution truly has no variance in some direction, or when no variance is observed in some direction because there are fewer examples (rows of X) than input features (columns of X). In this case, many forms of regularization correspond to inverting $X^T X + \alpha I$ instead. This __regularized matrix is guaranteed to be invertible since $det(X^T X + \alpha I) \neq 0$__

These linear problems have closed form solutions when the relevant matrix is invertible. (Solved by normal equation) It is also possible for a problem with no closed form solution to be underdetermined. 

An example is logistic regression applied to a problem where the classes are linearly separable. If a weight vector w is able to achieve perfect classification, then 2w will also achieve perfect classification and higher likelihood. (why?)

An iterative optimization procedure like stochastic gradient descent will continually increase the magnitude of w and, in theory, will never halt. In practice, a numerical implementation of gradient descent will eventually reach sufficiently large weights to cause numerical overflow, at which point its behavior will depend on how the programmer has decided to handle values that are not real numbers.

As we saw in section 2.9, we can solve underdetermined linear equations using the Moore-Penrose pseudoinverse. Recall that one definition of the pseudoinverse $X^+$ of a matrix X is \\[X^+ = 
\lim_{\alpha \rightarrow 0} (X^TX+\alpha I)^{-1} X^T\\]

We can now recognize this equation as performing linear regression with weight decay. Specifically, this equation is the limit as the regularization coefficient shrinks to zero. We can thus __interpret the pseudoinverse as stabilizing underdetermined problems using regularization.__

### 7.4 Dataset Augmentation
In practice, the amount of data we have is limited. One way to get around this problem is to create fake data and add it to the training set. For some machine learning tasks, it is reasonably straightforward to create new fake data.

This approach is easiest for classification. A classifier needs to take a complicated, high dimensional input x and summarize it with a single category identity y. This means that the main task facing a classifier is to be invariant to a wide variety of transformations. We can generate __new (x, y) pairs__ easily just by __transforming the x inputs in our training set.__ This approach is not as readily applicable to many other tasks. 

Dataset augmentation has been a particularly effective technique for a specific classification problem: __object recognition.__ Images are high dimensional and include an enormous variety of factors of variation, many of which can be easily simulated. Operations like __translating the training images a few pixels in each direction__ can often greatly improve generalization, even if the model has already been designed to be __partially translation invariant__ by using the convolution and pooling techniques. Many other operations such as __rotating the image__ or __scaling the image__ have also proven quite effective.

Injecting noise in the input to a neural network (Sietsma and Dow, 1991) can also be seen as a form of data augmentation. Neural networks prove not to be very robust to noise, however (Tang and Eliasmith, 2010). One way to improve the robustness of neural networks is simply to train them with random noise applied to their inputs. Input noise injection is part of some unsupervised learning algorithms such as the denoising autoencoder (Vincent et al., 2008). Noise injection also works when the noise is applied to the hidden units, which can be seen as doing dataset augmentation at multiple levels of abstraction.

When comparing machine learning benchmark results, it is important to take the effect of dataset augmentation into account. Often, __hand-designed dataset augmentation schemes__ can dramatically reduce the generalization error of a machine learning technique. To compare the performance of one machine learning algorithm to another, it is necessary to make sure that both algorithms were evaluated using the same hand-designed dataset augmentation schemes.

### 7.5 Noise Robustness
For some models, the addition of __noise with infinitesimal variance__ at the input of the model is equivalent to imposing a __penalty on the norm of the weights__ (Bishop, 1995a,b). In the general case, it is important to remember that noise injection can be much more powerful than simply shrinking the parameters, especially when the noise is added to the hidden units. __Noise applied to the hidden units__ is such an important topic that it merit its own separate discussion; the __dropout algorithm__ described in section 7.12 is the main development of that approach.

Another way that noise has been used in the service of regularizing models is by adding it to the weights. This technique has been used primarily in the context of recurrent neural networks (Jim et al., 1996; Graves, 2011). Noise applied to the weights can also be interpreted as equivalent (under some assumptions) to a more traditional form of regularization, encouraging stability of the function to be learned.  

Consider the regression setting, where we wish to train a function $\hat y(x)$ that maps a set of features x to a scalar using the least-squares cost function between the model predictions $\hat y(x)$ and the true values y: \\[J=\mathbb{E}_{p(x,y)}\left[ (\hat y(x) - y )^2 \right] \\]

We now assume that with each input presentation we also include a random perturbation $\epsilon W \sim N(\epsilon W; 0, \eta I)$ of the network weights. 

> $\epsilon W$ 是一個正態分布的隨機變數，平均是0，變異數是$\eta I$

We denote the __perturbed model as $\hat y_{\epsilon W}(x)$__. Despite the injection of noise, we are still interested in minimizing the squared error of the output of the network. The objective function thus becomes: 
\begin{align}
\tilde J_w &= \mathbb{E}_{p(w,y,\epsilon W)}{\left[ (\hat y_{\epsilon W}(x) - y)^2\right]} \\
&= \mathbb{E}_{p(w,y,\epsilon W)} {\left[ \hat y_{\epsilon W}(x)^2 -2y\hat y_{\epsilon W}(x)^2+ y^2 \right]} \\
\end{align}

For small $\eta$, the minimization of J with added weight noise (with covariance $\eta I$) is equivalent to minimization of J with an additional regularization term: $$\eta \mathbb{E}_{p(x,y)} \left[ ||\nabla_W \hat y(x)||^2\right]$$

This form of regularization encourages the parameters to
go to regions of parameter space where __small perturbations of the weights have a relatively small influence on the output.__ In other words, it pushes the model into regions where the model is relatively insensitive to small variations in the weights, finding points that are __not merely minima, but minima surrounded by flat regions__ [(Hochreiter and Schmidhuber, 1995)](ref/Hochreiter_and_Schmidhuber_1995.pdf) In the simplified case of linear regression (where, for instance, $\hat y(x) = w^Tx + b$), this regularization term collapses into $\eta E_{p}(x)\left[ ||x||^2\right]$, which is not a function of parameters and therefore does not contribute to the gradient of $\tilde J_W$ with respect to the model parameters. 


### 7.6 Semi-Supervised Learning 

In the paradigm of semi-supervised learning, both __unlabeled examples__ from P(x) and __labeled examples__ from P (x, y) are used to estimate P (y | x) or predict y from x. In the context of deep learning, semi-supervised learning usually refers to learning a representation h = f (x). The goal is to __learn a representation so that examples from the same class have similar representations.__

Instead of having separate unsupervised and supervised components in the model, one can construct models in which a generative model of either P (x) or P(x,y) shares parameters with a discriminative model of P(y | x). One can then trade-off the supervised criterion −logP(y | x) with the unsupervised or generative one (such as −logP(x) or −logP(x,y)). The generative criterion then expresses a particular form of prior belief about the solution to the supervised learning problem (Lasserre et al., 2006), namely that the structure of P(x) is connected to the structure of P(y | x) in a way that is captured by the shared parametrization.

Salakhutdinov and Hinton (2008) describe a method for learning the kernel function of a kernel machine used for regression, in which __the usage of unlabeled examples for modeling P (x) improves P (y | x) quite significantly.__

See [Chapelle et al. (2006)](ref/MITPress_SemiSupervisedLearning.pdf) for more information about semi-supervised learning.

### 7.7 Multi-Task Learning
Multi-task learning [(Caruana, 1993)](ref/Multitask_Connectionist_Learning.pdf) is a way to improve generalization by pooling the examples (which can be seen as soft constraints imposed on the parameters) arising out of several tasks. In the same way that additional training examples put more pressure on the parameters of the model towards values that generalize well, when part of a model is shared across tasks, that part of the model is more constrained towards good values (assuming the sharing is justified), often yielding better generalization.

<img src="ref/Fig7.2.png" width=70%>

> Multi-task learning 假設從不同的訓練任務中，底層的網絡可以學習到相似的部分，隨著多任務且多資料進入這個模型，可以有更好的generalization. 例如：同時用鳥/狗/人臉的分類器 來訓練模型，底層的網絡能夠更好的學到edge feature等等，進而提升人臉辨識器的精準度。

The model can generally be divided into two kinds of parts and associated parameters:
- __Task-specific parameters__ (which only benefit from the examples of their task to achieve good generalization). These are the __upper layers__ of the neural network in figure 7.2.
- __Generic parameters__, shared across all the tasks (which benefit from the pooled data of all the tasks). These are the __lower layers__ of the neural network in figure 7.2.

Of course this will happen only if some assumptions about the statistical relationship between the different tasks are valid, meaning that there is something shared across some
of the tasks.

From the point of view of deep learning, the underlying prior belief is the following: __among the factors that explain the variations observed in the data associated with the different tasks, some are shared across two or more tasks.__

### 7.8 Early Stopping

When training large models with sufficient representational capacity to overfit the task, we often observe that __training error decreases steadily over time, but validation set error begins to rise again. This behavior occurs very reliably.__

This means we can obtain a model with __better validation set error__ (and thus, hopefully better test set error) by returning to the parameter setting __at the point in time with the lowest validation set error.__

This strategy is known as __early stopping.__ It is probably the most commonly used form of regularization in deep learning. Its popularity is due both to its effectiveness and its simplicity.

Algorithm: Every time the error on the validation set improves (error decreases), we store a copy of the model parameters. When the training algorithm terminates, we return these parameters, rather than the latest parameters. 

One way to think of early stopping is as a very efficient hyperparameter selection algorithm. In this view, the number of training steps is just another hyperparameter.

Most hyperparameters that control model capacity have such a U-shaped validation set performance curve. In the case of early stopping, we are controlling the __effective capacity__ of the model by determining how many steps it can take to fit the training set. Most hyperparameters must be chosen using an expensive guess and check process, where we set a hyperparameter at the start of training, then run training for several steps to see its effect. The __“training time” hyperparameter__ is unique in that by definition a single run of training tries out many values of the hyperparameter.

Early stopping is a very unobtrusive form of regularization, in that it requires almost no change in the underlying training procedure, the objective function, or the set of allowable parameter values. This means that it is easy to use early stopping without damaging the learning dynamics. This is in contrast to weight decay, where one must be careful not to use too much weight decay and trap the network in a bad local minimum corresponding to a solution with pathologically small weights.

__How early stopping acts as a regularizer ?__ 

Bishop (1995a) and Sjöberg and Ljung (1995) argued that early stopping has the effect of __restricting the optimization procedure to a relatively small volume of parameter space in the neighborhood of the initial parameter value $\theta_0$__ More specifically, imagine taking $\tau$ optimization steps (corresponding to τ training iterations) and with learning rate $\epsilon$. We can view __the product $\epsilon \tau$ as a measure of effective capacity.__ Assuming the gradient is bounded, restricting both the number of iterations and the learning rate limits the volume of parameter
space reachable from $\theta_0$. In this sense, $\epsilon \tau$ behaves as if it were the reciprocal of the coefficient used for weight decay.

__Early stopping is equivalent to L2 regularization:__

In the case of a simple linear model with a quadratic error function and simple gradient descent, early stopping is equivalent to L2 regularization.

Taylor series expansion around $w^*$

\\[\hat J(w) = J(w^{*}) + \frac{1}{2}(w-w^*)^T H(w-w^*)\\]
- H is the Hessian matrix of J with respect to w evaluated at $w^*$
- $w^* = \arg \min_w J(w)$
- H is positive definite 

Gradient of $\hat J(w)$

\\[\nabla_w \hat J(w) = H(w-w^*)\\]

Apply Gradient Descent weight update

\begin{align}
w^{\tau} &=  w^{\tau-1} - \epsilon \nabla_w \hat J(w^{(\tau -1)}) \\
&= w^{\tau-1} - \epsilon H(w^{(\tau -1)} - w^{*} ) \\
w^{\tau} - w^{*} &= (I-\epsilon H)(w^{(\tau -1)} - w^{*}) \\
w^{\tau} - w^{*} &= (I-\epsilon QDQ^T)(w^{(\tau -1)} - w^{*}) \\
w^{\tau} - w^{*} &= (QQ^T-\epsilon QDQ^T)(w^{(\tau -1)} - w^{*}) \\
\end{align}

Assume $w^{(0)} = 0$ , $\tau$ is the number of epoch, $\epsilon$ is chosen to be small enough to gurantee $|1-\epsilon \lambda_i| < 1$ 

\begin{align}
w^{\tau} - w^{*} &= Q(I-\epsilon D)Q^T(w^{(\tau -1)} - w^{*}) \\
Q^T(w^{\tau} - w^{*}) &= (I-\epsilon D)Q^T(w^{(\tau -1)} - w^{*}) \\
\because S_{\tau} &= Q^T (w^{\tau} - w^{*}) \\
\therefore S_{\tau} &= (I-\epsilon D) S_{\tau-1} \\
\Rightarrow S_{\tau} &= (I-\epsilon D)^{\tau} S_{0} \\
\Rightarrow Q^T (w^{\tau} - w^{*}) &= (I-\epsilon D)^{\tau} Q^T (w^{0} - w^{*}) \\
&= -(I-\epsilon D)^{\tau} Q^T w^{*} \\
Q^T w^{\tau}&= [ I-(I-\epsilon D)^{\tau} ] Q^T w^{*} \\
\end{align}

The L2 regularization formula can be further formulated to fit with notation used in the previous section. 
\begin{align}
\tilde w &= Q(D+\alpha I)^{-1}D Q^T w^* \\
\Rightarrow Q^T\tilde w &= (D+\alpha I)^{-1}D Q^T w^* \\
\because (D+\alpha I)^{-1}D &= I - (D+\alpha I)^{-1} \alpha I\\
\therefore Q^T\tilde w &= [I - (D+\alpha I)^{-1} \alpha] Q^T w^* \\
\end{align}

Compare the equation derived from Early Stopping and L2 regularization, assume the parameters $\tau(\text{# of iteration}), \alpha(\text{L2 regularization parameter}), \epsilon(\text{Learning rate})$ are chosen such that \\[(I-\epsilon D)^{\tau} = (D+\alpha I)^{-1} \alpha\\] then __Early stopping is equivalent to L2 regularization__ (at least under the quadratic approximation of the objective function). 

Going even further, by taking logarithms and using the series expansion for log(1+x), we can conclude that if all $\lambda_i$ are small (that is, $\epsilon \lambda_i << 1$ and $\frac{\lambda_i}{\alpha} << 1$) then \\[\alpha \approx \frac{1}{\tau \epsilon}\\]

Maclaurin series of log(1+x) around x = 0, \\[ \log(1+x) \approx x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} ... \approx x \\]

Take the logarithms on both sides, 
\begin{align}
\tau \log(I-\epsilon D) &= \log(\frac{I}{\frac{D}{\alpha}+I}) \\
\tau \log(I-\epsilon D) &= -\log(I+\frac{D}{\alpha}) \\
\because log(1+x) &\approx x \text{ around x=0} \\
-\tau \epsilon D &\approx -\frac{D}{\alpha} \\
\therefore \alpha &\approx \frac{1}{\tau \epsilon}\\
\end{align}

That is, under these assumptions, the number of training iterations $\tau$ plays a role inversely proportional to the L2 regularization parameter, and the inverse of $\epsilon \tau$ plays the role of the weight decay coefficient.
- $\tau$ Number of traing iteration. 
- $\epsilon$ Learning rate of gradient descent. 
- $\alpha$ Parameter of L2 regularization. 

Parameter values corresponding to directions of significant curvature (of the objective function) are regularized less than directions of less curvature. Of course, in the context of early stopping, this really means that __parameters that correspond to directions of significant curvature tend to learn early relative to parameters corresponding to directions of less curvature.__

<img src="ref/Fig7.4.png" width=70%>

Early stopping is of course more than the mere restriction of the trajectory length; instead, early stopping typically involves monitoring the validation set error in order to __stop the trajectory at a particularly good point in space.__ Early stopping therefore has the advantage over weight decay that __early stopping automatically determines the correct amount of regularization while weight decay requires many training experiments with different values of its hyperparameter.__