## Chap 7 Regularization for Deep Learning
We defined regularization as __any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.__

In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance.

Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe. To some extent, we are always trying to fit a square peg (the data generating process) into a round hole (our model family).

What this means is that __controlling the complexity of the model__ is NOT a simple matter of finding the __model of the right size__, with the right number of parameters. Instead, we might find and -- indeed in practical deep learning scenarios, we almost always do find -- that the __best fitting model__ (in the sense of minimizing generalization error) is a __large model that has been regularized appropriately.__

### 7.1 Parameter Norm Penalties
Many regularization approaches are based on limiting the capacity of models by adding a parameter norm penalty $\Omega(\theta)$ to the objective function $J$. We denote the regularized objective function by $\tilde J$

$$\tilde J(\theta; X,y) = J(\theta; X,y) + \alpha \Omega(\theta)$$

For neural networks, we typically choose to use a parameter norm penalty that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. The biases typically require less data to fit accurately than the weights.

Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions. Each __bias controls only a single variable.__ This means that __we do NOT induce too much variance by leaving the biases unregularized.__ Also, __regularizing the bias parameters can introduce a significant amount of underfitting.__

In the context of neural networks, it is sometimes desirable to __use a separate penalty with a different α coefficient for each layer of the network.__ Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers just to __reduce the size of search space.__

#### 7.1.1 $L^2$ Parameter Regularization
L2 parameter norm penalty commonly known as __weight decay__ This regularization strategy __drives the weights closer to the origin__ by adding a regularization term $\Omega(\theta) = \frac{1}{2} ||w||^2$ to the objective function.

We can gain some insight into the behavior of weight decay regularization by studying the gradient of the regularized objective function. To simplify the presentation, we assume no bias parameter, so $\theta$ is just w. 

__Objective function:__ 
$$\tilde J(w;X,y) = \frac{\alpha}{2} w^T w + J(w;X,y)$$

__Gradient:__ 
$$\nabla_w \tilde J(w;X,y) = \alpha w + \nabla_w J(w;X,y)$$

__Weight update:__ 
$$w \leftarrow w - \epsilon  (\alpha w + \nabla_w J(w; X, y))$$

$$w \leftarrow (1- \epsilon \alpha)w - \epsilon \nabla_w J(w; X, y)$$

We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a __constant factor__ on each step, just before performing the usual gradient update. 

We will further simplify the analysis by making a __quadratic approximation__ to the objective function in the neighborhood of the value of the weights that obtains minimal unregularized training cost, $w^∗ = \arg \min_w J(w)$ 

> 為了簡化模型，我們將目標函數在最低點做二階的泰勒展開式作為近似。

Taylor series expalsion around $w^*$
\begin{align}
J(\theta) & \approx J(w^*) + (w-w^*)^T g + \frac{1}{2} (w-w^*)^T H (w-w^*) \\
&( \because w^* = arg \min_w J(w) \therefore g = 0) \\
\hat J(\theta) & = J(w^{*}) + \frac{1}{2}(w-w^*)^T H(w-w^*) \\
\end{align}

> $\hat J(\theta)為J(\theta)在w^*附近之近似$

where H is the Hessian matrix of J with respect to w evaluated at $w^*$. Because $w^∗$ is the location of a minimum of J, we can conclude that __H is positive semidefinite.__

The minimum of J occurs where its gradient
$$\nabla_w J(w) \approx \nabla_w \hat J(w) = H(w-w^*)$$
is equal to zero. 

We can now solve for the regularized version of J. We use the variable $\tilde w = arg \min_w \tilde J(w)$ to represent the location of the minimum of $\tilde J$. 

\begin{align}
\nabla_w \tilde J(w) &= \alpha w + H(w-w^*) \\
\Rightarrow &\alpha \tilde w + H(\tilde w - w^*) = 0 \\
\Rightarrow & (H+\alpha I)\tilde w = H w^* \\
\Rightarrow & \tilde w = (H+\alpha I)^{-1} H w^* \\ 
\end{align}

As $\alpha$ approaches 0, the regularized solution $\tilde w$ approaches $w^*$

Because H is __real and symmetric__, we can decompose it into a diagonal matrix D and an orthonormal basis of eigenvectors, Q, such that $H = QDQ^T, QQ^T=Q^TQ=I$

\begin{align}
\tilde w &= (QDQ^T + \alpha I)^{-1} QDQ^T w^* \\ 
&= (QDQ^T + Q \alpha Q^T)^{-1} QDQ^T w^* \\
&= \left[ Q(D+\alpha I)Q^T \right]^{-1} QDQ^T w^* \\
&= Q(D+\alpha I)^{-1} DQ^T w^* \\
\end{align}

We see that the effect of weight decay is to rescale $w^*$ along the axes defined by the eigenvectors of H.

Specifically, the component of w∗ that is aligned with the i-th eigenvector of H is rescaled by a factor of $\frac{\lambda_i}{\lambda_i + \alpha}$

Along the directions where the eigenvalues of H are relatively large, the effect of regularization is relatively small. On the other hand, the effect of regularization will shrink eigenvalues to zero.

\begin{align}
\lambda_i >> \alpha: & \\
& \lambda_i' = \lambda_i \cdot \frac{\lambda_i}{\lambda_i + \alpha} \sim \lambda_i \\
\lambda_i << \alpha: & \\
& \lambda_i' = \lambda_i \cdot  \frac{\lambda_i}{\lambda_i + \alpha} \sim 0 
\end{align}