# Regularization for Deep Learning

A central problem in machine learning is how to make an algorithm that willperform well not just on the training data, but also on new inputs. Many strategiesused in machine learning are explicitly designed to reduce the test error, possiblyat the expense of increased training error. These strategies are known collectivelyas regularization. 

**Definition**: We deﬁned regularization as “any modiﬁcation we make to alearning algorithm that is intended to reduce its generalization error but not itstraining error.”

In the context of deep learning, most regularization strategies are based onregularizing estimators. Regularization of an estimator works by trading increasedbias for reduced variance. An eﬀective regularizer is one that makes a proﬁtabletrade, reducing variance signiﬁcantly while not overly increasing the bias.

Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe. To some extent, we are always trying to ﬁt a square peg (the data-generating process) intoa round hole (our model family).

We might ﬁnd—and indeed in practical deep learning scenarios,we almost always do ﬁnd—that the best ﬁtting model (in the sense of minimizinggeneralization error) is a large model that has been regularized appropriately.

## Parameter Norm Penalties

Many regularization approaches are based on limiting the capacity of models,such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty $\Omega(\theta)$ to the objective function J. We denote the regularized objective function by $\tilde{J}$:

$$
\tilde{J}(\theta;\textbf{X},y) = J(\theta;\textbf{X},y) + \alpha\Omega(\theta)
$$

where $\alpha\in[0,\infty)$ is a hyperparameter that weights the relative contribution of the norm penalty term, $\Omega$, relative to the standard objective function J. Setting $\alpha = 0$ results in no regularization. Larger values of $\alpha$ correspond to more regularization. Minimizes the regularized objective function $\tilde{J}$ it will decrease both the original objectiveJon the training data and some measure of the size of the parameters $\theta$ (or some subset of the parameters).

**Note:** for neural networks, we typically choose to use a parameter norm penalty $\Omega$ that penalizes only the weights of the affine transformation at each layer and leaves the baises unregularized. This means that we do not induce too muchvariance by leaving the biases unregularized. 

## $L^2$ Parameter Regularization

one of the simplest and most common kinds of parameter norm penalty: the $L^2$ parameter norm penalty commonly known as **weight decay**. This regularization strategy drives the weights closer to the origin1by adding a regularization term $\Omega(θ) = \frac{1}{2}\Vert w \Vert_2^2$ to the objective function. $L^2$ regularization is also known as **ridge regression**.

We can gain some insight into the behavior of weight decay regularizationby studying the gradient of the regularized objective function. To simplify thepresentation, we assume no bias parameter, so $\theta$ is just $w$. Such a model has thefollowing total objective function:

$$
\tilde{J}(w;,\textbf{X}, y) = \frac{\alpha}{2}w^Tw + J(w;,\textbf{X}, y)
$$

with corresponding parameter gradient

$$
\nabla_w\tilde{J}(w;,\textbf{X}, y) = \alpha w + \nabla_w J(w;,\textbf{X}, y)
$$

To take a single gradient step to update the weights, we perform this update:

$$
w \leftarrow w - \epsilon(\alpha w + \nabla_w J(w;,\textbf{X}, y))
$$
$$
\iff
$$
$$
w \leftarrow (1-\epsilon\alpha)w -\epsilon\nabla_w J(w;\textbf{W}, y)
$$

We can see that the addition of the weight decay term has modiﬁed the learningrule to multiplicatively shrink the weight vector by a constant factor on each step,just before performing the usual gradient update. This describes what happens ina single step. But what happens over the entire course of training?

We will further simplify the analysis by making a quadratic approximationto the objective function in the neighborhood of the value of the weights that obtains minimal unregularized training cost, $w_∗= \argmin_w J(w)$. If the objectivefunction is truly quadratic, as in the case of ﬁtting a linear regression model with mean squared error, then the approximation is perfect. The approximation $\hat{J}$ is given by

$$
\hat{J}(\theta) = J(w_*) + \frac{1}{2}(w-w^*)^T\textbf{H}(w-w^*)
$$

where H is the Hessian matrix of J with respect to w evaluated at $w^*$. There is no first-order term in this quadratic approximation, because $w^*$ is the location of a minimum of J, we can conclude that $\textbf{H}$ is positive semidefinite. The minimus of $\hat{J}$ occurs where its gradient

$$
\nabla_w\hat{J}(w) = \textbf{H} (w-w^*)
$$

is equal to $\textbf{0}$.