# Penalize Large Weights With Weight Regularization

Neural networks learn a set of weights that best map inputs to outputs. A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can signify that the network has overfitted the training dataset and will likely perform poorly when making predictions on new data. A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization, and it can be used as a general technique to reduce the overfitting of the training dataset and improve the generalization of the model. 

In this tutorial, you will discover weight regularization as an approach to reduce overfitting for neural networks.  After reading this tutorial, you will know:
* Large weights in a neural network signify a more complex network that has overfit the training data.
* Penalizing a network based on the size of the network weights during training can reduce overfitting.
* An L1 or L2 vector norm penalty can be added to the optimization of the network to encourage smaller weights.

## Weight Regularization

In this section, you will discover the problem with neural networks with large weights, a technique that you can use to encourage the development of models with smaller weights called weight regularization, and tips for using this technique in your projects.

### Problem With Large Weights

When fitting a neural network model, we must learn the weights of the network (i.e., the model parameters) using stochastic gradient descent and the training dataset. The longer we train the network, the more specialized the weights will become to the training data, overfitting the training data. The weights will grow in size to handle the specifics of the examples seen in the training data. Large weights make the network unstable. Although the weights will be specialized to the training dataset, minor variation or statistical noise on the expected inputs will result in large differences in the output.

Generally, we refer to this model as having a large variance and a small bias. The model is sensitive to the specific examples, the statistical noise, in the training dataset. A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized to the training data. In practice, we prefer to choose the simpler models to solve a problem (e.g., Occam's razor). We prefer models with smaller weights.

Another possible issue is that there may be many input variables, each with different levels of relevance to the output variable. Sometimes we can use methods to select input variables, but the interrelationships between variables are often not obvious. Having small weights or even zero weights for less relevant or irrelevant inputs to the network will allow the model to focus on learning. This, too, will result in a simpler model.

### Encourage Small Weights

The learning algorithm can be updated to encourage the network to use small weights. One way to do this is to change the calculation of loss used in optimizing the network and consider the weights' size. Remember that when we train a neural network, we minimize a loss function, such as the log loss in classification or mean squared error in regression. In calculating the loss between the predicted and expected values in a batch, we can add the current size of all weights in the network or add a layer to this calculation. This is called a penalty because we are penalizing the model proportional to the size of the weights in the model.

The optimization algorithm will then push the model to have smaller weights, i.e., weights no larger than needed to perform well on the training dataset. Larger weights result in a larger penalty in the form of a larger loss score. Smaller weights are considered more regular or less specialized, and as such, we refer to this penalty as weight regularization. When this approach of penalizing model coefficients is used in other machine learning models such as linear regression or logistic regression, it may be referred to as shrinkage because the penalty encourages the coefficients to shrink during the optimization process.

Adding a weight size penalty or weight regularization to a neural network reduces generalization error and allows the model to pay less attention to less relevant input variables.

### How to Penalize Large Weights

There are two parts to penalizing the model based on the size of the weights. The first is the calculation of the size of the weights, and the second is the amount of attention that the optimization process should pay to the penalty.

**Calculate Weight Size**

Neural network weights are real values that can be positive or negative, as such, simply adding the weights is not sufficient. There are two main approaches used to calculate the size of the weights, they are:

* Calculate the sum of the absolute values of the weights, called the L1 norm (or L<sup>1</sup>).
* Calculate the sum of the squared values of the weights, called the L2 norm (or L<sup>2</sup>).

The use of L2 in linear and logistic regression is often referred to as Ridge Regression. L1 encourages weights to 0.0 if possible, resulting in more sparse weights (with more 0.0 values). L2 offers more nuance, both penalizing larger weights more severely but resulting in less sparse weights. This is useful to know when developing an intuition for the penalty or examples of its usage.

The weights may be considered a vector, and the magnitude of a vector is called its norm from linear algebra. As such, penalizing the model based on the size of the weights is also referred to as a weight or parameter norm penalty. It is possible to include both L1 and L2 approaches to calculating the size of the weights as the penalty. This is akin to using both penalties used in the Elastic Net algorithm for linear and logistic regression. The L2 approach is perhaps the most used and is traditionally referred to as weight decay in neural networks. It is called shrinkage in statistics, a name that encourages you to think of the impact of the penalty on the model weights during the learning process.

Recall that each node has input weights and a bias weight. The bias weight is generally not included in the penalty because the input is constant.

**Control Impact of the Penalty**

The calculated size of the weights is added to the loss objective function when training the network. Rather than adding each weight to the penalty directly, they can be weighted using a new hyperparameter called alpha (\alpha) or sometimes lambda. This controls the amount of attention that the learning process should pay to the penalty. Alternatively, put another way, the amount to penalize the model is based on the size of the weights. The alpha hyperparameter has a value between 0.0 (no penalty) and 1.0 (full penalty). This hyperparameter controls the amount of bias in the model from 0.0, or low bias (high variance), to 1.0, or high bias (low variance).

If the penalty is too strong, the model will underestimate the weights and underfit the problem. If the penalty is too weak, the model will be allowed to overfit the training data. The vector norm of the weights is often calculated per layer rather than across the entire network. This allows more flexibility in choosing the type of regularization used (e.g., L1 for inputs, L2 elsewhere) and flexibility in the alpha value, although it is common to use the same alpha value on each layer by default.

### Tips for Using Weight Regularization

This section provides some tips for using weight regularization with your neural network.

**Use With All Network Types**

Weight regularization is a generic approach. It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks. In LSTMs, it may be desirable to use different penalties or penalty configurations for the input and recurrent connections.

**Standardize Input Data**

It is generally good practice to update input variables to have the same scale. When input variables have different scales, the scale of the network's weights will, in turn, vary accordingly. This introduces a problem when using weight regularization because the absolute or squared values of the weights must be added for use in the penalty. This problem can be addressed by either normalizing or standardizing input variables.

**Use a Larger Network**

It is common for larger networks (more layers or more nodes) to overfit the training data more easily. When using weight regularization, it is possible to use larger networks with less risk of overfitting. A good configuration strategy may be to start with larger networks and use weight decay.

**Grid Search Parameters**

It is common to use small values for the regularization hyperparameter that controls the contribution of each weight to the penalty. Perhaps start by testing values on a log scale, such as 0.1, 0.001, and 0.0001. Then use a grid search at the order of magnitude that shows the most promise.

**Use L1 + L2 Together**

Rather than trying to choose between L1 and L2 penalties, use both. Modern and effective linear regression methods such as the Elastic Net use both L1 and L2 penalties at the same time, and this can be a useful approach to try. This gives you both the nuance of L2 and the sparsity encouraged by L1.

**Use on a Trained Network**

The use of weight regularization may allow more elaborate training schemes. For example, a model may be fit on training data first without any regularization, then updated later with the use of a weight penalty to reduce the size of the weights of the already well-performing model.

## Weight Regularization Case Study