# Sparse Representations with Activity Regularization

Deep learning models are capable of automatically learning a rich internal representation from raw input data. This is called feature or representation learning. Better learned representations, in turn, can lead to better insights into the domain, e.g., via visualization of learned features, and to better predictive models that make use of the learned features. A problem with learned features is that they can be too specialized to the training data or overfit and not generalize well to new examples. Large values in the learned representation can be a sign of the representation being overfitted. Activity or representation regularization provides a technique to encourage the learned representations, the output or activation of the hidden layer or layers of the network, to stay small and sparse. In this tutorial, you will discover activation regularization to improve the generalization of learned features in neural networks. After reading this tutorial, you will know:

* Neural networks learn features from data and models, such as autoencoders and encoder-decoder models, and explicitly seek effective learned representations.
* Similar to weights, large values in learned features, e.g., large activations, may indicate an overfit model.
* Adding penalties to the loss function that penalizes a model in proportion to the magnitude of the activations may result in more robust and generalized learned features.

## Activity Regularization

In this section, you will discover the problem with neural networks with large activity, a technique that you can use to encourage the development of models with a sparse activity called activity regularization, and tips for using this technique in your projects.

### Problem With Learned Features

Deep learning models can perform feature learning. The model will automatically extract the salient features from the input patterns or learn features during the network training. These features may be used in the network to predict a quantity for regression or a class value for classification. The output of a hidden layer within the network represents the model's learned features at that point. These internal representations are tangible things.

A field of study focused on the efficient and effective automatic learning of features, often investigated by having a network reduce input to a small learned feature before using a second network to reconstruct the original input from the learned feature. Models of this type are called auto-encoders, or encoder-decoders, and their learned features can be useful to learn more about the domain (e.g., via visualization) and in predictive models. The learned features, or encoded inputs, must be large enough to capture the salient features of the input but also focused enough to not overfit the specific examples in the training dataset. As such, there is a tension between the expressiveness and the generalization of the learned features.

In the same way, large weights in the network can signify an unstable and overfit model, large output values in the learned features can signify the same problems. It is desirable to have small values in the learned features, e.g., small outputs or activations from the encoder network.

### Encourage Small Activations

The loss function of the network can be updated to penalize models in proportion to the magnitude of their activation. This is similar to weight regularization, where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its activation or activity, as such, this form of penalty or regularization is referred to as activation regularization or activity regularization.

The output of an encoder or, generally, the output of a hidden layer in a neural network may represent the problem at that point in the model. As such, this type of penalty may also be referred to as representation regularization. The desire to have small or very few activations with mostly zero values is also called a desire for sparsity. As such, this type of penalty is also referred to as sparse feature learning. The encouragement of sparse learned features in autoencoder models is referred to as sparse autoencoder.

Sparsity is most commonly sought when a larger-than-required hidden layer (e.g., over-complete) is used to learn features that may encourage overfitting. A sparse overcomplete learned feature is more effective than other learned features offering better robustness to noise and even transforms in the input, e.g., learned features of images may have improved invariance to the position of objects in the image. The introduction of a sparsity penalty counters this problem and encourages better generalization.

There is a general focus on the sparsity of the representations rather than small vector magnitudes. A study of these representations that is more general than neural networks is known as sparse coding.

### How to Encourage Small Activations

An activation penalty can be applied per-layer, perhaps only at one layer that focuses on the learned representation, such as the output of the encoder model or the middle (bottleneck) of an autoencoder model. A constraint can be applied that adds a penalty proportional to the magnitude of the vector output of the layer. The activation values may be positive or negative, so we cannot simply sum the values. Two common methods for calculating the magnitude of the activation are:

* Sum of the absolute activation values, called L1 vector norm.
* Sum of the squared activation values, called the L2 vector norm.

The L1 norm encourages sparsity, e.g., allowing some activations to become zero, whereas the L2 norm generally encourages small activations values. Use of the L1 norm may be a more commonly used penalty for activation regularization. A hyperparameter must be specified that indicates the amount or degree that the loss function will weigh or pay attention to the penalty. Common values are on a logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc. Activity regularization can be used in conjunction with other regularization techniques, such as weight regularization.

### Tips for Using Activation Regularization

This section provides some tips for using activation regularization with your neural network.

**Use With All Network Types**

Activation regularization is a generic approach. It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.

**Use With Autoencoders and Encoder-Decoders**

Activity regularization may be best suited to those model types that explicitly seek an efficient learned representation. These include autoencoders (i.e., sparse autoencoders) and encoder-decoder models, such as encoder-decoder LSTMs used for sequence-to-sequence prediction problems.

**Experiment With Different Norms**

The most common activation regularization is the L1 norm, as it encourages sparsity. Experiment with other types of regularization such as the L2 norm or using both the L1 and L2 norms simultaneously, e.g., the Elastic Net linear regression algorithm.

**Use Rectified Linear Activation**

The rectified linear activation function, also called relu, is an activation function that is now widely used in the hidden layer of deep neural networks. Unlike classical activation functions such as tanh (hyperbolic tangent function) and sigmoid (logistic function), the relu function easily allows exact zero values. This makes it a good candidate when learning sparse representations, such as with the L1 vector norm activation regularization.

**Grid Search Parameters**

It is common to use small values for the regularization hyperparameter that controls the contribution of each activation to the penalty. Perhaps start by testing values on a log scale, such as 0.1, 0.001, and 0.0001. Then use a grid search at the order of magnitude that shows the most promise.

**Standardize Input Data**

It is a generally good practice to rescale input variables to have the same scale. When input variables have different scales, the scale of the network's weights will, in turn, vary accordingly. Large weights can saturate the nonlinear transfer function and reduce the variance in the output from the layer. This may introduce a problem when using activation regularization. This problem can be addressed by either normalizing or standardizing input variables.

**Use an Overcomplete Representation**

Configure the layer chosen to be the learned features, e.g., the output of the encoder or the bottleneck in the autoencoder, to have more nodes that may be required. This is called an overcomplete representation that will encourage the network to overfit the training examples. This can be countered with a strong activation regularization to encourage a rich, learned representation that is also sparse.

## Activity Regularization Case Study