# Decouple Layers With Dropout

Deep learning neural networks are likely to overfit a training dataset with few examples quickly. Ensembles of neural networks with different model configurations reduce overfitting but require the additional computational expense of training and maintaining multiple models.

A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and
generalization error in deep neural networks of all kinds. In this tutorial, you will discover the use of dropout regularization for reducing overfitting and improving the generalization of deep neural networks. After reading this tutorial, you will know:

* Large weights in a neural network signify a more complex network that has overfit the training data.
* Probabilistically dropping out nodes in the network is a simple and effective regularization method.
* A large network with more training epochs and the use of a weight constraint is suggested when using dropout.

## Dropout

In this section, you will discover that you can simulate the development of a large ensemble of neural network models in a single model called dropout, how you can use it to reduce overfitting, and tips for using this technique on your projects.

### Problem With Overfitting

Large neural nets trained on relatively small datasets can overfit the training data. This has the effect of the model learning the statistical noise in the training data, which results in poor performance when the model is evaluated on new data, e.g., a test dataset. Generalization error increases due to overfitting. One approach to reduce overfitting is to fit all possible neural networks on the same dataset and average the predictions from each model. This is not feasible in practice, and can be approximated using a small collection of different models, called an ensemble. Even with the ensemble approximation, it requires multiple models to be fit and stored, which can be a challenge if the models are large, requiring days or weeks to train and tune.

### Randomly Drop Nodes

Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel. During training, some number of node outputs are randomly ignored or dropped out. This makes the layer look like and be treated like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different view of the configured layer.

Dropout has the effect of making the training process noisy, forcing nodes within a layer to take on more or less responsibility for the inputs probabilistically. This conceptualization suggests that perhaps dropout breaks up situations where network layers co-adapt to correct mistakes from prior layers, making the model more robust.

Dropout simulates a sparse activation from a given layer, which interestingly, in turn, encourages the network to learn a sparse representation as a side-effect. It may be used as an alternative to activity regularization for encouraging sparse representations in autoencoder models.

Because the outputs of a layer under dropout are randomly subsampled, it reduces the capacity or thinning the network during training. As such, a wider network, e.g., more nodes, may be required when using dropout.

### How to Dropout

Dropout is implemented per layer in a neural network. It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer. Dropout may be implemented on any or all hidden layers in the network and the visible or input layer. It is not used on the output layer.

Dropout is not used after training when making a prediction with the fit network. The weights of the network will be larger than normal because of dropout. Therefore, before finalizing the network, the weights are first scaled by the chosen dropout rate. The network can then be used as per normal to make predictions.

The rescaling of the weights can be performed at training time instead, after each weight update at the end of the minibatch. This is sometimes called inverse dropout and does not require any modification of weights during training. Both the Keras and PyTorch deep learning libraries implement dropout in this way.

Dropout works well in practice, perhaps replacing the need for weight regularization (e.g., weight decay) and activation regularization (e.g., representation sparsity).

### Tips for Using Dropout Regularization

This section provides some tips for using dropout regularization with your neural network.

**Use With All Network Types**

Dropout regularization is a generic approach. It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks. In the case of LSTMs, it may be desirable to use different dropout rates for the input and recurrent connections.

**Dropout Rate**

The default interpretation of the dropout hyperparameter is the probability of training a given node in a layer, where 1.0 means no dropout and 0.0 means no outputs from the layer. A good value for dropout in a hidden layer is between 0.5 and 0.8. Input layers use a larger dropout
(retention) rate, such as 0.8.

**Use a Larger Network**

It is common for larger networks (more layers or more nodes) to overfit the training data more easily. When using dropout regularization, it is possible to use larger networks with less risk of overfitting. A large network (more nodes per layer) may be required as dropout will probabilistically reduce the network's capacity. A good rule of thumb is to divide the number of nodes in the layer before dropout by the proposed dropout rate and use that as the number of nodes in the new network that uses dropout. For example, a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes ( $ \frac{100}{0.5} $ ) when using dropout.

**Grid Search Parameters**
Rather than guess at a suitable dropout rate for your network, test different rates systematically (For example, test values between 1.0 and 0.1 in increments of 0.1). This will help you discover what works best for your specific model and dataset and how sensitive the model is to the dropout rate. A more sensitive model may be unstable and could benefit from an increase in size.

**Use a Weight Constraint**
Network weights will increase in size in response to the probabilistic removal of layer activations. Large weight size can be a sign of an unstable network. To counter this effect, a weight constraint can be imposed to force the norm (magnitude) of all weights in a layer to be below a specified value. For example, the maximum norm constraint is recommended with a value between 3 and 4.

**Use With Smaller Datasets**

Like other regularization methods, dropout is more effective on problems with limited training data, and the model is likely to overfit the training data. Problems where there is a large amount of training data, may see less benefit from using dropout.

## Dropout Case Study