In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../../notebook_format')
from formats import load_style
load_style()

In [4]:
# As usual, a bit of setup
os.chdir(path)

import time
import numpy as np
import cs231n.layers as layers
import matplotlib.pyplot as plt
from cs231n.solver import Solver
from cs231n.data_utils import get_CIFAR10_data
from cs231n.classifiers.fc_net import FullyConnectedNet
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """returns relative error"""
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

# Dropout

**Dropout** is another extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) (pdf) that complements the other regularization methods (L1, L2) to prevent overfitting. While training, dropout is implemented by only keeping a neuron active with some probability $p$ (a hyperparameter), or setting it to zero otherwise.

![dropout](image/dropout.png)

Figure taken from the <a href="http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf">Dropout paper</a> that illustrates the idea. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. During testing there is no dropout applied.

Vanilla dropout in an example 3-layer Neural Network would be implemented as follows:

```python
""" Vanilla Dropout: Not recommended implementation (see notes below) """

p = 0.5 # probability of keeping a unit active. higher = less dropout

def train_step(X):
  """ X contains the data """
  
  # forward pass for example 3-layer neural network
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = np.random.rand(*H1.shape) < p # first dropout mask
  H1 *= U1 # drop!
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = np.random.rand(*H2.shape) < p # second dropout mask
  H2 *= U2 # drop!
  out = np.dot(W3, H2) + b3
  
  # backward pass: compute gradients... (not shown)
  # perform parameter update... (not shown)
  
def predict(X):
  # ensembled forward pass
  H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations
  H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations
  out = np.dot(W3, H2) + b3
```

In the code above, inside the `train_step` function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It is also possible to perform dropout right on the input layer, in which case we would also create a binary mask for the input `X`. The backward pass remains unchanged, but of course has to take into account the generated masks `U1,U2`. 

Crucially, note that in the `predict` function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by $p$. This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of $p = 0.5$, the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation). To see this, consider an output of a neuron $x$ (before dropout). With dropout, the expected output from this neuron will become $px + (1-p)0$, because the neuron's output will be set to zero with probability $1-p$. At test time, when we keep the neuron always active, we must adjust $x \rightarrow px$ to keep the same expected output.

The undesirable property of the scheme presented above is that we must scale the activations by $p$ at test time. Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which performs the scaling at train time, leaving the forward pass at test time untouched. Additionally, this has the appealing property that the prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all. Inverted dropout looks as follows:

```python
""" 
Inverted Dropout: Recommended implementation example.
We drop and scale at train time and don't do anything at test time.
"""

p = 0.5 # probability of keeping a unit active. higher = less dropout

def train_step(X):
  # forward pass for example 3-layer neural network
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p!
  H1 *= U1 # drop!
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p!
  H2 *= U2 # drop!
  out = np.dot(W3, H2) + b3
  
  # backward pass: compute gradients... (not shown)
  # perform parameter update... (not shown)
  
def predict(X):
  # ensembled forward pass
  H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  out = np.dot(W3, H2) + b3
```

In [5]:
# Load the (preprocessed) CIFAR10 data
data = get_CIFAR10_data()
for k, v in data.items():
    print( '%s: %s' % ( k, v.shape ) )

X_val: (1000, 3, 32, 32)
X_train: (49000, 3, 32, 32)
X_test: (1000, 3, 32, 32)
y_val: (1000,)
y_test: (1000,)
y_train: (49000,)


## Dropout forward pass

In this exercise you will implement a dropout layer and modify your fully-connected network to optionally use dropout.

In the file `cs231n/layers.py`, implement the forward pass for dropout. Since dropout behaves differently during training and testing, make sure to implement the operation for both modes. Once you have done so, run the cell below to test your implementation.

In [16]:
x = np.random.randn( 500, 500 ) + 10

for p in [ 0.3, 0.6, 0.75 ]:
    out, _ = layers.dropout_forward( x, {'mode': 'train', 'p': p} )
    out_test, _ = layers.dropout_forward( x, {'mode': 'test', 'p': p} )

    print( 'Running tests with p = ', p )
    print( 'Mean of input: ', x.mean() )
    print( 'Mean of train-time output: ', out.mean() )
    print( 'Mean of test-time output: ', out_test.mean() )
    print( 'Fraction of train-time output set to zero: ', (out == 0).mean() )
    print( 'Fraction of test-time output set to zero: ', (out_test == 0).mean() )
    print()

Running tests with p =  0.3
Mean of input:  9.99653355533
Mean of train-time output:  10.0178797768
Mean of test-time output:  9.99653355533
Fraction of train-time output set to zero:  0.699432
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.6
Mean of input:  9.99653355533
Mean of train-time output:  10.0025390335
Mean of test-time output:  9.99653355533
Fraction of train-time output set to zero:  0.39948
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.75
Mean of input:  9.99653355533
Mean of train-time output:  10.0032761818
Mean of test-time output:  9.99653355533
Fraction of train-time output set to zero:  0.249428
Fraction of test-time output set to zero:  0.0



## Dropout backward pass

In the file `cs231n/layers.py`, implement the backward pass for dropout. After doing so, run the following cell to numerically gradient-check your implementation.

In [18]:
x = np.random.randn(10, 10) + 10
dout = np.random.randn(*x.shape)

dropout_param = {'mode': 'train', 'p': 0.8, 'seed': 123}
out, cache = layers.dropout_forward(x, dropout_param)
dx = layers.dropout_backward(dout, cache)
dx_num = eval_numerical_gradient_array(lambda xx: layers.dropout_forward(xx, dropout_param)[0], x, dout)

print( 'dx relative error: ', rel_error(dx, dx_num) )

dx relative error:  5.44561441743e-11


# Fully-connected nets with Dropout
In the file `cs231n/classifiers/fc_net.py`, modify your implementation to use dropout. Specificially, if the constructor the the net receives a nonzero value for the `dropout` parameter, then the net should add dropout immediately after every ReLU nonlinearity. After doing so, run the following to numerically gradient-check your implementation.

In [21]:
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size = (N,))

for dropout in [0, 0.25, 0.5]:
    print( 'Running check with dropout = ', dropout )
    model = FullyConnectedNet( [H1, H2], input_dim = D, num_classes = C,
                               weight_scale = 5e-2, dtype = np.float64,
                               dropout = dropout, seed = 123 )

    loss, grads = model.loss(X, y)
    print( 'Initial loss: ', loss )

    for name in sorted(grads):
        f = lambda _: model.loss(X, y)[0]
        grad_num = eval_numerical_gradient(f, model.params[name], verbose = False, h = 1e-5)
        print( '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])) )
    
    print()

Running check with dropout =  0
Initial loss:  2.30061511702
W1 relative error: 2.99e-07
W2 relative error: 1.36e-07
W3 relative error: 1.12e-07
b1 relative error: 2.94e-09
b2 relative error: 7.39e-09
b3 relative error: 9.26e-11

Running check with dropout =  0.25
Initial loss:  2.30178939633
W1 relative error: 4.22e-07
W2 relative error: 3.01e-06
W3 relative error: 9.30e-08
b1 relative error: 1.30e-08
b2 relative error: 1.56e-02
b3 relative error: 9.94e-11

Running check with dropout =  0.5
Initial loss:  2.30196485896
W1 relative error: 1.78e-06
W2 relative error: 2.59e-07
W3 relative error: 4.67e-08
b1 relative error: 6.58e-08
b2 relative error: 6.71e-09
b3 relative error: 1.15e-10



# Regularization experiment
As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a dropout probability of 0.75. We will then visualize the training and validation accuracies of the two networks over time.

In [None]:
# Train two identical nets, one with dropout and one without
num_train = 500
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

solvers = {}
dropout_choices = [ 0, 0.75 ]
for dropout in dropout_choices:
    model = FullyConnectedNet([500], dropout=dropout)
    print(dropout)

    solver = Solver(model, small_data,
                  num_epochs=25, batch_size=100,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 5e-4,
                  },
                  verbose=True, print_every=100)
    solver.train()
    solvers[dropout] = solver

In [None]:
# Plot train and validation accuracies of the two models

train_accs = []
val_accs = []
for dropout in dropout_choices:
    solver = solvers[dropout]
    train_accs.append(solver.train_acc_history[-1])
    val_accs.append(solver.val_acc_history[-1])

plt.subplot(3, 1, 1)
for dropout in dropout_choices:
    plt.plot(solvers[dropout].train_acc_history, 'o', label='%.2f dropout' % dropout)
plt.title('Train accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')
  
plt.subplot(3, 1, 2)
for dropout in dropout_choices:
    plt.plot(solvers[dropout].val_acc_history, 'o', label='%.2f dropout' % dropout)
plt.title('Val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')

plt.gcf().set_size_inches(15, 15)
plt.show()

# Question
Explain what you see in this experiment. What does it suggest about dropout?

# Answer


## Reference

Standford CS231n: Convolutional Neural Networks for Visual Recognition

- [Course Notes: Backpropagation](http://cs231n.github.io/optimization-2/)
- [Course Notes: Setting Up the Architecture](http://cs231n.github.io/neural-networks-1/)
- [Course Notes: Setting Up the Data and Loss](http://cs231n.github.io/neural-networks-2/)
- [Course Notes: Learning and Evaluation](http://cs231n.github.io/neural-networks-3/)
- [Course Notes: Putting it together: Minimal Neural Network Case Study](http://cs231n.github.io/neural-networks-case-study/)
- [Course Home Page](http://cs231n.stanford.edu/index.html)
- [Course Github](https://github.com/cs231n/cs231n.github.io)
- [Course Youtube](https://www.youtube.com/watch?v=NfnWJUyUJYU&list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC)