# Implementing DropOut with ``gluon``

In [the previous chapter](./P03-C03-mlp-dropout-scratch.ipynb), 
we introduced DropOut regularization, implementing the algorithm from scratch. 
As a reminder, DropOut is a regularization technique 
that zeroes out some fraction of the nodes during training. 
Then at test time, we use all of the nodes, but scale down their values,
essentially averaging the various dropped out nets. 
If you're approaching the this chapter out of sequence,
and aren't sure how DropOut works, it's best to take a look at the implementation by hand
since ``gluon`` will manage the low-level details for us.

DropOut is a special kind of layer because it behaves differently 
when training and predicting. 
We've already seen how ``gluon`` can keep track of when to record vs not record the computation graph.
Since this is a ``gluon`` implementation chapter,
let's get intro the thick of things by importing our dependencies and some toy data.



In [1]:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
ctx = mx.gpu(3)


## The MNIST dataset

In [2]:
mnist = mx.test_utils.get_mnist()
batch_size = 64
num_inputs = 784
num_outputs = 10
train_data = mx.io.NDArrayIter(mnist["train_data"], mnist["train_label"],
                               batch_size, shuffle=True)
test_data = mx.io.NDArrayIter(mnist["test_data"], mnist["test_label"],
                              batch_size, shuffle=True)

## Define the model

Now we can add DropOut following each of our hidden layers. 

In [3]:
num_hidden = 256
net = gluon.nn.Sequential()
with net.name_scope():
    ###########################
    # Adding first hidden layer
    ###########################
    net.add(gluon.nn.Dense(num_hidden, activation="relu"))
    ###########################
    # Adding dropout with rate .5 to the first hidden layer
    ###########################
    net.add(gluon.nn.Dropout(.5))
    
    ###########################
    # Adding first hidden layer
    ###########################
    net.add(gluon.nn.Dense(num_hidden, activation="relu")) 
    ###########################
    # Adding dropout with rate .5 to the second hidden layer
    ###########################
    net.add(gluon.nn.Dropout(.5))
    
    ###########################
    # Adding the output layer
    ###########################
    net.add(gluon.nn.Dense(num_outputs))

## Parameter initialization

Now that we've got an MLP with Dropout, let's register an initializer 
so we can play with some data.

In [4]:
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

## Train mode and predict mode

Now that we have an MLP with DropOut, 
we can grab some data and pass it through the network.
We'll actually want to pass the example through the net twice,
just to see what effect DropOut is having on our predictions.

In [5]:
x = train_data.next().data[0].as_in_context(ctx)
print(net(x[0]))
print(net(x[0]))


[[ 0.21150076  0.30272889  0.17183581 -0.06442011  0.01675556 -0.02324573
  -0.04187176 -0.13473594  0.31143227 -0.13674253]]
<NDArray 1x10 @gpu(3)>

[[ 0.21150076  0.30272889  0.17183581 -0.06442011  0.01675556 -0.02324573
  -0.04187176 -0.13473594  0.31143227 -0.13674253]]
<NDArray 1x10 @gpu(3)>


Note that we got the exact same answer on both forward passes through the net!
That's because by, default, ``mxnet`` assumes that we are in predict mode. 
We can explicitly invoke whis scope by placing code within a ``with autograd.predict_mode():`` block.

In [6]:
with autograd.predict_mode():
    print(net(x[0]))
    print(net(x[0]))


[[ 0.35026801  0.26806137  0.38593394  0.10388274  0.35403782  0.21958774
   0.09329571 -0.08700826  0.07742305 -0.29462805]]
<NDArray 1x10 @gpu(0)>

[[ 0.35026801  0.26806137  0.38593394  0.10388274  0.35403782  0.21958774
   0.09329571 -0.08700826  0.07742305 -0.29462805]]
<NDArray 1x10 @gpu(0)>


Unless something's gone horribly wrong, you should see the same result as before. 
We can also run the code in *train mode*.
This tells MXNet to run our Blocks as they would run during training.

In [6]:
with autograd.train_mode():
    print(net(x[0]))
    print(net(x[0]))


[[-0.24313435  0.13018197  0.62095457  0.19018309  0.29342416  0.378932
  -0.1570099   0.32591116  0.51068866 -0.23890349]]
<NDArray 1x10 @gpu(3)>

[[-0.67278039  0.42208618 -0.3246206  -0.04676402  0.26563731  0.3985112
  -0.20333797 -0.5953896  -0.01167    -0.2030338 ]]
<NDArray 1x10 @gpu(3)>


## Accessing ``is_training()`` status

You might wonder, how precisely do the Blocks determine 
whether they should run in train mode or predict mode?
Basically, autograd maintains a Boolean state 
that can be accessed via ``autograd.is_training()``. 
By default this falue is ``False`` in the global scope.
This way if someone just wants to make predictions and 
doesn't know anything about training models, everything will just work.
When we enter a ``train_mode()`` block, 
we create a scope in which ``is_training()`` returns ``True``. 

In [8]:
with autograd.predict_mode():
    print(autograd.is_training())
    
with autograd.train_mode():
    print(autograd.is_training())

False
True


## Integration with ``autograd.record``

When we train, neural network models,
we nearly always enter ``record()`` blocks.
The purpose of ``record()`` is to build the computational graph.
And the purpose of ``train`` is to indicate that we are training our model.
These two are highly correlated but should not be confused.
For example, when we generate adversarial examples (a topic we'll investigate later)
we may want to record, but for the model to behave as in predict mode.
On the other hand, sometimes, even when we're not recording,
we still want to evaluate the model's training behavior.

A problem then arises. Since ``record()`` and ``train_mode()``
are distinct, how do we avoid having to declare two scopes every time we train the model?


In [10]:
##########################
#  Writing this every time could get cumbersome
##########################
with autograd.record():
    with autograd.train_mode():
        yhat = net(x)

To make our lives a little easier, record() takes one argument, ``train_mode``,
which has a default value of True.
So when we turn on autograd, this by default turns on train_mode
(``with autograd.record()`` is equivalent to
``with autograd.record(train_mode=True):``).
To change this default behavior
(as when generating adversarial examples),
we can optionally call record via
(``with autograd.record(train_mode=False):``).

## Softmax cross-entropy loss

In [11]:
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

## Optimizer

In [12]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})

## Evaluation metric

In [13]:
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    data_iterator.reset()
    for i, batch in enumerate(data_iterator):
        data = batch.data[0].as_in_context(ctx).reshape((-1, 784))
        label = batch.label[0].as_in_context(ctx)
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=label)
    return acc.get()[1]

## Training loop

In [14]:
epochs = 10
smoothing_constant = .01

for e in range(epochs):
    train_data.reset()
    for i, batch in enumerate(train_data):
        data = batch.data[0].as_in_context(ctx).reshape((-1, 784))
        label = batch.label[0].as_in_context(ctx)
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label)
            loss.backward()
        trainer.step(data.shape[0])

        ##########################
        #  Keep a moving average of the losses
        ##########################
        curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if ((i == 0) and (e == 0)) 
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)

    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" %
          (e, moving_loss, train_accuracy, test_accuracy))

Epoch 0. Loss: 0.329100704774, Train_acc 0.937849813433, Test_acc 0.937300955414
Epoch 1. Loss: 0.248198956118, Train_acc 0.959005197228, Test_acc 0.958499203822
Epoch 2. Loss: 0.200681836518, Train_acc 0.967667244136, Test_acc 0.962977707006
Epoch 3. Loss: 0.179167188224, Train_acc 0.972098214286, Test_acc 0.968451433121
Epoch 4. Loss: 0.162256601359, Train_acc 0.975679637527, Test_acc 0.969844745223
Epoch 5. Loss: 0.149811580049, Train_acc 0.978294909382, Test_acc 0.972332802548
Epoch 6. Loss: 0.128341759589, Train_acc 0.981343283582, Test_acc 0.973128980892
Epoch 7. Loss: 0.126888067755, Train_acc 0.983142324094, Test_acc 0.976015127389
Epoch 8. Loss: 0.125234942328, Train_acc 0.985824227079, Test_acc 0.975915605096
Epoch 9. Loss: 0.112150726553, Train_acc 0.985157915778, Test_acc 0.977010350318


## Conclusion

Now let's take a look at how to build convolutional neural networks.

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/zackchase/mxnet-the-straight-dope)