<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dropout" data-toc-modified-id="Dropout-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dropout</a></span></li><li><span><a href="#Test-Time" data-toc-modified-id="Test-Time-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Test Time</a></span></li><li><span><a href="#Inverted-Dropout" data-toc-modified-id="Inverted-Dropout-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Inverted Dropout</a></span></li><li><span><a href="#Implementing-Inverted-Dropout-from-Scratch" data-toc-modified-id="Implementing-Inverted-Dropout-from-Scratch-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Implementing Inverted Dropout from Scratch</a></span></li><li><span><a href="#CONCISE-IMPLEMENTATION" data-toc-modified-id="CONCISE-IMPLEMENTATION-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>CONCISE IMPLEMENTATION</a></span></li></ul></div>

## Dropout
The term <b>"dropout"</b> refers to dropping out units in a neural network. It is a technique for addressing overfitting. It consists of randomly dropping out some fraction of the nodes (setting fraction of the units to zero (injecting noise)) in each layer before calculating subsequent layer during training and has become a standard technique for training neural networks. When dropout is applied, during training its zeros out some fraction of the nodes with probability p in each layer before calculating the subsequent layer and the resulting network can be viewed as a subset of the original network. Because the fraction of the nodes that are drop out are chosen randomly on every pass, the representations in each layer can't depend on the exact values taken by nodes in the previous layer. 

<b> Dropout rate</b> is the fraction of the nodes in a layer that are zeroed out and it’s usually set between 0 and 1.

  
 ## Test Time

<b>Typically at test time we disable dropout.</b> Given a trained model and a new example, we do not drop out any nodes and thus do not need to normalize. 
In traditional dropout the weights of the network at test time are scaled versions of the trained weights. If a unit is retained with <b>probability q=1-p</b> during training,S at test time the weights of that unit are multiplied by q.

<img src="../images/dropout1.png"/>


## Inverted Dropout

Inverted dropout is a variant of the original dropout technique developed <ahref='papers/JMLRdropout.pdf'>Srivastava et al.(2014)</a>. Just like the traditional dropout, inverted dropout randomly dropp out some fraction of the nodes.

The one difference is that, during training of a neural network using inverted dropout the weights of the network are scaled-down  by the inverse of the keep probability (probability of units retained) q=1-p  and does not need any scaling during test time.

In contrast, traditional dropout requires scaling during the test phase.
<br>
<br><img src="../images/invtdrop.png"/>


<img src="../images/dropout1.jpg"/>
  (source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 162)

for more on dropout read :
<a href='papers/JMLRdropout.pdf'>Dropout: A Simple Way to Prevent Neural Networks from Overfitting {Srivastava et al.
(2014)}</a>

In [1]:
from keras.datasets import mnist
from keras import models,layers
import tensorflow as tf
import numpy as np
import keras
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


In [10]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [11]:
batch_size = 256
train_data = tf.data.Dataset.from_tensor_slices((x_train.reshape(60000, 784).astype("float32") / 255,
                                              y_train.astype("float32")))
train_data = train_data.shuffle(buffer_size=1024).batch(batch_size)



test_data = tf.data.Dataset.from_tensor_slices((x_test.reshape(10000, 784).astype("float32") / 255,
                                         y_test.astype("float32")))
test_data = test_data.shuffle(buffer_size=1024).batch(batch_size)

## Implementing Inverted Dropout from Scratch

In [2]:
def dropout_layer(X,dropout_rate):
    assert 0<=dropout_rate <=1
    # probability of units to retain
    keep_prob=1-dropout_rate
    # In this case, all elements are dropped out
    if dropout_rate==1:
        return tf.zeros_like(X)
    # # In this case, all elements are kept
    if dropout_rate==0:
        return X
    mask=tf.random.uniform(shape=tf.shape(X), minval=0, maxval=1) <keep_prob
    return tf.cast(mask, dtype=tf.float32) * X/keep_prob

testing our dropout function on a few example with probabilities 0, 0.4, 0.5, and 1, respectively.


In [3]:
X = tf.range(1,17,dtype=tf.float32)
X=tf.reshape(X,(2, 8))
X

<tf.Tensor: shape=(2, 8), dtype=float32, numpy=
array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12., 13., 14., 15., 16.]], dtype=float32)>

# dropout all units

In [4]:
dropout_layer(X,1)

<tf.Tensor: shape=(2, 8), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>

# keep all units

In [5]:
dropout_layer(X,0)

<tf.Tensor: shape=(2, 8), dtype=float32, numpy=
array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12., 13., 14., 15., 16.]], dtype=float32)>

# dropout 0.5 of the units

In [6]:
dropout_layer(X,0.5)

<tf.Tensor: shape=(2, 8), dtype=float32, numpy=
array([[ 0.,  4.,  6.,  0.,  0.,  0.,  0., 16.],
       [18.,  0.,  0., 24.,  0.,  0.,  0., 32.]], dtype=float32)>

In [7]:
dropout_layer(X,0.5)

<tf.Tensor: shape=(2, 8), dtype=float32, numpy=
array([[ 2.,  0.,  6.,  0., 10., 12., 14., 16.],
       [18., 20.,  0.,  0.,  0., 28., 30.,  0.]], dtype=float32)>

with 0.5 dropout rate we can the see that the fraction of nodes dropout are random and Because the nodes drop out are chosen randomly on every pass, the representations in each layer can't depend on the exact values taken by nodes in the previous layer

In [8]:
class Linear(tf.keras.layers.Layer):
    def __init__(self, units=20, input_dim=32,**kwargs):
        super().__init__(**kwargs)
        self.w = self.add_weight(shape=(input_dim, units), initializer="random_normal", trainable=True)
        self.b = self.add_weight(shape=(units,), initializer="zeros", trainable=True)

    def call(self, inputs):
        return tf.matmul(inputs, self.w) + self.b

In [9]:
num_inputs, num_outputs, num_hidden_1,num_hidden_2 = 784, 10, 50,20
class MLP(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
        self.layer1=Linear(input_dim=num_inputs,units=num_hidden_1)
        self.layer2=Linear(input_dim=num_hidden_1,units=num_hidden_2)
        self.layer3=Linear(input_dim=num_hidden_2,units=num_outputs)
    def call(self,inputs):
        h_1=tf.maximum(self.layer1(inputs),0)
        h_1=dropout_layer(h_1,dropout_rate=0.3)
        h_2=tf.maximum(self.layer2(h_1),0)
        h_2=dropout_layer(h_2,dropout_rate=0.5)
        output=tf.math.softmax(self.layer3(h_2))
        return output
    
mlp=MLP()

In [12]:
def evaluate_accuracy(data_iterator, net):
    pred_correct = 0.
    for i ,(data,label) in enumerate(data_iterator):
        output=net(data)
        output=output.numpy()
        pred=np.argmax(output,axis=1)
        pred_correct=np.sum(pred==label)
        return (pred_correct/len(data))

In [13]:
# Instantiate a metric object
accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.RMSprop(learning_rate=1e-3)

In [14]:
epochs=15
for e in range(epochs):
    train_acc,test_acc=0,0
    # Iterate over the batches of a dataset.
    for step, (x, y) in enumerate(train_data):
        with tf.GradientTape() as tape:
            logits = mlp(x)
            # Compute the loss value for this batch.
            loss_value = loss_fn(y, logits)
        # Update the state of the `accuracy` metric.
        acc_c=accuracy.update_state(y, logits)
        # Update the weights of the model to minimize the loss value.
        gradients = tape.gradient(loss_value, mlp.trainable_weights)
        optimizer.apply_gradients(zip(gradients, mlp.trainable_weights))
    train_accuracy = evaluate_accuracy(train_data, mlp)
    test_accuracy = evaluate_accuracy(test_data, mlp)
    train_acc+=train_accuracy
    test_acc+=test_accuracy
    print('epoch %d, loss %f,train acc %f,test acc %f'%(e,loss_value,train_acc,test_acc))     
    # Result the metric's state at the end of an epoch
    accuracy.reset_states()

epoch 0, loss 1.817480,train acc 0.664062,test acc 0.695312
epoch 1, loss 1.743118,train acc 0.671875,test acc 0.718750
epoch 2, loss 1.659132,train acc 0.773438,test acc 0.742188
epoch 3, loss 1.652222,train acc 0.742188,test acc 0.710938
epoch 4, loss 1.628205,train acc 0.761719,test acc 0.781250
epoch 5, loss 1.577156,train acc 0.820312,test acc 0.796875
epoch 6, loss 1.639755,train acc 0.847656,test acc 0.769531
epoch 7, loss 1.619823,train acc 0.828125,test acc 0.859375
epoch 8, loss 1.677131,train acc 0.832031,test acc 0.800781
epoch 9, loss 1.597666,train acc 0.804688,test acc 0.835938
epoch 10, loss 1.625037,train acc 0.847656,test acc 0.816406
epoch 11, loss 1.613111,train acc 0.812500,test acc 0.835938
epoch 12, loss 1.563551,train acc 0.773438,test acc 0.812500
epoch 13, loss 1.505152,train acc 0.847656,test acc 0.843750
epoch 14, loss 1.577855,train acc 0.808594,test acc 0.882812


## CONCISE IMPLEMENTATION

In [15]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [16]:
train_images=train_images.reshape(60000, 784).astype(np.float32)/255
test_images=test_images.reshape(10000, 784).astype(np.float32)/255


In [17]:
def one_hot(sequence,dim=10):
    results=np.zeros((len(sequence),dim))
    for i, seq in enumerate(sequence):
        results[i,seq]=1# set the indices of one_hot[i] to 1s 
    return results

train_labels=one_hot(train_labels)
test_labels=one_hot(test_labels)

In [18]:
x_train,x_vdata,y_train,y_vdata=train_images[30000:], train_images[:30000], train_labels[30000:],train_labels[:30000]

In [19]:
class customizeDropout(layers.Layer):
    def __init__(self,rate):
        super().__init__()
        self.rate=rate
    def call(self,inputs,training=False):
        if training:
            return dropout_layer(X=inputs,dropout_rate=self.rate)
            #returnself.rate*inputs
        else:
            return inputs
    

In [20]:
net=models.Sequential()
net.add(layers.Dense(100,activation='relu',input_shape=(784,)))
net.add(customizeDropout(0.5))
net.add(layers.Dense(50,activation='relu'))
net.add(customizeDropout(0.4))
net.add(layers.Dense(10,activation='softmax'))

net=models.Sequential()
net.add(layers.Dense(20,activation='relu',input_shape=(784,)))
net.add(layers.Dropout(0.5))
net.add(layers.Dense(15,activation='relu'))
net.add(layers.Dropout(0.5))
net.add(layers.Dense(10,activation='softmax'))

In [21]:
net.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

In [22]:
net.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 100)               78500     
_________________________________________________________________
customize_dropout_1 (customi (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 50)                5050      
_________________________________________________________________
customize_dropout_2 (customi (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                510       
Total params: 84,060
Trainable params: 84,060
Non-trainable params: 0
_________________________________________________________________


In [23]:
x_train,x_vdata,y_train,y_vdata=train_images[30000:], train_images[:30000], train_labels[30000:],train_labels[:30000]

In [24]:
results=net.fit(x_train,y_train,batch_size=100,validation_data=(x_vdata,y_vdata),epochs=9)

Train on 30000 samples, validate on 30000 samples
Epoch 1/9
Epoch 2/9
Epoch 3/9
Epoch 4/9
Epoch 5/9
Epoch 6/9
Epoch 7/9
Epoch 8/9
Epoch 9/9


In [25]:
test_loss, test_acc = net.evaluate(test_images, test_labels)



In [26]:
print(test_acc)

0.9696000218391418
