<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dropout" data-toc-modified-id="Dropout-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dropout</a></span></li><li><span><a href="#Test-Time" data-toc-modified-id="Test-Time-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Test Time</a></span></li><li><span><a href="#Inverted-Dropout" data-toc-modified-id="Inverted-Dropout-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Inverted Dropout</a></span></li><li><span><a href="#Implementing-Inverted-Dropout-from-Scratch" data-toc-modified-id="Implementing-Inverted-Dropout-from-Scratch-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Implementing Inverted Dropout from Scratch</a></span></li><li><span><a href="#CONCISE-IMPLEMENTATION" data-toc-modified-id="CONCISE-IMPLEMENTATION-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>CONCISE IMPLEMENTATION</a></span></li></ul></div>

## Dropout
The term <b>"dropout"</b> refers to dropping out units in a neural network. It is a technique for addressing overfitting. It consists of randomly dropping out some fraction of the nodes (setting fraction of the units to zero (injecting noise)) in each layer before calculating subsequent layer during training and has become a standard technique for training neural networks. When dropout is applied, during training its zeros out some fraction of the nodes with probability p in each layer before calculating the subsequent layer and the resulting network can be viewed as a subset of the original network. Because the fraction of the nodes that are drop out are chosen randomly on every pass, the representations in each layer can't depend on the exact values taken by nodes in the previous layer. 

<b> Dropout rate</b> is the fraction of the nodes in a layer that are zeroed out and it’s usually set between 0 and 1.

  
 ## Test Time

<b>Typically at test time we disable dropout.</b> Given a trained model and a new example, we do not drop out any nodes and thus do not need to normalize. 
In traditional dropout the weights of the network at test time are scaled versions of the trained weights. If a unit is retained with <b>probability q=1-p</b> during training,S at test time the weights of that unit are multiplied by q.

<img src="images/dropout1.png"/>


## Inverted Dropout

Inverted dropout is a variant of the original dropout technique developed <ahref='papers/JMLRdropout.pdf'>Srivastava et al.(2014)</a>. Just like the traditional dropout, inverted dropout randomly dropp out some fraction of the nodes.

The one difference is that, during training of a neural network using inverted dropout the weights of the network are scaled-down  by the inverse of the keep probability (probability of units retained) q=1-p  and does not need any scaling during test time.

In contrast, traditional dropout requires scaling during the test phase.
<br>
<br><img src="images/invtdrop.png"/>


<img src="images/dropout1.jpg"/>
  (source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 162)

for more on dropout read :
<a href='https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf'>Dropout: A Simple Way to Prevent Neural Networks from Overfitting {Srivastava et al.
(2014)}</a>

In [1]:
import torch
from torch.utils.data import DataLoader,TensorDataset
import torch.nn.functional as F
from torch.autograd import Variable
import torchvision
from torchvision import transforms
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
train_data=torchvision.datasets.MNIST(root='./data/',train=True,transform=transforms.ToTensor())
test_data=torchvision.datasets.MNIST(root='./data/',train=False,transform=transforms.ToTensor())

In [3]:
num_inputs, num_outputs, num_hidden_1,num_hidden_2,lr,batch_size = 784, 10, 256, 256,0.5,256

In [4]:
train_iter=DataLoader(dataset=train_data,batch_size=batch_size,shuffle=True)
test_iter=DataLoader(dataset=test_data,batch_size=batch_size,shuffle=True)

## Implementing Inverted Dropout from Scratch

In [5]:
def dropout_layer(X,dropout_rate):
    assert 0<=dropout_rate <=1
    # probability of units to retain
    keep_prob=1-dropout_rate
    # In this case, all elements are dropped out
    if dropout_rate==1:
        return torch.zeros_like(X)
    # # In this case, all elements are kept
    if dropout_rate==0:
        return X
    mask=torch.empty(X.shape).uniform_(0,1) <keep_prob
    return mask *X/keep_prob

testing our dropout function on a few example with probabilities 0, 0.5, and 1, respectively.


In [6]:
X = torch.range(1,16).reshape(2,8)
X

  """Entry point for launching an IPython kernel.


tensor([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
        [ 9., 10., 11., 12., 13., 14., 15., 16.]])

# dropout all units

In [7]:
dropout_layer(X,1)

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

# keep all units

In [8]:
dropout_layer(X,0)

tensor([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
        [ 9., 10., 11., 12., 13., 14., 15., 16.]])

# dropout 0.5 of the units

In [9]:
dropout_layer(X,0.5)

tensor([[ 0.,  0.,  0.,  0.,  0., 12., 14.,  0.],
        [ 0., 20., 22.,  0., 26., 28., 30., 32.]])

In [10]:
dropout_layer(X,0.5)

tensor([[ 2.,  4.,  0.,  8., 10., 12., 14., 16.],
        [ 0., 20., 22.,  0., 26.,  0.,  0., 32.]])

with 0.5 dropout rate we can the see that the fraction of nodes dropout are random and Because the nodes drop out are chosen randomly on every pass, the representations in each layer can't depend on the exact values taken by nodes in the previous layer

In [11]:
class MLP(torch.nn.Module):
    def __init__(self,is_training=True):
        super().__init__()
        self.training = is_training
        self.lr1=torch.nn.Linear(num_inputs,num_hidden_1)
        self.lr2=torch.nn.Linear(num_hidden_1,num_hidden_2)
        self.lr3=torch.nn.Linear(num_hidden_2,num_outputs)
        self.relu=torch.nn.ReLU()
    def forward(self,inputs):
        inputs=inputs.reshape(-1,784)
        h1=self.relu(self.lr1(inputs))
        if self.training==True:
            h1=dropout_layer(h1,dropout_rate=0.5)
        h2=self.relu(self.lr2(h1))
        if self.training==True:
            h2=dropout_layer(h2,0.5)
        output=self.lr3(h2)
        return output
        
net=MLP()

In [12]:
loss_fn=torch.nn.CrossEntropyLoss()
opt=torch.optim.SGD(net.parameters(),lr=lr)

In [13]:
def evaluate_accuracy(net,data_iterator):
    pred_correct = 0
    for  data,label in data_iterator:
        data=data.reshape(-1,784)
        output=net(data)
        pred = output.argmax(dim=1)
        pred_correct += (pred==label).float().sum().item()
        return 100*pred_correct/len(data)

In [14]:
evaluate_accuracy(net,test_iter)

12.890625

In [15]:
num_epochs = 10
for epoch in range(num_epochs+1):
    test_acc,train_acc=0,0
    for X,y in train_iter:
        net.train()
        y_hat=net(X)
        l=loss_fn(y_hat,y)
        opt.zero_grad() 
        l.backward() 
        opt.step() 
    acc_te=evaluate_accuracy(net.eval(),test_iter)
    net.eval()
    acc_tr=evaluate_accuracy(net,train_iter)
    test_acc+=acc_te
    train_acc+=acc_tr
    print('epoch %d, loss %f,train acc %f,test acc %f'%(epoch,l,train_acc,test_acc))

epoch 0, loss 0.218459,train acc 93.750000,test acc 92.578125
epoch 1, loss 0.337482,train acc 95.703125,test acc 96.093750
epoch 2, loss 0.153460,train acc 97.265625,test acc 96.875000
epoch 3, loss 0.120417,train acc 97.265625,test acc 97.656250
epoch 4, loss 0.145466,train acc 97.656250,test acc 96.875000
epoch 5, loss 0.201399,train acc 97.265625,test acc 96.484375
epoch 6, loss 0.158104,train acc 98.046875,test acc 98.046875
epoch 7, loss 0.112679,train acc 97.656250,test acc 97.265625
epoch 8, loss 0.104840,train acc 98.046875,test acc 98.046875
epoch 9, loss 0.085689,train acc 97.656250,test acc 98.046875
epoch 10, loss 0.131020,train acc 98.046875,test acc 98.828125


## CONCISE IMPLEMENTATION

In [16]:
net=torch.nn.Sequential(torch.nn.Flatten(),
         torch.nn.Linear(784,256),
         torch.nn.Dropout(0.5),
         torch.nn.ReLU(),
         torch.nn.Linear(256,256),
        torch.nn.Dropout(0.5),
         torch.nn.ReLU(),
         torch.nn.Linear(256,10)
)

In [17]:
net.parameters

<bound method Module.parameters of Sequential(
  (0): Flatten()
  (1): Linear(in_features=784, out_features=256, bias=True)
  (2): Dropout(p=0.5, inplace=False)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=256, bias=True)
  (5): Dropout(p=0.5, inplace=False)
  (6): ReLU()
  (7): Linear(in_features=256, out_features=10, bias=True)
)>

In [18]:
num_epochs = 30
for epoch in range(num_epochs+1):
    for X,y in train_iter:
        #net.train()
        y_hat=net(X)
        l=loss_fn(y_hat,y)
        opt.zero_grad() 
        l.backward() 
        opt.step() 
    if epoch%5==0:
        print('epocgh %d, loss %f'%(epoch,l))
 

epocgh 0, loss 2.294591
epocgh 5, loss 2.297557
epocgh 10, loss 2.314187
epocgh 15, loss 2.303634
epocgh 20, loss 2.308324
epocgh 25, loss 2.299906
epocgh 30, loss 2.296230
