# Implementing Dropout

In this notebook I implement [dropout](https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) in a simple neural network and test it on the MNIST dataset. First I write a rough version, then clean it up to make it more usable.

## Imports

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch import Tensor
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

import time

## Get Data

In [None]:
training_data = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)
train_dataloader = DataLoader(training_data, batch_size=128, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=128, shuffle=True)

## Create Model

In [None]:
class DropoutNet(nn.Module):
    def __init__(self,dropout=True):
        super(DropoutNet, self).__init__()

        #we reduce the standard deviation of initialised weights to reduce chance of gradient explosion
        std = torch.sqrt(torch.tensor(2 / 256))
        self.dropout=dropout
        self.w1 = torch.randn((784,784),device=device)*std
        self.w2 = torch.randn((784,784),device=device)*std
        self.w3 = torch.randn((784,784),device=device)*std
        self.w4 = torch.randn((784,256),device=device)*std
        self.w5 = torch.randn((256,10),device=device)*std

        self.b1 = torch.randn((1,784),device=device)*std
        self.b2 = torch.randn((1,784),device=device)*std
        self.b3 = torch.randn((1,784),device=device)*std
        self.b4 = torch.randn((1,256),device=device)*std
        self.b5 = torch.randn((1,10),device=device)*std


        self.params = [self.w1,self.w2,self.w3,self.w4,self.w5,self.b1,self.b2,self.b3,self.b4,self.b5]
        for param in self.params:
          param.requires_grad_()
          param.to(device)


    def forward(self, x,train=True):
      #if we are training the network, we add dropout - we remove randomly selected nodes from the neural network.
      #each node has a 50% chance of being removed from the network

      #if we are testing the network, we use every node.
      #this strategy helps prevent overfitting to the training data
        if train and self.dropout:
            self.d1 = torch.randint(2,[784],device=device)
            self.d2 = torch.randint(2,[784],device=device)
            self.d3 = torch.randint(2,[784],device=device)
            self.d4 = torch.randint(2,[784],device=device)
            self.d5 = torch.randint(2,[256],device=device)
        else:
            self.d1 = torch.ones([784],device=device)
            self.d2 = torch.ones([784],device=device)
            self.d3 = torch.ones([784],device=device)
            self.d4 = torch.ones([784],device=device)
            self.d5 = torch.ones([256],device=device)


        x = F.relu(x@(self.w1).mul(self.d1.reshape(-1,1)))+self.b1
        x = F.relu(x@(self.w2).mul(self.d2.reshape(-1,1)))+self.b2
        x = F.relu(x@(self.w3).mul(self.d3.reshape(-1,1)))+self.b3
        x = F.relu(x@(self.w4).mul(self.d4.reshape(-1,1)))+self.b4
        x = F.relu(x@(self.w5).mul(self.d5.reshape(-1,1)))+self.b5
        return x
    def parameters(self):return self.params



## Train

In [None]:
def train_one_epoch(model,dropout=True):
    model = model.to(device)

    loss_fn = nn.CrossEntropyLoss()

    optimizer = optim.Adam(model.parameters())

    for inputs, labels in train_dataloader:

        optimizer.zero_grad()
        outputs = model.forward(inputs.to(device).reshape(-1,784))
        loss = loss_fn(outputs, labels.to(device))
        loss.backward()
        optimizer.step()


## Test

In [None]:
def test(model, dataloader=test_dataloader):
    with torch.no_grad():
        correct = 0
        total = 0
        for inputs, labels in dataloader:
            outputs = model.forward(inputs.to(device).reshape(-1,784),train=False)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels.to(device)).sum().item()
    return correct/total

## It works!

Here we can see the dropout in action. The test data accuracy on the no dropout model is much lower than the training accuracy - a sign of overfitting. On the model that uses dropout we see a much smaller difference between the train/test accuracy.

We also see faster training when using dropout with a large amount of neurons as the dropout causes "subnetworks" to be trained individually, which are then combined during testing

On the MNIST dataset, the model without dropout does ultimately perform better, which could be caused by the relative simplicity of the task and a lack of significant variation between training and test datasets

In [None]:
models = [DropoutNet(),DropoutNet(dropout=False)]
for model in models:
  print(f"model {['without','with'][int(model.dropout)]} dropout")
  for epoch in range(10):
    train_one_epoch(model)
    test_acc,train_acc = test(model),test(model,dataloader = train_dataloader)
    print(f"epoch : {epoch} | accuracy on test set : {test_acc} | accuracy on train set : {train_acc} | difference between test and train accuracy = {train_acc-test_acc}")


model with dropout
epoch : 0 | accuracy on test set : 0.804 | accuracy on train set : 0.8012833333333333 | difference between test and train accuracy = -0.0027166666666667005
epoch : 1 | accuracy on test set : 0.9198 | accuracy on train set : 0.9170333333333334 | difference between test and train accuracy = -0.002766666666666584
epoch : 2 | accuracy on test set : 0.9417 | accuracy on train set : 0.9428333333333333 | difference between test and train accuracy = 0.0011333333333333195
epoch : 3 | accuracy on test set : 0.954 | accuracy on train set : 0.9559666666666666 | difference between test and train accuracy = 0.001966666666666672
epoch : 4 | accuracy on test set : 0.9587 | accuracy on train set : 0.962 | difference between test and train accuracy = 0.0032999999999999696
epoch : 5 | accuracy on test set : 0.9625 | accuracy on train set : 0.9651666666666666 | difference between test and train accuracy = 0.002666666666666595
epoch : 6 | accuracy on test set : 0.965 | accuracy on train 

## Clean Version

In [None]:
class Dropout(nn.Module):
  def __init__(self):
    super(Dropout,self).__init__()
    self.training=True

  def forward(self, x):
    if self.training:
      return x
    length = x.shape[-1]
    mask = torch.randint(2,(length,),device=device)
    return x.mul(mask)


In [None]:
class CleanNet(nn.Module):
    def __init__(self):
        super(CleanNet, self).__init__()
        self.d = Dropout()
        self.l1 = nn.Linear(784,784,device=device)
        self.l2 = nn.Linear(784,256,device=device)
        self.l3 = nn.Linear(256,10,device=device)

    def forward(self, x,train=True):
      if train:
        self.d.training=False
      else:
        self.d.training=True

      x = self.d(x)
      x = F.relu(self.l1(x))
      x = self.d(x)
      x = F.relu(self.l2(x))
      x = self.d(x)
      x = F.relu(self.l3(x))

      return x

In [None]:
model = CleanNet()
for epoch in range(5):
  train_one_epoch(model)
  test_acc,train_acc = test(model),test(model,dataloader = train_dataloader)
  print(f"epoch : {epoch} | accuracy on test set : {test_acc} | accuracy on train set : {train_acc} | difference between test and train accuracy = {train_acc-test_acc}")

epoch : 0 | accuracy on test set : 0.9104 | accuracy on train set : 0.9062833333333333 | difference between test and train accuracy = -0.004116666666666657
epoch : 1 | accuracy on test set : 0.9412 | accuracy on train set : 0.9411 | difference between test and train accuracy = -9.999999999998899e-05
epoch : 2 | accuracy on test set : 0.9502 | accuracy on train set : 0.9508666666666666 | difference between test and train accuracy = 0.0006666666666665932
epoch : 3 | accuracy on test set : 0.9604 | accuracy on train set : 0.9599833333333333 | difference between test and train accuracy = -0.0004166666666667318
epoch : 4 | accuracy on test set : 0.9632 | accuracy on train set : 0.9675333333333334 | difference between test and train accuracy = 0.004333333333333411
