<a href="https://colab.research.google.com/github/ayulockin/debugNNwithWandB/blob/master/MNIST_pytorch_wandb_Exploding_Gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports and Setups

In [0]:
!pip install wandb -q

[K     |████████████████████████████████| 1.4MB 8.9MB/s 
[K     |████████████████████████████████| 92kB 9.9MB/s 
[K     |████████████████████████████████| 102kB 9.0MB/s 
[K     |████████████████████████████████| 460kB 27.1MB/s 
[K     |████████████████████████████████| 102kB 14.4MB/s 
[K     |████████████████████████████████| 71kB 11.1MB/s 
[K     |████████████████████████████████| 71kB 11.3MB/s 
[?25h  Building wheel for gql (setup.py) ... [?25l[?25hdone
  Building wheel for watchdog (setup.py) ... [?25l[?25hdone
  Building wheel for shortuuid (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for graphql-core (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [0]:
import wandb

In [0]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://app.wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 69f60a7711ce6b8bbae91ac6d15e45d6b1f1430e
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


In [0]:
import torch
from torch import nn
from torch import optim
from torch.nn import functional as F
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt
import numpy as np

#### For GPU

In [7]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## MNIST Hand written Dataset

In [8]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.1307,), (0.3081,))])

trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)

classes = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9')

0it [00:00, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


9920512it [00:02, 3372615.73it/s]                            


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw


0it [00:00, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


32768it [00:00, 48231.78it/s]                           
0it [00:00, ?it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


1654784it [00:02, 816838.33it/s]                             
0it [00:00, ?it/s]

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


8192it [00:00, 18231.71it/s]            

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw
Processing...
Done!





## Models

1.   `NetwithIssue`: This model uses `sigmoid` as activation function in each layer. Earlier when `relu` was not discovered `sigmoid` created the problem of **Vanishing Gradient**. Due to this deep networks were not possible and it took longer training time to converge. 
2.   `Net`: This model uses `relu` as activation function in each layer. This activation solves the problem of vanishing gradient.



In [0]:
class NetforExplode(nn.Module):
    def __init__(self):
        super(NetforExplode, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv1.weight.data.fill_(100)
        self.conv1.bias.data.fill_(-100)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.conv2.weight.data.fill_(100)
        self.conv2.bias.data.fill_(-100)

        self.fc1 = nn.Linear(9216, 128)
        self.fc1.weight.data.fill_(100)
        self.fc1.bias.data.fill_(-100)
        self.fc2 = nn.Linear(128, 10)
        self.fc2.weight.data.fill_(100)
        self.fc2.bias.data.fill_(-100)
        

    def forward(self, x):
        ## Conv 1st Block
        x = self.conv1(x)
        x = F.relu(x) ## Notice
        x = self.conv2(x)
        x = F.relu(x) ## Notice
        x = F.max_pool2d(x, 2)

        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x) ## Notice
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

In [0]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1, bias=False)
        self.conv2 = nn.Conv2d(32, 64, 3, 1, bias=False)

        self.fc1 = nn.Linear(9216, 128, bias=False)
        self.fc2 = nn.Linear(128, 10, bias=False)

    def forward(self, x):
        ## Conv 1st Block
        x = self.conv1(x)
        x = F.relu(x) ## Notice
        x = self.conv2(x)
        x = F.relu(x) ## Notice
        x = F.max_pool2d(x, 2)

        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x) ## Notice
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

In [0]:
def train(model, device, train_loader, optimizer, epoch, steps_per_epoch=20):
  # Switch model to training mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
  model.train()
  train_total = 0
  train_correct = 0

  # We loop over the data iterator, and feed the inputs to the network and adjust the weights.
  for batch_idx, (data, target) in enumerate(train_loader, start=0):
    if batch_idx > steps_per_epoch:
      break
    # Load the input features and labels from the training dataset
    data, target = data.to(device), target.to(device)
    
    # Reset the gradients to 0 for all learnable weight parameters
    optimizer.zero_grad()
    
    # Forward pass: Pass image data from training dataset, make predictions about class image belongs to (0-9 in this case)
    output = model(data)
    
    # Define our loss function, and compute the loss
    loss = F.nll_loss(output, target)

    scores, predictions = torch.max(output.data, 1)
    train_total += target.size(0)
    train_correct += int(sum(predictions == target))
            
    # Backward pass: compute the gradients of the loss w.r.t. the model's parameters
    loss.backward()
    
    # Update the neural network weights
    optimizer.step()

  acc = round((train_correct / train_total) * 100, 2)
  print('Epoch [{}], Loss: {}, Accuracy: {}, '.format(epoch, loss.item(), acc), end='')
  # wandb.log({'Train Loss': loss.item(), 'Train Accuracy': acc})


In [0]:
def test(model, device, test_loader, classes):
  # Switch model to evaluation mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
  model.eval()
  
  test_loss = 0
  test_total = 0
  test_correct = 0

  with torch.no_grad():
      for data, target in test_loader:
          # Load the input features and labels from the test dataset
          data, target = data.to(device), target.to(device)
          
          # Make predictions: Pass image data from test dataset, make predictions about class image belongs to (0-9 in this case)
          output = model(data)
          
          # Compute the loss sum up batch loss
          test_loss += F.nll_loss(output, target, reduction='sum').item()
          
          scores, predictions = torch.max(output.data, 1)
          test_total += target.size(0)
          test_correct += int(sum(predictions == target))
          
  acc = round((test_correct / test_total) * 100, 2)
  print(' Test_loss: {}, Test_accuracy: {}'.format(test_loss/test_total, acc))
  # wandb.log({'Test Loss': test_loss/test_total, 'Test Accuracy': acc})


## Exploding Gradient (Network with Issue)

In [13]:
net = NetforExplode().to(device)
print(net)

optimizer = optim.Adam(net.parameters())

NetforExplode(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=9216, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=10, bias=True)
)


In [14]:
# wandb.init(project='explodingdebug')
# wandb.watch(net, log='all')

for epoch in range(10):
  train(net, device, trainloader, optimizer, epoch)
  test(net, device, testloader, classes)

print('Finished Training')

Epoch [0], Loss: 374601678848.0, Accuracy: 9.15,  Test_loss: 515360212543.0784, Test_accuracy: 10.28
Epoch [1], Loss: 651828396032.0, Accuracy: 8.85,  Test_loss: 683000517047.0912, Test_accuracy: 10.1
Epoch [2], Loss: 518415974400.0, Accuracy: 9.97,  Test_loss: 573574628769.792, Test_accuracy: 11.35
Epoch [3], Loss: 557607550976.0, Accuracy: 9.23,  Test_loss: 706371366839.9104, Test_accuracy: 9.82
Epoch [4], Loss: 817587290112.0, Accuracy: 11.09,  Test_loss: 799592343509.4016, Test_accuracy: 9.58
Epoch [5], Loss: 762960674816.0, Accuracy: 10.27,  Test_loss: 942930434562.4576, Test_accuracy: 9.58
Epoch [6], Loss: 665451495424.0, Accuracy: 9.75,  Test_loss: 655409432389.2224, Test_accuracy: 9.29
Epoch [7], Loss: 877649723392.0, Accuracy: 10.49,  Test_loss: 1016483252745.0112, Test_accuracy: 9.8
Epoch [8], Loss: 550259130368.0, Accuracy: 11.16,  Test_loss: 430533319956.8896, Test_accuracy: 10.09
Epoch [9], Loss: 626729680896.0, Accuracy: 8.48,  Test_loss: 740067318012.3136, Test_accuracy:

> Our model couldn't converge. Let's look at the gradients of each layer. 

![gradients](link to github/images)

Let's try to drecrease loss and increase accuracy by training it for longer.


In [0]:
del net

In [27]:
net = NetforExplode().to(device)
optimizer = optim.Adam(net.parameters())

# wandb.init(project='debuggingnn')
# wandb.watch(net, log='all')

for epoch in range(10):
  train(net, device, trainloader, optimizer, epoch)
  test(net, device, testloader, classes)

print('Finished Training')

Epoch [0], Loss: 736184238080.0, Accuracy: 10.71,  Test_loss: 685160885596.9792, Test_accuracy: 9.74
Epoch [1], Loss: 383292276736.0, Accuracy: 9.45,  Test_loss: 579661026924.9536, Test_accuracy: 10.09
Epoch [2], Loss: 843826855936.0, Accuracy: 9.9,  Test_loss: 1530725270552.576, Test_accuracy: 10.1
Epoch [3], Loss: 705851031552.0, Accuracy: 9.9,  Test_loss: 784419297794.4576, Test_accuracy: 10.32
Epoch [4], Loss: 1398146072576.0, Accuracy: 10.04,  Test_loss: 1100310706441.421, Test_accuracy: 8.92
Epoch [5], Loss: 631964172288.0, Accuracy: 10.42,  Test_loss: 675881608753.9712, Test_accuracy: 9.8
Epoch [6], Loss: 418658648064.0, Accuracy: 10.19,  Test_loss: 499995826035.0976, Test_accuracy: 9.98
Epoch [7], Loss: 1115215101952.0, Accuracy: 10.49,  Test_loss: 795683708521.6768, Test_accuracy: 11.35
Epoch [8], Loss: 1363853443072.0, Accuracy: 10.79,  Test_loss: 1383919422904.7295, Test_accuracy: 8.92
Epoch [9], Loss: 500363689984.0, Accuracy: 10.42,  Test_loss: 627344827586.9696, Test_accu

## Relu saves the day

In [0]:
del net

net = Net().to(device)
print(net)

optimizer = optim.Adam(net.parameters())

Net(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (fc1): Linear(in_features=9216, out_features=128, bias=False)
  (fc2): Linear(in_features=128, out_features=10, bias=False)
)


In [0]:
wandb.init(project='debuggingnn')
wandb.watch(net, log='all')

for epoch in range(10):
  train(net, device, trainloader, optimizer, epoch)
  test(net, device, testloader, classes)

print('Finished Training')

Epoch [0], Loss: 0.4942484498023987, Accuracy: 69.27,  Test_loss: 0.4474091837644577, Test_accuracy: 86.37
Epoch [1], Loss: 0.29710474610328674, Accuracy: 88.47,  Test_loss: 0.2309967631816864, Test_accuracy: 93.38
Epoch [2], Loss: 0.14050628244876862, Accuracy: 92.49,  Test_loss: 0.19611059620380403, Test_accuracy: 93.99
Epoch [3], Loss: 0.11804260313510895, Accuracy: 94.94,  Test_loss: 0.1683878321647644, Test_accuracy: 94.79
Epoch [4], Loss: 0.08817225694656372, Accuracy: 95.61,  Test_loss: 0.15559489104747773, Test_accuracy: 95.51
Epoch [5], Loss: 0.035265251994132996, Accuracy: 95.68,  Test_loss: 0.1284967087507248, Test_accuracy: 96.07
Epoch [6], Loss: 0.22864656150341034, Accuracy: 95.54,  Test_loss: 0.11613762941360474, Test_accuracy: 96.56
Epoch [7], Loss: 0.136811763048172, Accuracy: 97.17,  Test_loss: 0.10356810202598572, Test_accuracy: 96.91
Epoch [8], Loss: 0.1875208616256714, Accuracy: 96.95,  Test_loss: 0.11141893649101257, Test_accuracy: 96.51
Epoch [9], Loss: 0.1007675

> The model converged quickly. Through the gradient plot it can be shown clearly that the gradients are distributed well. 

[LINK to]