### Training neural networks

We want a universal function that will take an image and convert it to a list of probabilites. To do so we need a loss function which is the measure of our prediction error. Now the loss depends on output and output depends on our weights. Hence we can adjust the weights such that the loss will be minimized. We minimize the loss using gradient descent. It is the slope of the loss wrt our parameters. It will always point in the direction of max change (increase). Hence we just go in the direction of -ve gradient. With a multilayer NN we use something called back prop, which means that we back propagate our losses to every layer of the neural net. This process uses chain rule. A->B->C. dA/dC = dA/dB * dB/dC. We also use learning rate to ensure we are not taking too large steps.

### Losses

we usally assign loss to a variable called `criterion` in PyTorch. For classification we use `CrossEntropyLoss`

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import datasets, transforms

In [2]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

  0%|          | 16384/9912422 [00:00<01:15, 131209.39it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /tmp/.pytorch/MNIST_data/MNIST/raw/train-images-idx3-ubyte.gz


9920512it [00:00, 30750866.45it/s]                          


Extracting /tmp/.pytorch/MNIST_data/MNIST/raw/train-images-idx3-ubyte.gz


32768it [00:00, 445810.61it/s]
  1%|          | 16384/1648877 [00:00<00:11, 147392.99it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /tmp/.pytorch/MNIST_data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting /tmp/.pytorch/MNIST_data/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /tmp/.pytorch/MNIST_data/MNIST/raw/t10k-images-idx3-ubyte.gz


1654784it [00:00, 7753419.33it/s]                            
8192it [00:00, 106982.36it/s]


Extracting /tmp/.pytorch/MNIST_data/MNIST/raw/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /tmp/.pytorch/MNIST_data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting /tmp/.pytorch/MNIST_data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!


In [3]:
# define our model
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64,10))

In [4]:
# define our loss function
criterion = nn.CrossEntropyLoss()

In [5]:
# get a batch of data
images, labels = next(iter(trainloader))
print(images.shape)

torch.Size([64, 1, 28, 28])


In [6]:
images[0].shape

torch.Size([1, 28, 28])

In [7]:
# we need to flatten the images to train them
# this means that we want the shape[0] to be as it as
# and flatten the other dimensions
images = images.view(images.shape[0], -1)

In [8]:
images.shape

torch.Size([64, 784])

In [9]:
# we can now use our model as if it is a function
# notice that we are using logits (output of our final layer)
# and not softmax output to train
logits = model(images)

In [10]:
loss = criterion(logits, labels)
loss

tensor(2.2916, grad_fn=<NllLossBackward>)

In [11]:
# LogSoftmax works better than softmax
# if we are using logsoftmax we want to use torch.exp for our actual probabilities
# and we want to use negative log likelihood function as our loss function


# dim = 1 calculates it across cols and not rows
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64,10),
                      nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()

In [12]:
# get a batch of data
images, labels = next(iter(trainloader))
print(images.shape)

torch.Size([64, 1, 28, 28])


In [13]:
images = images.view(images.shape[0], -1)

In [14]:
log_prob = model(images)

In [15]:
loss = criterion(log_prob, labels)
loss

tensor(2.3023, grad_fn=<NllLossBackward>)

### Gradient descent

We use a pytorch module called autograd for calculating gradients. Autograd works by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. To make PyTorch keep track of a tensor we need to set `requires_grad = True`. We can do this at the time of creation or later as `x.requires_grad_(True)`

You can turn off gradients for a block of code with the `torch.no_grad()`



In [16]:
x = torch.randn((2,2), requires_grad=True)
x

tensor([[1.2447, 1.7893],
        [0.8041, 0.1500]], requires_grad=True)

In [17]:
y = x ** 2
y

tensor([[1.5492, 3.2016],
        [0.6465, 0.0225]], grad_fn=<PowBackward0>)

In [18]:
# we can check the function that created a particular variable by using grad_fn
# here y is generated by power fucntion

# autograd keeps track of the functions that created a variable and calculates
# gradients that way
y.grad_fn

<PowBackward0 at 0x7f9d495d2438>

In [19]:
z = y.mean()
z.grad_fn

<MeanBackward1 at 0x7f9d487424a8>

In [20]:
# we can check the gradients of a variable by doing x.grad
# but this will be empty since we haven't done a backward pass as yet

In [21]:
x.grad

In [22]:
# to calculate the gradients we need to run .backward() on a variable
z.backward()

In [23]:
x.grad

tensor([[0.6223, 0.8947],
        [0.4020, 0.0750]])

In [24]:
# putting it all together
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64,10),
                      nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()
images, labels = next(iter(trainloader))
images = images.view(images.shape[0], -1)
logits = model(images)
loss = criterion(logits, labels)

In [25]:
print('Before backward pass: \n', model[0].weight.grad)

loss.backward()

print('After backward pass: \n', model[0].weight.grad)

Before backward pass: 
 None
After backward pass: 
 tensor([[-0.0014, -0.0014, -0.0014,  ..., -0.0014, -0.0014, -0.0014],
        [ 0.0003,  0.0003,  0.0003,  ...,  0.0003,  0.0003,  0.0003],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [-0.0012, -0.0012, -0.0012,  ..., -0.0012, -0.0012, -0.0012],
        [-0.0018, -0.0018, -0.0018,  ..., -0.0018, -0.0018, -0.0018],
        [ 0.0029,  0.0029,  0.0029,  ...,  0.0029,  0.0029,  0.0029]])


In [26]:
# we use an optimizer to update the weights of our neural network

from torch import optim

# Optimizers require the parameters to optimize and a learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)

In [27]:
# one pass through the network
print('Initial weights - ', model[0].weight)

images, labels = next(iter(trainloader))
images.resize_(64, 784)

# PyTorch by default accumulates gradients, that means that if we do multiple
# forward and backward passes, it is going to keep summing our gradients
# hence we call zero grad before every training pass

optimizer.zero_grad()

# make predictions
output = model(images)

# calculate the loss
loss = criterion(output, labels)

# calculate the gradients
loss.backward()
print('Gradient -', model[0].weight.grad)

Initial weights -  Parameter containing:
tensor([[-0.0196, -0.0063,  0.0280,  ..., -0.0045, -0.0026, -0.0128],
        [ 0.0056,  0.0230, -0.0016,  ...,  0.0097, -0.0002, -0.0232],
        [ 0.0003,  0.0046, -0.0290,  ..., -0.0219,  0.0272, -0.0255],
        ...,
        [ 0.0313,  0.0219,  0.0083,  ..., -0.0112, -0.0125, -0.0222],
        [ 0.0109,  0.0193, -0.0083,  ...,  0.0302, -0.0084, -0.0076],
        [ 0.0195, -0.0014, -0.0175,  ...,  0.0230,  0.0278, -0.0125]],
       requires_grad=True)
Gradient - tensor([[ 0.0014,  0.0014,  0.0014,  ...,  0.0014,  0.0014,  0.0014],
        [ 0.0010,  0.0010,  0.0010,  ...,  0.0010,  0.0010,  0.0010],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.0017,  0.0017,  0.0017,  ...,  0.0017,  0.0017,  0.0017],
        [-0.0015, -0.0015, -0.0015,  ..., -0.0015, -0.0015, -0.0015],
        [ 0.0025,  0.0025,  0.0025,  ...,  0.0025,  0.0025,  0.0025]])


In [28]:
# make an update
optimizer.step()
print('Updated weights - ', model[0].weight)

Updated weights -  Parameter containing:
tensor([[-0.0196, -0.0063,  0.0280,  ..., -0.0045, -0.0026, -0.0128],
        [ 0.0056,  0.0230, -0.0016,  ...,  0.0097, -0.0003, -0.0232],
        [ 0.0003,  0.0046, -0.0290,  ..., -0.0219,  0.0272, -0.0255],
        ...,
        [ 0.0313,  0.0218,  0.0083,  ..., -0.0112, -0.0126, -0.0222],
        [ 0.0109,  0.0193, -0.0083,  ...,  0.0303, -0.0084, -0.0076],
        [ 0.0195, -0.0014, -0.0175,  ...,  0.0230,  0.0278, -0.0125]],
       requires_grad=True)


## Training for a few epochs

In [29]:
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Flatten MNIST images into a 784 long vector
        images = images.view(images.shape[0], -1)
    
        # TODO: Training pass
        optimizer.zero_grad()
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

Training loss: 1.9604032466660684
Training loss: 0.9002394439187894
Training loss: 0.5352060402602529
Training loss: 0.43329321025912443
Training loss: 0.38745840537204923


In [30]:
labels[0]

tensor(6)