# Building neural networks with PyTorch

In this session, we are going to look at how we can use a neural network library called PyTorch to build, train and evaluate our own neural network!

As always we are going to start with importing the essential scientific Python packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

and we are also going to import the PyTorch package `torch` as well as the associated `torchvision` package that provides means of downloading and handling popular machine learning datasets.

In [None]:
import torch # import the PyTorch package
import torchvision # import trochvision package

## PyTorch Tensors - enhanced NumPy arrays

When working with PyTorch, you have to work with it's specialized type of object called `Tensor`'s that is built to represent **tensors** of arbitrary dimensions.

In [None]:
v = torch.Tensor([1, 2, 3])

In [None]:
v

In [None]:
v[0]

In [None]:
v.shape

You might realize that the `Tensor` appears to be extremeley similar to NumPy's array which is also capable of representing high dimension arrays (e.g. tensors).

In fact, there is almost a one-to-one correspondance between NumPy array's and Tensor's features and functionalities.

In [None]:
np.zeros((3, 5))

In [None]:
torch.zeros(3, 5)

### Why PyTorch Tensors?

Given the striking similarity, you may be wondering why PyTorch couldn't just use NumPy array instead of going through great effort of providing yet another high dimension array library with almost identical functionalities. 

One of the most critical reason that Tensor is a distinct object from NumPy array is the fact that **Tensor's can be placed on GPU memory** and **take advantage of the immense acceleration in computation provided by parallelization of GPUs**.

I will demonstrate the striking difference in computation speed achieved when computing with GPUs vs CPUs later in this session.

# Get the data

When you want to design a neural network to achieve a task, you must first get your hands on the data so that you can **train your network on it**!

We are going to load a **training set** that we are going to use to train our network and a separate **test set** that we'll use to evaluate the performance of the network.

## Loading the training set

We can use convenience methods in `torchvision.datasets` to download various popular machine learning benchmark images. Here we are going to download [**MNIST**](http://yann.lecun.com/exdb/mnist/) which is a collection of handwritten digits along with labels (i.e. what digit was drawn).

The MNIST dataset consists of a total of 70,000 images, of which 60,000 are desginated as the **training set** and 10,000 are designated as the **test set**. This standardized separation allows everyone around the world to evaluate and compare their model's performances with each other!

In [None]:
train_set = torchvision.datasets.MNIST('./data', train=True, download=True)

This returns Torchvision's special **dataset** object that can be used to represent **supervised datasets** consisting of both inputs (i.e. images) and targets (i.e. digit labels).

In [None]:
len(train_set)

In [None]:
image, label = train_set[100]

In [None]:
plt.imshow(image)
plt.title('Digit: {}'.format(label))

In [None]:
fig, axs = plt.subplots(5, 5, figsize=(6, 6))

for i, ax in enumerate(axs.ravel()):
    image, label = train_set[i]
    ax.imshow(image)
    ax.set_title('Digit: {}'.format(label))
    ax.axis('off')
    
fig.tight_layout()

## Loading the test set

You can get the test set in an identical fashion, passing in `train=False` into `MNIST`:

In [None]:
test_set = torchvision.datasets.MNIST('./data', train=False, download=True)

In [None]:
len(test_set)

In [None]:
image, label = train_set[3]

In [None]:
plt.imshow(image)
plt.title('Digit: {}'.format(label))

## Add data transforms

Even before you start feeding in your images into a neural network, it is very common to perform some data transformations - modifying images in some fixed manner that makes it easier to work with them.

One of the most common image transformation is **normalization**, where you first compute mean and standard deviation across all images (typically in the training set). You then subtract the mean from each image and also divide each image by the standard deviation. If you did this, and recomputed the mean and standard deviation across all images, you will find that they now have **mean of 0** and **standard deviation of 1**, and thus they are said to be **normalized**.

Normalization helps ensure input image intensities stay within some expected range, allowing the network to not have to worry about large variations in image values that is otherwise visually uninteresting.

Also, when you load images from Torchvision, they are provided as Pillow package's Image object. Pillow is one of Python's popular image processing package, and there images are represented by a dedicated Image object with a lot of methods implementing common image processing operations.

In [None]:
type(image)

However, in PyTorch, networks only understands PyTorch Tensors, and thus we must convert the images from Pillow Image into PyTorch Tensor before we can pass the image into the network.

We can achieve these two *transformations* by making use of Torchvision's transformation operations. You combine multiple transformation operations together and pass it at the time of dataset loading. This returns a dataset that **applies these transformations** automatically on all images! 

Let's add a transformation that will:
1. convert images into PyTorch tensors
2. normalize the images against the mean of 0.1307 and standard deviation of 0.3081.

In [None]:
from torchvision import transforms # get torchvision's transforms subpackage

In [None]:
# create a composite transform that first converts images to tensors and then normalize the images
image_transform = transforms.Compose([
    transforms.ToTensor(), # converts images into Tensors
    transforms.Normalize([0.1307], [0.3081])
])

# apply the transforms at the time of dataset loading
training_set = torchvision.datasets.MNIST('./data', train=True, download=True,
                                          transform=image_transform)
test_set = torchvision.datasets.MNIST('./data', train=True, download=True,
                                          transform=image_transform)

Now any image you access through the dataset has the transformation already applied

In [None]:
image, label = training_set[100]

In [None]:
type(image)

# Defining your network

In PyTorch, you define a new neural network by defining a **new class that inherits from nn.Module** as follows:

In [None]:
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class MyNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(5, 10)
        
    def forward(self, x):
        y = self.fc(x) # fully connected layer
        z = F.relu(y) # activation function
        return z

To understand this better, let's take a quick review of classes and learn the new concept of **object inheritance**.

## Defining a class

In the past session, we have taken a look at defining a **class** to represent a grouping of data and functions, where each **instance of a class** or **object** can be thought of as representing a concrete unit that has **properties** and **behavior** (or **methods**).

In [None]:
class Person:
    def __init__(self, name):
        self.name = name  # assign name
        
    def title(self):
        return "an ordinary person."
    
    def greeting(self):
        print('Hello! My name is {}.'.format(self.name))
        print("I'm {}".format(self.title()))

In [None]:
edgar = Person('Edgar')
john = Person('John')

In [None]:
edgar.greeting()

In [None]:
john.greeting()

Here you can see that both `edgar` and `john` are objects of type (class) `Person`. They both have properties called `name` that is unique to each, and have the common behaviors (methods) called `title` that returns a string describing the person, and `greeting` that prints out a greeting message introducing themselves. Note that `greeting` method calls the method `title` in creating the intro statement.

Key of **Object-Oriented Programming (OOP)** is to group certain data (e.g. `name`) with behavior (e.g. `greeting`, `title`) that when put together can be used to represent a conceptual grouping that may correspond to some real world *objects*.

## Specialization via inheritance

Now imagine that you want to define a new **class** of object called `Scientist` that has everything that a `Person`  has (e.g.`name`, `greeting`, and `title`), but has extra property called `topic` that specifies their research topic, and has a new behavior (i.e. *method*) called `research` that finds a significant result at p-value < 0.05. 

Without worring much about code duplication, you could implement it as such: 

In [None]:
import random

class Scientist:
    def __init__(self, name, topic):
        self.name = name
        self.topic = topic
        
    def title(self):
        return "an ordinary person."
    
    def greeting(self):
        print('Hello! My name is {}.'.format(self.name))
        print("I'm {}".format(self.title()))
 
    def research(self, silent=False):
        print('Performing a research on the topic {}...'.format(self.topic))
        pvalue = random.random() # randomly pick a value between [0, 1)
        
        if pvalue < 0.05:
            if not silent:
                print('Results statistically significant with p-value={:0.3f}!! Publish!!'.format(pvalue))
            return True
        else:
            if not silent:
                print('Results was not significant with p-value={:0.3f}... Continue working...'.format(pvalue))
            return False

In [None]:
edgar = Scientist(name='Edgar', topic='computational neuroscience')

In [None]:
edgar.greeting()

In [None]:
edgar.research()

Now notice that there is a lot of repeated code between a `Person` and a `Scientist`. Both have a property called `name` and methods called `title` and `greeting`.

After all, a Scientist **is a** Person, right?

When one class can be thought of as a **specialization** of another class, you can save alot of typing and code duplication by using **class inheritance**!

In [None]:
import random

class Scientist(Person):  # Scientist inherits from Person
    def __init__(self, name, topic):
        super().__init__(name) # call __init__ of Person with name
        self.topic = topic

    def research(self, silent=False):
        print('Performing a research on the topic {}...'.format(self.topic))
        pvalue = random.random() # randomly pick a value between [0, 1)
        
        if pvalue < 0.05:
            if not silent:
                print('Results statistically significant with p-value={:0.3f}!! Publish!!'.format(pvalue))
            return True
        else:
            if not silent:
                print('Results was not significant with p-value={:0.3f}... Continue working...'.format(pvalue))
            return False

In [None]:
moku = Scientist(name='Moku', topic='physics')

In [None]:
edgar.greeting()

In [None]:
edgar.research()

Notice that the `Scientist` class no longer implements the `greeting` and `title` methods, yet you can still call them on the instance of `Scientist`. This is because these methods were **inherited** from `Person` class.

Furthermore, we call something funny inside the `__init__` method: `super().__init__(name)`. As you may be able to guess, this calls the initializer of `Person` or the **super class**, passing in the value it expects (e.g. `name` of the person). This allows any complex configuration that `Person` might have done in its `__init__` to be reused. 

Also, you would refer to `Scientist` as a **subclass** of the `Person` (alternatively, `Person` is a *super class* of `Scientist`). We also say that there exits **is-a** relationship between `Scientist` and `Person` - `Scirntist` **is-a** `Person`.

Finally, you can **override** super class's implementation of a method to give a new, specialized behavior to an existing method! Here, let's **override** the implementation of the method `title`, in effect cusotmizing the introduction:

In [None]:
import random

class Scientist(Person):  # Scientist inherits from Person
    def __init__(self, name, topic):
        super().__init__(name)
        self.topic = topic
        
    # overriding title method
    def title(self):
        return "a researcher in {}".format(self.topic)

    def research(self, silent=False):
        print('Performing a research on the topic {}...'.format(self.topic))
        pvalue = random.random() # randomly pick a value between [0, 1)
        
        if pvalue < 0.05:
            if not silent:
                print('Results statistically significant with p-value={:0.3f}!! Publish!!'.format(pvalue))
            return True
        else:
            if not silent:
                print('Results was not significant with p-value={:0.3f}... Continue working...'.format(pvalue))
            return False

In [None]:
edgar = Scientist(name='Edgar', topic='computational neuroscience')

In [None]:
edgar.greeting()

You can see that we were able to modify the behaivor of an already existing method `greeting` by overriding the behavior of the another method `title`. This pattern in which you can **customize** behavior of an already existing method by overriding another method happens quite commonly. In fact, we will soon encounter them in implementing our neural network in PyTorch!!

## Networks inherit from *nn.Module*

Now armed with knowlege of class inheritance, let's take another look at a typical definition of a neural network in PyTorch. 

In PyTorch, any network **is a** `nn.Module`. In other words, you define a new class that inherits from `nn.Module`.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class MyNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(5, 10) # creates a new fully connected layer Module
        
    def forward(self, x):
        y = self.fc(x) # fully connected layer
        z = F.relu(y) # activation function
        return z

Above, we can see that `MyNetwork` *inherits from* `nn.Module`. This gives our class `MyNetwork` with a lot of properties and methods that is already part of `nn.Module`, and this is precisely what allows you to define a new neural network with very little code.

A network (a module) in PyTorch (which is a class), typically consists of one or more other **modules** that you instantiate and hold onto as *properties*. Here, we are defining a single `nn.Linear` module which corresponds to a *fully-connected* linear layer connecting from 5 input neurons into 10 output neurons. We **instantiate** this module and assign it to the object's property named `fc` (standing for **f**ully **c**onnected layer).

Note that in the `__init__`, we have not computed anything. We simply instantiated a module and assigned it to a property for *later use*.

Real use of a PyTorch module comes in when you **instantiate** the class - that is, you create an object:

In [None]:
net = MyNetwork()

This action just created a new **instance** of the network, with it's own network weights that can be trained!

A key feature of a module is that you can use it like a function - it accepts an input and returns an output!

In [None]:
x = torch.rand(1, 5) # a simple vector of 5 elements - or 5 input values

y = net(x) # you use a module instance like a function!

y

In [None]:
y.shape

The secret behind this is the `forward` method we defined in the `MyNetwork` class:

```python
 def forward(self, x):
    y = self.fc(x)
    z = F.relu(y)
    return z
```

In this method, we accepted a parameter `x`, and we used the fully-connected linear layer module `self.fc` as a function with `x` as the input!

It turns out that `nn.Linear` is yet another **subclass** of `nn.Module` (that is, `nn.Linear` is a `nn.Module`) and thus can take input and return outputs. Our particular `self.fc` was configured to take in input vector of size 5 and output vector of size 10.

The `F.relu` is then an element-wise operation that clips any value less than 0 to 0, while keeping positive values as is. These functions are typically referred to as the **activation functions**. Finally, the `forward` function returns the output of `relu` and this becomes the output of the whole network.

All in all, `MyNetwork` implemented a **single full-connected layer neural network with ReLU activation function** - one of the simplest networks you can construct!

## Building network to classify images into digits 

Now we have seen how to build a network by defining a new class that inherits from `nn.Module`, let's try to implement a network that takes in a $28 \times 28$ pixels gray scale image of a digit in MNIST and classifies them into 1 of the 10 digits!

The input will be one or more images of size $28 \times 28$, and we are going to set **the output to be a vector of size 10** where each position indicates a *log probability* that the image belongs to the specific digit.

We'll start with a simplest possible implementation where we **flatten out** the input image into a vector of size 28 * 28 = 784. This will be **fully connected neural network with no output nonlinearity** linking 784 input neurons into 10 output neurons.

In [None]:
class SimpleNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(784, 10)
        
    def forward(self, x):
        x = x.view(-1, 784) # flattens an image of form N x 1 x 28 x 28 -> N x 784
        x = self.fc(x)
        x = F.log_softmax(x, dim=1) # make sure that probabilities add up to one, and then take log
        return x

And that's it!

Let's instantiate the network and run an image through it:

In [None]:
net = SimpleNetwork()

In [None]:
image, label = test_set[30]

In [None]:
plt.imshow(image.squeeze(), cmap='gray')

In [None]:
net(image)

This is a log of class probabilities, so we can exponentiate this to get the actual probability over classes:

In [None]:
torch.exp(net(image))

Surely enough, they add up to 1

In [None]:
torch.exp(net(image)).sum()

We can perhaps take the index with the largest probability as the network's best guess:

In [None]:
p = torch.exp(net(image))
torch.argmax(p)

But the label is:

In [None]:
label

At this stage our network is not going to performing well at all. 

That's expected because our network is **randomly initialized**! In order to get a reasonable performance, we need to **train** the network!

# Training a neural network

We are going to **train our network** by minimizing a **loss function** (also known as the cost function) - a function that evaluates how *off* we are from the true target. Chosing a good loss function can influence how well your network trains and ultimately performs on the task.

In the case of **N-way classification** problem where the output is a vector of size *N*, it's quite common to treat the output as the log probability of N classes, and optimize the network by miniminzing the **negative log likelihood** loss. 

This is conceptually similar to adjusting the network weights (parameters) so that the correct class would have the higest probability among choices.

In training a neural network, you would follow a procedure called **gradient descent** to adjust the values of the network weights such that the loss function becomes smaller.

One of the biggest strenghts of frameworks like PyTorch lies in the fact **it can compute gradient of the loss with respect to all parameters in the network automatically** for you!

#### On gradient descent and back propagation

To compute the gradient of the loss with respect to the weights, neural network packages like PyTorch (and pretty much any other similar packages) make use of technique called **back propagation**. Refer to my slides for details on how backpropagation works to compute the gradients, and how the gradients can be used to adjust the parameters with the gradient descent. Interested readers are strongly encouraged to refer to wonderful online resources such as [Neural Network and Deep Learning](http://neuralnetworksanddeeplearning.com/) online text book by Michael Nielson for further details.

## Training on a minibatch

In a full-fledged optimization, you will typically evaluate the loss on **all training dataset** and try to **optimize the joint loss** (e.g. sum of all losses). However, this requires computing the loss on all images everytime you make a small modification to the parameters, and for complex neural networks, this computational cost can be extremely prohibitive.

Hence, you would typically **estimate** the joint loss by evaluating the loss on a randomly selected subsets of the input-target pairs, or on **a minibatch** (or simply a batch) of data.

Performing gradient descent on randomly sampled subsets of the training set is known as (minibatch) **stochastic gradient descent** (SGD).

To be able to perform minibatch SGD, we need a way to construct a random minibatch from the training datasets. Fortunately, this is easy to achive using PyTorch's `DataLoader`

In [None]:
batch_size = 64
training_loader = torch.utils.data.DataLoader(training_set, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size) # by default shuffle is False

`DataLoader` is an iterable, that returns batches of input target pairs of specified *batch size*. We can take a look at what it will return:

In [None]:
for x, t in training_loader:
    print('Images:', x.shape)
    print('Labels:', t.shape)
    break # quit after one iteration

Note that for both the images (inputs) and the labels (targets), the **first dimension is the batch dimension**.

## Iterating through the minibatch

Now we know how to get a minibatch, let's start putting together a training framework:

In [None]:
net = SimpleNetwork()
net.train() # puts the network into the training mode

for batch_idx, (data, target) in enumerate(training_loader):
    # evaluate the network output
    output = net(data)
    
    # compute the loss
    loss = F.nll_loss(output, target)
    
    # MISSING! Perform back propagation to compute gradient and perform gradient descent step!

The above code successuflly steps through 60,000 training images in batch of 64, evaluate the network on the inputs, and computes the loss between the network prediction and the targets. 

However, we are still missing the step of computing the gradient of loss with respect to the parameters of the network, and performing gradient descent to actually adjust and therefore **train** the network parameters!

We achieve this by using:

1. `backward` method call on the finall loss to trigger backpropagation computation, and 
2. An **optimizer** to adjust the values of the parameters based on the gradient - that is, perform a step of gradient descent

Putting this all together:

In [None]:
net = SimpleNetwork()
net.train() # puts the network into the training mode

# create and initialize an optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

for batch_idx, (data, target) in enumerate(training_loader):
    # reset the gradient before the next gradient step
    optimizer.zero_grad()
    
    # evaluate the network output
    output = net(data)
    
    # compute the loss
    loss = F.nll_loss(output, target)
    
    # perform back propagation to compute gradients with respect to parameters!
    loss.backward()
    
    # perform a gradient descent step on the parameters
    optimizer.step()

The final missing piece is some sort of **monitoring** by which we can observe that the network is actually training (that is, the loss decreases over training).

In [None]:
import time # use time module to measure how long it takes to train a network

In [None]:
net = SimpleNetwork()
net.train() # puts the network into the training mode

# create and initialize an optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

start = time.time()
for batch_idx, (data, target) in enumerate(training_loader):
    # reset the gradient before the next gradient step
    optimizer.zero_grad()
    
    # evaluate the network output
    output = net(data)
    
    # compute the loss
    loss = F.nll_loss(output, target)
    
    # perform back propagation to compute gradients with respect to parameters!
    loss.backward()
    
    # perform a gradient descent step on the parameters
    optimizer.step()
    
    # report the loss every 100 batches
    if batch_idx % 100 == 0:
        print('Loss: {:.6f}'.format(loss.item()))

duration = time.time() - start
print('Training completed in {:.2f} seconds'.format(duration))

We see that indeed our network appears to train over time as shown by the fact loss decreses over iterations. You can also see that loss eventually stop decreasing when the training reaches the end.

You can adjust the speed of the training by changing the value of the **learning rate**. Loosely speaking, learning rate controls the size of the step for each gradient descent step. 

Larger learning rate could lead to faster training but it can also easily stray youself away from an optimal solution. Achieving a good network training depends a lot on a good choice of the value of the learning rate.

## Testing the network

Now we have a trained network, it's time to evaluate it's performance on the test set. Training on one set and testing on a distinct set that was **not** used during the training is called **cross-validation**, and can be a good way to evaluate how well your network will **generalize** beyond the training set.

In [None]:
net.eval() # put network into evaluation model
test_loss = 0
correct = 0

# prevents unnecessary gradient computation during test - can lead to time and memory saving
with torch.no_grad(): 
    for data, target in test_loader:
        output = net(data)
        
        # sum up batch loss
        test_loss += F.nll_loss(output, target, size_average=False).item() 
        
        # get the index of the max log-probability
        pred = output.max(1, keepdim=True)[1] 
        
        # count number of times where max probability matches the label index
        correct += pred.eq(target.view_as(pred)).sum().item()

# divide the test loss by number of samples in the test set
test_loss /= len(test_loader.dataset)

print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

You should see that our network actually performs ~84-90% correct on the digit classification!

Let's take the earlier example:

In [None]:
image, label = test_set[30]

In [None]:
label

In [None]:
p = torch.exp(net(image))
torch.argmax(p)

# Training a more complex network

Above we were already getting well above chance performance on digit classification with an extremely simple **single fully-connected layer network**. Now, let's try improving the result by training slight more complex network - **three layer fully-connected network with ReLU nonlinearity**.

In [None]:
class ComplexNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 10)
        
    def forward(self, x):
        x = x.view(-1, 784) # flattens an image of form N x 1 x 28 x 28 -> N x 784
        x = F.relu(self.fc1(x)) # first fully connected layer followed by ReLU
        x = F.relu(self.fc2(x)) # second fully connected layer followed by ReLU
        x = self.fc3(x) # third fully connected layer *without* output ReLU
        x = F.log_softmax(x, dim=1) # make sure that probabilities add up to one, and then take log
        return x

Let's not go ahead and train it!

In [None]:
net = ComplexNetwork()
net.train() # puts the network into the training mode

# create and initialize an optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.005)


start = time.time()
for batch_idx, (data, target) in enumerate(training_loader):
    # reset the gradient before the next gradient step
    optimizer.zero_grad()
    
    # evaluate the network output
    output = net(data)
    
    # compute the loss
    loss = F.nll_loss(output, target)
    
    # perform back propagation to compute gradients with respect to parameters!
    loss.backward()
    
    # perform a gradient descent step on the parameters
    optimizer.step()
    
    # report the loss every 100 batches
    if batch_idx % 100 == 0:
        print('Loss: {:.6f}'.format(loss.item()))
        
duration = time.time() - start
print('Training completed in {:.2f} seconds'.format(duration))

You should notice two things:
1. It trains a bit slower
2. The loss is still decreasing at the end of the training.

It is slower because the nework is a bit more complex and it requires more computations.

The more critical is the second point. To deal with the fact that it is still decreasing, we should not limit ourselves to a single pass through our training set, but rather go through numerous pass through it. Each complete pass through the training set is called **an epoch**, so here we want to perform **multiple epoch** training!

In [None]:
net = ComplexNetwork()
net.train() # puts the network into the training mode

# create and initialize an optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.005)

start = time.time()
for epoch_idx in range(5):
    for batch_idx, (data, target) in enumerate(training_loader):
        # reset the gradient before the next gradient step
        optimizer.zero_grad()

        # evaluate the network output
        output = net(data)

        # compute the loss
        loss = F.nll_loss(output, target)

        # perform back propagation to compute gradients with respect to parameters!
        loss.backward()

        # perform a gradient descent step on the parameters
        optimizer.step()

        # report the loss every 100 batches
        if batch_idx % 100 == 0:
            print('Epoch {} Loss: {:.6f}'.format(epoch_idx, loss.item()))
            
duration = time.time() - start
print('Training completed in {:.2f} seconds'.format(duration))

Now it takes even longer to train because we are going through the dataset multiple times!

In [None]:
net.eval() # put network into evaluation model
test_loss = 0
correct = 0

# prevents unnecessary gradient computation during test - can lead to time and memory saving
with torch.no_grad(): 
    for data, target in test_loader:
        output = net(data)
        
        # sum up batch loss
        test_loss += F.nll_loss(output, target, size_average=False).item() 
        
        # get the index of the max log-probability
        pred = output.max(1, keepdim=True)[1] 
        
        # count number of times where max probability matches the label index
        correct += pred.eq(target.view_as(pred)).sum().item()

# divide the test loss by number of samples in the test set
test_loss /= len(test_loader.dataset)

print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

But indeed, we see improvement in the network performance!