# PyTorch 101

## Introduction

**PyTorch** is the **fastest growing** Deep Learning framework and it is also used by **Fast.ai** in its MOOC, [Deep Learning for Coders](https://course.fast.ai/) and its [library](https://docs.fast.ai/).

PyTorch is also very *pythonic*, meaning, it feels more natural to use it if you already are a Python developer.

Besides, using PyTorch may even improve your health, according to [Andrej Karpathy](https://twitter.com/karpathy/status/868178954032513024) :-)


## Motivation

There are *many many* PyTorch tutorials around and its documentation is quite complete and extensive. So, **why** should you keep reading this step-by-step tutorial?

Well, even though one can find information on pretty much anything PyTorch can do, I missed having a **structured, incremental and from first principles** approach to it.

In [0]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

import torch
import torch.optim as optim
import torch.nn as nn

## TorchVision

[Torchvision](https://pytorch.org/docs/stable/torchvision/index.html) is a package containing popular datasets, model architectures and common image transformations for computer vision.

### Datasets

All [datasets](https://pytorch.org/docs/stable/torchvision/datasets.html) included in torchvision are subclasses of the [**Dataset**](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class we've seen earlier in this tutorial, so we can stick with the [**DataLoader**](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) as well.

We'll work with the canonical dataset for computer vision tutorials: [**MNIST**](https://pytorch.org/docs/stable/torchvision/datasets.html#mnist).

**Training Set**

http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz

**Test Set**

http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz

http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz


In [0]:
from torchvision import datasets

mnist_train = datasets.MNIST('./mnist', train=True, download=True)
mnist_test = datasets.MNIST('./mnist', train=False, download=True)

In [0]:
mnist_train

In [0]:
mnist_train.data.shape

In [0]:
mnist_train.targets

There are 60k 28x28 images and the targets are the corresponding digits.

Let's take a look at the first sample, which is a **5**. You'll notice that, somewhat unsurprisingly, the sample is a **tensor**.

In [0]:
sample_tensor = mnist_train.data[0]
plt.imshow(sample_tensor)
print(sample_tensor.type())

### Transforms

Torchvision has some common image transformations on its [**transforms**](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision-transforms) module. It is important to realize there are two main groups of transformations:

- transformations based on [**PIL images**](https://pytorch.org/docs/stable/torchvision/transforms.html#transforms-on-pil-image)
- transformations based on [**Tensors**](https://pytorch.org/docs/stable/torchvision/transforms.html#transforms-on-torch-tensor)

Obviously, there are transformations to convert from tensors [**ToPILImage**](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.ToPILImage) and from PIL image [**ToTensor**](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.ToTensor).

Let's try converting our sample tensor into a sample (PIL) image.

In [0]:
from torchvision.transforms import ToPILImage

sample_img = ToPILImage()(sample_tensor)
plt.imshow(sample_img)

### Transforms on PIL Image

These transforms include the typical things you'd like to do with an image: [Resize](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.Resize), [CenterCrop](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.CenterCrop), [GrayScale](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.Grayscale), [RandomHorizontalFlip](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.RandomHorizontalFlip), to name a few.

They take a **PIL Image** as inputs, not tensors. So, let's use our sample image from the previous step and try some **random horizontal flipping**. But, just to make sure we flip, let's ditch the randomness and make it flip a 100% of times.

In [0]:
from torchvision.transforms import RandomHorizontalFlip

flipper = RandomHorizontalFlip(p=1.0)

In [0]:
flipped_img = flipper(sample_img)
plt.imshow(flipped_img)

Ok, we have a **flipped 5** now.

Let's take a look at the other group of transformations now...

### Transforms on Tensor

These are only three transforms that take tensors as inputs: [LinearTransformation](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.LinearTransformation), [Normalize](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.Normalize) and [RandomErasing](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.RandomErasing) (although I believe this one was a better fit for the other group of transforms...).

Let's apply a **Normalize** transform to our **flipped 5**. But, in order to be able to do that, we need to make it back into a tensor using **ToTensor**.

In [0]:
from torchvision.transforms import ToTensor

tensorizer = ToTensor()
img_tensor = tensorizer(flipped_img)
img_tensor

In [0]:
img_tensor.min(), img_tensor.max()

### Normalize Transform

From our image tensor, we see its values are in the **[0, 1]** range.

Usually, when dealing with neural networks, it is better to have our inputs in a symmetrical range, like **[-1, 1]**.

And that's why we are going to use the **Normalize** transform. From PyTorch's extensive documentation, we get:

`Normalize a tensor image with mean and standard deviation.`

`Given mean: (M1,...,Mn) and std: (S1,..,Sn) for n channels, this transform will normalize each channel of the input torch.*Tensor i.e.`

`input[channel] = (input[channel] - mean[channel]) / std[channel]`

So, if we would like to map our [0, 1] range into [-1, 1], we can set our **`mean` to 0.5** and our **`std` to 0.5** as well.

This way, we get:
- 0 input is transformed into (0 - .5)/.5 = -1
- 1 input is transformed into (1 - .5)/.5 = 1

---

Even though the transform is called **Normalize**, what we have just done with the inputs is actually a **min-max scaling**. We have **NOT** computed the **true mean** and **true standard deviation**, we have just made them equals 0.5 for our convenience.

Had we use the true values for both mean and standard deviation, we would have achieved a **standardization**, that is, our data would have **zero mean** and **unit standard deviation**.

---

Our images have only **1 channel**, so we only need to specify one element in each tuple.

In [0]:
from torchvision.transforms import Normalize

normalizer = Normalize(mean=(.5,), std=(.5,))
normalized_tensor = normalizer(img_tensor)
normalized_tensor

In [0]:
normalized_tensor.min(), normalized_tensor.max()

The range is [-1, 1] now.

**Mission accomplished!**

### Compose

Sure enough, we don't need to do transformations one by one: we can use [**Compose**](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.Compose) for that.

This is straightforward: line all desired transformations up in a list. This is pretty much the same as a Pipeline in Scikit-Learn.

In [0]:
from torchvision.transforms import Compose

composer = Compose([ToPILImage(),
                    RandomHorizontalFlip(p=1.0),
                    ToTensor(),
                    Normalize(mean=(.5,), std=(.5,))])

composed_tensor = composer(sample_tensor)

The resulting tensor should be exactly the same as the tensor we have transformed step-by-step... let's confirm that:

In [0]:
(composed_tensor == normalized_tensor).all()

### Transforming Datasets

The best thing about composing transforms is the fact you can apply them whenever a given data point is being sampled from the dataset.

Remember the [**Dataset**](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class, which torchvision datasets use as parent class? It is possible to call a composed transformation, just like we did, inside the **`__get_item__(self, index)`** method.

And that's exactly what torchvision datasets do! You can specify a **transform** argument and, whenever your [**DataLoader**](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) fetches some samples, they will be transformed already! **Beautiful**, uh?

In [0]:
from torch.utils.data import DataLoader

new_composer = Compose([RandomHorizontalFlip(p=.5),
                        ToTensor(),
                        Normalize(mean=(.5,), std=(.5,))])

mnist_train = datasets.MNIST('./mnist', train=True, download=True, transform=new_composer)
mnist_test = datasets.MNIST('./mnist', train=False, download=True, transform=new_composer)

train_loader = DataLoader(mnist_train, batch_size=128, shuffle=True)
val_loader = DataLoader(mnist_test, batch_size=128, shuffle=True)

In [0]:
transformed_sample = next(iter(train_loader))[0][0]
transformed_sample

In [0]:
transformed_sample.min(), transformed_sample.max()

Just as expected!

## A Simple Neural Network

We will use the **MNIST** dataset and build a simple neural network to try classify the hand-written digits.

Besides, let's keep it simple and build a **Sequential** model, just like we did with our linear regression.

Our input images have 28x28 pixels, that is, 784 pixels in total. Our targets are 10 different classes (digits 0 to 9). So, our model can be structured as follows:

- **input layer**: 784 units
- **hidden layer(s) and non-linear activations**: we can get creative :-)
- **output layer**: 10 units

For now, let's use **one hidden layer** with **50 units** and let's use a [**ReLU**](https://pytorch.org/docs/stable/nn.html#relu) as non-linear activation.

Since our inputs are 2-dimensional (28x28) and our model has 784 inputs, we still need to add a [**Flatten**](https://pytorch.org/docs/stable/nn.html#flatten) layer at the beginning.

So, our model would look like this - let's use [**add_module**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.add_module) to **name** our layers:

In [0]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = nn.Sequential()
model.add_module('flatten', nn.Flatten())
model.add_module('linear1', nn.Linear(784, 50))
model.add_module('relu1', nn.ReLU())
model.add_module('output', nn.Linear(50, 10))
model.to(device)

print(model)

If we use model's [**named_modules()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.named_modules) method, we can retrieve a list of all modules and the model itself!

In [0]:
list(model.named_modules())

And we can use the layers / modules names using dot notation:

In [0]:
model.linear1

We can also get its corresponding **weights**: `model.linear1.weight`

Let's take a look at the weights, plotting them:

In [0]:
weights = model.linear1.weight
plt.hist(weights.detach().cpu().numpy().reshape(-1,))

In [0]:
print(weights.min(), weights.max())

The linear layer has **39200 weights** and we can see they are **uniformly distributed in [-.0357, .0357] range**. Do you think this is a **good initialization scheme**?

### init

Initialization schemes are a big deal! It goes beyond the scope of this tutorial to go into more details, but we'll see how we can **initialize the weights** of a given layer in our neural network.

For more details on **weight initialization schemes**, you can check my [post](https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404) out :-)

PyTorch has an [**init**](https://pytorch.org/docs/stable/nn.init.html#torch-nn-init) module that has the most common initialization schemes, such as [**kaiming_uniform**](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_) and [**kaiming_normal**](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_) - also known as **He initialization** - the recommended scheme for using with **ReLU** activation function.

Let's use **kaiming_normal_** (notice the **_** - we are making changes **in place**!) to initialize the weights of our linear layer. And let's zero our biases too.

In [0]:
nn.init.kaiming_normal_(model.linear1.weight, mode='fan_in', nonlinearity='relu')
nn.init.constant_(model.linear1.bias, 0)

Did it work? Let's plot it!

In [0]:
weights = model.linear1.weight
plt.hist(weights.detach().cpu().numpy().reshape(-1,))

Awesome! We have succesfully initialized the weights of our network!

### Softmax Layer for Classification

You may be thinking: "*this is a **multi-class classification** problem... where is the **softmax** layer at the end?*"

Well, you do have a point... but PyTorch makes things a bit confusing here... 

You **MAY** add a softmax layer at the end or, better yet, a [**LogSoftmax**](https://pytorch.org/docs/stable/nn.html#logsoftmax). If you **DO add it**, you **MUST** use the corresponding negative log-likelihood loss ([**NLLLoss**](https://pytorch.org/docs/stable/nn.html#nllloss)).

**BUT**

If you **DO NOT** add a **LogSoftmax** layer at the end, you **MUST** use the [**CrossEntropyLoss**](https://pytorch.org/docs/stable/nn.html#crossentropyloss) which, by itself, combines **LogSoftmax AND NLLLoss**.

---

In summary, you have two options:
- **do** add **LogSoftmax** layer at the end and use **NLLLoss**, or
- just use **CrossEntropyLoss**

---

We use the latter, as it is simpler.

### Training Loop

We can also reuse the code from the **Regression** tutorial to perform the model training:

In [0]:
def make_train_step(model, loss_fn, optimizer):
    # Builds function that performs a step in the train loop
    def train_step(x, y):
        # Sets model to TRAIN mode
        model.train()
        # Step 1: Makes predictions
        yhat = model(x)
        # Step 2: Computes loss
        loss = loss_fn(yhat, y)
        # Step 3: Computes gradients
        loss.backward()
        # Step 4: Updates parameters and zeroes gradients
        optimizer.step()
        optimizer.zero_grad()
        # Returns the loss
        return loss.item()
    
    # Returns the function that will be called inside the train loop
    return train_step

def validation(model, loss_fn, val_loader):
    # Figures device from where the model parameters (hence, the model) are
    device = next(model.parameters()).device.type
    val_losses = []

    # no gradients in validation!
    with torch.no_grad():
        val_batch_losses = []
        for x_val, y_val in val_loader:
            x_val = x_val.to(device)
            y_val = y_val.to(device)
            
            # sets model to EVAL mode
            model.eval()

            # make predictions
            yhat = model(x_val)
            val_loss = loss_fn(yhat, y_val)
            val_batch_losses.append(val_loss.item())

        val_losses.append(np.mean(val_batch_losses))

    return val_losses


def train_loop(model, loss_fn, optimizer, n_epochs, train_loader, val_loader=None):
    # Figures device from where the model parameters (hence, the model) are
    device = next(model.parameters()).device.type
    # Creates the train_step function for our model, loss function and optimizer
    train_step = make_train_step(model, loss_fn, optimizer)

    losses = []
    val_losses = []

    for epoch in range(n_epochs):
        # TRAINING
        batch_losses = []
        for x_batch, y_batch in train_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)

            loss = train_step(x_batch, y_batch)
            batch_losses.append(loss)

        losses.append(np.mean(batch_losses))

        # VALIDATION
        if val_loader is not None:
            val_loss = validation(model, loss_fn, val_loader)
            val_losses.append(val_loss)

        print("Epoch {} complete...".format(epoch))

    return losses, val_losses

We need to pass some arguments to the **training loop** function:
- a **model**: we have one, our neural network
- a **loss function**: our problem is a classification and we have not added a softmax layer, so we must use **CrossEntropyLoss**
- an **optimizer**: let's stick with **SGD**
- the **number of epochs**: since we have a bigger model now, let's use only 10
- **data loader(s)**: we have them as well, for the MNIST dataset

In [0]:
torch.manual_seed(42)

n_epochs = 10

lr = 1e-1

model = nn.Sequential()
model.add_module('flatten', nn.Flatten())
model.add_module('linear1', nn.Linear(784, 50))
model.add_module('relu1', nn.ReLU())
model.add_module('output', nn.Linear(50, 10))
model.to(device)

loss_fn = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

losses, val_losses = train_loop(model, loss_fn, optimizer, n_epochs, train_loader, val_loader)

In [0]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

---

Isn't it **great**? We loaded the **MNIST dataset**, built a **neural network** model, defined the corresponding **loss function** and **reused everything else**.

---

What's left to do? We need to check our model's **accuracy** on the validation dataset. So, we make small changes to the **validation loop**, checking which class got the highest probability and matching it against our targets.

We can use [**torch.max()**](https://pytorch.org/docs/stable/torch.html#torch.max) with `dim=1` to get both the maximum value (highest predicted probability) and its index (which corresponds to the predicted digit).

In [0]:
correct = 0
total = 0

with torch.no_grad():
    for x_val, y_val in val_loader:
        x_val = x_val.to(device)
        y_val = y_val.to(device)
        
        model.eval()
        yhat = model(x_val)
        
        # this is PyTorch's version of argmax, but it returns a tuple: (max value, index of max value)
        _, predicted = torch.max(yhat, 1)
        # we get the size of the batch and add up to the total number of samples
        total += y_val.size(0)
        # we add how many samples got classified correctly
        correct += (predicted == y_val).sum().item()
        
print(correct/total)

**We got around 94% accuracy**!

But, can we do better?

## A Convolutional Neural Network (CNN)

It is time to try something a bit more sophisticated!

Let's create our own model, inheriting from the [**Module**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) class. It will take one argument, the **number of features** we will create output in our convolution layer.

We will use [**Conv2d**](https://pytorch.org/docs/stable/nn.html#conv2d) and [**MaxPool2d**](https://pytorch.org/docs/stable/nn.html#maxpool2d) layers, apart from the already familiar **Linear**, **ReLU** and **Flatten**.

In [0]:
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, n_feature):
        super(CNN, self).__init__()
        self.n_feature = n_feature
        # Creates the convolution layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=n_feature, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=n_feature, out_channels=n_feature, kernel_size=5)
        # Creates the linear layers
        self.fc1 = nn.Linear(n_feature * 4 * 4, 50) # where do this 4 * 4 come from?! check it below
        self.fc2 = nn.Linear(50, 10)
        
    def forward(self, x, verbose=False):
        # Input dimension: (1, 28, 28)
        # Output dimension: (n_feature, 24, 24) - we loose (5-1) pixels in each dimension due to the kernel and no padding
        x = self.conv1(x)
        x = F.relu(x)

        # Input dimension (n_feature, 24, 24)
        # Output dimension (n_feature, 12, 12)
        x = F.max_pool2d(x, kernel_size=2)

        # Input dimension (n_feature, 12, 12)
        # Output dimension (n_feature, 8, 8)
        x = self.conv2(x)
        x = F.relu(x)

        # Input dimension (n_feature, 8, 8)
        # Output dimension (n_feature, 4, 4) - we loose (5-1) pixels in each dimension due to the kernel and no padding
        x = F.max_pool2d(x, kernel_size=2)

        # Input dimension (n_feature, 4, 4)
        # Output dimension (n_feature * 4 * 4)
        x = nn.Flatten()(x)

        # Input dimension (n_feature * 4 * 4)
        # Output dimension (50)
        x = self.fc1(x)
        x = F.relu(x)
        
        # Input dimension (50)
        # Output dimension (10)
        x = self.fc2(x)
        return x

### Functional

Maybe you noticed already, but our CNN model used a **mix** of **layers** from both **torch.nn** and **torch.nn.functional**.

"*What is the difference?*", you ask?

PyTorch's [**functional**](https://pytorch.org/docs/stable/nn.functional.html#torch-nn-functional) layers are, er... **functions**.

Let's consider two different layers: **2d convolution** and **2d max pooling**. Both can be used as **layer** or **function**:
- [**Conv2d**](https://pytorch.org/docs/stable/nn.html#conv2d) or [**F.conv2d**](https://pytorch.org/docs/stable/nn.functional.html#conv2d)
- [**MaxPool2d**](https://pytorch.org/docs/stable/nn.html#maxpool2d) or [**F.max_pool2d**](https://pytorch.org/docs/stable/nn.functional.html#max-pool2d)

Why did we use a **Conv2d layer** but a **F.max_pool2d function**?

Well, on one hand, it is more **convenient** to use `F.max_pool2d(x, kernel_size=2)` instead of `nn.MaxPool2d(kernel_size=2)(x)`. It just looks better, I think...

But the thing is, **max pooling layers** do not have **parameters/weights** to be learned! **Convolution layers, do!**

Remember the **Nested Model** section? Every **model** or **layer** gets its **parameters** recursively accessed by the parent **model**, **provided they are attributes in the `__init__` method**!

---

So, it is pretty simple: does the layer has **parameters** that need to be learned?
- **YES** - Use **layers** from **torch.nn** and make them **attributes of the parent model**
- **NO** - feel free to use **functions** from **torch.nn.functional**

---

### Training Loop

Nothing new here, we create an instance of our **CNN model** and send it to the device as usual.

In [0]:
lr = 1e-1

model = CNN(n_feature=6).to(device)
loss_fn = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

print(model)

In [0]:
torch.manual_seed(42)

n_epochs = 10

losses, val_losses = train_loop(model, loss_fn, optimizer, n_epochs, train_loader, val_loader)

In [0]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

In [0]:
correct = 0
total = 0

with torch.no_grad():
    for x_val, y_val in val_loader:
        x_val = x_val.to(device)
        y_val = y_val.to(device)
        
        model.eval()
        yhat = model(x_val)
        
        # this is PyTorch's version of argmax, but it returns a tuple: (max value, index of max value)
        _, predicted = torch.max(yhat, 1)
        # we get the size of the batch and add up to the total number of samples
        total += y_val.size(0)
        # we add how many samples got classified correctly
        correct += (predicted == y_val).sum().item()
        
print(correct/total)

**Awesome! We are close to 98% accuracy now!**

### Visualizing Filters

One of the cool things about Convolution Neural Networks is that its **features** or **filters** can be visualized, so we can have a glimpse of **what the network is looking for** inside the input images :-)

We can fetch the **weights** of a convolution layer...

In [0]:
model.conv1.weight

This tensor is 6 x 5 x 5... why is that? It has **6 features or filters** (we configured it that way) and each one is **5 x 5** because that's the **kernel size** we defined. No surprises here!

But we can think of each one of this 5 x 5 tensors as **an image**!

In [0]:
# Code adapted from "How to Visualize Filters and Feature Maps in Convolutional Neural Networks"
# by Jason Brownlee
# https://machinelearningmastery.com/how-to-visualize-filters-and-feature-maps-in-convolutional-neural-networks/

layer_name = 'conv1'
# retrieve weights from the hidden layer
filters = getattr(model, layer_name).weight.data.cpu().numpy()
# normalize filter values to 0-1 so we can visualize them
f_min, f_max = filters.min(), filters.max()
filters = (filters - f_min) / (f_max - f_min)

n_filters, ix = filters.shape[0], 1

for i in range(n_filters):
    # get the filter
    f = filters[i, :, :, :]
    # plot each channel separately
    for j in range(filters.shape[1]):
        # specify subplot and turn of axis
        ax = plt.subplot(n_filters, filters.shape[1], ix)
        ax.set_xticks([])
        ax.set_yticks([])
        # plot filter channel in grayscale
        plt.imshow(f[j, :, :], cmap='gray')
        ix += 1

# show the figure
plt.show()

## Transfer Learning

"Transfer learning is a machine learning method where a model developed for a task is **reused** as the starting point for a model on a second task.

It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems."

Source: [A Gentle Introduction to Transfer Learning for Deep Learning](https://machinelearningmastery.com/transfer-learning-for-deep-learning/)

So, let's try using a **pre-trained model** as starting point for classifying MNIST's digits!

Let's take one the most **famous** neural network architectures for a spin...

### AlexNet

"AlexNet was the winning entry in ILSVRC 2012. It solves the problem of image classification where the input is an image of one of 1000 different classes (e.g. cats, dogs etc.) and the output is a vector of 1000 numbers."

To learn more about AlexNet, check out the post where this excerpt of text comes from: [Understanding AlexNet](https://www.learnopencv.com/understanding-alexnet/).

![alt text](https://www.learnopencv.com/wp-content/uploads/2018/05/AlexNet-1.png)

Source: [Learn OpenCV](https://www.learnopencv.com/understanding-alexnet/)

It is actually a bit **too much** to use AlexNet for the purpose of classifying MNIST's digits... but let's go for it anyway!

How do we use a **pre-trained model**? We first need to **download its weights** from somewhere and then **load them into the corresponding empty model architecture**.

To be honest, PyTorch provides a direct way of directly downloading the weights of AlexNet... but let's make it the **hard way**, downloading it manually and only then **loading them into the model**, so we learn how to actually load it.

### Loading Models

Let's start by downloading the pre-trained weights... AlexNet has about 230 Mb!

In [0]:
!curl https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth --output alexnet-owt-4df8aa71.pth

Now, let's create the **empty** network architecture... we could build it from scratch based on the figure above, but it seems too much of a hassle - let's use TorchVision's [**models**](https://pytorch.org/docs/stable/torchvision/models.html#torchvision-models) module.

In [0]:
from torchvision.models import alexnet

alex = alexnet()

print(alex)

First, we load the file into a **state_dict** (remember those?!) using [**torch.load()**](https://pytorch.org/docs/stable/torch.html#torch.load):

In [0]:
state_dict = torch.load('alexnet-owt-4df8aa71.pth')
print(state_dict.keys())

Then we load the **state_dict** into the **empty architecture** we created. For this step, we use, obviously, the model's [**load_state_dict()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.load_state_dict) method.

In [0]:
alex.load_state_dict(state_dict)

### Freezing Layers

OK, we have AlexNet locked and loaded! Should we just **start training it?** 

Well, not so fast... it has 60 million parameters! Besides, **what would be the point of transfer learning then**?

So, we are **freezing its layers**, meaning, we **do not want to update its weights**. How do we do that? **Turning gradients OFF!**

In [0]:
for parameter in alex.parameters():
    parameter.requires_grad = False


We also need to make a small change... AlexNet was trained to classify **1000 classes**, but we have **only 10 digits**, so we need to **replace the last layer**.

In [0]:
print(alex.classifier)

In [0]:
alex.classifier[6] = nn.Linear(4096, 10)

Let's double check if the **only trainable parameters** are the ones from our newly created last layer: **classifier.6**.

For this, the model's [**named_parameters()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.named_parameters) method come in handy:

In [0]:
for name,param in alex.named_parameters():
    if param.requires_grad == True:
        print("\t",name)

### Transforming Datasets (again)

AlexNet also expects **different inputs**: 3-channel 224 x 224 images.
But our **MNIST** data has single-channel 28 x 28 images. So, we can resort to **transforms** once again to transform the latter into the former.

In [0]:
from torchvision.transforms import Resize, Grayscale, ToTensor, Normalize

alex_transforms = Compose([Resize(224),
                           Grayscale(num_output_channels=3),
                           ToTensor(),
                           Normalize((.5, .5, .5), (.5, .5, .5)),])

mnist_train = datasets.MNIST('./mnist', train=True, download=True, transform=alex_transforms)
mnist_test = datasets.MNIST('./mnist', train=False, download=True, transform=alex_transforms)

train_loader = DataLoader(mnist_train, batch_size=128, shuffle=True)
val_loader = DataLoader(mnist_test, batch_size=128, shuffle=True)

### Training Loop

Nothing new here, but Alex is big, so let's start with 2 epochs only...

Also, don't forget to **send Alex to the device** :-)

In [0]:
lr = 1e-1

model = alex.to(device)
loss_fn = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

In [0]:
torch.manual_seed(42)

n_epochs = 2

losses, val_losses = train_loop(model, loss_fn, optimizer, n_epochs, train_loader, val_loader)

In [0]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

In [0]:
correct = 0
total = 0

with torch.no_grad():
    for x_val, y_val in val_loader:
        x_val = x_val.to(device)
        y_val = y_val.to(device)
        
        model.eval()
        yhat = model(x_val)
        
        # this is PyTorch's version of argmax, but it returns a tuple: (max value, index of max value)
        _, predicted = torch.max(yhat, 1)
        # we get the size of the batch and add up to the total number of samples
        total += y_val.size(0)
        # we add how many samples got classified correctly
        correct += (predicted == y_val).sum().item()
        
print(correct/total)

Doesn't look so impressive now, does it? Anyway...

### Saving Models

As you can see, a **bigger** model like AlexNet takes **quite some time for training each epoch**, even with all but one of its layers **frozen**.

So, it is important to be able to **checkpoint** our model, in case we'd like to **restart training later**.

To checkpoint a model, we basically have to **save its state** into a file, to **load** it back later - nothing special, actually.

What defines the **state of a model**?
- **model.state_dict()**: kinda obvious, right?
- **optimizer.state_dict()**: remember optimizers had the `state_dict` as well?
- **loss**: after all, you should keep track of its evolution
- **epoch**: it is just a number, so why not? :-)
- **anything else you'd like to have restored**

Then, **wrap everything into a Python dictionary** and use [**torch.save()**](https://pytorch.org/docs/stable/torch.html?highlight=save#torch.save) to dump it all into a file! Easy peasy!

In [0]:
checkpoint = {'epoch': n_epochs,
              'model_state_dict': model.state_dict(),
              'optimizer_state_dict': optimizer.state_dict(),
              'loss': losses,
              'val_loss': val_losses}

torch.save(checkpoint, 'alexnet_checkpoint.pth')

In [0]:
!ls -l alexnet_checkpoint.pth

How would you **load** it back?

In [0]:
checkpoint = torch.load('alexnet_checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
losses = checkpoint['loss']
val_losses = checkpoint['val_loss']

In [0]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

Seems about right...

You may save a model for **checkpointing**, like we have just done, or for **making predictions**, assuming training is finished.

After loading the model, **DO NOT FORGET**:

---

**SET THE MODE** (not the mood!):
- **checkpointing: model.train()**
- **predicting: model.eval()**

---

## Further Improvements


Is there **anything else** we can improve or change? Sure, there is **always something else** to add to your model — using a [**learning rate scheduler**](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) or a [**learning rate finder**](https://github.com/davidtvs/pytorch-lr-finder).

But this tutorial is already waaaaay too long, so I will stop right here.

## Final Thoughts

I believe this tutorial has **most of the necessary steps** one needs go to trough in order to **learn**, in a **structured** and **incremental** way, how to **develop Deep Learning models for Computer Vision using PyTorch**.

Hopefully, after finishing working through all code in this post, you’ll be able to better appreciate and more easily work your way through PyTorch’s official [tutorials](https://pytorch.org/tutorials/).

If you have any thoughts, comments or questions, please leave a comment below or contact me on [LinkedIn](https://br.linkedin.com/in/dvgodoy) or [Twitter](https://twitter.com/dvgodoy).