# Lecture 3.14: Introduction To Pytorch

[**Lecture Slides**](https://docs.google.com/presentation/d/10G4hNPwtIq0urT--yN3VFznhqaS8kalBrlLXfcd9bvE/edit?usp=sharing)

This lecture, we are going to train a neural network classifier in pytorch both on local CPU and cloud GPU.

**Learning goals:**
- understand the difference between `ndarray` and `Tensor`
- carry out basic operations on `Tensor`s
- backpropagate through a computational graph with `autograd`
- build and train a neural network banknote classifier
- setup a google colab notebook
- train a neural network on a GPU
- compare the pros and cons of pytorch and keras

## 0. Setup

To use pytorch, we have to install the `torch` package. It was added to this projects's `Pipfile`, so please run:

    pipenv install
    
in the repo's root directory to install the latest dependencies.


Remember your first `import numpy as np`? Time has flown by! This is the start of another exciting chapter, your first pytorch import 🎊

In [None]:
import torch

## 1. Tensors

### 1.1 Tensor Creation

Welcome to the world of pytorch! 🔥

The main class here is the `Tensor`. It's a scary mathematical name, but according to [pytorch](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py) themselves:

> Tensors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing.

This means that we can expect their interface to be similar to `ndarray`. 😌

The preferred way to build `Tensor`s, is with the `torch.tensor()` constructor. This is similar to `np.array()`:

In [None]:
torch.tensor([42, 666])

pytorch also offers a bunch of other useful constructors:

In [None]:
torch.randn(4, 2)

In [None]:
torch.ones(3, 5)

In [None]:
torch.zeros(1)

In [None]:
torch.arange(0, 1337, 55)

### 1.2 Basic Operations

Just like NumPy, pytorch integrates closely with python by overloading many common operators. We can use python list slicing 🔪:

In [None]:
a = torch.tensor([0, 1, 1, 2, 3, 5, 8, 13])
print(a[3])
print(a[3:6])
print(a[1:7:2])

We can use arithmetic operators:

In [None]:
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
print(a + b)
print(a*b)

Piecewise operations are the default in pytorch, just like for `ndarray`s.

### 1.3 Tensor Data Types

Remember that NumPy is super fast because it uses its own data types. Maybe pytorch uses the same trick! Let's check the type of our tensors elements:

In [None]:
a = torch.tensor([1, 2, 3])
type(a[0])

😑 Ok we already knew this was a `Tensor`, but what data type does it contain? We can check through its `.dtype` field:

In [None]:
a[0].dtype

If we want to convert this torch value to a python scalar, we can use the `.item()` method:

In [None]:
a[0].item()

If we want to convert any `Tensor` to a numpy array, we can use the `.numpy()` method:

In [None]:
a = torch.tensor([1, 2, 3])
b = a.numpy()
b

This is a _bridge_ meaning that `a` and `b` share their underlying memory location. Careful, as changing one will change the other! 

Finally we can convert `ndarray`s to `Tensor`s with the `torch.from_numpy()` constructor:

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = torch.from_numpy(a)
b

## 2. Autograd

### 2.1 Backpropagation

Pretty familiar grounds so far! But we haven't really shown what makes `Tensor`s special. 🌈 The official documentation mentions GPU acceleration, but we'll keep that for the end of the lecture. Until then, let's focus on another pytorch killer feature: the `grad_fn`.

By providing the `requires_grad=True` argument to our `Tensor` constructors, we tell pytorch to track all operations on it. 📡 These operations are stored in their `.grad_fn` field:

In [None]:
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
y.grad_fn

y is a `Tensor` "born" from the addition $x + 2$ on the `Tensor` `x` which had `requires_grad=True`. Therefore pytorch keep track of this operation, and saved it under `.grad_fn`.

This `requires_grad` property is passed on to the children `Tensors`, meaning we can create a _chain_ of `.grad_fn`.

In [None]:
y.requires_grad

In [None]:
z = y**2
z.grad_fn

Pytorch therefore keeps track of the _computational graph_ created from root tensors with `requires_grad=True`.

But why are going through this hassle? Well tensors secretly work with `torch.autograd`, the [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) package 🕵️‍♀️. So by saving the operations linking `Tensors`, `autograd` is then able to numerically evaluate the derivatives of all nodes in the computational graph. 

When we call `.backward()` on a `Tensor`, we ask `autograd` to use the chain rule and evaluate the gradients through the nodes of the computational graph. The gradients are then saved on each `Tensor` under `.grad`. These are the partial derivatives of the operation on which we called `.backward()` with respect to that `Tensor`. 

This all sounds familiar.... Propagating gradients through a computation graph with the chain rule? Caching intermediate derivative values? `.backward()` is therefore an implementation of _backpropagation_. Of course, pytorch's end goal here is to use these gradients as part of _gradient descent_ to optimize neural networks. 

### 2.2 Linear Regression


To understand exactly what gradient descent with pytorch looks like, let's implement a simple two layer linear neural network, less glamourously known as a ✨linear regression model✨ (checkout lecture 3.5 for a refresher!) We'll use two features and a single _example_ to update the model parameters.

First let's initialize our model parameters:

In [None]:
theta0 = torch.tensor(-2.0, requires_grad=True)
theta1 = torch.tensor(3.0, requires_grad=True)
theta2 = torch.tensor(-1.0, requires_grad=True)

We then define our only training example:

In [None]:
x = torch.tensor([0.3, -0.1])
y = torch.tensor(0.4)

We recall that the linear regression hypothesis is:

$$ y = \sum_{i=0}^{n}\theta_{i}x_{i} $$

In [None]:
y_predict = theta0 + x[0] * theta1 + x[1] * theta2
y_predict

So our initial prediction is $-1$ ... pretty far from the label $0.4$! Let's calculate the associated MSE loss:

In [None]:
loss = (y_predict - y)**2
loss

This loss is the final node in our computation graph. By calling `.backward()` on it, we tell `autograd` to backpropagate through the edges, and calculate the gradients.

In [None]:
loss.backward()

... that was it? Backpropagation felt much more complicated in the lecture 3.12 slides 😅Let's check the gradient $\frac{dJ}{d\theta_{0}}$:

In [None]:
theta0.grad

That's pretty cool, but this value of $-2.8$ doesn't get us anywhere on its own. We need to use it as part of a gradient descent update:

In [None]:
print(f"value of theta0 before gradient descent update: {theta0}")

alpha = 0.01

theta0 = theta0 - alpha * theta0.grad
theta1 = theta1 - alpha * theta1.grad
theta2 = theta2 - alpha * theta2.grad

print(f"value of theta0 after gradient descent update: {theta0}")

The value of $\theta_{0}$ was updated by taking a step in the opposite direction to the gradient. Let's check how this affected our loss:

In [None]:
y_predict = theta0 + x[0] * theta1 + x[1] * theta2
new_loss = (y_predict - y)**2
print(f"Loss before gradient descent update: {loss}")
print(f"Loss after gradient descent update: {new_loss}")

The loss was minimized, and if we repeated this process many times with with many example, we might obtain $\theta$s which correctly predict the labels `y`.

Note that almost _all_ of the code above is mathematical operations. Pytorch is appreciated by data scientists because it abstracts away the machine learning heavy lifting (e.g backpropagation), but still lets them control the low level operations. 

Sometimes we don't want to write all the low level maths, and would rather get on training our neural networks. 😅 Let's use pytorch to train the exact same fake banknote detection classifier as last lecture.

## 3. Data Munging

We load our dataset in a `DataFrame`:

In [None]:
import pandas as pd

df = pd.read_csv('bank_note.csv')
df.head()

Our features are scaled and ready to go! 🏋️‍♀️We'll use all 4 features and put them in a feature matrix:

In [None]:
import torch

X = df[['feature_1', 'feature_2', 'feature_3', 'feature_4']].values
y = df['is_fake'].values

Notice that in the linear regression example above, we don't have access to a `.fit(X, y)` method which takes care of feeding the dataset to the neural network. Therefore, we are going to want to manually create _examples_ , which are pairs of (features, label), i.e pairs of rows from `X` and `y`. We can create these tuple pairs with the handy [`.zip()`](https://realpython.com/python-zip-function/) function:

In [None]:
dataset = list(zip(X, y))
dataset[0]

Our first example here has:

$$
x = \begin{bmatrix}1.12\\1.15\\-0.98\\0.35\end{bmatrix}; y = 0
$$

## 4. Training

### 4.1 Neural Network Structure

Our dataset is ready, so now we want to train a neural network! We saw in the previous section that pytorch allows to backpropagate through any mathematical operation. However, passing around raw `x + y ` functions and `Tensors` can lead to a lot of copy-pasting and isn't very bug-safe. 😬 Instead we'd like to abstract away the computational graph in a dedicated class. Pytorch makes this easy with the `nn.Module`.

We extend the `nn.Module` abstract class, and create a `Net` class. All we have to do is:
- define our _layers_ in the `__init__` constructor
- define the computational graph in the `.forward()` method. For neural networks, this often includes appplying activation functions, and feeding the output of one layer into the next. `.forward()` is simply the chained function defined by the neural network, which takes features as input and returns predictions as output.

Since we are lazy, we'll use the ready-made pytorch `nn.Linear` layer. Just like keras, this will infer the required number of model parameters from its inputs and outputs. `nn.Linear` also takes care of setting `requires_grad` on all the right `Tensor`s, so we know backpropagation will work.

Let's put all of this together in a 2 hidden layer neural network with ReLU activations:


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # create the layers
        self.dense1 = nn.Linear(4, 6)
        self.dense2 = nn.Linear(6, 6)
        self.dense3 = nn.Linear(6, 1)

    def forward(self, x):
        # first hidden layer
        x = F.relu(self.dense1(x))
        # second hidden layer
        x = F.relu(self.dense2(x))
        # output layer
        x = torch.sigmoid(self.dense3(x))
        return x


net = Net()
print(net)

We made a neural network in pytorch 🎊

🧠 What do the arguments to `nn.Linear()` represent? Why are we using these particular values here?

Let's test its predictions:

In [None]:
x_predict = torch.tensor([0.1, -0.2, 0.3, 0.4])
net(x_predict)

Yes, `nn.Module` does some background python magic for us so we can call your network directly as a function. 😮
That prediction doesn't look great though, because we haven't trained the model yet!


### 4.2 Dataset Iteration
For this, we want to loop through our `dataset`. But we also don't really want to implement batches or shuffling of examples... Luckily, pytorch takes care of that for us too. 

We use a `DataLoader` to iterate through our dataset. The `torch.utils.data` can load many dataset types, including our list of tuples of `ndarrays`. Checkout the [documentation](https://pytorch.org/docs/stable/data.html) for more details! All we have to do, is pass it in the constructor:


In [None]:
from torch.utils.data import DataLoader

ds_loader = DataLoader(dataset, batch_size=32, shuffle=True)

`ds_loader` is a python _iterable_ , meaning use it in `for` loops, or peek into the first batch of examples like this:

In [None]:
list(ds_loader)[0]

That's a lot of values! The shape of the features should always be `(batch_size, dims)`:

In [None]:
list(ds_loader)[0][0].shape

### 4.3 Loss and Optimization

There are two missing pieces to our training. Just like the `nn.Linear` layers, we don't really want to implement the _loss_ and the _optimizer_ for our neural network. Recall that the loss is the mathematical function defining the average model error for given predictions and labels, $J$. The optimizer defines the update of each model parameter for a given gradient with respect to that $\theta$. These can get complicated! Instead, we'll use the `torch.nn` and the `torch.optim` modules:

In [None]:
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

- `criterion` is a simple function which calculates the loss:
```
loss = criterion(predictions, labels)
``` 
We choose binary cross-entropy loss (`BCELoss`) since we are dealing with a binary classification problem.

- `optimizer` will fetch the `.grad` fields directly on the `net.parameters()`, and adapts the learning rate according to its algorithm. The update is done in place by calling the `.step()` method:
```
optimizer.step()
```
We pick `Adam` because adam rocks 🤘

### 4.4 Optimization

We're now ready to optimize! Let's start easy, just like our pytorch linear regression. We'll carry out _one_ gradient descent update, with only the first batch of data.

First, we load the batch from our `DataLoader`. We have to convert its `dtype` and reshape the labels, because like `sklearn`, pytorch likes its input a certain way:

In [None]:
# fetch the first batch
inputs, labels = list(ds_loader)[0]
# pytorch likes floats
inputs = inputs.float()
# view = np.reshape(), need a matrix here
labels = labels.float().view(-1, 1)

Then we can predict the outputs using our neural network and calculate the associated cross-entropy loss:

In [None]:
# predict
outputs = net(inputs)
# loss from predictions & labels
loss = criterion(outputs, labels)
print(f"Loss before gradient descent update: {loss}")

Finally we can use gradient descent to update the model parameters:

In [None]:
# always zero the parameter gradients before calling .backward()
optimizer.zero_grad()
# backpropagation
loss.backward()
# gradient descent update
optimizer.step()

# calculate the new loss
outputs = net(inputs)
loss = criterion(outputs, labels)
print(f"Loss after gradient descent update: {loss}")

🧠 Take your time to understand the above code, and step through the different stages. What happens in the `.net()` call? what about the `.step()` method? 

We reduced the loss! This is great progress, but we'd like to find the global minimum, not just go "down" one step... For this, we have to repeat the updates for all batches and several epochs. Let's go! 🤠

Pytorch uses a dynamic computation graph, so it starts from scratch everytime we make a forward pass through the neural network: `net(inputs)`. This means we can just loop the gradient descent step without worrying about global state or breaking things. We'll also set the pytorch and NumPy random seeds to improve [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html).

In [None]:
# reproducibility
torch.manual_seed(1337)
np.random.seed(666)

# initialization
net = Net()
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
losses = []

# loop over epochs
for epoch in range(100):
    print(f'epoch {epoch} ')
    
    # loop over batches
    for i, data in enumerate(ds_loader):
        # data loading
        inputs, labels = data
        inputs = inputs.float()
        labels = labels.float().view(-1, 1)
        
        # prediction
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        
        # optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # print statistics
        losses.append(loss.item())

print('finished Training')

Wow that was fast! Pytorch's custom data types and `autograd` package make for swift gradient calculations. And we haven't even used GPUs yet 😏

🧠 How many epochs did we train our neural network for?

We stored the mean loss of each batch in the `losses` list. Let's visualize our loss curve:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()


fig = plt.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(losses, lw=1)
ax.set_xlabel('batch')
ax.set_ylabel('steps')
ax.set_title('Loss Curve');

The loss curve looks just the same as last lecture with keras, which means we have successfully trained our pytorch neural network. 🎊



## 5. Prediction

Just like we did with our untrained model, we can call the `Net` directly as a function to predict the label a given input `Tensor`:

In [None]:
banknote = [0.0, -0.1, 0.3, 0.2]
x_predict = torch.tensor(banknote)
net(x_predict)

Our neural network is very confident that this banknote is genuine 💸

## 6. Exercises

### 6.1 BCEWithLogits

For binary classification, pytorch recommends the use of [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss). According to the documentation:

> This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

Sounds like a good idea for our fake banknote detector. 👌

💪 Implement and train the same pytorch neural network model for banknote classification, but this time, use `BCEWithLogitsLoss`.
- since `BCEWithLogitsLoss` already incorporates the output layer sigmoid activation, you'll have to rewrite the `Net` class and remove it
- use the same hyperparameters as above: `batch_size=32`, `epochs=100`, `Adam` optimizer, ...
- store your losses per batch in `losses`, the loss curve will be automatically plotted when running the unit test
- the loss curve should look exactly the same as above

In [None]:
# INSERT YOUR CODE HERE


In [None]:
def test_bce_with_logits():
    assert losses, "Can't find losses. Did you use the correct variable name?"    
    assert np.array(losses[-10:]).mean() < 0.005, "It doesn't look like your loss converged"
    print('Success! 🎉')
    
fig = plt.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(losses, lw=1)
ax.set_xlabel('batch')
ax.set_ylabel('steps')
ax.set_title('Loss Curve')

test_bce_with_logits()

### 6.2 Batch size analysis

💪💪 Analyse the effect of batch size on neural network optimization, by plotting the loss curves of models with different batch sizes.
- look at lecture 3.13 for an example... but implement this with pytorch!
- plot the loss curves side by side, or the graphs will be unreadable
- wrap the neural training in a function to easily iterate through different batchsizes
- either reduce the number of epochs to ~ 20, or find a way to set individual epochs for each batch size. Otherwise, the small batch sizes will take a long time to train
- the graph is the unit test 🙃


In [None]:
# INSERT YOUR CODE HERE

In [None]:
for batch_size, losses in losses_dict.items():
    fig = plt.figure(dpi=100)
    ax = fig.add_subplot(111)
    ax.plot(losses, lw=1)
    ax.set_xlabel('batch')
    ax.set_ylabel('steps')
    ax.set_title(f'batch_size={batch_size}');
    

🧠 Think about your results. Do they agree with the results from lecture 3.13?

## 7. GPUs

We've seen how pytorch adds backpropagation support to Tensors, and provides some useful methods to iterate through datasets, calculate common losses, and perform gradient updates on model parameters.

Tensors have another trick up their sleeve: they can be used on a _GPU_. As mentioned in the slides, GPUs can parallelize matrix operations, and have a larger memory bandwidth. Training a neural network on a GPU can be orders of magnitude faster than a CPU.

Let's move to google colab and play with some fancy hardware! 🤖

## 8. Pytorch vs Keras vs Tensorflow

We have learned about _two_ deep learning frameworks, keras, and pytorch. By now, it should be obvious that these libraries take a different approach to neural networks. Keras' api is more abstracted and OOP focused, whilst pytorch augments mathematical functions. There is no "better" choice, as each will shine for different usecases:

**keras**
- concise code
- little theoretical understanding required
- good for simple projects and fast POCs

**pytorch**
- lower level control
- more mathematical
- good for growing projects and advanced NNs
- popular nowadays 😎 great community support

Tensorflow, which we have been running as keras backend, is another popular open source deep learning library. It is on par with pytorch in terms of customization and low level control, and sells itself as being production and deployment focused. However, more and more machine learning engineers are swithing to pytorch, which is considered easier to develop with. This might change in the near future, so it's good to keep an eye on the evolution of the ML open source ecosystem 👀.

All in all, coders are allowed to have personal preferences 💁‍♂️, so you should pick the tools that best solve your problem _and_ that you are most comfortable with. Now you have two to choose from ✌️


## 9. Summary

Today, we discovered a new deep learning library, **pytorch**. We first explained how **GPUs** can accelerate neural networks with parallel matrix operations, and large memory bandwitdth. Computing time is a bottleneck in neural network optimization, so **faster** training leads to **more powerful** models. We listed tensorflow & pytorch as the most popular DL frameworks which can interface with GPUs. We then learned how to use pytorch: first by creating **`Tensor`s**, then by backpropagating through a computation graph with the **`autograd`** package. We applied this to neural networks by extending the **`nn.Module`**, where we defined **layers** and the **`.forward()`** method. We **looped** this code by iterating through a **`DataLoader`** , and recreated the banknote authentication classifier from last lecture. Finally, we ported the pytorch neural network to **Google colab** , and trained it on a **GPU runtime environment**.

# Resources

## Core Resources

- [**Slides**](https://docs.google.com/presentation/d/10G4hNPwtIq0urT--yN3VFznhqaS8kalBrlLXfcd9bvE/edit?usp=sharing)
- [Pytorch tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)  
Fast official pytorch tutorial, great for a refresher on basics
- [Collection of DL models implemented in tf and pytorch](https://github.com/rasbt/deeplearning-models)   
Great to have bookmarked for inspiration / debugging when creating neural networks with pytorch or tensorflow

### Additional Resources

- [Pytorch ecosystem](https://pytorch.org/ecosystem/)  
List of the many libraries extending pytorch
- [ignite](https://github.com/pytorch/ignite)  
Pytorch wrapper similar to the keras api
- [Pytorch for fast.ai](https://www.fast.ai/2017/09/08/introducing-pytorch-for-fastai/)  
Blogpost explaining fast.ai's move from keras+tensorflow to pytorch
- [Pytorch reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)  
Notes about randomness in pytorch
- [Why are GPUs well suited to deep learning](https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning)  
Quora thread with a detailed explanation of GPUs advantages over CPUs
- [Carbon footprint of large language models](https://arxiv.org/pdf/1906.02243.pdf)  
Famous paper investigating the environmental impact of ML training