# Pytorch Crash Course

### Quick Tips for installing Pytorch
You need the CUDA Toolkit and an Nvidia GPU.

Type
```
nvcc --version
```
into your console to figure out what CUDA version you have.

If you don't get a return on your own computer, you might need to install the [CUDA Toolkit](https://developer.nvidia.com/cuda/toolkit).
###### _If you are on a remote computer, you probably can't do that on your own. Talk to an admin instead._

If everything is alright the return should look roughly like this:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:26:51_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
```
(I have CUDA 12.6 on the machine I ran the command)

Now you can go to [pytorch's official website](pytroch.org) to download pytorch for the correct CUDA version.

Pytorch seems to be okay at dealing with CUDA version mismatches, I have seen newer versions of pytorch run fine with older CUDA versions. __DON'T QUOTE ME ON THIS THOUGH__

Pytorch also offers [old versions](https://pytorch.org/get-started/previous-versions/) for download, which work with older CUDA versions.



## Why Pytorch?

- Differentiation by hand is hard, PyTorch does it for you with one line of code
- PyTorch implements a lot of other functionality as well (data handling, optimizers, model handling, loss funcitons, ...)
- PyTorch runs fast

## Why use GPUs for Machine Learning?

#### A Demo:
Let's fit a CNN to CIFAR10.
(_This code is shamelessly copied from an exercise in Prof. Jain's FMI course from last year_)

In [2]:
from src.small_CNN import *

def train(model, optimizer, device):
    for epoch in range(1, num_epochs + 1):
        for batch_idx, (X, y) in enumerate(train_loader):
            start_batch = time.time()
            X = X.to(device)
            y = y.to(device)

            z = model(X)
            loss = F.cross_entropy(z, y)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            torch.cuda.synchronize()
            end_batch = time.time()
            batch_time = int((end_batch - start_batch) * 1000)
        print(f"epoch {epoch}, {batch_time}ms elapsed")

In [3]:
device = torch.device("cpu")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

train(model, optimizer, device)

epoch 1, 359ms elapsed
epoch 2, 447ms elapsed
epoch 3, 474ms elapsed
epoch 4, 470ms elapsed
epoch 5, 459ms elapsed
epoch 6, 445ms elapsed
epoch 7, 411ms elapsed
epoch 8, 436ms elapsed
epoch 9, 476ms elapsed
epoch 10, 546ms elapsed
epoch 11, 359ms elapsed
epoch 12, 449ms elapsed
epoch 13, 383ms elapsed
epoch 14, 405ms elapsed
epoch 15, 350ms elapsed


In [4]:
device = torch.device("cuda:2")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

train(model, optimizer, device)

epoch 1, 9ms elapsed
epoch 2, 3ms elapsed
epoch 3, 4ms elapsed
epoch 4, 5ms elapsed
epoch 5, 4ms elapsed
epoch 6, 3ms elapsed
epoch 7, 4ms elapsed
epoch 8, 11ms elapsed
epoch 9, 7ms elapsed
epoch 10, 5ms elapsed
epoch 11, 6ms elapsed
epoch 12, 5ms elapsed
epoch 13, 4ms elapsed
epoch 14, 4ms elapsed
epoch 15, 4ms elapsed


As you can see, running on the GPU can be a lot faster. But why is that the case?

Long story short: Forward- and backward-passes (Inference and Training) require massive amounts of vector/matrix (tensor) operations, which are highly parallelizable. GPUs (Graphics Processing Units) are built to solve these kinds of highly parallel operations to render pretty computer graphics, so they are also good at neural networks by accident. Datacenter GPUs nowadays are often purpose built for ML tasks, so they are even better.

If you do projects on your home computer and have a decent gaming GPU, then you can probably use it for ML.

If you need big computing power, the KI-GPU Server at OTH Regensburg or the NHR at FAU Erlangen can give you access to big datacenter GPUs.

## Okay but how do I PyTorch?

#### Online Resources
- [Official Documentation](https://docs.pytorch.org/docs/stable/index.html): For looking up syntax etc.
- [Official Tutorials](https://docs.pytorch.org/tutorials/index.html): This is the best place to start; also teaches a lot of general ML things.
- [This Blog Post by Edward Z. Yang (A PyTorch Dev)](https://blog.ezyang.com/2019/05/pytorch-internals/) for a good intro to how PyTorch works internally
- [Official PyTorch YT Channel](https://www.youtube.com/pytorch): Tons of videos on everything PyTorch related.
- [Huggingface Tutorials](https://huggingface.co/learn): If you are specifically interested in NLP, then this will teach you just enough PyTorch to get by (while also teaching you a lot of other things).
- ["Let's build GPT: from scratch, in code, spelled out." by Andrej Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY): Not just for people interested in transformers! An excellent tutorial on working with PyTorch to build a complex model. If you are interested also look at [the follow up video on reproducing GPT-2](https://www.youtube.com/watch?v=l8pRSuU81PU)


### You already know a lot of the syntax if you know Numpy...

- PyTorch tensors represent n-dimensional arrays of various datatypes (just like ndarrays in numpy).
- Doing linear Algebra with PyTorch is very similar to doing it with numpy.
- A lot of numpy functions are named identically in PyTorch. (so just trying out what you would do in numpy often works)
- Most of the ones that aren't identical have an equivalent you can find with a google search.



In [5]:
my_tensor = torch.tensor([[1, 2, 3],[4, 5, 6]])
my_tensor

tensor([[1, 2, 3],
        [4, 5, 6]])

In [6]:
my_tensor[1][1:]

tensor([5, 6])

In [7]:
my_tensor.reshape((3,2))

tensor([[1, 2],
        [3, 4],
        [5, 6]])

In [8]:
my_tensor * 2

tensor([[ 2,  4,  6],
        [ 8, 10, 12]])

In [9]:
my_tensor.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

In [10]:
my_tensor.unsqueeze(0)

tensor([[[1, 2, 3],
         [4, 5, 6]]])

In [11]:
my_tensor @ my_tensor.T #matrix multiplication

tensor([[14, 32],
        [32, 77]])

##### The important new bit: The GPU
Tensors need to explicitly be moved between devices.

In [12]:
torch.cuda.is_available() #check if GPU is available

True

In [13]:
my_second_tensor = torch.tensor([[4, 5, 6],[7, 8, 9]])
my_second_tensor.device

device(type='cpu')

In [14]:
my_second_tensor = my_second_tensor.to("cuda")
my_second_tensor.device

device(type='cuda', index=0)

Tensors need to be on the same device to allow operations between them.

In [15]:
print(my_tensor + my_second_tensor) #this produces an error

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [16]:
print(my_tensor.to("cuda") + my_second_tensor) #this does not

tensor([[ 5,  7,  9],
        [11, 13, 15]], device='cuda:0')


#### Beware the views

Just like in Numpy, some PyTorch operations return views and not hard copies. This is usually intuitive and intended, but can lead to accidents with in-place operations because different variable point to the same underlying data if not accounted for.
The most important operations that return a view are:
- (basic) indexing and slicing ops
- Transposition (`.T`, `.transpose()`, ...)
- `squeeze()`, `unsqueeze()`
- some selection operations (`.select()`, `. diagonal()`, ...)

It should also be noted that `.reshape()` can return either a view or a copy, meaning code should account for both possibilities.

Also note that mathematical operations on views return copies, meaning they can be used safely without modifying the data underlying the view.

The [Pytorch Documentation](https://docs.pytorch.org/docs/stable/tensor_view.html) contains a full list of all relevant operations.

In [17]:
a = torch.tensor([[4, 5, 6],[7, 8, 9]])
b = a.T

In [18]:
b[0][0] = 100

In [19]:
a = a*2
a

tensor([[200,  10,  12],
        [ 14,  16,  18]])

In [20]:
b

tensor([[100,   7],
        [  5,   8],
        [  6,   9]])

### You probably want to use nn.Module to create your models

A good way to build your own neural network architectures is to use nn.Module. Just write out your own class that inherits from nn.Module and give it a forward method and you are good to go. The simplest way to build your own model is to chain various premade operations from `torch.nn` in a `nn.Sequential`. Tutorials for various architectures can be found all over the internet.

In [21]:
class Net(nn.Module):
    def __init__(self, num_classes, dropout=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(32),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(64),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(64 * 5 * 5, 400),
            nn.BatchNorm1d(400),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(400, 100),
            nn.BatchNorm1d(100),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(100, num_classes))

    def forward(self, x):
        return self.net(x)

A lot of PyTorch's functionality is build around this. Training a model that is defined like this is super straightforward. Just pass the model parameters to an optimizer, compute a loss, make the backward call and perform an optimization step.

```
loss = F.cross_entropy(x_hat, y) #compute some loss based on model outputs y_hat and labels y
optimizer.zero_grad(set_to_none=True) #reseting the optimizer's gradients is good practice to avoid mixup between different training steps
loss.backward() #compute gradients based on loss
optimizer.step() #update model parameters
```


##### Iterating through Modules

In some cases it might be necessary to build models as more complex datastructures. A classic case here is one big model with a set of smaller "sub-models" which are themselves implemented using `nn.Module` (e.g. attention heads in a transformer). If the sub-models can't reasonably each be stored as discrete parameters, storing them in a nn.Modulelist is generally best practice.


### Handling Datasets
Small datsets (<100ish MB) can usually be stored in GPU memory just fine and accessed by iterating over them. Larger Datasets should be processed using a Dataloader (more info [here](https://docs.pytorch.org/docs/stable/data.html)).

To make your dataloader more efficient set `numworkers > 1` (this enables asynchronous dataloading) and set `pin_memory=True` for faster transfers from system memory to GPU memory.