# Pytorch Crash Course

### Quick Tips for installing Pytorch
You need the CUDA Toolkit and an Nvidia GPU.

Type
```
nvcc --version
```
into your console to figure out what CUDA version you have.

If you don't get a return on your own computer, you might need to install the [CUDA Toolkit](https://developer.nvidia.com/cuda/toolkit).
###### _If you are on a remote computer, you probably can't do that on your own. Talk to an admin instead._

If everything is alright the return should look roughly like this:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:26:51_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
```
(I have CUDA 12.6 on the machine I ran the command)

Now you can go to [pytorch's official website](pytroch.org) to download pytorch for the correct CUDA version.

Pytorch seems to be okay at dealing with CUDA version mismatches, I have seen newer versions of pytorch run fine with older CUDA versions. __DON'T QUOTE ME ON THIS THOUGH__

Pytorch also offers [old versions](https://pytorch.org/get-started/previous-versions/) for download, which work with older CUDA versions.



## Why Pytorch?

- Differentiation is hard, PyTorch does it for you with one line of code
- Implementing

## Why use GPUs for Machine Learning?

#### A Demo:
Let's fit a CNN to CIFAR10.
(_This code is shamelessly copied from an exercise in Prof. Jain's FMI course from last year_)

In [16]:
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
import torchvision
from torchvision import datasets, transforms
import time
from tqdm import tqdm

in_channels   = 3
num_classes   = 10
num_workers   = 8
batch_size    = 256 #this is pretty large for the actual model we use here, but not unreasonable in other usecases
num_epochs    = 15
learning_rate = 0.005

In [2]:
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_data = datasets.CIFAR10(root='data', train=True, transform=transform,download=True)
train_loader = DataLoader(train_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

train_data = datasets.CIFAR10(
    root='data',
    train=True,
    transform=transform,
    download=True
)

Files already downloaded and verified
Files already downloaded and verified


In [4]:
class Net(nn.Module):
    def __init__(self, num_classes, dropout=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(32),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(64),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(64 * 5 * 5, 400),
            nn.BatchNorm1d(400),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(400, 100),
            nn.BatchNorm1d(100),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(100, num_classes))

    def forward(self, x):
        return self.net(x)

In [4]:
device = torch.device("cpu")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)


for epoch in tqdm(range(1, num_epochs + 1)):

    model.train()
    for batch_idx, (X, y) in enumerate(train_loader):

        X = X.to(device)
        y = y.to(device)

        z = model(X)
        loss = F.cross_entropy(z, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

100%|██████████| 15/15 [01:50<00:00,  7.35s/it]


In [5]:
device = torch.device("cuda:2")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in tqdm(range(1, num_epochs + 1)):

    model.train()
    for batch_idx, (X, y) in enumerate(train_loader):

        X = X.to(device)
        y = y.to(device)

        z = model(X)
        loss = F.cross_entropy(z, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

100%|██████████| 15/15 [00:25<00:00,  1.69s/it]


Wow, the Code that runs almost a minute faster on my GPU (And it will run even faster after we have optimized it later). How can that be? #TODO: ACTUAL TIME

Long story short: Forward- and backward-passes (Inference and Training) require massive amounts of vector/matrix (tensor) operations, which are highly parallelizable. GPUs (Graphics Processing Units) are built to solve these kinds of highly parallel operations to render pretty computer graphics, so they are also good at neural networks by accident. Datacenter GPUs nowadays are often purpose built for ML tasks, so they are even better.

If you do projects on your home computer and have a decent gaming GPU, then you can probably use it for ML.

## Okay but how do I PyTorch?

#### Online Resources
- [Official Documentation](https://docs.pytorch.org/docs/stable/index.html): For looking up syntax etc.
- [Official Tutorials](https://docs.pytorch.org/tutorials/index.html): This is the best place to start; also teaches a lot of general ML things.
- [Huggingface Tutorials](https://huggingface.co/learn): If you are specifically interested in NLP, then this will teach you just enough PyTorch to get by (while also teaching you a lot of other things).
- ["Let's build GPT: from scratch, in code, spelled out." by Andrej Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY): Not just for people interested in transformers! An excellent tutorial on working with PyTorch to build a complex model. If you are interested also look at [the follow up video on reproducing GPT-2](https://www.youtube.com/watch?v=l8pRSuU81PU)


### You already know a lot of the syntax if you know Numpy...

- PyTorch tensors represent n-dimensional arrays of various datatypes (just like ndarrays in numpy).
- Doing linear Algebra with PyTorch is very similar to doing it with numpy.
- A lot of numpy functions are named identically in PyTorch. (so just trying out what you would do in numpy often works)
- Most of the ones that aren't identical have an equivalent you can find with a google search.



In [44]:
my_tensor = torch.tensor([[1, 2, 3],[4, 5, 6]])
my_tensor

tensor([[1, 2, 3],
        [4, 5, 6]])

In [47]:
my_tensor[1][1:]

tensor([5, 6])

In [48]:
my_tensor.reshape((3,2))

tensor([[1, 2],
        [3, 4],
        [5, 6]])

In [49]:
my_tensor * 2

tensor([[ 2,  4,  6],
        [ 8, 10, 12]])

In [50]:
my_tensor.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

In [51]:
my_tensor.unsqueeze(0)

tensor([[[1, 2, 3],
         [4, 5, 6]]])

In [52]:
my_tensor @ my_tensor.T #matrix multiplication

tensor([[14, 32],
        [32, 77]])

##### The important new bit: The GPU
Tensors need to explicitly be moved between devices.

In [30]:
torch.cuda.is_available() #check if GPU is available

True

In [36]:
my_second_tensor = torch.tensor([[4, 5, 6],[7, 8, 9]])
my_second_tensor.device

device(type='cpu')

In [37]:
my_second_tensor = my_second_tensor.to("cuda")
my_second_tensor.device

device(type='cuda', index=0)

In [41]:
print(my_tensor + my_second_tensor)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [42]:
print(my_tensor.to("cuda") + my_second_tensor)

tensor([[ 5,  7,  9],
        [11, 13, 15]], device='cuda:0')


### ... and most of numpy is just translating mathematical operations into code.
Which means that

- If you are implementing a paper, as a first step, implement the tensor operations exactly as described using torch, then start thinking about how to improve performance.
- If you are starting from scratch, take a minute or two to write down what you want to implement in mathematical notation first.
- A good trick to ensure consistency is to use variable names that match the mathematical symbols.
- Don't forget that a lot of popular papers will already have implementations, using them (even if just as a reference) makes your life easier.

##### An Example:

This is the loss function (clipped surrogate objective) from the [DPO Paper](https://arxiv.org/pdf/2305.18290) ("Direct Preference Optimization:
Your Language Model is Secretly a Reward Model", Rafailov et al. 2024)

$$
\mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}} \left[ \sigma \left(\beta\log\frac{\pi_{\theta}(y_w | x)}{\pi_{ref}(y_w | x)}  -\beta\log\frac{\pi_{\theta}(y_l | x)}{\pi_{ref}(y_l | x)}\right) \right]
$$


```python
L_DPO = - torch.mean(torch.nn.sigmoid(beta * torch.log(pi_theta_winner / pi_ref_winner) - beta * torch.log(pi_theta_loser / pi_ref_loser)))
```

at which point you just need to write something like
```
optimizer.zero_grad()
L_DPO.backward()
optimizer.step()
```
to almost have a trainable model.

### You probably want to use nn.Module to create your models

A super easy way to test out your own neural network architectures is to use nn.Module. TODO


In [None]:
# A sample model with

### Handling Datasets
- Having your data stored in memory can be a good or a bad idea
- dataloader asynchronous stuff

In [5]:
def get_data(device):
    train_data_raw = datasets.CIFAR10(
        root='data',
        train=True,
        download=True
    )
    train_x = torch.tensor(train_data_raw.data, device=device, dtype=torch.float32)
    train_y = torch.tensor(train_data_raw.targets, device=device)
    norm = torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    train_x = norm(torch.transpose(train_x,-1,-3))
    return (train_x, train_y)

In [13]:
device = torch.device("cuda:2")



X,y = get_data(device)
model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()
for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        z = model(X[i:i + batch_size])
        loss = F.cross_entropy(z, y[i:i + batch_size])

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    torch.cuda.synchronize()

Files already downloaded and verified


100%|██████████| 15/15 [00:07<00:00,  1.90it/s]


nn.Modulelist if you need to iterate through modules!!!! this is particularly import for compile (see below)

## How to write Pytorch Code that isn't (too) bad

This is a guide to avoiding some common pitfalls, not to writing the fastest code possible (you probably wouldn't have time for that anyway).

#### Motivation

- Compute is an expensive shared resource.
- Wasting compute means literally just creating entropy (CLIMATE CHANGE IS A THING).
- Other people also have projects they want to do.

#### Some Online Guides that go more in-depth

### Enable Tensor Cores (if applicable)

Your GPU might not support this (Nvidia GPUs for datacenters like KIGS or the HPC in Erlangen do). Most libraries that implements models (e.g. transformers etc.) will either do this by default or let you enable it through them.

Reduce length of Mantissa in 32-bit floating point numbers from 23 to 10 (TensorFloat-32) or 7 (bfloat16) while keeping exponent length the same to preserve range of values.

In [11]:
torch.set_float32_matmul_precision('high') #TF32
torch.set_float32_matmul_precision('medium') #BF16

def train(model, optimizer, x, y):
    with torch.autocast(device_type="cuda"):
        z = model(x)
        loss = F.cross_entropy(z, y)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


WHAT IS A TENSOR CORE


### torch.compile is really neat

You might need to install Triton though. It can usually be installed like any python package. [Windows version](https://github.com/woct0rdho/triton-windows) also exists now.


In [12]:
device = torch.device("cuda:2")

X,y = get_data(device)

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

train = torch.compile(train)

for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        train(model, optimizer, X[i:i + batch_size], y[i:i + batch_size])

    torch.cuda.synchronize()


Files already downloaded and verified


100%|██████████| 15/15 [00:15<00:00,  1.04s/it]


for inference you can compile you model, for training it's best practice to compile a train function instead

reduce overhead arg can help with small models/batches (try it out in each usecase)

avoid graph breaks; no non pytorch stuff in compile

best practices: avoid inplace ops when possible, AOTAutograd sometimes has issues with them




### The CPU-GPU Bottleneck



In [10]:
device = torch.device("cuda:2")

X,y = get_data(device)

model = Net(num_classes)
model = model.to(device)
model = torch.compile(model)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()
for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        z = model(X[i:i + batch_size])
        a = int(X[0][0][0][0])
        loss = F.cross_entropy(z, y[i:i + batch_size])

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

Files already downloaded and verified


 20%|██        | 30/150 [00:18<01:12,  1.66it/s]


KeyboardInterrupt: 

### Views and Copies

### Multithreading is very simple and you should probably use it

- example: pipeline processing image timeseries:
	- real world usecase: sequential steps, process as fast as possible
	- naive training approach follows same logic: proprocessing first step on cpu, cnn on first frame, first rnn step, proprocessing second step, cnn on second frame, seocnd rnn step
	- batch cnn step, multithread, avoid idle gpu time
	- THIS  IS OBVIOUS WHEN YOU THINK ABOUT IT, but pytorch makes it easy to not do it
	- you can use normal multithreading for standard python code

### Magic Numbers

### Avoiding Computation of Unnecessary Gradients

## Why you (probably) don't need more than one GPU

### Large Model? Use Quantization

### Training (generally) doesn't scale well to multiple GPUs



### Some more specific Tips

#### People have probably already implemented a lot of the things you need
LINKS HERE
Huggingface

#### Mixed Precision is cool

#### Accelerate is a library that exists

#### There are differences in how various kernels handle torch.compile
