# Pytorch Crash Course

### Quick Tips for installing Pytorch
You need the CUDA Toolkit and an Nvidia GPU.

Type
```
nvcc --version
```
into your console to figure out what CUDA version you have.

If you don't get a return on your own computer, you might need to install the [CUDA Toolkit](https://developer.nvidia.com/cuda/toolkit).
###### _If you are on a remote computer, you probably can't do that on your own. Talk to an admin instead._

If everything is alright the return should look roughly like this:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:26:51_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
```
(I have CUDA 12.6 on the machine I ran the command)

Now you can go to [pytorch's official website](pytroch.org) to download pytorch for the correct CUDA version.

Pytorch seems to be okay at dealing with CUDA version mismatches, I have seen newer versions of pytorch run fine with older CUDA versions. __DON'T QUOTE ME ON THIS THOUGH__

Pytorch also offers [old versions](https://pytorch.org/get-started/previous-versions/) for download, which work with older CUDA versions.



## Why use GPUs for Machine Learning?

#### A Demo:
Let's fit a CNN to CIFAR10.
(_This code is shamelessly copied from an exercise in Prof. Jain's FMI course from last year_)

In [1]:
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
import torchvision
from torchvision import datasets, transforms
import time
from tqdm import tqdm

in_channels   = 3
num_classes   = 10
num_workers   = 8
batch_size    = 256 #this is pretty large for the actual model we use here, but not unreasonable in other usecases
num_epochs    = 15
learning_rate = 0.005

In [2]:
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_data = datasets.CIFAR10(root='data', train=True, transform=transform,download=True)
train_loader = DataLoader(train_data, batch_size=batch_size, num_workers=num_workers, shuffle=True)

train_data = datasets.CIFAR10(
    root='data',
    train=True,
    transform=transform,
    download=True
)

  entry = pickle.load(f, encoding="latin1")


In [4]:
class Net(nn.Module):
    def __init__(self, num_classes, dropout=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(32),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(64),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(64 * 5 * 5, 400),
            nn.BatchNorm1d(400),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(400, 100),
            nn.BatchNorm1d(100),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(100, num_classes))

    def forward(self, x):
        return self.net(x)

In [None]:
device = torch.device("cpu")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)


for epoch in tqdm(range(1, num_epochs + 1)):

    model.train()
    for batch_idx, (X, y) in enumerate(train_loader):

        X = X.to(device)
        y = y.to(device)

        z = model(X)
        loss = F.cross_entropy(z, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [None]:
device = torch.device("cuda:0")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in tqdm(range(1, num_epochs + 1)):

    model.train()
    for batch_idx, (X, y) in enumerate(train_loader):

        X = X.to(device)
        y = y.to(device)

        z = model(X)
        loss = F.cross_entropy(z, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Wow, the Code that runs almost a minute faster on my GPU (And it will run even faster after we have optimized it later). How can that be?

Long story short: Forward- and backward-passes (Inference and Training) require massive amounts of vector/matrix (tensor) operations, which are highly parallelizable. GPUs (Graphics Processing Units) are built to solve these kinds of highly parallel operations to render pretty computer graphics, so they are also good at neural networks by accident. Datacenter GPUs nowadays are often purpose built for ML tasks, so they are even better.

If you do projects on your home computer and have a decent gaming GPU, then you can probably use it for ML.

## Okay but how do I PyTorch?

#### Online Resources

### You already know a lot of the syntax if you know Numpy



### You probably want to use nn.Module

### Having your data stored in memory can be a good or a bad idea

In [5]:
def get_data(device):
    train_data_raw = datasets.CIFAR10(
        root='data',
        train=True,
        download=True
    )
    train_x = torch.tensor(train_data_raw.data, device=device, dtype=torch.float32)
    train_y = torch.tensor(train_data_raw.targets, device=device)
    norm = torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    train_x = norm(torch.transpose(train_x,-1,-3))
    return (train_x, train_y)

In [6]:
device = torch.device("cuda:0")


X,y = get_data(device)
model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()
for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        z = model(X[i:i + batch_size])
        loss = F.cross_entropy(z, y[i:i + batch_size])

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

100%|██████████| 15/15 [00:06<00:00,  2.29it/s]


## How to write Pytorch Code that isn't (too) bad

This is a guide to avoiding some common pitfalls, not to writing the fastest code possible (you probably wouldn't have time for that anyway).

#### Motivation

- Compute is an expensive shared resource.
- Wasting compute means literally just creating entropy (CLIMATE CHANGE IS A THING).
- Other people also have projects they want to do.

#### Some Online Guides that go more in-depth

### Enable Tensor Cores (if applicable)

Your GPU might nor support this. Most libraries that implements models (e.g. transformers etc.) will either do this by default or let you enable it through them.

In [7]:
torch.set_float32_matmul_precision('high')


WHAT IS A TENSOR CORE


### torch.compile is really neat

You might need to install Triton though. It can usually be installed like any python package. [Windows version](https://github.com/woct0rdho/triton-windows) also exists now.


In [11]:
device = torch.device("cuda:0")
num_epochs = 150

X,y = get_data(device)

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

TORCH_LOGS="graph_breaks"

def train(model, optimizer, x, y):
        z = model(x)
        loss = F.cross_entropy(z, y)

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

train = torch.compile(train)

for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        train(model, optimizer, X[i:i + batch_size], y[i:i + batch_size])


  entry = pickle.load(f, encoding="latin1")
 18%|█▊        | 27/150 [00:13<01:01,  2.01it/s]


KeyboardInterrupt: 

for inference you can compile you model, for training it's best practice to compile a train function instead

reduce overhead arg can help with small models/batches (try it out in each usecase)



### The CPU-GPU Bottleneck



In [10]:
device = torch.device("cuda:0")

X,y = get_data(device)

model = Net(num_classes)
model = model.to(device)
model = torch.compile(model)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()
for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        z = model(X[i:i + batch_size])
        a = int(X[0][0][0][0])
        loss = F.cross_entropy(z, y[i:i + batch_size])

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

  entry = pickle.load(f, encoding="latin1")
 17%|█▋        | 26/150 [00:26<02:05,  1.01s/it]


KeyboardInterrupt: 

### Views and Copies

### Multithreading is very simple and you should probably use it

### Magic Numbers

## Why you (probably) don't need more than one GPU

### Large Model? Use Quantization

### Training (generally) doesn't scale well to multiple GPUs



### Some more specific Tips

#### People have probably already implemented a lot of the things you need
LINKS HERE
Huggingface

#### Mixed Precision is cool

#### Accelerate is a library that exists

#### There are differences in how various kernels handle torch.compile
