# Pytorch Crash Course

### Quick Tips for installing Pytorch
You need the CUDA Toolkit and an Nvidia GPU.

Type
```
nvcc --version
```
into your console to figure out what CUDA version you have.

If you don't get a return on your own computer, you might need to install the [CUDA Toolkit](https://developer.nvidia.com/cuda/toolkit).
###### _If you are on a remote computer, you probably can't do that on your own. Talk to an admin instead._

If everything is alright the return should look roughly like this:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:26:51_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
```
(I have CUDA 12.6 on the machine I ran the command)

Now you can go to [pytorch's official website](pytroch.org) to download pytorch for the correct CUDA version.

Pytorch seems to be okay at dealing with CUDA version mismatches, I have seen newer versions of pytorch run fine with older CUDA versions. __DON'T QUOTE ME ON THIS THOUGH__

Pytorch also offers [old versions](https://pytorch.org/get-started/previous-versions/) for download, which work with older CUDA versions.



## Why Pytorch?

- Differentiation is hard, PyTorch does it for you with one line of code
- PyTorch implements a lot of other functionality as well (data handling, optimizers, model handling, loss funcitons, ...)
- PyTorch runs fast

## Why use GPUs for Machine Learning?

#### A Demo:
Let's fit a CNN to CIFAR10.
(_This code is shamelessly copied from an exercise in Prof. Jain's FMI course from last year_)

In [1]:
from src.small_CNN import *

def train(model, optimizer, device):
    for epoch in range(1, num_epochs + 1):
        for batch_idx, (X, y) in enumerate(train_loader):
            start_batch = time.time()
            X = X.to(device)
            y = y.to(device)

            z = model(X)
            loss = F.cross_entropy(z, y)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            torch.cuda.synchronize()
            end_batch = time.time()
            batch_time = int((end_batch - start_batch) * 1000)
        print(f"epoch {epoch}, {batch_time}ms elapsed")

Files already downloaded and verified
Files already downloaded and verified


In [None]:
device = torch.device("cpu")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

train(model, optimizer, device)

In [2]:
device = torch.device("cuda:2")

model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

train(model, optimizer, device)

epoch 1, 4ms elapsed
epoch 2, 2ms elapsed
epoch 3, 2ms elapsed
epoch 4, 2ms elapsed
epoch 5, 1ms elapsed
epoch 6, 1ms elapsed
epoch 7, 1ms elapsed
epoch 8, 2ms elapsed
epoch 9, 1ms elapsed
epoch 10, 2ms elapsed
epoch 11, 1ms elapsed
epoch 12, 2ms elapsed
epoch 13, 3ms elapsed
epoch 14, 2ms elapsed
epoch 15, 2ms elapsed


As you can see, running on the GPU can be a lot faster. But why is that the case?

Long story short: Forward- and backward-passes (Inference and Training) require massive amounts of vector/matrix (tensor) operations, which are highly parallelizable. GPUs (Graphics Processing Units) are built to solve these kinds of highly parallel operations to render pretty computer graphics, so they are also good at neural networks by accident. Datacenter GPUs nowadays are often purpose built for ML tasks, so they are even better.

If you do projects on your home computer and have a decent gaming GPU, then you can probably use it for ML.

If you need big computing power, the KI-GPU Server at OTH Regensburg or the NHR at FAU Erlangen can give you access to big datacenter GPUs.

## Okay but how do I PyTorch?

#### Online Resources
- [Official Documentation](https://docs.pytorch.org/docs/stable/index.html): For looking up syntax etc.
- [Official Tutorials](https://docs.pytorch.org/tutorials/index.html): This is the best place to start; also teaches a lot of general ML things.
- [Huggingface Tutorials](https://huggingface.co/learn): If you are specifically interested in NLP, then this will teach you just enough PyTorch to get by (while also teaching you a lot of other things).
- ["Let's build GPT: from scratch, in code, spelled out." by Andrej Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY): Not just for people interested in transformers! An excellent tutorial on working with PyTorch to build a complex model. If you are interested also look at [the follow up video on reproducing GPT-2](https://www.youtube.com/watch?v=l8pRSuU81PU)


### You already know a lot of the syntax if you know Numpy...

- PyTorch tensors represent n-dimensional arrays of various datatypes (just like ndarrays in numpy).
- Doing linear Algebra with PyTorch is very similar to doing it with numpy.
- A lot of numpy functions are named identically in PyTorch. (so just trying out what you would do in numpy often works)
- Most of the ones that aren't identical have an equivalent you can find with a google search.



In [None]:
my_tensor = torch.tensor([[1, 2, 3],[4, 5, 6]])
my_tensor

In [None]:
my_tensor[1][1:]

In [None]:
my_tensor.reshape((3,2))

In [None]:
my_tensor * 2

In [None]:
my_tensor.T

In [None]:
my_tensor.unsqueeze(0)

In [None]:
my_tensor @ my_tensor.T #matrix multiplication

##### The important new bit: The GPU
Tensors need to explicitly be moved between devices.

In [None]:
torch.cuda.is_available() #check if GPU is available

In [None]:
my_second_tensor = torch.tensor([[4, 5, 6],[7, 8, 9]])
my_second_tensor.device

In [None]:
my_second_tensor = my_second_tensor.to("cuda")
my_second_tensor.device

Tensors need to be on the same device to allow operations between them.

In [None]:
print(my_tensor + my_second_tensor)

In [None]:
print(my_tensor.to("cuda") + my_second_tensor)

### You probably want to use nn.Module to create your models

A good way to build your own neural network architectures is to use nn.Module. Just write out your own class that inherits from nn.MOdule and give it a forward method and you are good to go.

TODO: more here?

In [None]:
class Net(nn.Module):
    def __init__(self, num_classes, dropout=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(32),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(64),
            nn.Dropout2d(dropout),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(64 * 5 * 5, 400),
            nn.BatchNorm1d(400),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(400, 100),
            nn.BatchNorm1d(100),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(100, num_classes))

    def forward(self, x):
        return self.net(x)

### Handling Datasets
TODO
- Having your data stored in memory can be a good or a bad idea
- one tensor dimension is the batch, all data in a batch should be a tensor
- process entire batch at once by feeding full tensor into your model
- dataloader asynchronous stuff

In [None]:
def train_data_in_mem(model, optimizer, device, X, y):
    for epoch in tqdm(range(1, num_epochs + 1)):
        for i in range(0, X.size(0), batch_size):
            start_batch = time.time()
            X = X.to(device)
            y = y.to(device)

            z = model(X)
            loss = F.cross_entropy(z, y)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

            torch.cuda.synchronize()
            end_batch = time.time()
            batch_time = int((end_batch - start_batch) * 1000)
        print(f"epoch {epoch}, {batch_time}ms elapsed")

In [None]:
device = torch.device("cuda:2")

train_data_raw = datasets.CIFAR10(
    root='data',
    train=True,
    download=True
)
X = torch.tensor(train_data_raw.data, device=device, dtype=torch.float32)
y = torch.tensor(train_data_raw.targets, device=device)
norm = torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
X = norm(torch.transpose(X,-1,-3))


model = Net(num_classes)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()

train_data_in_mem(model, optimizer, device, X, y)

nn.Modulelist if you need to iterate through modules!!!! this is particularly import for compile (see below)

### The CPU-GPU Bottleneck

In [None]:
device = torch.device("cuda:2")

X,y = get_data(device)

model = Net(num_classes)
model = model.to(device)
model = torch.compile(model)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
model.train()
for epoch in tqdm(range(1, num_epochs + 1)):
    for i in range(0, X.size(0), batch_size):
        z = model(X[i:i + batch_size])
        a = int(X[0][0][0][0])
        loss = F.cross_entropy(z, y[i:i + batch_size])

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

### Views and Copies

this needed?