# How to write Pytorch code that isn't (too) bad

This is a guide to avoiding some common pitfalls, not to writing the fastest code possible (you probably wouldn't have time for that anyway).

### Motivation

- Compute is an expensive shared resource.
- Wasting compute means literally just creating entropy (__CLIMATE CHANGE IS A THING__).
- Other people also have projects they want to do.

### Some online guides that go more in-depth

[Official Performance Tuning Guide](https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html)
[Official PyTorch Guide on CUDA](https://docs.pytorch.org/docs/stable/notes/cuda.html)

### A Note on training on more than one GPU

Long story short, you probably don't need to.

#### But I want my model to be faster

PyTorch supports training on more than one GPU with [Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html). Implementing it can be fairly straightforward, depending on usecase.

_However_, training performance usually scales sublinear with GPU count, that is to say, the performance gain is likely not going to be that big, especially considering the increased usage of limited shared resources.

Training on multiple GPUs introduces significant, unavoidable overhead. If you don't know what you are doing, spending effort on doing it is likely not worth it for student projects.

#### But my model doesn't fit in one GPU

There are models (especially modern LLMs) that require more than the 50-80 GBs of V-Ram a typical Datacenter GPU can offer. If you need to work with a model like that, multi GPU might be hard to avoid. Luckily for you, most libraries centered around these models (like [transformers](https://huggingface.co/docs/transformers/index)) will handle the hard work for you. Consult their documentation.

If you are working with large LLMs via Transformers, consider using a [Quantized Model](https://huggingface.co/docs/transformers/quantization/overview) and consult their [documentation](https://huggingface.co/docs/transformers/v5.0.0rc2/en/llm_tutorial_optimization#1-lower-precision) for other ways of reducing memory requirements, such as lowering precision or paging parts of the model. Multi-GPU should be treated as a last resort.

### We now switch to a bigger Model to better see the speedups

(The code below is based on the aforementioned GPT2 from scratch tutorial.)

In [None]:
from src.gpt import GPT, GPTConfig
from src.dataloader import DataLoader
import tiktoken
import math
import torch
import time

total_batch_size = 524288
B, T = 16, 1024
max_lr = 6e-4
min_lr = max_lr * 0.1
warmup_steps = 10
max_steps = 50

grad_accum_steps = total_batch_size // (B * T)

device = "cuda:2"

def get_lr(it):
    if it < warmup_steps:
        return max_lr * (it+1) / warmup_steps

    elif it > max_steps:
        return min_lr

    else:
        decay_ratio = (it -warmup_steps) / (max_steps - warmup_steps)
        assert 0 <= decay_ratio <= 1
        coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
        return min_lr + coeff * (max_lr - min_lr)

In [None]:
model = GPT(GPTConfig(vocab_size=50304))
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=max_lr)
enc = tiktoken.get_encoding('gpt2')
train_loader = DataLoader(B, T, 1, 1)


def train(it):
    optimizer.zero_grad()

    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)

    logits, loss = model(x, y)

    loss.backward()
    optimizer.step()


start = time.time()
avg_batch_time = 0
for i in range(max_steps):
    start_batch = time.time()

    train(i)

    torch.cuda.synchronize()
    end_batch = time.time()
    batch_time = int((end_batch - start_batch) * 1000)
    print(f"step {i + 1}, loss: {loss}, {batch_time}ms elapsed")
    avg_batch_time += batch_time
end = time.time()
print(f"{int((end - start) * 1000)}ms elapsed, {avg_batch_time / max_steps}ms avg batch time")

### Enable Tensor Cores (if applicable)

Your GPU might not support this (Nvidia GPUs for datacenters like KIGS or the HPC in Erlangen do). Most libraries that implements models (e.g. transformers etc.) will either do this by default or let you enable it through them.

Reduce length of Mantissa in 32-bit floating point numbers from 23 to 10 (TensorFloat-32) or 7 (bfloat16) while keeping exponent length the same to preserve range of values.

In [None]:
torch.set_float32_matmul_precision('high') #TF32
torch.set_float32_matmul_precision('medium') #BF16

model = GPT(GPTConfig(vocab_size=50304))
model.to(device)


optimizer = torch.optim.AdamW(model.parameters(), lr=max_lr)
enc = tiktoken.get_encoding('gpt2')
train_loader = DataLoader(B, T, 1, 1)


def train(it):
    optimizer.zero_grad()

    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)

    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        logits, loss = model(x, y)

    loss.backward()
    optimizer.step()


start = time.time()
avg_batch_time = 0
for i in range(max_steps):
    start_batch = time.time()

    train(i)

    torch.cuda.synchronize()
    end_batch = time.time()
    batch_time = int((end_batch - start_batch) * 1000)
    print(f"step {i + 1}, {batch_time}ms elapsed")
    avg_batch_time += batch_time
end = time.time()
print(f"{int((end - start) * 1000)}ms elapsed, {avg_batch_time / max_steps}ms avg batch time")

WHAT IS A TENSOR CORE


### torch.compile is really neat

https://docs.pytorch.org/docs/stable/compile/programming_model.html

https://www.youtube.com/live/rew5CSUaIXg

https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit?tab=t.0#heading=h.ivdr7fmrbeab

You might need to install Triton though. It can usually be installed like any python package. [Windows version](https://github.com/woct0rdho/triton-windows) also exists now.


In [None]:
model = GPT(GPTConfig(vocab_size=50304))
model.to(device)

model = torch.compile(model)

optimizer = torch.optim.AdamW(model.parameters(), lr=max_lr)
enc = tiktoken.get_encoding('gpt2')
train_loader = DataLoader(B, T, 1, 1)


def train(it):
    optimizer.zero_grad()

    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)

    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        logits, loss = model(x, y)

    loss.backward()
    optimizer.step()


start = time.time()
avg_batch_time = 0
for i in range(max_steps):
    start_batch = time.time()

    train(i)

    torch.cuda.synchronize()
    end_batch = time.time()
    batch_time = int((end_batch - start_batch) * 1000)
    print(f"step {i + 1}, {batch_time}ms elapsed")
    avg_batch_time += batch_time
end = time.time()
print(f"{int((end - start) * 1000)}ms elapsed, {avg_batch_time / max_steps}ms avg batch time")

for inference you can compile you model, for training it's best practice to compile a train function instead

reduce overhead arg can help with small models/batches (try it out in each usecase)

avoid graph breaks; no non pytorch stuff in compile

best practices: avoid inplace ops when possible, AOTAutograd sometimes has issues with them


### Multithreading is very simple and you should probably use it

TODO MAKE SOMETHING WITH THIS

- example: pipeline processing image timeseries:
	- real world usecase: sequential steps, process as fast as possible
	- naive training approach follows same logic: proprocessing first step on cpu, cnn on first frame, first rnn step, proprocessing second step, cnn on second frame, seocnd rnn step
	- batch cnn step, multithread, avoid idle gpu time
	- THIS  IS OBVIOUS WHEN YOU THINK ABOUT IT, but pytorch makes it easy to not do it
	- you can use normal multithreading for standard python code

### Nice Values for Constants

Make certain constants multiples of two. This sounds super esoteric, but actually works in my experience:
- The GPU memory is divided into blocks.
- Accessing memory is faster if it fits neatly into those blocks.
- The memory blocks are sized in multiples of two.
- Any constants that determine the size of tensors that will be stored in GPU memory should be multiples of two.

This includes batch size, neural network layer sizes


### Avoiding Computation of Unnecessary Gradients

Gradients are only needed when you actually plan to use them for training. They should be disabled otherwise.

In [None]:
device = torch.device("cuda:2")

model = Net(num_classes)
model = model.to(device)
model.eval()
X, y = get_data(device)

During Inference, instead of this

In [None]:
y_hat = model(X)

do
this:

In [None]:
with torch.no_grad():
    y_hat = model(X)

.detach!!!!!!
requires_grad