# CUDA

Throughout the book, we've mostly been using PyTorch or tools built on top of it, such as fastai and Hugging Face transformers. When we first introduced it in this book, we pitched PyTorch as a low-level framework, where you build architectures and write training loops "from scratch" using your knowledge of linear algebra.

But PyTorch may not be the lowest level of abstraction you deal with in machine learning.

PyTorch itself is written in C++, to which the CUDA language is an extension. CUDA is self-described as a "programming model" that allows you to write code for Nvidia GPUs. When writing your C++ code, you include certain functions called "CUDA Kernels" that perform a portion of the work on the GPU.

> Who's That Pokemon? CUDA Kernels

> A kernel is a function that is compiled for and designed to run on special accelerator hardware like GPUs (graphic processing units), FPGAs (field-programmable gate arrays), and ASICs (application-specific integated circuits). They are generally written by engineers who are very familiar with the hardware architecture, and are extensively tuned to perform a single task very well, such as matrix multiplication or convolution. CUDA kernels are kernels run on devices that use CUDA-Nvidia's GPUs and accelerators.

PyTorch and many other deep learning frameworks use a handful of CUDA kernels to implement their backend, and then build a higher-level interface to a language like Python. This allows you to run super-fast, hand-tuned code on specialized hardware that experts have spent years optimizing without having to think about memory, pointers, threads, etc.

There are many ohter similiar platforms, like AMS's ROCm, SYCL (an open source alternative from the Khronos Group), and, with AI hardware startups showing up in every nook and corner, many more.

But CUDA is, by far, the most mature and well-developed GPU programming interface available today. In fact, it's mostly the reason that we're all forced to use Nvidia's GPUs-its software stack is just so much better than everyone else's, which makes it easier to develop libraries like PyTorch on top of it.

Unless you have the bandwidth, it's not always a great idea to look for kerne-level improvements. This is probably very low on the list of things you should do if youre focus is on deploying an NLP application using existing tools and technology.

But...it is useful to understand how such a critical component of the infrastructure that powers deep learning today works, and it's certainly interesting and fun. An understanding of some of the idears in CUDA may also help you debug obscure errors in you deep learning framework, and can help you make more informed purchasing decisions for hardware.

## Threads and Thread Blocks

The fundamental atom of CUDA is the thread. A thread represents a single unit of execution of a computation. Every instruction that runs in a single thread will be executed sequentially. To get massive parallelism, CUDA devices usually have a lot of threads, which all run independently.

Crucially, communication between threads is hard (even on regular CPUs), and so we try to avoid this as much as possible. If you don't believe this, try to get a hundred people to agree on whether or not pineapples belong on pizza. It's hard, which is why CUDA attempts to sidestep the problem to a large degree, and is much better suited for problems that are embarrasingly parallel.

> yes, "embarrassingly parallel" is a somewhat widely accepted technical term that you'll likely hear in a few situations. In general, it means taht the problem you're trying to solve is composed of multiple smaller tasks that don't depend on each ohter. This is true in deep learning, where we have natural parallelism across hyperparameter sets, samples in a training batch, and even across tokens in a sequence for transformers.

Threads in CUDA are arranged into what are called blocks, which are themselves arranged into `grids`.

## Writing CUDA Kernels

## CUDA in Practice

Writing CUDA kernels, profiling them, and tweaking your code can be fun, but you don't always need to work at this level of abstraction to extract the benefits of CUDA. The examples we showed you are much simpler than the CUDA code that is currently deployed in the real world.

In Python, when we want to do matrix multiplication, we look up the docs. Maybe there are a few syntax variations, like a.matmul(b), matmul(a,b), and a@b, but that's about it. We generally don't give these methods too much thoughts.

CUDA is on an entirely different plane of existence. There are multiple competing matrix multiplication algorithms, with complex heuristics for deciding which kernel to call in which scenarios. The implementation of martix multiplication that's used can vary significantly depending on the shape of the matrices, memory bandwidth, and other hardware-specific details.

Thankfully, there's a slightly better abstraction layer for general-purpose GPU code: CUDA libraries. This includes CuFFT, cuDNN, cuSPARE, and more. The CUDA libraries contain highly optimized implementations of the most common algorithms you might want to run on a GPU, like convolution, Fourier transforms, matrix multiplicaiton, and more.

There's also the PyTorch C++ library, libtorch, which provides even higher-level primitives like torch::Tensor. PyTorch C++ code looks surprisingly similar to PyTorch code in Python. Here's an example of a layer from the official guide (https://oreil.ly/IEdxH) to custom extensions that the PyTorch documentation refers to as long log-term memory (LLTM):

```
#include <vector>
std::vector<at::Tensor> lltm_forward(
    torch::Tensor input,
    torch::Tensor weights,
    torch::Tensor bias,
    torch::Tensor old_h,
    torch::Tensor old_cell
) {
    auto X = torch::cat({old_h, input}, /*dim=*/1);

    auto gate_weights = torch::addmm(bias, X, weights.transpose(0,1));
    auto gates = gate_weights.chunk(3, /*dim=*/1);

    auto input_gate = torch::sigmoid(gates[0]);
    auto output_gate = torch::sigmoid(gates[1]);
    auto candidate_cell = torch::elu(gates[2], /*alpha=*/1.0);

    auto new_cell = old_cell + candidate_cell * input_gate;
    auto new_h = torch::tanh(new_cell) * output_gate;

    return  {
        new_h,
        new_cell,
        input_gate,
        output_gate,
        candidate_cell,
        X,
        gate_weights
    };
}
```
This is still much higher-level than the pointer manipulation you'll do in CUDA, but it can actually be very useful if you need to implement a new custom layer and find that cobbling up Python code incurs a significant performance penalty. The act of simply writing your layers with libtorch, linking into Python, and using that instead can produce a noticeable speed improvement, and this an optimization that may difinitely be worth your time.

If you want take the first steps toward writing low-level GPU code in practice, but don't want to burn your prcious hours trying to figure out what the most efficient access pattern is for a half-precision Fourier transform in shared memory, CUDA libraries and libtorch are wonderful tools that you can use as you craft your next NLP creation.