# Introduction to GPU Programming

Wim R.M. Cardoen

Center of High-Performance Computing (CHPC)
University of Utah

October 14, 2024

#### Outline I

- Motivation
- Hardware
  - Streaming multiprocessor (SM)
  - Warps
- Software
  - GPGPU & CUDA
  - Structure of a GPU computation
  - Case study: matrix multiplication
  - Building CUDA applications & useful env. variables
  - Profiling & debugging
  - Important CUDA libraries
  - Alternatives to CUDA
  - Links
- Use of GPUs at the CHPC
  - GPUs available at CHPC
    - Regular env.: lp/kp/np/grn clusters

#### Outline II

- Protected env.: redwood cluster
- How to access the GPUs at CHPC

### Motivation

## Theoretical GFLOP/s: GPU vs. CPU



Figure: Theoretical GFLOP/s: GPUs vs. CPUs.<sup>a</sup>

5 / 47

 $<sup>^</sup>a https://docs.nvidia.com/cuda/archive/9.1/pdf/CUDA\_C\_Programming\_Guide.pdf$ 

# CPU processor trend (last 50 years)



 After the year 2000, freq./power for a single CPU core reaches a max. (Heat dissipation!).

# Energy efficiency per job: GPU vs. CPU



Figure: Energy efficiency per job (NERSC).<sup>a</sup>

<sup>&</sup>lt;sup>a</sup>https://blogs.nvidia.com/blog/gpu-energy-efficiency-nersc/ (05/21/2023)

### Hardware

# Streaming Multiprocessor (SM)

- GPU device connected to the CPU by a PCle bus.
- each GPU device contains an array (x) of Streaming Multiprocessors (SM).
- each SM has:
  - a Single-Instruction Multiple-Thread (SIMT) Architecture.
  - contains y regular cores and [z tensor cores].
- scalable: newer generations: increase of x, y and [z], e.g.:
  - NVIDIA A100-PCIE-40GB (notch293)
    - global memory: 40 GB.
    - 108 SMs, 64 Cores/SM, 4 Tensor Cores/SM.
    - GPU Max. Clock Rate: 1.41 GHz.
  - NVIDIA H100 SXM5 NVL (grn008)
    - global memory: 93 GB.
    - 132 SMs, 128 Cores/SM, 4 Tensor Cores/SM.
    - GPU Max. Clock Rate: 1.78 GHz.

#### NVIDIA GH100 SM



Figure: GH100 SM

#### NVIDIA GH100 Full Device



Figure: NVIDIA GH100 Full Device (144 SMs).

# GPU Threads - Warps

- Each SM:
  - generates, schedules, executes threads in batches of 32 threads.
  - WARP: a batch of 32 threads
- each thread in a WARP executes the same instructions but runs its own "path".
- if threads within a WARP diverge, the threads become inactive/disabled.

### Software

#### GPGPU & CUDA

- GPU (Graphic Processing Unit): orginally developed for graphical applications.
- GP-GPU: General-Purpose GPU, i.e.
  the use of GPUs beyond graphical applications.
   CAVEAT: problem to be reformulated in terms of the graphics API.
- 2007: NVIDIA introduces the CUDA<sup>1</sup> framework (Compute Unified Device Architecture)
  - CUDA API: extension of the C language.
  - handles the GPU thread level parallelism.
  - deals with moving data between CPU and GPU.
  - also support for C++, Fortran and Python.

- CUDA Driver
- CUDA Toolkit (nvcc,nvprof, ..., libraries, header files).

<sup>&</sup>lt;sup>1</sup>The CUDA Toolkit consists of 2 parts:

### Structure of a GPU computation

- Allocate memory space on the GPU device.
- 2 Transfer the data from the CPU to the GPU device.
- Perform the calculation on the GPU device.
  - kernel: function executed on the GPU.
  - To enhance performance: keep data as long as possible on the GPU device.
- Transfer the result back from the GPU device to the CPU.
- Deallocate memory space on the GPU device.

Note: source code & makefile available in ./src

# Alloc. & free of global memory on the GPU

- cudaError\_t
   CUDA Error types.
- cudaError\_t cudaMalloc(void \*\*devPtr, size\_t size)
  Allocates memory on the device.
- cudaError\_t cudaFree(void \*devPtr)
  Frees memory on the device.

```
double *M.d, *N.d, *P.d; // Pointers (device)
int const SZ=16;
int const SZ=SZ*SZ;

// Allocate M on device (M.d)
if(cudaMalloc(&M.d, sizeof(double)*SZ2) != cudaSuccess)
{
    printf(" ERROR: alloc vector M on DEVICE \n");
    return 1;
}

// Deallocate matrix M.d
if(cudaFree(M.d) != cudaSuccess)
{
    printf(" ERROR: unable to deallocate M.d (DEVICE)\n");
    return 1;
}
```

Listing 1: Alloc/Free extract

# Copy data between host (CPU) and device (GPU)

Copy data bewteen host (CPU) and device (CPU)

- Direction (kind):
  - cudaMemcpyHostToHost
  - cudaMemcpyHostToDevice
  - $\bullet \ \mathtt{cudaMemcpyDeviceToHost} \\$
  - $\bullet \ \, {\tt cudaMemcpyDeviceToDevice}$

Listing 2: cudaMemcpy extract

#### **CUDA Kernel**

- CUDA kernel: alias for a function which may run on a GPU device.
- Kernel declaration syntax:

```
funcspec void kernelName(args){ body }
where:
```

- funcspec: function type qualifier, i.e. \_\_global\_\_,\_host\_\_,\_device\_\_
- kernelName: name of the kernel/CUDA function.
- args: argument list of the kernel/CUDA function.
- body: body of the kernel/CUDA function (your code).
- Kernel call syntax:

kernelName<<<gridSize,blockSize>>>(args)

#### where:

- gridSize: size of the grid of thread blocks.
- blockSize: size of a thread block.



# Function type qualifiers

| Qualifier | Called from | Executed on |
|-----------|-------------|-------------|
| global    | host        | device      |
| host      | host        | host        |
| device    | device      | device      |

Table: Function type qualifiers

#### Note:

 You can have to different versions of a function i.e.: one with \_\_host\_\_ & one with \_\_device\_\_

### Grids, Blocks and Threads

We have a hierarchical (software) implementation.

- uint3,dim3:
  - CUDA defined structures of unsigned integer x,y,z
  - dim3: based on uint3 but unspecified components are initialized to 1.
- Grid: each Grid consists of Blocks
  - dim3 gridDim: dimensions of the Grid.
  - uint3 blockIdx: block index within the Grid.
- Block: each Block consists of Threads
  - dim3 blockDim: dimensions of the Block
  - uint3 threadIdx: thread index within the block.

# Matrix Mul.: kernel (v. 1)

#### **Case study**: $P = M \times N$ where $P, M, N \in \mathbb{R}^{n \times n}$

Listing 3: Kernel (v. 1)

# Invoking kernel (v. 1)

#### Invoking 1 Block of Threads

```
int main(void)
{
   int const SZ=16;
   // ...
   // Invoke Kernel to generate P=MxN
   dim3 dimBlock(SZ,SZ,1);
   dim3 dimGrid(1,1,1);
   MatrixMulKernel1<<<<di>dimBlock>>>(M_d,N_d,P_d,SZ);
   //..
}
```

Listing 4: Invoking kernel (v. 1)

## Mat. Mul (v.2): Grid of 2D Blocks

- int tx = blockIdx.x\*blockDim.x + threadIdx.x;
- int ty = blockIdx.y\*blockDim.y + threadIdx.y;



Figure: 2D-Grid of 2D-Blocks of Threads

# Matrix Mul.: visualization (v. 2)



Figure: Matrix Mul. (2D Grid)

# Matrix Mul.: kernel (v. 2)

Listing 5: Kernel (v. 2)

# Invoking kernel (v. 2)

• Invoking a grid of blocks of threads

```
int main(void)
{
    int const SZ=500;

    // ..
    int const THREADX=16;
    int const THREADY=16;
    int const THREADY=16;
    dim3 dimBlock(THREADX, THREADY, 1);
    int numBlocksX=(SZ%THREADX==0 ? SZ/THREADX : SZ/THREADX +1);
    int numBlocksY=(SZ%THREADY==0 ? SZ/THREADY : SZ/THREADY +1);
    dim3 dimGrid(numBlocksX, numBlocksY, 1);

MatrixMulKernel2

    MatrixMulKernel2

    // ..
}
```

Listing 6: Invoking kernel (v. 2)

# Types of GPU memory

- global memory: (largest, slowest and often the bottleneck).
- shared memory: allocated per thread block & low latency
- constant memory: cached, read-only
- registers: fast, on-chip memory (exclusive to each thread).

# Matrix Mul.: use of shared memory (v. 3)



Figure: Matrix Mul.: use of shared memory

# Matrix Mul.: kernel (v. 3) - use of shared memory

```
#include <mul.h>
#define WIDTH 16
__global__ void MatrixMulKernel3(double *M_d, double *N_d, double *P_d, int const SZ)
    int tx = blockldx.x * blockDim.x + threadIdx.x;
    int tv = blockldx.v * blockDim.v + threadIdx.v:
    __shared__ double M_s[WIDTH][WIDTH];
    __shared__ double N_s[WIDTH][WIDTH];
    double Pval = 0.0:
    int nslices=(SZ%WIDTH==0)?(SZ/WIDTH):(SZ/WIDTH+1);
    for(int islice=0: islice < nslices: islice++)
        M_s[threadIdx.x][threadIdx.y]=M_d[tx*SZ + islice*WIDTH + threadIdx.y];
        N_s[threadIdx.x][threadIdx.v] = N_d[islice*WIDTH*SZ + threadIdx.x*SZ + tv]:
        __syncthreads();
        for (int k=0: k<WIDTH: k++)
             Pval += M_s[threadIdx.x][k]* N_s[k][threadIdx.y];
        __syncthreads();
    if(tx < SZ \&\& ty < SZ)
       P_d[tx*SZ+ty] = Pval;
    return:
```

Listing 7: Kernel (v. 3)

# Building/Compiling CUDA applications

#### General scheme:

- Source code for CUDA applications:
  - C/C++ host code with extensions to deal with the device(s).
  - Other programming languages are allowed e.g. Fortran
- Primo: separate device functions from host code.
- Device code: preprocessing, compilation with the NVIDIA compiler (nvcc).
- Host code: preprocessed, compiled with a host (C/C++) compiler
   e.g. (gcc, g++, icc, icpc, ...)
- Compiled device functions are embedded as fatbinary images in the host object file.
- Linking stage: adding CUDA runtime libraries to the host object file to create an executable.

## Further concepts

- .cu : Suffix for CUDA source file (host code (C,C++) & device code).
- .cuf: Suffix for CUDA source file (host code (Fortan) & device code).
- .ptx: Suffix for Parallel Thread Execution (PTX) files. An intermediate representation (similar to assembly for a virtual GPU architecture<sup>2</sup>
- .cubin: Suffix for the CUDA device binary file pertaining to a real GPU architecture<sup>3</sup>
- fatbin: Multiple PTX [& cubin] files are merged into a fatbin file.

<sup>&</sup>lt;sup>2</sup>Virtual architectures bear the "compute\_" prefix e.g. "compute\_70".

<sup>&</sup>lt;sup>3</sup>Real (physical) architectures bear the "sm\_" prefix e.g. "sm\_70".

#### Some recent CUDA Architectures

| Architecture | Year | compute_   | $sm_{-}$   |
|--------------|------|------------|------------|
| Maxwell      | 2014 | 50, 52, 53 | 50, 52, 53 |
| Pascal       | 2016 | 60, 61, 62 | 60, 61, 62 |
| Volta        | 2017 | 70, 72     | 70, 72     |
| Turing       | 2018 | 75         | 75         |
| Ampere       | 2020 | 80, 86, 87 | 80, 86, 87 |
| Ada Lovelace | 2022 | 89         | 89         |
| Hopper       | 2022 | 90, 90a    | 90, 90a    |

Table: Some of the recent CUDA architectures (10/08/2024)

# Compilation trajectory (cont.)



Figure: Compilation trajectory

#### In praxi

- What are the existing CUDA architectures?
- How find the architecture of a machine?
- How to build for a particular architecture?
- How to build for multiple architectures?

# Profiling & debugging

#### CUDA SDK comes with:

- its own profiler: nvprof.
- its own debugger: nvsight

# Profiling mul3 using nvprof

```
[u0253283@notch001:3]$ nvprof ./mul3
==2424062== NVPROF is profiling process 2424062, command: ./mul3
Calling Kernel ...
Kernel Call Finished ...
Frob. Norm(P-P h):
                      0.0000000000
==2424062== Profiling application: ./mul3
==2424062== Profiling result:
           Type Time(%)
                              Time
                                                   Ava
                                                             Min
                                                                      Max
                                                                           Name
GPU activities:
                  59.70% 1.1707ms
                                              1.1707ms 1.1707ms 1.1707ms
                                                                           MatrixMulKernel3(double*, double*, double*, int)
                  32.02% 627.87us
                                             313.93us 307.13us
                                                                 320.73us
                                                                            [CUDA memcpy HtoD]
                   8.28% 162.33us
                                              162.33us 162.33us
                                                                 162.33us
                                                                            [CUDA memcpy DtoH]
                                             46.151ms 109.78us 138.23ms
                                                                           cudaMalloc
     API calls:
                  95.82% 138.45ms
                   2.88% 4.1668ms
                                             1.3889ms 494.15us 3.1420ms
                                                                           cudaMemcpv
                   0.82% 1.1901ms
                                             1.1901ms 1.1901ms 1.1901ms
                                                                           cudaLaunchKernel
                                             131.81us
                                                        94.455us 202.38us
                   0.27% 395.44us
                                                                           cudaFree
                                                           291ns
                                                                 103.89us
                         250.22us
                                             2.1940us
                                                                           cuDeviceGetAttribute
                                              16.428us 16.428us
                   0.01%
                         16.428us
                                                                 16.428us
                                                                           cuDeviceGetPCIBusId
                   0.01% 12.713us
                                              12.713us
                                                       12.713us
                                                                 12.713us cuDeviceGetName
                   0.00% 1.5660us
                                                 522ns
                                                           321ns
                                                                    905ns cuDeviceGetCount
                   0.00% 1.2730us
                                                 636ns
                                                           338ns
                                                                    935ns cuDeviceGet
                                                           613ns
                                                                    613ns cuModuleGetLoadingMode
                   0.00%
                   0.00%
                             556ns
                                                 556ns
                                                           556ns
                                                                    556ns cuDeviceTotalMem
                   0.00%
                             547ns
                                                                    547ns cudaGetLastError
                                                 547ns
                                                           547ns
                   0.00%
                             337ns
                                                 337ns
                                                           337ns
                                                                    337ns cuDeviceGetUuid
```

Figure: Profiling mul3 on notch001

### Important CUDA libraries

In order to increase the performance of your code we recommend to use highly-optimized libraries. Among them, we have:

- cuBLAS: Basic Linear Algebra Subroutines on NVIDIA GPUs.
- MAGMA: Matrix Algebra on GPU and Multi-core Architectures.
- cuRAND: Random Number Generation library.
- cuFFT: CUDA Fast Fourier Transform library.
- NCCL: NVIDIA Collective Communications Library.
- cuDNN: CUDA Deep Neural Network library.
- cuTENSOR: GPU-accelerated Tensor Linear Algebra.
- DALI: Library for decoding & augmenting images (DL applications).
- . . .

#### Alternatives to CUDA

- Similar to CUDA
  - ROCM (AMD)
- OpenACC (use of directives (cfr. OpenMP)
  - GCC: supports OpenACC for NVIDIA & AMD GPUs.
  - NVIDIA HPC SDK (formerly PGI)
  - Sourcery Codebench (AMD GPU)
- Higher-level abstractions
  - Kokkos (prog. model for parallel algorithms for many-core chips)

#### Links

- CUDA Toolkit Documentation
- CUDA C++ Programming Guide Release 12.6 (10/01/24)
- CUDA C++ Best Practices Guide, Release 12.6 (09/24/24)
- NVIDIA CUDA Compiler Driver NVCC, Release 12.6 (09/24/24)
- PTX & ISA Release 8.5 (09/24/24)

### Use of GPUs at the CHPC

# GPU devices on lp/kp/np/grn

| GPU device type             | compute capability |
|-----------------------------|--------------------|
| NVIDIA GeForce GTX TITAN X  | 5.2                |
| Tesla P100-PCIE-16GB        | 6.0                |
| Tesla P40                   | 6.1                |
| NVIDIA GeForce GTX 1080 Ti  | 6.1                |
| NVIDIA Titan V              | 7.0                |
| NVIDIA Tesla V100-PCIE-16GB | 7.0                |
| Tesla T4                    | 7.5                |
| NVIDIA GeForce RTX 2080 Ti  | 7.5                |
| NVIDIA A100-PCIe-40GB       | 8.0                |
| NVIDIA A100-SXM4-80GB       | 8.0                |
| NVIDIA A800 40GB Active     | 8.0                |

Table: GPU devices on lp/kp/np/grn (10/01/2024)

# GPU devices on lp/kp/np/grn (cont.)

| GPU device type                | compute capability |
|--------------------------------|--------------------|
| NVIDIA GeForce RTX 3090        | 8.6                |
| NVIDIA A40                     | 8.6                |
| NVIDIA RTX A5500               | 8.6                |
| NVIDIA RTX A6000               | 8.6                |
| NVIDIA RTX 6000 Ada Generation | 8.9                |
| NVIDIA L40                     | 8.9                |
| NVIDIA L40S                    | 8.9                |
| NVIDIA H100 NVL/Deep Dive      | 9.0                |

Table: GPU devices on lp/kp/np/grn (10/01/2024)

### GPU devices on redwood

| GPU device type                | compute<br>capability |
|--------------------------------|-----------------------|
| NVIDIA GeForce GTX 1080 Ti     | 6.1                   |
| NVIDIA A100-SXM4-40GB          | 8.0                   |
| NVIDIA A100 80GB PCIe          | 8.0                   |
| NVIDIA A30                     | 8.0                   |
| NVIDIA A40                     | 8.6                   |
| NVIDIA RTX 6000 Ada Generation | 8.9                   |
| NVIDIA H100 NVL/Deep Dive      | 9.0                   |

Table: GPU devices on redwood (10/01/2024)

### Accessing GPUs at CHPC

• Using GPUs at the CHPC (Presentation by Martin Čuma)

### Questions?

Thank you! Any questions?