-----

## GPU Concepts

Burton Rosenberg

29 May 2023


----

The NVidia GPU supports a programming library called CUDA. This adapts the GPU's graphic rendering hardware to the needs of scientific computation. Here we introduce how the architecture of an NVidia GPU.

### Heterogeneous computing 

The NVidia GPU is an example of _heterogeneous computing_. The GPU is a device on the sytems bus that does a different style of computing than the CPU. Those things more appropriate to a GPU are shipped to the GPU for computing. 

This includes,

- data movement from the CPU memory to the GPU memory
- the code, a kernel, sent to the GPU 
- the kernel launched and the computation undertaken
- the result of the computation copied from the GPU memory to the CPU memory.

### Single Instruction Multiple Data (SIMD) parallelism

While NVidia calls its computing SIMT (Single Instruction Multiple Thread), for simplicity will look at how it implements a modified of Single Instruction Multile Data parallelism.

In strict SIMD, a single program works lock-step, instruction by instruction, and only the data worked upon is different. The program is called the _kernel_ and might look like,
<pre>
    __global__ void sum_array(float * a, float *b, float * c) {
        int i = threadIdx.x + blockIdx.x * blockDim.x ;
        a[i] = b[i] + c[i] ;
        return ;
    }
</pre>
Except for the index `i`, which is computed from some magically present constants `threadIdx`, `blockIdx` and `blockDim`, all threads run exactly this code in lock step.

True SIMD would not allow any instance in the kernel to diverge. NVidia does allow a kernel to have loops and branches that depend on the code values. Threads are launced in groups of 32 called a _warp_. Different warps proceed completely independently. They might be launched at different times, and will finish at different times.


Within a warp all threads work instruction by instruction lock-step. If there is a branch and some threads in the warp go one side of the branch, and some the other, in fact every thread runs both sides of the branches. However, the result of the execution will be disabled for the wrong side of the branch, per thread. This is called _warp divergence_.

### Streaming Multiprocessors

The execution machinery of an NVidia GPU is comprised of repeating Streaming Mutliprocessor (SM) units. 
Each contain cores for such operations as floating point and tensor arithmetic. Each SM can launch a 
certain number of warps, each warp with 32 threads.

The various generations of NVidia SM's is summarized by a _CUDA Capability_, which for the A100 is 8.0. This is not to be confused with the CUDA version, or other versions. It is an abstracted way of referring to functionality, such as having tensor cores or not. 

Besides the SM's, the NVidia GPU has various forms of memory, such a _global_, _shared_, _texture_ and _constant_. Each is optimized for a particular access. The texture memory, for instance, is optimized for concurrent read-only access, as needed by texture mapping when rendering graphics. We are most interested in the global memory, because it is basic and simple, with some notations about shared, which can be faster, but must be carefully programmer. 

----

<div style="float:right;">
<img src="./A100-SM.png" width=50%> 
</div>

----


### Floating Point Data types

The digram of the SM shows hardware for several arthemtic standards,

- INT32, 32 bit integer
- FP32, for IEEE standard 32 bit floating point
- FP64, for IEEE standard 64 bit floating point.
- TF32, a 32 bit floating point in the FP unit, and 16 bit in the tensor unit

The use of 64 bit floating point is specific to CUDA. It is not a focus for graphics programming, and most GPU's do not provide 64-bit floating point. 

In fact, reduced precision floating-point, below what has ever been used, is popular, such as 16 bit floating point with a choice of BF16 with greater range (exponent bits) or FP16 with greater precision (mantissa bits).

----

<div style="float:right;">
<img src="./A100-FPtypes.png" width=50%> 
</div>

----



### Device Capabilities

The program DeviceQuery, found in the samples, queries all found devices for their capabilities. We have 4 A100 GPU's on thoreau, each returning theses capabilities.

<pre>
Device 0: "NVIDIA A100 80GB PCIe"
  CUDA Driver Version / Runtime Version          11.8 / 11.2
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 81100 MBytes (85039775744 bytes)
  (108) Multiprocessors, ( 64) CUDA Cores/MP:     6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1512 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     &lt; Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) &gt;

</pre>