# ACM/CS 114 Parallel algorithms for scientific applications

Michael A. G. Aïvázis

California Institute of Technology

Winter 2010

### Hybrid architectures

- recall the layout of SIMD machines
  - a large number of small, special purpose processors
  - ▶ a single "controller" manages the instruction stream
    - each processor executes the same instruction on its local data
    - may be able to specify which processors are active/idle



- modern hybrid systems based on GPUs have somewhat more elaborate architectures
  - multi-tier memory layouts
  - ▶ large core counts per board
  - elaborate access rules that enable the hardware to be very fast
  - recent ones finally support double precision floating point arithmetic
  - broadly available it's in your graphics card, thanks to video gaming

#### nVidia GPU architecture

- ► The GPU boards are hosted by a conventional processor
  - connected through PCI Express, which is the limiting factor in moving data to and from the device memory
- computing power and memory configurations vary from
  - ▶ a single 2-core GPU with 128M of memory on older video cards
  - ▶ to  $30 \times 8$ -core SIMTs with 4G of memory on the Tesla C1060
    - ▶ 240 cores; peak: 933 Gflops single precision, 78 Gflops double precision
    - ▶ rivulet.cacr.caltech.edu has four such boards



## Getting started

- getting the drivers, tools and code samples
  - ▶ visit http://nvidia.com/cuda
- compiling and linking
  - ▶ C for CUDA: a few extensions
  - source is a mixture of the code that runs on the host and the kernels that run on the GPU
  - there are restrictions on what kind of code you can include in a kernel
  - nvcc is the special compiler and linker
- staging and launching
  - the resulting executable runs on the host
  - launches threads on the GPU through special statements
- the emulator
  - the SDK comes with a software emulator
  - extremely useful for debugging
- special hardware
  - your video card
  - ▶ Tesla boards



### Sanity check

```
// memxchnq.cu: making sure the compiler and cuda runtime are accessible
 2 #include <cuda.h>
 3 #include <assert.h>
   int main(int argc, char* argv[]) {
      const int N = 12:
      // allocate some buffers on the host
 8
      float *send host = (float *) malloc(N*sizeof(float));
 0
      float *recv host = (float *) malloc(N*sizeof(float));
      // allocate matching ones on the device
10
      float *send device, *recv device;
      cudaMalloc((void **) &recv device, N*sizeof(float));
      cudaMalloc((void **) &send device, N*sizeof(float));
14
      // and initialize the host data
      for (int i=0: i<N: i++) {
16
         send host[i] = 2.0f + i*i;
         recv host[i] = 0.0f;
18
19
      // send the data from the host to the device
      cudaMemcpv(recv device, send host, N*sizeof(float), cudaMemcpvHostToDevice);
20
      // move the data in device memory
      cudaMemcpv(send device, recv device, N*sizeof(float), cudaMemcpvDeviceToDevice);
      // get it back on the host
24
      cudaMemcpv(recv host, send device, N*sizeof(float), cudaMemcpvDeviceToHost);
25
      // check the result
26
      for (int i=0: i<N: i++) {
        assert(send host[i] == recv host[i]);
28
29
      // free the buffers;
30
      free (send host); free (recv host);
31
      cudaFree(send device); cudaFree(recv device);
      return 0;
34 1
```

#### The execution model

- nvcc splits the source code into two parts
  - the code that runs on the host
  - ▶ the device *kernel*, the code that runs of the GPU
    - built out of specially marked subroutines in the program
- the program is launched on the host and runs sequentially
- at specific points in the code, the program
  - launches the kernel and runs it on the GPU by many threads in parallel
  - the host continues on without blocking
  - until it encounters some blocking call to the CUDA runtime
- the execution context is specified by organizing
  - groups of threads in blocks
  - groups of blocks in grids
  - blocks are scheduled and executed in arbitrary order
    - ▶ in warps: 32 SIMD threads at a time (on currently available devices)
- at runtime, each thread is given
  - ▶ threadIdx: its own thread id
  - blockIdx: the id of the block of active threads
  - blockDim: the geometry of the block of active threads



### Adding a bit of work

```
1 // scale.cu: multiply each element in an array by a given float
2 #include <cuda.h>
3 #include <assert.h>
  // manipulate the host array
6 void scale_host(float* a, float scale, int N) {
     // loop over all array elements and multiply them by 2
     for (int i=0; i<N; i++) {
8
        a[i] *= scale;
0
     return:
  // and here is the corresponding code for the GPU
    global void scale device(float* a, float scale, int N) {
     // this thread is responsible for one element of the array
16
     // compute its offset using the block geometry builtins
     int idx = blockIdx.x * blockDim.x + threadIdx.x:
     // make sure we don't go past the last one
19
     if (idx < N) {
20
        // do the arithmetic
        a[idx] *= scale;
     return;
25 }
```

#### Launching the kernel

```
// send the data from the host to the device
     cudaMemcpy (
        array_dev, send_host, N*sizeof(float), cudaMemcpyHostToDevice);
4
     // set up the device execution context for our threads
6
     // each thread will take care of one element
     int blockSz = 4; // 4 threads per block
     // compute the number of blocks needed
0
     int nBlocks = N/blockSz;
10
     // adjust up to make sure we cover the entire array
     if (N % nBlocks) {
        nBlocks++;
14
     // scale the array on the device
     float scale = 2.0f:
16
     scale device <<<nBlocks, blockSz>>> (array dev, scale, N);
     // scale the input array on the host
18
19
     scale host (send host, scale, N);
20
     // get it back on the host
     cudaMemcpy (
        recy host, array dev, N*sizeof(float), cudaMemcpyDeviceToHost);
24
```

### Capabilities of the Tesla C1060 board

```
Device 1: "Tesla C1060"
    CUDA Driver Version:
                                         2.30
    CUDA Runtime Version:
                                         2 30
    CUDA Capability Major revision number: 1
    CUDA Capability Minor revision number: 3
    Total amount of global memory:
                                         4294705152 bytes
    Number of multiprocessors:
                                         30
    Number of cores:
                                         240
    Total amount of constant memory:
                                         65536 bytes
    Total amount of shared memory per block: 16384 bytes
    Total number of registers available per block: 16384
    Warp size:
    Maximum number of threads per block: 512
    Maximum sizes of each dimension of a block: 512 x 512 x 64
    Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
    Maximum memory pitch:
                                         262144 bytes
    Texture alignment:
                                         256 bytes
    Clock rate:
                                         1.30 GHz
    Concurrent copy and execution:
                                         Yes
2.0
    Run time limit on kernels:
                                         No
    Integrated:
    Support host page-locked memory mapping: Yes
    Compute mode:
                                         Default.
                                         (multiple host threads can
2.5
```

use this device simultaneously)

### Memory hierarchy

- each thread gets
  - a set of registers: really fast but limited memory
  - memory shared by all threads in its block
  - private *local* memory allocated from the global address space
  - access to the device global memory pool
- texture and constant memory are beyond scope here
  - not really useful for general purpose programming
- latency and bandwidth for these memory pools are very different
  - the name of the game: resource management
  - the price to pay for the astronomical performance



#### Summary

- ▶ GPUs brought hybrid programming models back to center stage
  - but our parallelization steps don't change
  - just the balance between fine and coarse grain tasks
- ▶ high performance computing the way it used to be
  - resource allocation and management strategies define performance
  - ▶ true enough for sequential, MPI and threaded programs anyway
- barely scratched the surface here
  - let me know if you are interested in pursuing further
- ▶ if you must program in a hybrid model
  - why not write multi-threaded host programs
    - to take advantage of more devices per host
    - to overlap calculations on the host and GPU
  - ▶ why not use MPI as well and scale out to multiple nodes
    - for some really massive calculations!

