

# High performance scientific computing in C++ HPC C++ Course 2023

29 May – 02 June 2023 | Sandipan Mohanty | Forschungszentrum Jülich, Germany



#### What happens on line 9 of this code?

```
1  #include <iostream>
2  #include <Eigen/Dense>
3  using namespace Eigen;
4  using namespace std;
5  auto main() -> int
6  {
7     MatrixXd m = MatrixXd::Random(3,3);
8     auto result = (m + MatrixXd::Constant(3, 3, 1.2)) * 50;
9     m = result;
10     cout << "m =" << "\n" << m << "\n";
11
12  }</pre>
```

- (A) The result matrix calculated in the previous line is copied to m
- (B) The calculation specified in the previous line happens due to the call to operator=
- (C) SEGFAULT



Assume CountingIterator to be an iterator which iterates over a conceptual sequence of integers. What do you expect to be printed from this code?

```
auto main() -> int
        auto tot = 0.;
        constexpr auto L = 100.;
        constexpr auto N = 1000000UL;
        std::for_each(std::execution::par.
                       CountingIterator (OUL),
                       CountingIterator(N),
                       [&](auto i) {
                           auto dx = 2. * I. / N:
                           auto x = -L + dx * (i + 0.5);
11
                           tot += dx / (1 + x * x):
12
                       });
13
        std::cout << tot << "\n":
14
15
```

- (A) Crude approximation of the base of natural logarithm *e*
- (B) Crude approximation of  $\pi$
- (C) Meaningless random value



# GPU programming with CUDA



## Data parallelism







#### **Priorities**



01 0

- CPU: faster clock speed, more cache, more sophisticated instructions and scheduling
- GPU: More chip area dedicated to floating point computations



## A separate device



- Separate memories. GPU does not automatically know the state of any object in the memory of the CPU.
- Must transfer data.
- Must tell what to do with the data.
- Must retrieve results with another information transfer.



#### Can run C++ functions

- A program running on a CPU can call special functions designed to run on the GPU
- The GPU understands a different set of hardware instructions. than the CPU, so any human readable function meant for the GPU must be compiled to a different kind of hardware instructions than code compiled for the CPU.
- A set of function "execution space specifiers" are provided as language extensions : \_\_\_global\_\_\_, \_\_device\_\_\_ and host. These indicate to a CUDA aware compiler which 11 parts to translate to the CPU language and which parts to the GPU language.
- A function running on the GPU can call other functions compiled for the GPU, leading to a call tree on the device side.

```
device auto shuf(int id)
   return (id + 1723) % 2000;
global
void gpufunc(int *ids, unsigned N)
    ids[i] = shuf(ids[i]);
auto cpufunc() -> int
   gpufunc<<<1, 100>>>(p, 3000):
```



14 15

16

## **Execution space specifiers**

- \_\_device\_\_ : the function runs on the device, and it can only be called from the device
- host\_\_\_: the function runs on the host, and it can only be called form the host
- global\_\_: the function is a "kernel". It runs on the device, and can be called from the host, or from device (compute capability >= 3.2)
  - Must have void return type
  - Can not be a member function
  - It is asynchronous : the function returns before the device performs its work
  - Must be called along with an "execution configuration" e.g., gpufunc<<<1, 100>>>(p, 3000)
- \_\_device\_\_ and \_\_host\_\_ can both be used for a function, in which case, it is compiled for both the host and the device.



- Kernel functions are called with the <<<GridSpec, BlockSpec>>> notation, i.e., potentially in a large number of threads, arranged in blocks
- BlockSpec denotes a 3 dimensional object, 3 integers, specifying the arrangement of threads in a thread block
- GridSpec denotes a 3 dimensional object, 3 integers, specifying how blocks are arranged in a grid
- Each thread running a kernel function has a built in variable, threadIdx, specifying the position of the thread in its block, and another variable blockIdx to identify the block in the grid, and blockDim = number of threads in a block
- Overall x index: blockIdx.x \* blockDim.x + threadIdx.x etc



The grid of blocks



```
global void MatAdd(float A[N][N], float B[N][N],
                            float C[N][N])
        int i = blockIdx.x * blockDim.x + threadIdx.x:
        int j = blockIdx.v * blockDim.v + threadIdx.v;
        if (i < N && j < N)
            C[i][i] = A[i][i] + B[i][i];
    auto main() -> int
10
11
12
        // Kernel invocation
13
        dim3 threadsPerBlock{16, 16};
14
        dim3 numBlocks{N / threadsPerBlock.x.
                        N / threadsPerBlock.v}:
        MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
17
18
        . . .
10
```

- The block and grid properties are often chosen to reflect properties of the problem being solved.
- In this example, the threads are organized in a 2D lattice: a natural fit for a matrix sum
- Each thread only needs to process one element!
- There is a maximum number of threads allowed in a block: a limit coming from hardware properties
- It is therefore necessary to arrange blocks into a grid



Remember how you had to process an array of double, 4 elements at a time and a stride of 4, when using AVX style SIMD register?



- Remember how you had to process an array of double, 4 elements at a time and a stride of 4, when using AVX style SIMD register?
- Think of a block in CUDA as a potentially 3-dimensional stencil of tiny computations which you repeat to cover the iteration space





- Remember how you had to process an array of double, 4 elements at a time and a stride of 4, when using AVX style SIMD register?
- Think of a block in CUDA as a potentially 3-dimensional stencil of tiny computations which you repeat to cover the iteration space
- We can use the grid to describe the iteration space in terms of the blocks. Depending on hardware availability, multiple blocks will run in parallel





- Remember how you had to process an array of double, 4 elements at a time and a stride of 4, when using AVX style SIMD register?
- Think of a block in CUDA as a potentially 3-dimensional stencil of tiny computations which you repeat to cover the iteration space
- We can use the grid to describe the iteration space in terms of the blocks. Depending on hardware availability, multiple blocks will run in parallel



stride = blockDim.x \* gridDim.x

■ If the iteration space is much larger then the total number of GPU threads, it is sometimes helpful to do grid stride loops in the kernels. You have to take into account that gridDim.\_ \* blockDim.\_ GPU threads in the whole grid, which are processing lots of indexes together. That's how many indexes you would now jump over as a "stride"

#### Information transfer to and from the device

- Any data the kernel needs to process must be transferred using CUDA memory transfer functions
- Pointer/reference values received as input parameters in a function are interpreted on the same side of the host-device boundary

```
float *d_A, *d_B, *d_C;
auto size = N * sizeof(float);
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
```

```
device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
```



#### Information transfer to and from the device

- Any data the kernel needs to process must be transferred using CUDA memory transfer functions
- Pointer/reference values received as input parameters in a function are interpreted on the same side of the host-device boundary
- Allocations on unified memory are accessible from both the host and the device.
- Any data transfer required between the physically separate host and device memory happens automatically when using unified memory

```
float *u_A, *u_B, *u_C;
auto size = N * sizeof(float);
cudaMallocManaged(&u_A, size);
cudaMallocManaged(&u_B, size);
cudaMallocManaged(&u_C, size);
```



## Device side memory hierarchy and memory space specifiers

- "Local memory" -> per thread memory
- "Shared memory" -> private to a block, but shared among the threads inside a block
- "Global memory" -> visible from all threads in all blocks
- "Constant memory" -> also in the device space, and cached in the constant cache

- Memory address specifier \_\_device\_\_ declares a variable which lives on the device
- \_\_constant\_\_ declares a variable to be stored in constant cache
- \_\_shared\_\_ : variable for the shared memory inside a block, and has the lifetime of the block
- managed\_: A variable declared with managed storage specifier can be accessed from both the host and the device, We can determine its address, and it can be read/written from both the host and the device. Since the host and device memories are physically separate, this behaviour is achieved by transferring memory implicitly



## **Example**

```
global void mul(const double *A, const double *B, double *C, size t N) {
        auto i = threadIdx.x + blockIdx.x * blockDim.x;
        auto i = threadIdx.v + blockIdx.v * blockDim.v;
        double res{}:
4
        if (i < N && i < N)
5
            for (size t k = 0ul; k < N; ++k)
6
                res += A[N * i + k] * B[N * k +i]:
        C[N*i + i] = res;
9
    auto main(int argc, char *argv[]) -> int {
10
        const unsigned N = (argc > 1) ? std::stoul(argv[1]) : 2048u;
11
12
        auto a = malloc_usm<double>(N * N);
        auto b = malloc usm<double>(N * N):
13
        auto c = malloc usm<double>(N * N);
14
15
        for (size t i = 0UL: i < N * N: ++i) { a[i] = b[i] = 1.1: }
        auto t0 = std::chrono::high resolution clock::now();
16
        dim3 ThreadsPerBlock{16, 16}:
17
        dim3 NumBlocks{N / ThreadsPerBlock.x, N / ThreadsPerBlock.v};
18
        mul <<< NumBlocks, ThreadsPerBlock>>> (a, b, c, N);
10
        cudaDeviceSvnchronize();
20
```

This is simply a syntax demonstration! Not a particularly clever implementation!



# **Compiling CUDA code**

■ With nvcc :

```
nvcc [--expt-extended-lambda] [-std=__] source.cu
```

■ With clang++ :

```
clang++ [-std=__] source.cc --cuda-gpu-arch=___ -I /path/to/CUDA/include \
   -L /path/to/CUDA/lib64 -lcudart_static -ldl -lrt -lpthread
```



#### CUDA and C++

- Except in some ancient versions, CUDA is parsed by the rules of the C++ language. Many perfectly valid code in C, e.g., using class, new, using etc. as variable names can not be part of CUDA programs
- Valid C++ code, can often not be used, for a variety of reasons:
  - Generally, the NVIDIA implementation of newer language features arrives a few years after standardization
  - Some language features may have to be modified for use in the context of GPUs
  - CUDA 11 does support most of C++17. Our working environment is based on CUDA 11.7.

- Execution space specifiers, execution configuration etc. are language extensions
- This sometimes means additional rules are necessary before a new language feature can be used with CUDA. E.g., how do we make a lambda function \_\_device\_\_ ? Should \_\_host\_\_ etc. be considered parts of the functions signature or not ? nvcc and clang++ disagree!



## **Thrust**



```
#include <thrust/host vector.h>
    #include <thrust/device vector.h>
    #include <thrust/generate.h>
    #include <thrust/sort.h>
    #include <thrust/copy.h>
    #include <cstdlib>
    using namespace thrust;
    auto main() -> int
        // generate 32 M random numbers on
10
        // the host
        host vector<int> h vec(32 << 20):
13
        generate(h_vec.begin(), h_vec.end(), rand);
14
15
        // transfer data to the device
        device vector<int> d vec = h vec;
        sort(d vec.begin(), d vec.end());
17
        // transfer data back to the host
18
        copy(d vec.begin(), d vec.end(), h vec.begin());
19
20
```

- Template library like STL or TBB for CUDA
- Elegant high level syntax (STL like iterator interface for algorithms, clever use of operator overloading ...) to clearly express the intent of the programmer
- The compiler translates the stated intents to efficient code for the GPU
- Primarily NVIDIA GPUs



```
#include <thrust/host vector.h>
    #include <thrust/device vector.h>
    #include <thrust/generate.h>
    #include <thrust/sort.h>
    #include <thrust/copy.h>
    #include <cstdlib>
    using namespace thrust;
    auto main() -> int
        // generate 32 M random numbers on
        // the host
        host vector<int> h vec(32 << 20):
        generate(h_vec.begin(), h_vec.end(), rand);
13
14
15
        // transfer data to the device
        device vector<int> d vec = h vec;
        sort(d vec.begin(), d vec.end());
17
        // transfer data back to the host
18
        copy(d vec.begin(), d vec.end(), h vec.begin());
19
20
```

Example: thrust::host\_vector and thrust::device\_vector use the assignment operator to transfer data between the CPU and the GPU



```
#include <thrust/host vector.h>
    #include <thrust/device vector.h>
    #include <thrust/generate.h>
    #include <thrust/sort.h>
    #include <thrust/copy.h>
    #include <cstdlib>
    using namespace thrust;
    auto main() -> int
        // generate 32 M random numbers on
        // the host
        host vector<int> h vec(32 << 20):
        generate(h_vec.begin(), h_vec.end(), rand);
13
15
        // transfer data to the device
        device vector<int> d vec = h vec;
        sort(d vec.begin(), d vec.end());
17
        // transfer data back to the host
18
        copy(d vec.begin(), d vec.end(), h vec.begin());
19
20
```

- Example: thrust::host\_vector and thrust::device\_vector use the assignment operator to transfer data between the CPU and the GPU
- Thrust algorithms like thrust::sort have syntax like STL algorithms



```
#include <thrust/host vector.h>
    #include <thrust/device vector.h>
    #include <thrust/generate.h>
    #include <thrust/sort.h>
    #include <thrust/copy.h>
    #include <cstdlib>
    using namespace thrust;
    auto main() -> int
        // generate 32 M random numbers on
        // the host
        host vector<int> h vec(32 << 20):
        generate(h_vec.begin(), h_vec.end(), rand);
13
15
        // transfer data to the device
        device vector<int> d vec = h vec;
        sort(d vec.begin(), d vec.end());
17
        // transfer data back to the host
18
        copy(d vec.begin(), d vec.end(), h vec.begin());
19
20
```

- Example: thrust::host\_vector and thrust::device\_vector use the assignment operator to transfer data between the CPU and the GPU
- Thrust algorithms like thrust::sort have syntax like STL algorithms
- Many data parallel general operations have their own algorithms: transform, reduce, inclusive\_scan



#### Host and device vectors

```
#include <thrust/host vector.h>
    #include <thrust/device vector.h>
    #include <iostream>
    auto main() -> int
        thrust::host vector<int> H(4);
        for (int i = 0; i < 4; ++i) H[i] = i;
        // resize H
        H.resize(2):
        std::cout << "H now has size "
            << H.size() << "\n":
11
        // Copy host vector H to
12
        // device_vector D
13
        thrust::device_vector<int> D = H;
14
        // elements of D can be modified
15
        D[0] = 99;
16
        D[11] = 88:
17
        // print contents of D
18
        for(int i = 0; i < D.size(); ++i)</pre>
19
             std::cout << "D[" << i << "] = "
20
                       << D[i] << "\n":
21
22
```

Containers host\_vector and device\_vector are designed similar to std::vector, but, do not have initializer list constructors or new member functions of std::vector like emplace\_back



#### Host and device vectors

```
#include <thrust/host vector.h>
    #include <thrust/device vector.h>
    #include <iostream>
    auto main() -> int
        thrust::host vector<int> H(4);
        for (int i = 0; i < 4; ++i) H[i] = i;
        // resize H
        H.resize(2):
        std::cout << "H now has size "
            << H.size() << "\n":
11
        // Copy host vector H to
12
        // device_vector D
13
        thrust::device_vector<int> D = H;
14
        // elements of D can be modified
15
        D[0] = 99;
16
        D[11] = 88:
17
        // print contents of D
18
        for(int i = 0; i < D.size(); ++i)
19
            std::cout << "D[" << i << "] = "
20
                       << D[i] << "\n":
21
22
```

- Containers host\_vector and device\_vector are designed similar to std::vector, but, do not have initializer list constructors or new member functions of std::vector like emplace\_back
- The overloaded assignment operators can copy data across devices



## Other initialization options

```
// initialize all ten integers to 1
thrust::device_vector<int> D(10, 1);

// set the first seven elements to 9
thrust::fill(D.begin(), D.begin() + 7, 9);

// initialize a host_vector with
// the first five elements of D
thrust::host_vector<int> H(D.begin(), D.begin() + 5);
// set elements of H to 0, 1, 2, ...
thrust::sequence(H.begin(), H.end());
```

- Many algorithms to provide initial values, to serve different purposes.
- There is also thrust::generate which can call a functional for every element of the vector
- The type of the iterators tell the compiler which version of the respective algorithms to use. No run-time overhead



#### Exercise 4.1:

The example programs <code>examples/thrust0.cu</code> and <code>examples/thrust1.cu</code> contain the thrust code in the previous slides. Run them on JUSUF using the following steps:

- Check that the NVidia CUDA modules are loaded: m1
- Compile using the nvcc compiler: nvcc thrust0.cu
- Try changing the file name to thrust0.cc and compiling



## Thrust algorithms

```
device vector<int> X(10),Y(10),Z(10);
      // initialize X to 0,1,2,3, ....
      sequence(X.begin(), X.end());
      // compute Y = -X
      thrust::transform(X.begin(), X.end(),
         Y.begin(), thrust::negate<int>());
      // fill Z with twos
      thrust::fill(Z.begin(), Z.end(), 2);
      // compute Y = X \mod 2
      thrust::transform(X.begin(), X.end(),
10
                    Z.begin(), Y.begin(),
11
                   thrust::modulus<int>());
12
      // replace all the ones in Y with 10
13
      thrust::replace(Y.begin(), Y.end(), 1, 10);
14
1.5
      // print Y
      thrust::copy(Y.begin(), Y.end(),
16
      std::ostream iterator<int>(cout, "\n"));
17
```

#### Host and device versions



## Thrust algorithms

```
device vector<int> X(10),Y(10),Z(10);
      // initialize X to 0,1,2,3, ....
      sequence(X.begin(), X.end());
      // compute Y = -X
      thrust::transform(X.begin(), X.end(),
         Y.begin(), thrust::negate<int>());
      // fill Z with twos
      thrust::fill(Z.begin(), Z.end(), 2);
      // compute Y = X \mod 2
      thrust::transform(X.begin(), X.end(),
10
                    Z.begin(), Y.begin(),
11
                    thrust::modulus<int>());
12
      // replace all the ones in Y with 10
13
      thrust::replace(Y.begin(), Y.end(), 1, 10);
14
1.5
      // print Y
      thrust::copy(Y.begin(), Y.end(),
16
      std::ostream iterator<int>(cout, "\n"));
17
```

- Host and device versions
- A set of elementary functionals are available in thrust/functional.h



## Thrust algorithms

```
device vector<int> X(10),Y(10),Z(10);
      // initialize X to 0,1,2,3, ....
      sequence(X.begin(), X.end());
      // compute Y = -X
      thrust::transform(X.begin(), X.end(),
         Y.begin(), thrust::negate<int>());
      // fill Z with twos
      thrust::fill(Z.begin(), Z.end(), 2);
      // compute Y = X \mod 2
      thrust::transform(X.begin(), X.end(),
10
                    Z.begin(), Y.begin(),
11
                    thrust::modulus<int>());
12
      // replace all the ones in Y with 10
13
      thrust::replace(Y.begin(), Y.end(), 1, 10);
14
      // print Y
15
      thrust::copy(Y.begin(), Y.end(),
16
      std::ostream iterator<int>(cout, "\n"));
17
```

- Host and device versions
- A set of elementary functionals are available in thrust/functional.h
- Notice the copy from a device vector to the ostream iterator!



#### **Universal vectors**

```
// examples/thrust usm.cc
   #include <thrust/universal vector.h>
   #include <thrust/sort.h>
    #include <iostream>
    auto main() -> int
5
6
        thrust::universal vector<int> h vec(1 << 22);
        std::cout << "Filling host vector with random numbers\n";
        thrust::generate(thrust::host, h vec.begin(), h vec.end(), rand);
9
        std::cout << "Done.\n":
10
11
12
        std::cout << "Sorting vector on device\n";
        thrust::sort(thrust::device, h vec.begin(), h vec.end()):
13
        std::cout << "Done.\n";
14
15
```

thrust::universal\_vector is similar to thrust::host\_vector and thrust::device\_vector, but uses unified memory for storage



#### **Universal vectors**

```
// examples/thrust usm.cc
   #include <thrust/universal vector.h>
    #include <thrust/sort.h>
    #include <iostream>
    auto main() -> int
5
6
        thrust::universal vector<int> h vec(1 << 22);
        std::cout << "Filling host vector with random numbers\n";
        thrust::generate(thrust::host, h vec.begin(), h vec.end(), rand);
9
        std::cout << "Done.\n":
10
11
        std::cout << "Sorting vector on device\n";
12
        thrust::sort(thrust::device, h vec.begin(), h vec.end()):
13
        std::cout << "Done.\n";
14
15
```

- thrust::universal\_vector is similar to thrust::host\_vector and thrust::device\_vector, but uses unified memory for storage
- Data does not need to be moved explicitly between host and device



#### **Universal vectors**

```
// examples/thrust_usm.cc
    #include <thrust/universal vector.h>
    #include <thrust/sort.h>
    #include <iostream>
    auto main() -> int
5
6
        thrust::universal vector<int> h vec(1 << 22);
        std::cout << "Filling host vector with random numbers\n";
        thrust::generate(thrust::host, h vec.begin(), h vec.end(), rand);
9
        std::cout << "Done.\n":
10
11
        std::cout << "Sorting vector on device\n";
12
        thrust::sort(thrust::device, h vec.begin(), h vec.end()):
13
        std::cout << "Done.\n";
14
15
```

- thrust::universal\_vector is similar to thrust::host\_vector and thrust::device\_vector, but uses unified memory for storage
- Data does not need to be moved explicitly between host and device
- Algorithms need to be told whether they are meant for host or device explicitly



### **Custom functionals for transforms**

```
struct saxpy functor
      const float a:
      saxpy functor(float a) : a(a) {}
      host device
        auto operator()(const float& x,
                 const float& v) const -> float {
          return a * x + v;
    } :
    void saxpy fast (float A.
          thrust::device_vector<float>& X,
11
          thrust::device vector<float>& Y)
12
13
14
      // Y < - A * X + Y
      thrust::transform(X.begin(), X.end(),
15
                       Y.begin(), Y.begin(),
16
                       saxpv functor(A));
17
18
```

- When pre-defined operations in thrust/functional.h do not suffice, we can write our own function objects
- The overloaded operator() must be marked with \_\_host\_\_ \_device\_\_



# **Custom functionals using placehoders**

■ For very simple operations, custom functionals can be generated inline using the thrust::placehoders namespace.

- \_1, \_2 ... are placehoders
- Expressions involving placeholders yield a functional mapping its arguments sequencially to \_1, \_2 ...



# **Custom functionals using lambda functions**

```
nvcc --extended-lambda saxpy0.cu
```

Note where we mark the lambda function to be for the host and device



#### Exercise 4.2: Placeholders and lambda functions

The example examples/saxpy0.cu shows how to use the placehoders with thrust algorithms for simple inline functionality. There is also a commented out version of the same thing done using a lambda function. The placeholder version is more compact, but the lambda version can have multiple statements, like a normal function.



#### Exercise 4.3: Mandelbrot set

The Mandelbrot set is the set of complex numbers c for which the function  $f(z)=z^2+c$  does not diverge when iterated from z=0. An image representing the set can be created by generating the sequence  $z_n=z_{n-1}^2+c$  for each pixel in the image, by treating the x and y values of the pixel as the real and imaginary components of c. The sequence can be taken to have diverged if the magnitude of z exceeds 2. The program exercises/mandelbrot\_cpu.cc does it, using the standard C++ library. A modified version using thrust, mandelbrot\_gpu.cu is also present. Build nvcc and clang++ and run. Figure out the code differences and why they are needed.



#### Reductions

- Reductions require a binary operation and some initial value
- For convenience, variants like count, count\_if, inner\_product exist
- If a reduction is to follow a transform on the same data, transform\_reduce offers an opportunity for "kernel fusion"



### Partial sums, sorting, etc.

```
int data[6] = {1, 0, 2, 2, 1, 3};
inclusive_scan(data,data+6,data);

exclusive_scan(data,data+6,data);

// data is now {0, 1, 1, 3, 5, 6};

thrust::sort(A, A + N);

const int N = 6;

int keys[N]={1, 4, 2, 8, 5, 7};

char values[N]={'a', 'b', 'c', 'd', 'e', 'f'};

thrust::sort_by_key(keys, keys + N, values);

// keys is now {1, 2, 4, 5, 7, 8};

// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

thrust::stable_sort(A, A+N,

thrust::greater<int>());
```

 Frequently needed algorithms, which are not trivial to parallelize, have thrust implementations



## Partial sums, sorting, etc.

```
int data[6] = {1, 0, 2, 2, 1, 3};
inclusive_scan(data,data+6,data);

exclusive_scan(data,data+6,data);

// data is now {0, 1, 1, 3, 5, 6};

thrust::sort(A, A + N);

const int N = 6;

int keys[N]={1, 4, 2, 8, 5, 7};

char values[N]={'a', 'b', 'c', 'd', 'e', 'f'};

thrust::sort_by_key(keys, keys + N, values);

// keys is now {1, 2, 4, 5, 7, 8};

// values is now {'a', 'c', 'b', 'e', 'f', 'd'};

thrust::stable_sort(A, A+N,

thrust::greater<int>());
```

- Frequently needed algorithms, which are not trivial to parallelize, have thrust implementations
- Nicely hides low-level details and lets us work on the program logic



## Partial sums, sorting, etc.

```
int data[6] = {1, 0, 2, 2, 1, 3};
inclusive_scan(data,data+6,data);

exclusive_scan(data,data+6,data);

// data is now {0, 1, 1, 3, 5, 6};

thrust::sort(A, A + N);

const int N = 6;

int keys[N]={1, 4, 2, 8, 5, 7};

char values[N]={'a', 'b', 'c', 'd', 'e', 'f'};

thrust::sort_by_key(keys, keys + N, values);

// keys is now {1, 2, 4, 5, 7, 8}

// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

thrust::stable_sort(A, A+N,

thrust::greater<int>());
```

- Frequently needed algorithms, which are not trivial to parallelize, have thrust implementations
- Nicely hides low-level details and lets us work on the program logic
- The high-level syntax is parsed at compile time, and reduced to efficient system specific implementations. Overhead exists, but it is low.



### Thrust iterator library

```
thrust::constant iterator<int> first(10);
    first[0] // returns 10
    first[100] // returns 10
    thrust::counting iterator<int> first(10):
    first[0] // returns 10
    first[1] // returns 11
    first[100] // returns 110
    first = thrust::make transform iterator(vec.begin(), negate<int>());
    . . .
    last = thrust::make transform iterator(vec.end(), negate<int>());
10
    thrust::reduce(first, last): // returns -60 (i.e. -10 + -20 + -30)
11
12
    thrust::device vector<int> map(2):
13
    map[0] = 3:
14
15
    map[1] = 1;
16
    thrust::device vector<int> source(6):
    source[0] = 10;
17
    source[1] = 20;
18
19
    int sum = thrust::reduce(thrust::make permutation iterator(source.begin(), map.begin()),
20
                             thrust::make permutation iterator(source.begin(), map.end()));
21
```



# Thrust zip iterator and arbitrary transforms

```
struct arbitrary functor {
        template <class Tuple> host device void operator()(Tuple t) {
            //D[i] = A[i] + B[i] * C[i];
            thrust::get<3>(t) = thrust::get<0>(t) + thrust::get<1>(t) * thrust::get<2>(t);
4
5
6
    } :
    auto main() -> int
        using namespace thrust:
        device vector \langle \mathbf{float} \rangle A(5), B(5), C(5), D(5);
        // initialize input vectors
10
        A[0] = 3; B[0] = 6; C[0] = 2;
11
        A[11] = 4: B[11] = 7: C[11] = 5:
12
13
         // apply the transformation
14
        for_each(make_zip_iterator(make_tuple(A.begin(), B.begin(), C.begin(), D.begin())),
15
                  make zip iterator(make tuple(A.end(), B.end(), C.end(), D.end())).
16
                  arbitrary functor()):
17
        for(int i = 0; i < 5; ++i)
18
             std::cout << A[i] << " + " << B[i] << " * " << C[i] << " = " << D[i] << "\n";
10
20
```

### Thrust examples

#### Exercise 4.4:

examples/thrust/matmuls.cu and examples/thrust/matmuld.cu demonstrate using CUDA blas library, CUDA random number generator and thrust together. Using thrust for the memory management parts significantly improves the readability of such programs. Learn the steps necessary to compile them, and check performance on one of the GPU nodes.

29 May - 02 June 2023



### STDPAR: standard C++ for GPUs

- NVC++, the NVIDIA HPC SDK C++ compiler
- No <<< >>>, no \_\_device\_\_ etc. Just plain C++ written with STL algorithms
- std::execution::par regions automatically translated into GPU code!
- There are restrictions, but they will likely be fewer and fewer in the future

```
std::transform_reduce(std::execution::par, R2.begin(),
   R2.end(), S12.begin(), 0., std::plus<double>{},
   [](auto r2, auto s12){
      return Vexv(r2, s12);
   });
```

```
nvc++ -O3 -std=c++17 -stdpar exvol.cc -o exvol.nv
```



### STDPAR: standard C++ for GPUs

- Only inline functions or function templates. nvc++ selects functions for GPU execution on its own, and that
  only works if it can see the definitions
- CUDA Unified Memory for all data movement between CPU and GPU: presently, only heap allocated objects in CPU code compiled by nvc++ -stdpar can be automatically managed. Stack and global storage not accessible. Even heap allocations from portions of CPU code not compiled by nvc++ -stdpar are not visible.
- Pointers dereferenced in the parallel algorithms must point to heap locations. References used must be of heap objects.
- Lambda captures by references can often entail pointer dereferencing for stack entities, which should not
  occur in parallel algorithm regions
- No function pointers: functions are compiled for CPU and GPU. Pointer can only point to one. Inside GPU code, there will then be the danger of accessing a pointer to a function with CPU code. Pass function objects or lambdas as arguments to the algorithms instead.
- Only random access iterators
- catch clauses in GPU code ignored. Fine inside CPU code.



#### Exercise 4.5:

The programs stdpardemol.cc and stdpardemol.cc are simple short programs using parallel algorithms. The second one is a slighly modified version of the exvol.cc program we used in connection with SIMD programming. Compile them with nvc++ and run them on a GPU node on JUSUF.

#### Exercise 4.6:

Two versions of a program <code>jacobi\_cxx17.cc</code> and <code>jacobi\_cxx20.cc</code> are in your examples folder. Identify the part which can be parallelized using STL parallel algorithms, and do the necessary code changs. The C++17 version can be compiled for the GPU using <code>nvc++</code>, without any code changes. Try this new way of CPU/GPU programming where the exact same code runs on both!

Slide 37



#### **Unfinished business**

#### Exercise 4.7:

The travelling salesman example from the first day is conceptually easily parallelisable problem. In examples/TsP-pst1/ you will find a version which uses parallelisation with PSTL. The parallelised version allows us to search all possible paths for 15 cities in about 30 seconds on JUSUF CPU nodes. But we used far too much C++20 there for it to be compatible with current NVidia compilers. There is a different version TsP-stdpar/, in which I have removed all unsupported C++20 features. It no longer uses modules, and is in a single file. Although the nvc++ compiler does not emit any errors, it can't generate code for the GPU using this version. Can you fix it?

