This repository wraps a GPU (or basically any other accelerator) API.
The folder structure sets up like this:
- algorithms: Implements common parallel algorithms for the API
- cmake: contains specific cmake files if necessary
- examples: contains a basic example and a jacobi relaxation example
- interfaces: implements the API for a specific vendor
- submodules: git submodules needed for this repository; resolve with
git submodule init
andgit submodule update
- tests: gtests for the algorithms package
Currently, there are three implementations available
- Nvidia's C CUDA
- AMD's HIP
- SYCL implemented by Intel's OneAPI and hipSYCL (https://github.com/illuhad/hipSYCL)
The setup is explained with linux commands; for windows, see the belonging batch reference.
- Create a folder:
mkdir -p build
- Switch into the folder:
cd build
- Now setup CMake with
cmake .. <DEVICE_OPTIONS> -DREAL_SIZE_IN_BYTES=<PRECISION>
; - The real size defines the precision level of the floating point operations; use a precision of 4 or 8 bytes
- The device options are explained in the following
If you want to run the examples, follow the instructions in the belonging package.
- use
-DDEVICE_BACKEND:STRING=CUDA
to build the CUDA implementation - CUDA also requires a sub architecture for the device, for example
-DDEVICE_SUB_ARCH=sm60
; see the CMakeLists.txt for all options - Make sure to have CUDA including nvcc installed
- Complete example call:
cmake .. -DDEVICE_BACKEND:STRING=CUDA -DREAL_SIZE_IN_BYTES=4 -DDEVICE_SUB_ARCH=sm60
- use
-DDEVICE_BACKEND:STRING=HIP
to build the HIP implementation - Complete example call:
cmake .. -DDEVICE_BACKEND:STRING=HIP -DREAL_SIZE_IN_BYTES=4 -DDEVICE_SUB_ARCH=gfx906
- use
-DDEVICE_BACKEND:STRING=ONEAPI
to build the OneAPI implementation - currently, OneAPI is mainly used to target Intel devices; for CUDA or HIP as device backend, use hipSYCL and set the right environment variables, respectively
- Set the environment variable
PREFERRED_DEVICE_TYPE
to compile for the definedDEVICE_SUB_ARCH
- If
PREFERRED_DEVICE_TYPE
is not specified on build, JIT compilation is assumed and the value ofDEVICE_SUB_ARCH
is ignored - Options for
PREFERRED_DEVICE_TYPE
areGPU
,CPU
, orFPGA
- The environment variable must be also set before running any code using this lib. The runtime needs this hint to select the right device the code was compile for. If the value was not specified or illegal, the runtime applies a default selection strategy, preferring GPUs over CPUs and CPUs over the host. This might crashes the application if the targeted compilation architecture differs from the runtime target
- If
PREFERRED_DEVICE_TYPE
was not specified on build but before running an application, the JIT compiler will generate the kernels and allows switching the device type at runtime - Complete example call:
export PREFERRED_DEVICE_TYPE=GPU
andcmake .. -DDEVICE_BACKEND:STRING=ONEAPI -DREAL_SIZE_IN_BYTES=4 -DDEVICE_SUB_ARCH=dg1
- use
-DDEVICE_BACKEND:STRING=HIPSYCL
to build for hipSYCL - hipSYCL does currently not require the definition of a sub architecture, but it has to be specified in the cmake
- Complete example call:
cmake .. -DDEVICE_BACKEND:STRING=HIPSYCL -DREAL_SIZE_IN_BYTES=4 -DDEVICE_SUB_ARCH=dg1
- Extend the CMakeLists.txt with the new API
- Add a vendor specific sub cmake that is included in the CMakeLists.txt
- Do the same for the examples/basic and examples/jacobi build files
- Add a folder for the new API in interfaces/
- Copy for example an existing implementation into the new folder but implement it regarding the new API
- Compile and run the basic folder to get feedback if the basic concepts are working
- Implement examples/jacobi/src/gpu/kernels for your new API
- compile and run the jacobi benchmark
- Now switch to the algorithms package and repeat the procedure
- You can now compile and run the examples in the tests/ folder
- Add the new compiler as Device backend in the
CMakeLists.txt
in root and example/jacobi such thatsycl.cmake
is included for it - Adjust
sycl.cmake
in both folders such that the sycl environment is correctly set up for the device and kernel target - If there is a reduction implementation for this compiler, add it in
algorithms
and extend the if/else in the rootCMakeLists.txt