xpu is a tiny (< 5000 LOC) and lightweight C++ library designed to simplify GPU programming by providing a unified interface for various GPU architectures CUDA, HIP, and SYCL. While also providing the option to run GPU code on CPU. This allows developers to write a single codebase that can be easily compiled and run on different hardware, while using modern C++ and the flexibility to use native CUDA, HIP, or SYCL code where needed.
Features include:
- Unified interface to write GPU code for CUDA, HIP, SYCL.
- Zero overhead for device code compared to native CUDA/HIP/SYCL.
- Run on CPU as fallback or for debugging.
- Compile device code for CPU with a regular C++ compiler without any additional requirements.
- RAII based memory management while maintaining control over how, when and where memory is allocated.
- Support for native CUDA/HIP/SYCL host code via
xpu::function
(e.g. for usage withcub
device-wide functions). - Common abstraction for constant memory.
- Profiling API to collect timings and throughput on kernel executions, host <-> device transfers, memset and wall time.
- Seperate compilation of device code. Host code may call kernels from any library it's linked against.
Kernels are declared as callable objects that inherit from xpu::kernel
. The kernel is implemented as a regular C++ function.
For example, to declare a kernel that adds two vectors in your header file:
#include <xpu/device.h>
struct DeviceLib {}; // Dummy type to match kernels to a library.
struct VectorAdd : xpu::kernel<DeviceLib> {
using context = xpu::kernel_context<xpu::no_smem>; // optional shorthand
XPU_D void operator()(context &,
xpu::buffer<const float>, xpu::buffer<const float>, xpu::buffer<float>, size_t);
};
Then call the kernel on the host side like this:
#include <xpu/host.h>
// ...
xpu::buffer<float> a, b, c; // Declare buffers.
xpu::queue q; // Create a queue.
// Run the kernel.
q.launch<VectorAdd>(xpu::n_threads(1000), a, b, c, 1000);
// ...
See the wiki for the full runnable example.
I started development of xpu
as a basis for GPU processing in the CBM experiment.
That meant supporting as many platforms as possible, while also providing a fallback to run on CPU. Additionally i wanted to have and RAII-style memory management for device memory while retaining control of how and when allocations happen. This is something where SYCL's buffer API falls short... SYCL still also solves lot of these issues. However the problem remains the SYCL compiler could generate less performant code for our use cases and we would want to switch to a native compiler (i.e. nvcc
) instead.
xpu
requires a C++17 capable compiler and CMake 3.11 or newer.
Additionally for GPUs a compiler for the respective backend is required:
- For Nvidia GPUs: CUDA 10.2 or newer
- For AMD GPUs: ROCm 4.5 or newer
- For Intel GPUs / SYCL Targets: Intel oneAPI DPC++ Compiler
Note: xpu
can be used without a GPU backend. In this case, device code will only be compiled for CPU.
Windows is not supported at the moment. xpu
is tested on Linux and MacOS.
Adding xpu
to your project is as simple as adding the following to your CMakeLists.txt
:
include(FetchContent)
FetchContent_Declare(xpu
GIT_REPOSITORY https://github.com/fweig/xpu
GIT_TAG v0.9.1
)
FetchContent_MakeAvailable(xpu)
Then call xpu_attach
on your target:
add_library(Library SHARED ${LibrarySources}) # Works for executables as well
xpu_attach(Library ${DeviceSources}) # DeviceSources is a subset of LibrarySources that should be compiled for GPU
Enable the desired backends by passing -DXPU_ENABLE_<BACKEND>=ON
to cmake. (e.g. -DXPU_ENABLE_CUDA=ON
for CUDA or -DXPU_ENABLE_HIP=ON
for HIP).
See the wiki for all available CMake options.
The documentation is based on doxygen. To build it:
$ cd build
$ cmake .. -DXPU_BUILD_DOCS=ON
$ make docs
The API reference is generated in HTML under docs/html
. You can view it with your favorite browser:
cd docs
firefox docs/html/index.html
The API reference is also available online for the current master.
Additionally, there's the wiki that contains more detailed guides and examples.
Note: Both wiki and doxygen are still incomplete and under construction. If you have any questions, feel free to ask them in the issues.
To build the tests, pass -DXPU_ENABLE_TESTS=ON
to cmake. To compile and run the tests, googletest
is required.
cmake
will automatically download and build it.
To build and run the testbench:
$ cmake -B build -S . -DXPU_ENABLE_TESTS=ON -DXPU_ENABLE_CUDA=ON -DXPU_ENABLE_HIP=ON -DXPU_ENABLE_SYCL=ON
$ cd build
$ make
$ ctest .
Disable any backends you don't need in the first step.
Please feel free to ask any questions you have, request features, and report bugs by creating a new issue.
xpu
is licensed under the MIT license. See LICENSE for details.