diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst index 68100c066e85..d1e81545d7b3 100644 --- a/doc/rst/technotes/gpu.rst +++ b/doc/rst/technotes/gpu.rst @@ -14,17 +14,6 @@ The current implementation will generate GPU kernels for certain ``forall`` and ``foreach`` loops and launch these onto a GPU when the current locale (e.g. ``here``) is assigned to a special (sub)locale representing a GPU. -For more information about what loops are eligible for GPU execution see the -`Overview`_ section. For more information about what is supported see the -requirements and `Requirements and Limitations`_ section. To see an example -program written in Chapel that will execute on a GPU see the code listing in -the `Examples`_ section. For more information about specific features related -to GPU support see the subsections under `GPU Support Features`_. Additional -information about GPU Support can be found in the "Ongoing Efforts" slide decks -of our `release notes `_; however, -be aware that information presented in release notes for prior releases may be -out-of-date. - .. contents:: Overview @@ -41,7 +30,9 @@ strategies). Chapel will launch kernels for all eligible loops that are encountered by tasks executing on a GPU sublocale. Loops are eligible when: -* They are order-independent (e.g., ``forall`` or ``foreach``). +* They are order-independent. i.e., `forall + <../users-guide/datapar/forall.html>`_ or `foreach `_ loops over + iterators that are also order-independent. * They only make use of known compiler primitives that are fast and local. Here "fast" means "safe to run in a signal handler" and "local" means "doesn't cause any network communication". @@ -112,89 +103,53 @@ Examples with multiple GPUs: * `onAllGpusOnAllLocales `_ -- simple example using all GPUs and locales * `copyToLocaleThenToGpu `_ -- stream-like example (with data initialized on Locale 0 then transferred to other locales and GPUs) -Setup and Compilation ---------------------- - -To enable GPU support set the environment variable ``CHPL_LOCALE_MODEL=gpu`` -before building Chapel. +Setup +----- -Chapel's build system will automatically try and deduce what type of GPU you -have and where your installation of relevant runtime (e.g. CUDA or ROCM) are. -If the type of GPU is not detected you may set ``CHPL_GPU_CODEGEN`` manually to -either ``cuda`` (for NVIDIA GPUs) or ``rocm`` (for AMD GPUs). If the relevant -runtime path is not automatically detected (or you would like to use a -different installation) you may set ``CHPL_CUDA_PATH`` and/or -``CHPL_ROCM_PATH``. - -You may also have to set the ``CHPL_GPU_ARCH`` environment variable. When -``CHPL_GPU_CODEGEN`` is set to ``cuda`` by default we target compute capability -6.0 (``--cuda-gpu-arch=sm_60`` in clang). When ``CHPL_GPU_CODEGEN`` is set to -``rocm`` we assign ``CHPL_GPU_ARCH`` to ``gfx906`` by default. You may modify -this by changing ``CHPL_GPU_ARCH`` or by passing it to ``chpl`` via -``--gpu-arch``. - -We also suggest setting ``CHPL_RT_NUM_THREADS_PER_LOCALE=1`` (this is necessary -if using CUDA 10). - -To compile a program simply execute ``chpl`` as normal. To ensure that a loop -is executing on the GPU you can use the operations in the :mod:`GpuDiagnostics` -module or use the :proc:`~GPU.assertOnGpu()` proc from the :mod:`GPU` module. - -Requirements and Limitations ----------------------------- - -Because of the early nature of the GPU support project there are a number of -limitations. We provide a (non exhaustive) list of these limitations in this -section; many of them will be addressed in upcoming editions. - -* We currently support NVIDIA and AMD GPUs +Requirements +~~~~~~~~~~~~ * ``LLVM`` must be used as Chapel's backend compiler (i.e. ``CHPL_LLVM`` must be set to ``system`` or ``bundled``). For more information about these settings see :ref:`Optional Settings `. -* If using a ``system`` LLVM it must have been built with support for the - relevant target of GPU you wish to generate code for (e.g. NVPTX to target - NVIDIA GPUs and AMDGPU to target AMD GPUs). + * If using a ``system`` LLVM it must have been built with support for the + relevant target of GPU you wish to generate code for (e.g. NVPTX to target + NVIDIA GPUs and AMDGPU to target AMD GPUs). -* If using a system install of ``LLVM`` we expect this to be the same - version as the bundled version (currently 14). Older versions may - work; however, we only make efforts to test GPU support with this version. + * If using a system install of ``LLVM`` we expect this to be the same + version as the bundled version (currently 14). Older versions may + work; however, we only make efforts to test GPU support with this version. -* ``CHPL_TASKS=qthreads`` is required for GPU support. +* Either ``nvcc`` (for NVIDIA) or ``hipcc`` (for AMD) must be available; Chapel + uses libraries included in these packages and will automatically deduce the + path to these libraries based on the location of the ``nvcc``/``hipcc`` + executable. Note that the automatically deduced paths may be overwritten by + manually setting the ``CHPL_CUDA_PATH`` or ``CHPL_ROCM_PATH`` environment + variables. -* PGAS style communication is not available within GPU kernels; that is: - reading from or writing to a variable that is stored on a different locale - from inside a GPU eligible loop (when executing on a GPU) is not supported. +GPU-Related Environment Variables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -* There is no user-level feature to specify GPU block size on a - per-kernel basis. This can be set on a program wide basis at compile-time by - passing ``--gpu-block-size=size`` to the compiler or setting it with the - ``CHPL_GPU_BLOCK_SIZE`` environment variable. - -* The use of most ``extern`` functions within a GPU eligible loop is not supported - (a limited set of functions used by Chapel's runtime library are supported). - - * Various functions within Chapel's standard modules call unsupported - ``extern`` functions and thus are not supported in GPU eligible loops. - -* Runtime checks such as bounds checks and nil-dereference checks are - automatically disabled for CHPL_LOCALE_MODEL=gpu. - -* For AMD GPUs: - - * Can only be used with local builds (i.e., CHPL_COMM=none) - - * Certain 64-bit math functions are unsupported. To see what does - and doesn't work see `this test - `_ - and note which operations are executed when `excludeForRocm == true`. +To enable GPU support set the environment variable ``CHPL_LOCALE_MODEL=gpu`` +before building Chapel. -* For loops to be considered eligible for execution on a GPU they - must fulfill the requirements discussed in the `Overview`_ section. +Chapel's build system will automatically try and deduce what type of GPU you +have and where your installation of relevant runtime (e.g. CUDA or ROCM) are. +If the type of GPU is not detected you may set ``CHPL_GPU_CODEGEN`` manually to +either ``cuda`` (for NVIDIA GPUs) or ``rocm`` (for AMD GPUs). If the relevant +runtime path is not automatically detected (or you would like to use a +different installation) you may set ``CHPL_CUDA_PATH`` and/or +``CHPL_ROCM_PATH``. -* Associative arrays cannot be used on GPU sublocales with - ``CHPL_GPU_MEM_STRAGETY=array_on_device``. +``CHPL_GPU_ARCH`` environment variable can be set to control the desired GPU +architecture to compile for. The default value is ``sm_60`` for +``CHPL_GPU_CODEGEN=cuda`` and ``gfx906`` for ``CHPL_GPU_CODEGEN=rocm``. You may +also use the ``--gpu-arch`` compiler flag to set GPU architecture. For a list +of possible values please refer to `CUDA Programming Guide +`_ +for NVIDIA or "processor" values in `this table in the LLVM documentation +`_ for AMD. GPU Support Features -------------------- @@ -229,6 +184,13 @@ an error if one of the aforementioned requirements is not met. This check might also occur if :proc:`~GPU.assertOnGpu()` is placed elsewhere in the loop depending on the presence of control flow. +Utilities in :mod:`Memory.Diagnostics ` module can be used to +monitor GPU memory allocations and detect memory leaks. For example, +:proc:`startVerboseMem() ` and +:proc:`stopVerboseMem() ` can be used to enable +and disable output from memory allocations and deallocations. GPU-based +operations will be marked in the generated output. + Multi-Locale Support ~~~~~~~~~~~~~~~~~~~~ @@ -246,7 +208,7 @@ An idiomatic way to use all GPUs available across locales is with nested coforall loc in Locales do on loc { coforall gpu in here.gpus do on gpu { - forall { + foreach { // ... } } @@ -261,34 +223,98 @@ For more examples see the tests under |multi_locale_dir|_ available from our `pu Memory Strategies ~~~~~~~~~~~~~~~~~ -Currently by default Chapel uses unified memory feature to store data that is -allocated on a GPU sublocale (i.e. ``here.gpus[0]``). Under unified memory the -CUDA driver implicitly manages the migration of data to and from the GPU as -necessary. +The ``CHPL_GPU_MEM_STRATEGY`` environment variable can be used to choose between +two different memory strategies. -We provide an alternate memory allocation strategy that stores array data -directly on the device and store other data on the host. There are multiple +The current default strategy is ``unified_memory``. The strategy applies to all +data allocated on a GPU sublocale (i.e. ``here.gpus[0]``). Under unified memory +the underlying GPU implementation implicitly manages the migration of data to +and from the GPU as necessary. + +The alternative is to set the environment variable explicitly to +``array_on_device``. This strategy stores array data directly on the device and +store other data on the host in a page-locked manner. There are multiple benefits to using this strategy including that it enables users to have more explicit control over memory management, may be required for Chapel to interoperate with various third-party communication libraries, and may be necessary to achieve good performance. As such it may become the default memory strategy we use in the future. Be aware though that because this strategy is -relatively new addition it hasn't been as thoroughly tested as our -unified-memory based approach. - -To use this new strategy set the environment variable ``CHPL_GPU_MEM_STRATEGY`` -to ``array_on_device``. For more examples that work with this strategy see the -tests under |page_lock_mem_dir|_ available from our `public Github repository -`_. - -.. |page_lock_mem_dir| replace:: ``test/gpu/native/page-locked-mem/`` -.. _page_lock_mem_dir: https://github.com/chapel-lang/chapel/tree/main/test/gpu/native/page-locked-mem +relatively new addition it hasn't been as thoroughly tested as our unified +memory based approach. Note that host data can be accessed from within a GPU eligible loop running on the device via a direct-memory transfer. -One limitation with memory access in this mode is that we do not support direct -reads or writes from the host into individual elements of array data allocated -on the GPU (e.g. ``use(A[i])`` or ``A[i] = ...``). Array data accessed "as a -whole" (e.g. ``writeln(A)``) will work, however. +Debugger and Profiler Support for NVIDIA +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As of Chapel 1.30.0 ``cuda-gdb`` and `NVIDIA NSight Compute +`_ can be used to debug and profile +GPU kernels. We have limited experience with both of these tools. However, +compiling with ``-g`` and running the application in ``cuda-gdb`` help uncover +segmentation faults coming from GPU kernels. + +Similarly, NSight Compute can be used to collect detailed performance metrics +from GPU kernels generated by the Chapel compiler. By default, using ``-g`` only +enables Chapel line numbers to be associated with performance metrics, however +it thwarts optimizations done by the backend assembler. In our experience, this +can reduce execution performance significantly, making profiling less valuable. +To avoid this, please use ``--gpu-ptxas-enforce-optimization`` while compiling +alongside ``-g``, and of course, ``--fast``. + +Known Limitations +----------------- + +We are aware of the following limitations and plan to work on them among other +improvements in the future. + +* Intel GPUs are not supported, yet. + +* For AMD GPUs: + + * Can only be used with local builds (i.e., CHPL_COMM=none) + + * Certain 64-bit math functions are unsupported. To see what does + and doesn't work see `this test + `_ + and note which operations are executed when ``excludeForRocm == true``. + +* Distributed arrays cannot be used within GPU kernels. + +* PGAS style communication is not available within GPU kernels; that is: + reading from or writing to a variable that is stored on a different locale + from inside a GPU eligible loop (when executing on a GPU) is not supported. + +* Runtime checks such as bounds checks and nil-dereference checks are + automatically disabled for CHPL_LOCALE_MODEL=gpu. i.e., ``--no-checks`` is + implied when compiling. + +* The use of most ``extern`` functions within a GPU eligible loop is not + supported (a limited set of functions used by Chapel's runtime library are + supported). + +* Associative arrays cannot be used on GPU sublocales with + ``CHPL_GPU_MEM_STRAGETY=array_on_device``. + +* If using CUDA 10, single thread per locale can be used. i.e., you have to set + ``CHPL_RT_NUM_THREADS_PER_LOCALE=1``. + +* ``CHPL_TASKS=fifo`` is not supported. Note that `fifo tasking layer + <../usingchapel/tasks.html#chpl-tasks-fifo>`_ is the + default in only Cygwin and NetBSD. + +Further Information +------------------- +* Please refer to issues with `GPU Support label + `_ for + other known limitations and issues. + +* Alternatively, you can add the `bug label + `_ + for known bugs only. + +* Additional information about GPU Support can be found in the "Ongoing Efforts" + slide decks of our `release notes + `_; however, be aware that + information presented in release notes for prior releases may be out-of-date.