CUDA Target

angavrilov edited this page Sep 13, 2010 · 2 revisions

The cuda mode targets NVidia GPU hardware via the proprietary driver API and C compiler.

Resources provided by NVidia

Vendor documentation and software can be downloaded from the web site.

In particular the following downloads are necessary:

  • A compatible driver version.
  • The CUDA toolkit, which contains the compiler.

The SDK download is not necessary because it only contains C++ code examples.

Hardware requirements

Detailed error reporting from GPU code (enabled via optimize debug >= 1) requires a device with CUDA capability no less than 1.1 and support for mapping of CPU memory into GPU address space.

Device enumeration

The following functions can be used to retrieve information about available hardware:

  • (cuda-device-count) → integer
    Returns the number of CUDA-compatible devices installed.
  • (cuda-device-name device-id) → string
    Returns the name of the referenced device. The ID must be an integer between 0 and count-1.
  • (cuda-device-version device-id) → cons
    Returns the compute capability of the device as a cons of two integers.
  • (cuda-device-total-mem device-id) → integer
    Returns the amount of physical memory existing on the device.
  • (cuda-device-attr device-id attr-name) → integer
    Retrieves various other attributes of the device.

The following device attributes exist:

  • :max-threads-per-block :warp-size :max-registers-per-block
  • :max-shared-memory-per-block :total-constant-memory
    Various memory and scheduler related parameters.
  • :max-block-dim-x :max-block-dim-y :max-block-dim-z
  • :max-grid-dim-x :max-grid-dim-y :max-grid-dim-z
    Supported dimensions of the thread grids.
  • :clock-rate :multiprocessor-count
    Hardware performance metrics.
  • :max-pitch :texture-alignment
    Supported alignment metrics.
  • :gpu-overlap :kernel-exec-timeout :integrated
  • :can-map-host-memory :compute-mode
    Device capability and operation mode flags.

For more information see documentation for cuDeviceGetAttribute on the official web site.

CUDA contexts

All GPU operations require an active CUDA context. It can be created via the following function:

  • (cuda-create-context device-id &optional flags)

The device ID is an ordinal index of the device. The flags argument may be used to specify a list of the following possible values:

  • :sched-spin – instructs the system to actively spin while waiting for the GPU.
  • :sched-yield – instructs the system to yield its CPU slice while waiting.
  • :blocking-sync – instructs the system to block the thread.
  • :map-host – enables mapping of CPU memory if supported by the device.

When a context is created, it is bound to the current thread and pushed onto a local stack.

The following function returns the current context of the current thread:

  • (cuda-current-context)

A context can be destroyed in the following way:

  • (cuda-destroy-context (cuda-current-context))

Invoking GPU code

Every CUDA context has its own instance of every GPU code module. The instance is created when the module is first accessed with the context being active, and automatically reinitialized if the module is modified.

CUDA kernels accept the following predefined keyword parameters:

  • :block-cnt-x
  • :block-cnt-y
    Define the block grid.
  • :thread-cnt-x
  • :thread-cnt-y
  • :thread-cnt-z
    Define the in-block thread grid.

All of the mentioned parameters default to 1.

Dynamic arrays must be allocated as CUDA linear memory, or mapped host memory buffers (see the Buffers page for details). For debugging convenience kernel wrappers can handle any buffer type by allocating temporary areas and automatically copying data; this feature assumes that all parameters refer to different memory areas (i.e. are not aliased).

Error recovery

The driver API was designed by NVidia for use in languages like C++, where fixing a bug requires recompiling and restarting the program. One of the consequences is that once a crash is detected in GPU code, the CUDA context becomes completely unusable and must be re-created from scratch.

In order to improve usability in REPL environment, the library implements in-place reinitialization of the current CUDA context and all associated objects. This operation can be invoked via the following function:

  • (cuda-recover)

The operation preserves the state of the lisp wrapper objects, but modifies the underlying low-level memory pointers and handles. Also, since a failed context doesn’t even allow reading device memory, reallocated device memory blocks are filled with 0.

Recovery is enhanced in the debugging mode which is enabled via the following global variable:

  • *cuda-debug*

When debug mode is active (this is the default), the library maintains mirrors for all allocated device memory blocks in ordinary memory, which allows recovering their contents, but requires a lot of additional memory copies and makes all kernel calls strictly synchronous. This mode also makes cuda-create-context automatically include :map-host in the flag list.

For convenience, common operations provide a recover-and-retry restart when a CUDA error condition is signalled.