Release Version 0.6.2: Stream callback mechanism change · eyalroz/cuda-api-wrappers

Changes since v0.6.1:

The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.

So far, a callback was any object invokable with an std::stream_t parameter. From now on, we support two kinds of callback:

A plain function - not a closure, which may be invoked with a pointer to an arbitrary type: cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
An object invokable with no parameters - a closure, to which one cannot provide any additional information: cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)

This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.

Of course, the release also has s the "usual" long list of minor fixes.

Changes to existing API

#473 Redesign of host function / callback enqueue and launch mechanism, see above
#459 cuda::kernel::get() now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)
#477 When creating a new program, we default to assuming it's CUDA C++ and do not require an explicit specification of that fact.

API additions

#468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
#472 Can now pass cuda::memory::region_t's when enqueueing copy operations on streams (and thus also cuda::span<T>'s)
#466 Can now perform copies using cuda::memory::copy_parameters_t<N> (for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.
#463 Can now obtain a raw pointer's context and device without first wrapping it in a cuda::pointer_t
#452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)
A method of the launch configuration builder for indicating no dynamic shared memory is used

Bug fixes

#475 device::get() no longer incorrectly marked as noexcept
#467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
#455 Add missing definition of allocate_managed()~ in context.hpp`
#453 Now actually setting the flags when enqueueing a flush_remote_writes() operation on a stream (this is one of the "batch stream memory operations)
#450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
#449 apriori_compiled_kernel_t::get_attribute() was missing an inline decoration
#448 cuda::profiling::mark::range_start() and range_end() were calling create_attributions() the wrong way

Cleanup and warning avoidance

#443 Aligned member initialization order(s) in array_t with their declaration order.

Compatibility

#462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
#304 Some CUDA 9.x incompatibilities have been fixed

Other changes

#471 Made a few more comparison operators constexpr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.6.2: Stream callback mechanism change