Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add device-specific synchronization API to Buffer #36103

Closed
pitrou opened this issue Jun 15, 2023 · 9 comments · Fixed by #37040
Closed

[C++] Add device-specific synchronization API to Buffer #36103

pitrou opened this issue Jun 15, 2023 · 9 comments · Fixed by #37040

Comments

@pitrou
Copy link
Member

pitrou commented Jun 15, 2023

Describe the enhancement requested

For the C device data interface, and other applications, we'll need to add optional synchronization information to buffers.

Here is for example a possible API. It tries to avoid or minimize additional footprint, especially for the CPU case:

// in device.h
class DeviceSyncEvent {
 public:
  virtual ~DeviceSyncEvent() = default;
  /// \brief Block until synchronization event is ready
  virtual Status Wait() = 0;
};

// in buffer.h
class Buffer {
 public:
  ...
  /// \brief Device-specific synchronization event
  ///
  /// If nullptr is returned (which is always the case for CPU buffers),
  /// no synchronization is required.
  virtual DeviceSyncEvent* sync_event() { return nullptr; }
  // XXX or perhaps:
  virtual std::shared_ptr<DeviceSyncEvent> sync_event() { return nullptr; }
  ...
};

// in arrow/gpu
class CudaSyncEvent : public DeviceSyncEvent {
 public:
  void* cuda_event();  // actually a CUevent
  Status Wait() override;
};

class CudaBuffer : public Buffer {
 public:
  DeviceSyncEvent* sync_event() override { return &sync_event_; }
  ...
 protected:
  CudaSyncEvent sync_event_;
  ...
};

Or, as an alternative, define a DeviceSyncStream instead of a DeviceSyncEvent.

TODO: define lifetime semantics.

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented Jun 15, 2023

@kkraus14 @zeroshade

@zeroshade
Copy link
Member

@pitrou Is the idea that the Wait() is actually having the CPU wait for synchronization? Or that it puts a wait on the GPU stream?

@pitrou
Copy link
Member Author

pitrou commented Jun 15, 2023

Wait is a CPU wait call for convenience. You don't have to use it, you can instead use the CUDA event directly. Perhaps it's not a very useful method, in which case it can be removed.

@pitrou
Copy link
Member Author

pitrou commented Jun 15, 2023

Also cc @felipecrv , if you know a bit about GPU APIs

@zeroshade
Copy link
Member

We should probably put a void* raw_event() in the base interface, while the derived impls would have properly typed methods. i.e.: CudaSyncEvent::cuda_event would return a cudaEvent_t* and CudaSyncEvent::event would return a void*. etc. But I guess nitpicks like this can be discussed on the eventual PR for this rather than hashing it out here. I think we can agree that this is definitely needed.

@pitrou
Copy link
Member Author

pitrou commented Jun 15, 2023

AFAICT, we don't expose cuda.h in our own headers. We might want to change this if really necessary, otherwise I think it's better to keep this policy.

@pitrou
Copy link
Member Author

pitrou commented Jun 15, 2023

In any case, the right lifetime semantics will have to be decided, which may entail putting a shared_ptr somewhere in the API.

@kkraus14
Copy link
Contributor

I'm not sure if an event is the right thing for Buffers. In addition to being able to synchronize, device buffers are often managed in a stream ordered fashion. I.E. allocated and more importantly freed asynchronously from a CPU perspective with the ordering being handled via CUDA streams.

I.E. RMM (libcudf's memory manager) has a couple of stream APIs associated to its device_buffer class as well as a private stream member which gets used in its destructor for deallocating. https://github.com/rapidsai/rmm/blob/0c08dd585031f58e2a9dcfbba5608cce10c423b2/include/rmm/device_buffer.hpp#L374-L400

@pitrou
Copy link
Member Author

pitrou commented Jun 15, 2023

Ok, so instead of exposing an event-like API, we could expose a stream-like API.

@zeroshade zeroshade added this to the 14.0.0 milestone Aug 22, 2023
zeroshade added a commit that referenced this issue Aug 22, 2023
### Rationale for this change
Building on the `ArrowDeviceArray` we need to expand the abstractions for handling events and stream synchronization for devices.

### What changes are included in this PR?
Initial Abstract implementations for the new DeviceSync API and a CPU implementation. This will be followed up by a CUDA implementation in a subsequent PR.

### Are these changes tested?
Yes, tests are added for Import/Export DeviceArrays using the DeviceSync handling.

* Closes: #36103

Lead-authored-by: Matt Topol <zotthewizard@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
### Rationale for this change
Building on the `ArrowDeviceArray` we need to expand the abstractions for handling events and stream synchronization for devices.

### What changes are included in this PR?
Initial Abstract implementations for the new DeviceSync API and a CPU implementation. This will be followed up by a CUDA implementation in a subsequent PR.

### Are these changes tested?
Yes, tests are added for Import/Export DeviceArrays using the DeviceSync handling.

* Closes: apache#36103

Lead-authored-by: Matt Topol <zotthewizard@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants