-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2447: [C++] Device and MemoryManager API #6295
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
6935e22
to
7b362f0
Compare
b726745
to
deb4ee6
Compare
As a preliminary, it would be helpful to know the microperformance implications on the zero-copy IPC hot path (i.e. the before/after of https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/read_write_benchmark.cc) |
Ok, here are the benchmark results. It seems there is a slowdown on the read path:
|
@kou You may want to take a look. |
Thanks for pinging me. I'll take a look this later. |
This seems similar in concept to umem? Just wondering if there is anything to share between the implementations |
@dhirschfeld It seems so, though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I'm OK with this approach.
I'll update GLib part as a follow-up task once this pull request is merged.
deb4ee6
to
a8c6337
Compare
I only quickly scanned this PR changeset, and yes, umem was introduced to tackle the same problem. While this PR is obviously tied to Arrow code base, umem goal was to provide a low-level library (but not "lower" than this PR, IMO) that different libraries could use for managing the memory of different devices. Also, this PR targets the communication between the host RAM and memory of GPU devices, while umem implements an abstraction that supports connecting the memories of arbitrary devices. For instance, one could connect GPU memory to storage devices using the same API used for connecting host RAM and GPU device memory. umem is implemented in C to make the umem model accessible for C libraries, like XND. However, umem provides also C++ interface that simplifies umem usage considerably. |
As a logistical matter, if we were to use umem we would have to figure out how to vendor it without introducing an external dependency (since we now have a zero-external-dependency core build). There is also the question of managing the CUDA runtime dependency (e.g. currently we quarantine runtime CUDA dependency in |
a8c6337
to
12f8e67
Compare
Correct me if I'm wrong, but |
@pitrou, yes, that is mostly true: umem is not used in any production code atm (AFAIK) but the umem implementation as the prototype of the current idea is not abandoned. If there is interest, umem will be maintained. I think the problem of making the memory of devices accessible by different devices is a relevant problem for many libraries, not just for Apache Arrow, and it would be better to tackle this problem in one project rather than having many library-specific implementations. |
It seems there is a slight chicken and egg problem. Regardless of how it's implemented, I believe Apache Arrow is definitely going to have a public memory and device API that does not expose any third party library headers, so I would suggest reviewing and merging this PR and exploring the code-sharing project as a follow up. I'm going to work on reviewing this PR now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, especially all the CUDA work.
I'm thinking about whether there's a way to change things to keep the cost of this memory extensibility
low for simple CPU buffers. Here is one proposal:
- Do not put the
is_cpu_
ormemory_manager_
members inBuffer
- Make
is_cpu
virtual, returningtrue
by default (per my comment this probably would mean that the is_cpu() check would need to be taken out ofBuffer::data/mutable_data
- Make
memory_manager
virtual, return the CPU memory manager by default - Do not call
memory_manager()
on hot paths inBuffer
if it can be avoided
Maybe this would cause some problems that make this not feasible? It means that subclasses of Buffer
that need a different memory manager will have to do more work, but there will be few enough of them anyway that the benefits (restoring Buffer ctor performance, keeping the size of the struct the same) are worth it.
data_(data), | ||
mutable_data_(NULLPTR), | ||
size_(size), | ||
capacity_(size) {} | ||
capacity_(size) { | ||
SetMemoryManager(default_cpu_memory_manager()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any value in allowing memory_manager_
to be nullptr and presuming CPU memory in that case? Do you think there would ever be more than one implementation of a CPU-based memory manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I wrote this comment before my review-level comment, so feel free to disregard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be several instances of CPU-based memory managers (each one with a different MemoryPool). The MemoryPool may be used when e.g. doing a buffer copy.
#ifndef NDEBUG | ||
CheckCPU(); | ||
#endif | ||
return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if others have discussed, but this feels slightly heavy-handed to me. Is this potentially nanny-ing the user too much? It seems it would make things more awkward for GPU device users who will need to be casting to void*
from the result of address()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the CPU, CUDA memory is typically passed as CUdeviceptr
, which is an unsigned integer. For example that's what cuMemAlloc returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a fair point (my CUDA is a bit rusty but I do recall this now)
static Result<std::shared_ptr<Buffer>> View(std::shared_ptr<Buffer> source, | ||
const std::shared_ptr<MemoryManager>& to); | ||
static Result<std::shared_ptr<Buffer>> ViewOrCopy( | ||
std::shared_ptr<Buffer> source, const std::shared_ptr<MemoryManager>& to); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use shared_from_this
and make these instance methods. Not sure what the performance implications of that are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure either. I found this comment:
Far from improving performance, enable_shared_from_this has nonzero cost. In particular, it injects a weak_ptr into the object (this is not mandated, but strongly implied by the requirements, and it's what everyone does). That increases the object's size by two raw pointers.
@@ -50,6 +50,8 @@ ARROW_EXPORT | |||
Status SerializeRecordBatch(const RecordBatch& batch, CudaContext* ctx, | |||
std::shared_ptr<CudaBuffer>* out); | |||
|
|||
// TODO deprecate for ipc::ReadMessage()? | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably
// Test CudaHostBuffer | ||
|
||
class TestCudaHostBuffer : public TestCudaBase { | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not needed
if (bytes_read != flatbuffer_length) { | ||
return Status::Invalid("Expected to read ", flatbuffer_length, | ||
" metadata bytes, but ", "only read ", bytes_read); | ||
} | ||
// The buffer could be a non-CPU buffer (e.g. CUDA) | ||
ARROW_ASSIGN_OR_RAISE(metadata, | ||
Buffer::ViewOrCopy(metadata, CPUDevice::memory_manager(pool))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Note that part of the slowdown may be due to ARROW-7754 ( |
12f8e67
to
fb64386
Compare
fb64386
to
5f4cd7c
Compare
Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The API provides convenience functions to view or copy a buffer from one device to the other.
5f4cd7c
to
c665f61
Compare
Rebased. Updated benchmarks:
|
The performance change seems acceptable. I assume this is ready for a final review and merge? |
Yes, it should be ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thank you again for working on this, I think this will help a lot with mixed-device Arrow use
#ifndef NDEBUG | ||
CheckCPU(); | ||
#endif | ||
return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a fair point (my CUDA is a bit rusty but I do recall this now)
|
||
// Like AssertBufferEqual, but doesn't call Buffer::data() | ||
void AssertMyBufferEqual(const Buffer& buffer, util::string_view expected) { | ||
ASSERT_EQ(util::string_view(buffer), expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's curious that this works
Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The layer exposes two interfaces:
Device
interface exposes information a particular memory-holding deviceMemoryManager
allows allocating, copying, reading or writing memory located on a particular deviceThe
Buffer
API is modified so that callingdata()
fails on non-CPU buffers. A separateaddress()
method returns the buffer address as an integer, and is allowed on any buffer.The API provides convenience functions to view or copy a buffer from one device to the other. For example, a on-GPU buffer can be copied to the CPU, and in some situations a zero-copy CPU view can also be created (depending on the GPU capabilities and how the GPU memory was allocated).
An example use in the PR is IPC. On the write side, a new
SerializeRecordBatch
overload takes aMemoryManager
argument and is able to serialize data either to any kind of memory (CPU, GPU). On the read side,ReadRecordBatch
now works on any kind of input buffer, and returns record batches backed by either CPU or GPU memory.It introduces a slight complexity in the CUDA namespace, since there are both
CudaContext
andCudaMemoryManager
classes. We could solve this by merging the two concepts (but doing so may break compatibility for existing users of CUDA).