ARROW-2447: [C++] Device and MemoryManager API #6295

pitrou · 2020-01-27T20:15:40Z

Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The layer exposes two interfaces:

the Device interface exposes information a particular memory-holding device
the MemoryManager allows allocating, copying, reading or writing memory located on a particular device

The Buffer API is modified so that calling data() fails on non-CPU buffers. A separate address() method returns the buffer address as an integer, and is allowed on any buffer.

The API provides convenience functions to view or copy a buffer from one device to the other. For example, a on-GPU buffer can be copied to the CPU, and in some situations a zero-copy CPU view can also be created (depending on the GPU capabilities and how the GPU memory was allocated).

An example use in the PR is IPC. On the write side, a new SerializeRecordBatch overload takes a MemoryManager argument and is able to serialize data either to any kind of memory (CPU, GPU). On the read side, ReadRecordBatch now works on any kind of input buffer, and returns record batches backed by either CPU or GPU memory.

It introduces a slight complexity in the CUDA namespace, since there are both CudaContext and CudaMemoryManager classes. We could solve this by merging the two concepts (but doing so may break compatibility for existing users of CUDA).

github-actions · 2020-01-27T20:16:36Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2020-01-28T11:46:29Z

https://issues.apache.org/jira/browse/ARROW-2447

wesm · 2020-01-28T22:04:10Z

As a preliminary, it would be helpful to know the microperformance implications on the zero-copy IPC hot path (i.e. the before/after of https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/read_write_benchmark.cc)

pitrou · 2020-01-29T10:03:19Z

Ok, here are the benchmark results. It seems there is a slowdown on the read path:

before:

------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         57037 ns        57024 ns        49637 bytes_per_second=17.1216G/s
WriteRecordBatch/4/real_time         22435 ns        22431 ns       125916 bytes_per_second=43.528G/s
WriteRecordBatch/16/real_time        24192 ns        24187 ns       115458 bytes_per_second=40.3679G/s
WriteRecordBatch/64/real_time        32327 ns        32320 ns        84962 bytes_per_second=30.2088G/s
WriteRecordBatch/256/real_time       61745 ns        61732 ns        45951 bytes_per_second=15.816G/s
WriteRecordBatch/1024/real_time     180312 ns       180268 ns        15595 bytes_per_second=5.41596G/s
WriteRecordBatch/4096/real_time     642503 ns       642354 ns         4357 bytes_per_second=1.51993G/s
WriteRecordBatch/8192/real_time    1311063 ns      1310645 ns         1967 bytes_per_second=762.74M/s
ReadRecordBatch/1/real_time            916 ns          915 ns      3036262 bytes_per_second=1066.49G/s
ReadRecordBatch/4/real_time           1599 ns         1598 ns      1745660 bytes_per_second=610.904G/s
ReadRecordBatch/16/real_time          4614 ns         4613 ns       608964 bytes_per_second=211.637G/s
ReadRecordBatch/64/real_time         16549 ns        16546 ns       169957 bytes_per_second=59.0093G/s
ReadRecordBatch/256/real_time        78461 ns        78443 ns        36050 bytes_per_second=12.4465G/s
ReadRecordBatch/1024/real_time      313487 ns       313403 ns         9045 bytes_per_second=3.11516G/s
ReadRecordBatch/4096/real_time     1392722 ns      1392241 ns         2015 bytes_per_second=718.018M/s
ReadRecordBatch/8192/real_time     2754512 ns      2753567 ns         1023 bytes_per_second=363.041M/s

after:

------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         56888 ns        56880 ns        49308 bytes_per_second=17.1665G/s
WriteRecordBatch/4/real_time         21998 ns        21995 ns       128021 bytes_per_second=44.3928G/s
WriteRecordBatch/16/real_time        24570 ns        24567 ns       115878 bytes_per_second=39.7461G/s
WriteRecordBatch/64/real_time        33321 ns        33315 ns        84767 bytes_per_second=29.308G/s
WriteRecordBatch/256/real_time       62866 ns        62856 ns        44577 bytes_per_second=15.534G/s
WriteRecordBatch/1024/real_time     182013 ns       181983 ns        15492 bytes_per_second=5.36534G/s
WriteRecordBatch/4096/real_time     653490 ns       653338 ns         4260 bytes_per_second=1.49438G/s
WriteRecordBatch/8192/real_time    1307078 ns      1306782 ns         2158 bytes_per_second=765.065M/s
ReadRecordBatch/1/real_time           1285 ns         1285 ns      2091915 bytes_per_second=759.691G/s
ReadRecordBatch/4/real_time           2082 ns         2082 ns      1327615 bytes_per_second=469.107G/s
ReadRecordBatch/16/real_time          5681 ns         5680 ns       490930 bytes_per_second=171.898G/s
ReadRecordBatch/64/real_time         20053 ns        20051 ns       100000 bytes_per_second=48.6979G/s
ReadRecordBatch/256/real_time        85744 ns        85733 ns        33086 bytes_per_second=11.3893G/s
ReadRecordBatch/1024/real_time      344091 ns       344038 ns         8083 bytes_per_second=2.8381G/s
ReadRecordBatch/4096/real_time     1394700 ns      1394456 ns         2034 bytes_per_second=717M/s
ReadRecordBatch/8192/real_time     2655597 ns      2655116 ns         1045 bytes_per_second=376.563M/s

pitrou · 2020-01-30T16:55:49Z

@kou You may want to take a look.

kou · 2020-01-30T21:59:49Z

Thanks for pinging me. I'll take a look this later.

dhirschfeld · 2020-01-30T22:04:33Z

This seems similar in concept to umem?
https://github.com/xnd-project/umem

Just wondering if there is anything to share between the implementations

pitrou · 2020-01-31T21:28:42Z

@dhirschfeld It seems so, though umem seems lower-level. Perhaps @pearu can chime in, he's the author.

kou

+1

I'm OK with this approach.
I'll update GLib part as a follow-up task once this pull request is merged.

cpp/src/arrow/buffer.h

cpp/src/arrow/device.cc

cpp/src/arrow/gpu/cuda_context.cc

cpp/src/arrow/ipc/writer.cc

cpp/src/plasma/io.cc

cpp/src/arrow/ipc/message.cc

pearu · 2020-02-03T16:38:37Z

I only quickly scanned this PR changeset, and yes, umem was introduced to tackle the same problem.

While this PR is obviously tied to Arrow code base, umem goal was to provide a low-level library (but not "lower" than this PR, IMO) that different libraries could use for managing the memory of different devices. Also, this PR targets the communication between the host RAM and memory of GPU devices, while umem implements an abstraction that supports connecting the memories of arbitrary devices. For instance, one could connect GPU memory to storage devices using the same API used for connecting host RAM and GPU device memory.

umem is implemented in C to make the umem model accessible for C libraries, like XND. However, umem provides also C++ interface that simplifies umem usage considerably.

wesm · 2020-02-03T16:53:18Z

As a logistical matter, if we were to use umem we would have to figure out how to vendor it without introducing an external dependency (since we now have a zero-external-dependency core build).

There is also the question of managing the CUDA runtime dependency (e.g. currently we quarantine runtime CUDA dependency in libarrow_cuda.so).

pitrou · 2020-02-05T17:03:57Z

Correct me if I'm wrong, but umem doesn't seem to get much use or much maintenance. I'm uncertain we would gain much by relying on it.

pearu · 2020-02-05T18:25:21Z

@pitrou, yes, that is mostly true: umem is not used in any production code atm (AFAIK) but the umem implementation as the prototype of the current idea is not abandoned. If there is interest, umem will be maintained.

I think the problem of making the memory of devices accessible by different devices is a relevant problem for many libraries, not just for Apache Arrow, and it would be better to tackle this problem in one project rather than having many library-specific implementations.

wesm · 2020-02-05T19:17:15Z

It seems there is a slight chicken and egg problem. Regardless of how it's implemented, I believe Apache Arrow is definitely going to have a public memory and device API that does not expose any third party library headers, so I would suggest reviewing and merging this PR and exploring the code-sharing project as a follow up. I'm going to work on reviewing this PR now

wesm

Thank you for working on this, especially all the CUDA work.

I'm thinking about whether there's a way to change things to keep the cost of this memory extensibility
low for simple CPU buffers. Here is one proposal:

Do not put the is_cpu_ or memory_manager_ members in Buffer
Make is_cpu virtual, returning true by default (per my comment this probably would mean that the is_cpu() check would need to be taken out of Buffer::data/mutable_data
Make memory_manager virtual, return the CPU memory manager by default
Do not call memory_manager() on hot paths in Buffer if it can be avoided

Maybe this would cause some problems that make this not feasible? It means that subclasses of Buffer that need a different memory manager will have to do more work, but there will be few enough of them anyway that the benefits (restoring Buffer ctor performance, keeping the size of the struct the same) are worth it.

wesm · 2020-02-05T21:27:58Z

cpp/src/arrow/buffer.h

        data_(data),
        mutable_data_(NULLPTR),
        size_(size),
-        capacity_(size) {}
+        capacity_(size) {
+    SetMemoryManager(default_cpu_memory_manager());


Would there be any value in allowing memory_manager_ to be nullptr and presuming CPU memory in that case? Do you think there would ever be more than one implementation of a CPU-based memory manager?

Note I wrote this comment before my review-level comment, so feel free to disregard

There can be several instances of CPU-based memory managers (each one with a different MemoryPool). The MemoryPool may be used when e.g. doing a buffer copy.

wesm · 2020-02-05T21:32:47Z

cpp/src/arrow/buffer.h

+#ifndef NDEBUG
+    CheckCPU();
+#endif
+    return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR;


Not sure if others have discussed, but this feels slightly heavy-handed to me. Is this potentially nanny-ing the user too much? It seems it would make things more awkward for GPU device users who will need to be casting to void* from the result of address()

On the CPU, CUDA memory is typically passed as CUdeviceptr, which is an unsigned integer. For example that's what cuMemAlloc returns.

OK, a fair point (my CUDA is a bit rusty but I do recall this now)

wesm · 2020-02-05T21:35:09Z

cpp/src/arrow/buffer.h

+  static Result<std::shared_ptr<Buffer>> View(std::shared_ptr<Buffer> source,
+                                              const std::shared_ptr<MemoryManager>& to);
+  static Result<std::shared_ptr<Buffer>> ViewOrCopy(
+      std::shared_ptr<Buffer> source, const std::shared_ptr<MemoryManager>& to);


You could use shared_from_this and make these instance methods. Not sure what the performance implications of that are

Hmm, I'm not sure either. I found this comment:

Far from improving performance, enable_shared_from_this has nonzero cost. In particular, it injects a weak_ptr into the object (this is not mandated, but strongly implied by the requirements, and it's what everyone does). That increases the object's size by two raw pointers.

wesm · 2020-02-05T21:56:55Z

cpp/src/arrow/gpu/cuda_arrow_ipc.h

@@ -50,6 +50,8 @@ ARROW_EXPORT
 Status SerializeRecordBatch(const RecordBatch& batch, CudaContext* ctx,
                            std::shared_ptr<CudaBuffer>* out);

+// TODO deprecate for ipc::ReadMessage()?
+


wesm · 2020-02-05T23:07:13Z

cpp/src/arrow/gpu/cuda_test.cc

+// Test CudaHostBuffer
+
+class TestCudaHostBuffer : public TestCudaBase {
+ public:


nit: not needed

wesm · 2020-02-05T23:08:51Z

cpp/src/arrow/ipc/message.cc

  if (bytes_read != flatbuffer_length) {
    return Status::Invalid("Expected to read ", flatbuffer_length,
                           " metadata bytes, but ", "only read ", bytes_read);
  }
+  // The buffer could be a non-CPU buffer (e.g. CUDA)
+  ARROW_ASSIGN_OR_RAISE(metadata,
+                        Buffer::ViewOrCopy(metadata, CPUDevice::memory_manager(pool)));


pitrou · 2020-02-06T14:15:42Z

Note that part of the slowdown may be due to ARROW-7754 (Result<T> is slow).

Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The API provides convenience functions to view or copy a buffer from one device to the other.

pitrou · 2020-02-12T16:46:20Z

Rebased. Updated benchmarks:

Before:

------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         56718 ns        56709 ns        37276 bytes_per_second=17.2179G/s
WriteRecordBatch/4/real_time         23413 ns        23409 ns        92371 bytes_per_second=41.7105G/s
WriteRecordBatch/16/real_time        25937 ns        25933 ns        84500 bytes_per_second=37.6511G/s
WriteRecordBatch/64/real_time        34884 ns        34878 ns        62584 bytes_per_second=27.9945G/s
WriteRecordBatch/256/real_time       63044 ns        63032 ns        33631 bytes_per_second=15.4902G/s
WriteRecordBatch/1024/real_time     183534 ns       183487 ns        11373 bytes_per_second=5.32089G/s
WriteRecordBatch/4096/real_time     649403 ns       649262 ns         3214 bytes_per_second=1.50379G/s
WriteRecordBatch/8192/real_time    1309263 ns      1308906 ns         1463 bytes_per_second=763.789M/s
ReadRecordBatch/1/real_time            828 ns          828 ns      2530510 bytes_per_second=1.1516T/s
ReadRecordBatch/4/real_time           1496 ns         1496 ns      1378741 bytes_per_second=652.623G/s
ReadRecordBatch/16/real_time          4155 ns         4154 ns       501789 bytes_per_second=235.059G/s
ReadRecordBatch/64/real_time         14983 ns        14981 ns       139940 bytes_per_second=65.1764G/s
ReadRecordBatch/256/real_time        69903 ns        69893 ns        30210 bytes_per_second=13.9702G/s
ReadRecordBatch/1024/real_time      279993 ns       279942 ns         7486 bytes_per_second=3.48782G/s
ReadRecordBatch/4096/real_time     1111390 ns      1111204 ns         1878 bytes_per_second=899.774M/s
ReadRecordBatch/8192/real_time     2584303 ns      2583353 ns          806 bytes_per_second=386.951M/s

After:

------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         57387 ns        57377 ns        37117 bytes_per_second=17.0172G/s
WriteRecordBatch/4/real_time         24439 ns        24435 ns        86591 bytes_per_second=39.959G/s
WriteRecordBatch/16/real_time        25210 ns        25206 ns        84164 bytes_per_second=38.7374G/s
WriteRecordBatch/64/real_time        33703 ns        33698 ns        61934 bytes_per_second=28.9755G/s
WriteRecordBatch/256/real_time       64205 ns        64193 ns        33096 bytes_per_second=15.2101G/s
WriteRecordBatch/1024/real_time     188936 ns       188888 ns        11068 bytes_per_second=5.16874G/s
WriteRecordBatch/4096/real_time     654949 ns       654809 ns         3163 bytes_per_second=1.49105G/s
WriteRecordBatch/8192/real_time    1454723 ns      1454184 ns         1594 bytes_per_second=687.416M/s
ReadRecordBatch/1/real_time           1058 ns         1058 ns      1985751 bytes_per_second=922.751G/s
ReadRecordBatch/4/real_time           1807 ns         1806 ns      1154435 bytes_per_second=540.562G/s
ReadRecordBatch/16/real_time          4982 ns         4982 ns       422006 bytes_per_second=196.009G/s
ReadRecordBatch/64/real_time         17784 ns        17781 ns       118674 bytes_per_second=54.9138G/s
ReadRecordBatch/256/real_time        80367 ns        80354 ns        26082 bytes_per_second=12.1513G/s
ReadRecordBatch/1024/real_time      321198 ns       321140 ns         6501 bytes_per_second=3.04038G/s
ReadRecordBatch/4096/real_time     1292189 ns      1291978 ns         1634 bytes_per_second=773.881M/s
ReadRecordBatch/8192/real_time     2571412 ns      2570970 ns          816 bytes_per_second=388.891M/s

wesm · 2020-02-12T16:52:43Z

The performance change seems acceptable. I assume this is ready for a final review and merge?

pitrou · 2020-02-12T18:28:44Z

Yes, it should be ready.

wesm

+1. Thank you again for working on this, I think this will help a lot with mixed-device Arrow use

wesm · 2020-02-12T23:05:37Z

cpp/src/arrow/buffer.h

+#ifndef NDEBUG
+    CheckCPU();
+#endif
+    return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR;


OK, a fair point (my CUDA is a bit rusty but I do recall this now)

wesm · 2020-02-12T23:24:01Z

cpp/src/arrow/buffer_test.cc

+
+// Like AssertBufferEqual, but doesn't call Buffer::data()
+void AssertMyBufferEqual(const Buffer& buffer, util::string_view expected) {
+  ASSERT_EQ(util::string_view(buffer), expected);


It's curious that this works

pitrou force-pushed the ARROW-2447-device-api-memory-manager branch 2 times, most recently from 6935e22 to 7b362f0 Compare January 28, 2020 10:47

pitrou changed the title ~~[WIP] ARROW-2447: [C++] Device and MemoryManager API~~ ARROW-2447: [C++] Device and MemoryManager API Jan 28, 2020

pitrou force-pushed the ARROW-2447-device-api-memory-manager branch 2 times, most recently from b726745 to deb4ee6 Compare January 28, 2020 13:42

kou approved these changes Feb 2, 2020

View reviewed changes

pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from deb4ee6 to a8c6337 Compare February 3, 2020 15:28

pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from a8c6337 to 12f8e67 Compare February 5, 2020 16:50

wesm reviewed Feb 5, 2020

View reviewed changes

pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from 12f8e67 to fb64386 Compare February 6, 2020 15:14

kszucs force-pushed the master branch from b18ed44 to e79c251 Compare February 7, 2020 07:41

kszucs force-pushed the ARROW-2447-device-api-memory-manager branch from fb64386 to 5f4cd7c Compare February 7, 2020 10:07

ARROW-2447: [C++] Device and MemoryManager API

c665f61

Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The API provides convenience functions to view or copy a buffer from one device to the other.

pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from 5f4cd7c to c665f61 Compare February 12, 2020 16:45

wesm approved these changes Feb 12, 2020

View reviewed changes

wesm closed this in 9f0c70c Feb 12, 2020

pitrou deleted the ARROW-2447-device-api-memory-manager branch February 12, 2020 23:35

asfimport mentioned this pull request Feb 18, 2020

[C++] Create a device abstraction #18584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-2447: [C++] Device and MemoryManager API #6295

ARROW-2447: [C++] Device and MemoryManager API #6295

pitrou commented Jan 27, 2020 •

edited

Loading

github-actions bot commented Jan 27, 2020

github-actions bot commented Jan 28, 2020

wesm commented Jan 28, 2020

pitrou commented Jan 29, 2020

pitrou commented Jan 30, 2020

kou commented Jan 30, 2020

dhirschfeld commented Jan 30, 2020

pitrou commented Jan 31, 2020

kou left a comment

pearu commented Feb 3, 2020

wesm commented Feb 3, 2020

pitrou commented Feb 5, 2020

pearu commented Feb 5, 2020

wesm commented Feb 5, 2020

wesm left a comment •

edited

Loading

wesm Feb 5, 2020

wesm Feb 5, 2020

pitrou Feb 6, 2020 •

edited

Loading

wesm Feb 5, 2020

pitrou Feb 6, 2020

wesm Feb 12, 2020

wesm Feb 5, 2020

pitrou Feb 6, 2020

wesm Feb 5, 2020

wesm Feb 5, 2020

wesm Feb 5, 2020

pitrou commented Feb 6, 2020 •

edited

Loading

pitrou commented Feb 12, 2020

wesm commented Feb 12, 2020

pitrou commented Feb 12, 2020

wesm left a comment

wesm Feb 12, 2020

wesm Feb 12, 2020

ARROW-2447: [C++] Device and MemoryManager API #6295

ARROW-2447: [C++] Device and MemoryManager API #6295

Conversation

pitrou commented Jan 27, 2020 • edited Loading

github-actions bot commented Jan 27, 2020

github-actions bot commented Jan 28, 2020

wesm commented Jan 28, 2020

pitrou commented Jan 29, 2020

pitrou commented Jan 30, 2020

kou commented Jan 30, 2020

dhirschfeld commented Jan 30, 2020

pitrou commented Jan 31, 2020

kou left a comment

Choose a reason for hiding this comment

pearu commented Feb 3, 2020

wesm commented Feb 3, 2020

pitrou commented Feb 5, 2020

pearu commented Feb 5, 2020

wesm commented Feb 5, 2020

wesm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Feb 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Feb 6, 2020 • edited Loading

pitrou commented Feb 12, 2020

wesm commented Feb 12, 2020

pitrou commented Feb 12, 2020

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jan 27, 2020 •

edited

Loading

wesm left a comment •

edited

Loading

pitrou Feb 6, 2020 •

edited

Loading

pitrou commented Feb 6, 2020 •

edited

Loading