Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2447: [C++] Device and MemoryManager API #6295

Closed

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jan 27, 2020

Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The layer exposes two interfaces:

  • the Device interface exposes information a particular memory-holding device
  • the MemoryManager allows allocating, copying, reading or writing memory located on a particular device

The Buffer API is modified so that calling data() fails on non-CPU buffers. A separate address() method returns the buffer address as an integer, and is allowed on any buffer.

The API provides convenience functions to view or copy a buffer from one device to the other. For example, a on-GPU buffer can be copied to the CPU, and in some situations a zero-copy CPU view can also be created (depending on the GPU capabilities and how the GPU memory was allocated).

An example use in the PR is IPC. On the write side, a new SerializeRecordBatch overload takes a MemoryManager argument and is able to serialize data either to any kind of memory (CPU, GPU). On the read side, ReadRecordBatch now works on any kind of input buffer, and returns record batches backed by either CPU or GPU memory.

It introduces a slight complexity in the CUDA namespace, since there are both CudaContext and CudaMemoryManager classes. We could solve this by merging the two concepts (but doing so may break compatibility for existing users of CUDA).

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou pitrou force-pushed the ARROW-2447-device-api-memory-manager branch 2 times, most recently from 6935e22 to 7b362f0 Compare January 28, 2020 10:47
@pitrou pitrou changed the title [WIP] ARROW-2447: [C++] Device and MemoryManager API ARROW-2447: [C++] Device and MemoryManager API Jan 28, 2020
@github-actions
Copy link

@pitrou pitrou force-pushed the ARROW-2447-device-api-memory-manager branch 2 times, most recently from b726745 to deb4ee6 Compare January 28, 2020 13:42
@wesm
Copy link
Member

wesm commented Jan 28, 2020

As a preliminary, it would be helpful to know the microperformance implications on the zero-copy IPC hot path (i.e. the before/after of https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/read_write_benchmark.cc)

@pitrou
Copy link
Member Author

pitrou commented Jan 29, 2020

Ok, here are the benchmark results. It seems there is a slowdown on the read path:

  • before:
------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         57037 ns        57024 ns        49637 bytes_per_second=17.1216G/s
WriteRecordBatch/4/real_time         22435 ns        22431 ns       125916 bytes_per_second=43.528G/s
WriteRecordBatch/16/real_time        24192 ns        24187 ns       115458 bytes_per_second=40.3679G/s
WriteRecordBatch/64/real_time        32327 ns        32320 ns        84962 bytes_per_second=30.2088G/s
WriteRecordBatch/256/real_time       61745 ns        61732 ns        45951 bytes_per_second=15.816G/s
WriteRecordBatch/1024/real_time     180312 ns       180268 ns        15595 bytes_per_second=5.41596G/s
WriteRecordBatch/4096/real_time     642503 ns       642354 ns         4357 bytes_per_second=1.51993G/s
WriteRecordBatch/8192/real_time    1311063 ns      1310645 ns         1967 bytes_per_second=762.74M/s
ReadRecordBatch/1/real_time            916 ns          915 ns      3036262 bytes_per_second=1066.49G/s
ReadRecordBatch/4/real_time           1599 ns         1598 ns      1745660 bytes_per_second=610.904G/s
ReadRecordBatch/16/real_time          4614 ns         4613 ns       608964 bytes_per_second=211.637G/s
ReadRecordBatch/64/real_time         16549 ns        16546 ns       169957 bytes_per_second=59.0093G/s
ReadRecordBatch/256/real_time        78461 ns        78443 ns        36050 bytes_per_second=12.4465G/s
ReadRecordBatch/1024/real_time      313487 ns       313403 ns         9045 bytes_per_second=3.11516G/s
ReadRecordBatch/4096/real_time     1392722 ns      1392241 ns         2015 bytes_per_second=718.018M/s
ReadRecordBatch/8192/real_time     2754512 ns      2753567 ns         1023 bytes_per_second=363.041M/s
  • after:
------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         56888 ns        56880 ns        49308 bytes_per_second=17.1665G/s
WriteRecordBatch/4/real_time         21998 ns        21995 ns       128021 bytes_per_second=44.3928G/s
WriteRecordBatch/16/real_time        24570 ns        24567 ns       115878 bytes_per_second=39.7461G/s
WriteRecordBatch/64/real_time        33321 ns        33315 ns        84767 bytes_per_second=29.308G/s
WriteRecordBatch/256/real_time       62866 ns        62856 ns        44577 bytes_per_second=15.534G/s
WriteRecordBatch/1024/real_time     182013 ns       181983 ns        15492 bytes_per_second=5.36534G/s
WriteRecordBatch/4096/real_time     653490 ns       653338 ns         4260 bytes_per_second=1.49438G/s
WriteRecordBatch/8192/real_time    1307078 ns      1306782 ns         2158 bytes_per_second=765.065M/s
ReadRecordBatch/1/real_time           1285 ns         1285 ns      2091915 bytes_per_second=759.691G/s
ReadRecordBatch/4/real_time           2082 ns         2082 ns      1327615 bytes_per_second=469.107G/s
ReadRecordBatch/16/real_time          5681 ns         5680 ns       490930 bytes_per_second=171.898G/s
ReadRecordBatch/64/real_time         20053 ns        20051 ns       100000 bytes_per_second=48.6979G/s
ReadRecordBatch/256/real_time        85744 ns        85733 ns        33086 bytes_per_second=11.3893G/s
ReadRecordBatch/1024/real_time      344091 ns       344038 ns         8083 bytes_per_second=2.8381G/s
ReadRecordBatch/4096/real_time     1394700 ns      1394456 ns         2034 bytes_per_second=717M/s
ReadRecordBatch/8192/real_time     2655597 ns      2655116 ns         1045 bytes_per_second=376.563M/s

@pitrou
Copy link
Member Author

pitrou commented Jan 30, 2020

@kou You may want to take a look.

@kou
Copy link
Member

kou commented Jan 30, 2020

Thanks for pinging me. I'll take a look this later.

@dhirschfeld
Copy link

This seems similar in concept to umem?
https://github.com/xnd-project/umem

Just wondering if there is anything to share between the implementations

@pitrou
Copy link
Member Author

pitrou commented Jan 31, 2020

@dhirschfeld It seems so, though umem seems lower-level. Perhaps @pearu can chime in, he's the author.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I'm OK with this approach.
I'll update GLib part as a follow-up task once this pull request is merged.

cpp/src/arrow/buffer.h Outdated Show resolved Hide resolved
cpp/src/arrow/buffer.h Show resolved Hide resolved
cpp/src/arrow/buffer.h Outdated Show resolved Hide resolved
cpp/src/arrow/buffer.h Show resolved Hide resolved
cpp/src/arrow/device.cc Outdated Show resolved Hide resolved
cpp/src/arrow/gpu/cuda_context.cc Show resolved Hide resolved
cpp/src/arrow/gpu/cuda_context.cc Show resolved Hide resolved
cpp/src/arrow/ipc/writer.cc Outdated Show resolved Hide resolved
cpp/src/plasma/io.cc Show resolved Hide resolved
cpp/src/arrow/ipc/message.cc Show resolved Hide resolved
@pitrou pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from deb4ee6 to a8c6337 Compare February 3, 2020 15:28
@pearu
Copy link
Contributor

pearu commented Feb 3, 2020

I only quickly scanned this PR changeset, and yes, umem was introduced to tackle the same problem.

While this PR is obviously tied to Arrow code base, umem goal was to provide a low-level library (but not "lower" than this PR, IMO) that different libraries could use for managing the memory of different devices. Also, this PR targets the communication between the host RAM and memory of GPU devices, while umem implements an abstraction that supports connecting the memories of arbitrary devices. For instance, one could connect GPU memory to storage devices using the same API used for connecting host RAM and GPU device memory.

umem is implemented in C to make the umem model accessible for C libraries, like XND. However, umem provides also C++ interface that simplifies umem usage considerably.

@wesm
Copy link
Member

wesm commented Feb 3, 2020

As a logistical matter, if we were to use umem we would have to figure out how to vendor it without introducing an external dependency (since we now have a zero-external-dependency core build).

There is also the question of managing the CUDA runtime dependency (e.g. currently we quarantine runtime CUDA dependency in libarrow_cuda.so).

@pitrou pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from a8c6337 to 12f8e67 Compare February 5, 2020 16:50
@pitrou
Copy link
Member Author

pitrou commented Feb 5, 2020

Correct me if I'm wrong, but umem doesn't seem to get much use or much maintenance. I'm uncertain we would gain much by relying on it.

@pearu
Copy link
Contributor

pearu commented Feb 5, 2020

@pitrou, yes, that is mostly true: umem is not used in any production code atm (AFAIK) but the umem implementation as the prototype of the current idea is not abandoned. If there is interest, umem will be maintained.

I think the problem of making the memory of devices accessible by different devices is a relevant problem for many libraries, not just for Apache Arrow, and it would be better to tackle this problem in one project rather than having many library-specific implementations.

@wesm
Copy link
Member

wesm commented Feb 5, 2020

It seems there is a slight chicken and egg problem. Regardless of how it's implemented, I believe Apache Arrow is definitely going to have a public memory and device API that does not expose any third party library headers, so I would suggest reviewing and merging this PR and exploring the code-sharing project as a follow up. I'm going to work on reviewing this PR now

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, especially all the CUDA work.

I'm thinking about whether there's a way to change things to keep the cost of this memory extensibility
low for simple CPU buffers. Here is one proposal:

  • Do not put the is_cpu_ or memory_manager_ members in Buffer
  • Make is_cpu virtual, returning true by default (per my comment this probably would mean that the is_cpu() check would need to be taken out of Buffer::data/mutable_data
  • Make memory_manager virtual, return the CPU memory manager by default
  • Do not call memory_manager() on hot paths in Buffer if it can be avoided

Maybe this would cause some problems that make this not feasible? It means that subclasses of Buffer that need a different memory manager will have to do more work, but there will be few enough of them anyway that the benefits (restoring Buffer ctor performance, keeping the size of the struct the same) are worth it.

data_(data),
mutable_data_(NULLPTR),
size_(size),
capacity_(size) {}
capacity_(size) {
SetMemoryManager(default_cpu_memory_manager());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any value in allowing memory_manager_ to be nullptr and presuming CPU memory in that case? Do you think there would ever be more than one implementation of a CPU-based memory manager?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I wrote this comment before my review-level comment, so feel free to disregard

Copy link
Member Author

@pitrou pitrou Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be several instances of CPU-based memory managers (each one with a different MemoryPool). The MemoryPool may be used when e.g. doing a buffer copy.

#ifndef NDEBUG
CheckCPU();
#endif
return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if others have discussed, but this feels slightly heavy-handed to me. Is this potentially nanny-ing the user too much? It seems it would make things more awkward for GPU device users who will need to be casting to void* from the result of address()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the CPU, CUDA memory is typically passed as CUdeviceptr, which is an unsigned integer. For example that's what cuMemAlloc returns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a fair point (my CUDA is a bit rusty but I do recall this now)

static Result<std::shared_ptr<Buffer>> View(std::shared_ptr<Buffer> source,
const std::shared_ptr<MemoryManager>& to);
static Result<std::shared_ptr<Buffer>> ViewOrCopy(
std::shared_ptr<Buffer> source, const std::shared_ptr<MemoryManager>& to);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use shared_from_this and make these instance methods. Not sure what the performance implications of that are

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure either. I found this comment:

Far from improving performance, enable_shared_from_this has nonzero cost. In particular, it injects a weak_ptr into the object (this is not mandated, but strongly implied by the requirements, and it's what everyone does). That increases the object's size by two raw pointers.

@@ -50,6 +50,8 @@ ARROW_EXPORT
Status SerializeRecordBatch(const RecordBatch& batch, CudaContext* ctx,
std::shared_ptr<CudaBuffer>* out);

// TODO deprecate for ipc::ReadMessage()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably

// Test CudaHostBuffer

class TestCudaHostBuffer : public TestCudaBase {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not needed

if (bytes_read != flatbuffer_length) {
return Status::Invalid("Expected to read ", flatbuffer_length,
" metadata bytes, but ", "only read ", bytes_read);
}
// The buffer could be a non-CPU buffer (e.g. CUDA)
ARROW_ASSIGN_OR_RAISE(metadata,
Buffer::ViewOrCopy(metadata, CPUDevice::memory_manager(pool)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@pitrou
Copy link
Member Author

pitrou commented Feb 6, 2020

Note that part of the slowdown may be due to ARROW-7754 (Result<T> is slow).

Add an abstraction layer to allow safe handling of buffers residing on
different devices (the CPU, a GPU...).  The API provides convenience
functions to view or copy a buffer from one device to the other.
@pitrou pitrou force-pushed the ARROW-2447-device-api-memory-manager branch from 5f4cd7c to c665f61 Compare February 12, 2020 16:45
@pitrou
Copy link
Member Author

pitrou commented Feb 12, 2020

Rebased. Updated benchmarks:

  • Before:
------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         56718 ns        56709 ns        37276 bytes_per_second=17.2179G/s
WriteRecordBatch/4/real_time         23413 ns        23409 ns        92371 bytes_per_second=41.7105G/s
WriteRecordBatch/16/real_time        25937 ns        25933 ns        84500 bytes_per_second=37.6511G/s
WriteRecordBatch/64/real_time        34884 ns        34878 ns        62584 bytes_per_second=27.9945G/s
WriteRecordBatch/256/real_time       63044 ns        63032 ns        33631 bytes_per_second=15.4902G/s
WriteRecordBatch/1024/real_time     183534 ns       183487 ns        11373 bytes_per_second=5.32089G/s
WriteRecordBatch/4096/real_time     649403 ns       649262 ns         3214 bytes_per_second=1.50379G/s
WriteRecordBatch/8192/real_time    1309263 ns      1308906 ns         1463 bytes_per_second=763.789M/s
ReadRecordBatch/1/real_time            828 ns          828 ns      2530510 bytes_per_second=1.1516T/s
ReadRecordBatch/4/real_time           1496 ns         1496 ns      1378741 bytes_per_second=652.623G/s
ReadRecordBatch/16/real_time          4155 ns         4154 ns       501789 bytes_per_second=235.059G/s
ReadRecordBatch/64/real_time         14983 ns        14981 ns       139940 bytes_per_second=65.1764G/s
ReadRecordBatch/256/real_time        69903 ns        69893 ns        30210 bytes_per_second=13.9702G/s
ReadRecordBatch/1024/real_time      279993 ns       279942 ns         7486 bytes_per_second=3.48782G/s
ReadRecordBatch/4096/real_time     1111390 ns      1111204 ns         1878 bytes_per_second=899.774M/s
ReadRecordBatch/8192/real_time     2584303 ns      2583353 ns          806 bytes_per_second=386.951M/s
  • After:
------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
WriteRecordBatch/1/real_time         57387 ns        57377 ns        37117 bytes_per_second=17.0172G/s
WriteRecordBatch/4/real_time         24439 ns        24435 ns        86591 bytes_per_second=39.959G/s
WriteRecordBatch/16/real_time        25210 ns        25206 ns        84164 bytes_per_second=38.7374G/s
WriteRecordBatch/64/real_time        33703 ns        33698 ns        61934 bytes_per_second=28.9755G/s
WriteRecordBatch/256/real_time       64205 ns        64193 ns        33096 bytes_per_second=15.2101G/s
WriteRecordBatch/1024/real_time     188936 ns       188888 ns        11068 bytes_per_second=5.16874G/s
WriteRecordBatch/4096/real_time     654949 ns       654809 ns         3163 bytes_per_second=1.49105G/s
WriteRecordBatch/8192/real_time    1454723 ns      1454184 ns         1594 bytes_per_second=687.416M/s
ReadRecordBatch/1/real_time           1058 ns         1058 ns      1985751 bytes_per_second=922.751G/s
ReadRecordBatch/4/real_time           1807 ns         1806 ns      1154435 bytes_per_second=540.562G/s
ReadRecordBatch/16/real_time          4982 ns         4982 ns       422006 bytes_per_second=196.009G/s
ReadRecordBatch/64/real_time         17784 ns        17781 ns       118674 bytes_per_second=54.9138G/s
ReadRecordBatch/256/real_time        80367 ns        80354 ns        26082 bytes_per_second=12.1513G/s
ReadRecordBatch/1024/real_time      321198 ns       321140 ns         6501 bytes_per_second=3.04038G/s
ReadRecordBatch/4096/real_time     1292189 ns      1291978 ns         1634 bytes_per_second=773.881M/s
ReadRecordBatch/8192/real_time     2571412 ns      2570970 ns          816 bytes_per_second=388.891M/s

@wesm
Copy link
Member

wesm commented Feb 12, 2020

The performance change seems acceptable. I assume this is ready for a final review and merge?

@pitrou
Copy link
Member Author

pitrou commented Feb 12, 2020

Yes, it should be ready.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thank you again for working on this, I think this will help a lot with mixed-device Arrow use

#ifndef NDEBUG
CheckCPU();
#endif
return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a fair point (my CUDA is a bit rusty but I do recall this now)


// Like AssertBufferEqual, but doesn't call Buffer::data()
void AssertMyBufferEqual(const Buffer& buffer, util::string_view expected) {
ASSERT_EQ(util::string_view(buffer), expected);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's curious that this works

@wesm wesm closed this in 9f0c70c Feb 12, 2020
@pitrou pitrou deleted the ARROW-2447-device-api-memory-manager branch February 12, 2020 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants