🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

gjoseph92 · 2021-07-23T20:52:45Z

When merging sub-frames before deserializing, if all the sub-frames are memoryviews backed by the same underlying buffer, "merge" them without copying by making a new memoryview pointing to the same buffer, instead of always copying.

Running @crusaderky's reproducer from #5107 on main:

With this PR:

Thinking more about my comment #5107 (comment) (tldr: is copying unavoidable because we have to convert non-contiguous memory into contiguous memory?), I realized that in many cases, all the frames already are in contiguous memory thanks to #4506, and they're also zero-copy memoryviews thanks to #3980 (go @jakirkham!). So merging frames in that case could become a bookkeeping operation, where we just make a view of the buffer spanning all the sub-frames we want to merge.

This appears to work but also feels hacky/brittle to me. It requires calculating and passing down offsets into the underlying buffer, and kinda makes some assumptions about how the chain of calling functions works. I might prefer:

Reorganizing the functions such that unpack_frames also handles merging frames, since we already have the original buffer and all of the offsets we need there. However, that would require deserializing headers within unpack_frames, which is probably just as much of a mess.
A wrapper class for memoryview that tracks its index when sliced, and logic to merge those. I think we'd have to write this in C/Cython though, otherwise it wouldn't expose the buffer protocol.
Some way to get the index/slice a memoryview object is pointing to within its buffer. But I believe there's no Python API for this.

cc @madsbk

Closes Unnecessary deep copy causes memory flare on network comms #5107
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

This is a hack because in real use, I am pretty sure that `frames` will be either all memoryviews or all not, since they'll either be coming from a comm or from a bytestring. But in tests where we call `loads` directly, this may not be the case.

jakirkham · 2021-07-23T21:17:15Z

Yeah I think reworking serialization to be a bit more efficient is one of the things that Mads was looking into. This may involve moving to msgspec. Though should note that Mads is on PTO atm (so not trying to bug him 😉)

gjoseph92 · 2021-07-23T23:03:37Z

Using msgspec would make things easier, especially since it'll give us a memoryview when decoding an extension type, which will make this zero-copy logic straightforward. I'm looking forward to that.

Since Mads is going to be refactoring all this anyway, I'd advocate for getting this in as is then, though it's a bit arcane. I have a feeling it'll help memory performance quite a lot. I think the main reason why #4967 helped with memory pressure so much is that transfers were creating these memory spikes, and it just cut down the number of transfers. Making transfers actually zero-copy (at least for NumPy arrays; not sure about pandas yet?) may make high-communication workloads run much more smoothly.

distributed/protocol/core.py

distributed/protocol/serialize.py

Co-authored-by: crusaderky <crusaderky@gmail.com>

gjoseph92 · 2021-07-30T14:52:17Z

Hey @crusaderky any idea why these CI logs are truncated? I'm seeing most of the not ci1 tests fail in test_serialize_bytes, but can't see the actual error. See for example https://github.com/dask/distributed/runs/3195109068.

crusaderky · 2021-07-30T14:55:25Z

@gjoseph92 looks like the whole docker container fell over halfway through. Either that or some issue in the github wrapper around it.

jrbourbeau · 2021-07-30T14:57:24Z

GitHub actions was down for a bit yesterday. @gjoseph92 could you push an empty commit to rerun CI?

crusaderky · 2021-07-30T14:58:28Z

Maybe it's a segmentation fault that is triggered by repr() which in turn is triggered by printing the stack trace? I would try to deliberately cause one to see if it spells out SIGSEGV explicitly or if you get an equally nebulous output.

mrocklin · 2021-07-30T16:24:19Z

Gabe how confident are you that this PR won't have unintended consequences? My gut reaction would be to wait on this one just because it's so core and doesn't solve a current critical issue.

…

On Fri, Jul 30, 2021 at 7:58 AM crusaderky ***@***.***> wrote: Maybe it's a segmentation fault that is triggered by repr() which in turn is triggered by the printing on the stack trace? I would try to deliberately cause one to see if it spells out SIGSEGV explicitly or if you get an equally nebulous output. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEGERCNG5TH2UW2MYTT2K4Z7ANCNFSM5A4VTQCQ> .

jakirkham · 2021-07-30T16:36:12Z

If the question is when should we merge this, would it make sense to do it after cutting the release?

Should add Mads is getting back next week. So he may have thoughts on how to build on this in cutting down on memory usage.

gjoseph92 · 2021-07-30T18:39:10Z

Yes certainly don't merge this until after the release. Definitely seems like there's a segfault here, which annoyingly I haven't reproduced locally yet on macOS.

Less useful, but also less complex. Short of writing the frames by hand, I couldn't come up with a test that could even trigger this behavior. I'd feel uncomfortable having that code be untested and only run on what's already an edge case. So if this assert is ever triggered, we can come back and add the zero-copy and test it.

This reverts commit 39ce2b6.

A separate `copy_frames` function makes this more readable and easier to test. Also, I came up with a test for this case that's still contrived, but not ridiculously contrived. That said, we don't want this copy to happen. And I'm pretty confident it will never happen with reall comms, because either the whole message is one buffer (TCP), or memoryviews aren't used at all. This mix-and-match only even happens in tests; see 1869b18. So maybe we should stick with the assert as a warning to future developers, so nobody messes this up and it keeps working with a silent performance regression?

GPUtester · 2021-07-31T00:23:58Z

Can one of the admins verify this patch?

As noted in ab6119a: I think this error is currently impossible to raise in real use, and we want to keep it that way. We'd like a future test to fail if something causes this case to happen, rather than a silent copy.

gjoseph92 · 2021-07-31T01:17:09Z

This should get another review. The failing tests were because the whole buffer underlying the frames may have a short prefix from pack_frames (so frames[0] doesn't start at byte 0 in the buffer), which we need to pass all the way down.

This made me think of a strange (I think impossible actually) edge case: when frames contains memoryviews with different underlying buffers, and that initial prefix-offset is passed in. If they're different buffers, it's unlikely they both need the same prefix offset. I say this is impossible because real comms don't mix-and-match like this AFAIK. Currently:

TCP: all frames are memoryviews of the same buffer (so same prefix)
websockets: all frames are separate bytestrings (not memoryviews so irrelevant)
UCX: zero-copy slices of a NumPy array or bytearray? Not totally sure; seems consistent though (not memoryviews so irrelevant)

Nonetheless I didn't like leaving this case around since it could fail in very weird ways, so I added logic to copy these subframes that don't match up just in case: ab6119a. But I've changed my mind again and decided to just make this case an AssertionError, because:

as explained above, it shouldn't happen in real use
if some future change or Comm implementation makes it happen, that should be an explicit failure instead of a silent performance degradation.

I'll be out for the next week, so if consensus is to do the copy instead of the assert, just revert e2da610.

This has become more complicated than I wanted given that we're hopefully redoing all this soon. I really wish there was some way to get the current index into the underlying buffer that a memoryview was pointing to. That would eliminate all this passing down offsets and be more reliable. Is there any way I'm missing to do this without writing a C extension to do pointer arithmetic on ((PyMemoryViewObject *)x)->view.buf vs ((PyMemoryViewObject *)memoryview(x.obj))->view.buf?

jakirkham · 2021-07-31T02:51:58Z

UCX uses a mix of NumPy arrays and RMM DeviceBuffers depending on if it is in CPU or GPU memory. That said, wouldn't worry about that. We can probably refresh that layer to benefit from some of the same things we are doing for TCP later. For example receiving one CPU and one GPU buffer instead of multiple and then just indexing into them.

As to the pointer arithmetic point, yes one can use NumPy.

In [1]: import numpy as np

In [2]: m = memoryview(bytearray(8))

In [3]: a = np.asarray(m)

In [4]: a.__array_interface__["data"][0]
Out[4]: 140691895000496

That said, NumPy has other utility functions like may_share_memory, which may simplify things. They are not perfect, but maybe good enough and would simplify things potentially. IDK if we want to depend on NumPy though. Maybe one can extract similar info out of ctypes?

quasiben · 2021-08-02T14:10:16Z

add to allowlist

gjoseph92 · 2021-08-12T21:24:37Z

@jakirkham thanks for the pointer 😆 about NumPy! That is exactly the sort of thing I was looking for.

Using that I was able to come up with a much more sensible implementation of this in a new PR.

gjoseph92 added 2 commits July 23, 2021 11:49

WIP merge memoryview frames without copying

4fc3dae

make tests pass

1869b18

This is a hack because in real use, I am pretty sure that `frames` will be either all memoryviews or all not, since they'll either be coming from a comm or from a bytestring. But in tests where we call `loads` directly, this may not be the case.

gjoseph92 added 2 commits July 23, 2021 14:41

Test zero-copy

75922a6

cleanup

cf62d30

gjoseph92 changed the title ~~Serialization: zero-copy merge subframes when possible~~ Deserialization: zero-copy merge subframes when possible Jul 24, 2021

jakirkham mentioned this pull request Jul 24, 2021

Unnecessary deep copy causes memory flare on network comms #5107

Closed

crusaderky approved these changes Jul 25, 2021

View reviewed changes

distributed/protocol/core.py Outdated Show resolved Hide resolved

distributed/protocol/serialize.py Outdated Show resolved Hide resolved

distributed/protocol/serialize.py Outdated Show resolved Hide resolved

gjoseph92 and others added 3 commits July 28, 2021 12:06

Typo and readability

cc528c0

Co-authored-by: crusaderky <crusaderky@gmail.com>

annotations from the ~future~

934526f

Pass offset in deserialize_bytes

ee679fd

rerun ci

523957b

jakirkham mentioned this pull request Jul 30, 2021

Are reference cycles a performance problem? #4987

Open

gjoseph92 added 5 commits July 30, 2021 22:47

Account for prelude before frames

2885366

Test for multiple memoryview case

31e0e6e

Revert "Assert instead of subframe copy"

6ace0a4

This reverts commit 39ce2b6.

Changed mind again: back to AssertionError

e2da610

As noted in ab6119a: I think this error is currently impossible to raise in real use, and we want to keep it that way. We'd like a future test to fail if something causes this case to happen, rather than a silent copy.

quasiben mentioned this pull request Jul 31, 2021

gpuCI for Dask org members rapidsai/dask-build-environment#2

Closed

gjoseph92 changed the title ~~Deserialization: zero-copy merge subframes when possible~~ 🛑 DNM Deserialization: zero-copy merge subframes when possible Aug 12, 2021

gjoseph92 closed this Aug 12, 2021

gjoseph92 mentioned this pull request Aug 12, 2021

Deserialization: zero-copy merge subframes when possible #5208

Merged

3 tasks

gjoseph92 deleted the serialization/zero-copy-merge-frames branch August 12, 2021 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

gjoseph92 commented Jul 23, 2021 •

edited

Loading

jakirkham commented Jul 23, 2021

gjoseph92 commented Jul 23, 2021

gjoseph92 commented Jul 30, 2021

crusaderky commented Jul 30, 2021

jrbourbeau commented Jul 30, 2021

crusaderky commented Jul 30, 2021 •

edited

Loading

mrocklin commented Jul 30, 2021 via email

jakirkham commented Jul 30, 2021 •

edited

Loading

gjoseph92 commented Jul 30, 2021

GPUtester commented Jul 31, 2021

gjoseph92 commented Jul 31, 2021

jakirkham commented Jul 31, 2021 •

edited

Loading

quasiben commented Aug 2, 2021

gjoseph92 commented Aug 12, 2021

🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

Conversation

gjoseph92 commented Jul 23, 2021 • edited Loading

jakirkham commented Jul 23, 2021

gjoseph92 commented Jul 23, 2021

gjoseph92 commented Jul 30, 2021

crusaderky commented Jul 30, 2021

jrbourbeau commented Jul 30, 2021

crusaderky commented Jul 30, 2021 • edited Loading

mrocklin commented Jul 30, 2021 via email

jakirkham commented Jul 30, 2021 • edited Loading

gjoseph92 commented Jul 30, 2021

GPUtester commented Jul 31, 2021

gjoseph92 commented Jul 31, 2021

jakirkham commented Jul 31, 2021 • edited Loading

quasiben commented Aug 2, 2021

gjoseph92 commented Aug 12, 2021

gjoseph92 commented Jul 23, 2021 •

edited

Loading

crusaderky commented Jul 30, 2021 •

edited

Loading

jakirkham commented Jul 30, 2021 •

edited

Loading

jakirkham commented Jul 31, 2021 •

edited

Loading