Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

Closed

Conversation

gjoseph92
Copy link
Collaborator

@gjoseph92 gjoseph92 commented Jul 23, 2021

When merging sub-frames before deserializing, if all the sub-frames are memoryviews backed by the same underlying buffer, "merge" them without copying by making a new memoryview pointing to the same buffer, instead of always copying.

Running @crusaderky's reproducer from #5107 on main:
before
With this PR:
after

Thinking more about my comment #5107 (comment) (tldr: is copying unavoidable because we have to convert non-contiguous memory into contiguous memory?), I realized that in many cases, all the frames already are in contiguous memory thanks to #4506, and they're also zero-copy memoryviews thanks to #3980 (go @jakirkham!). So merging frames in that case could become a bookkeeping operation, where we just make a view of the buffer spanning all the sub-frames we want to merge.

This appears to work but also feels hacky/brittle to me. It requires calculating and passing down offsets into the underlying buffer, and kinda makes some assumptions about how the chain of calling functions works. I might prefer:

  • Reorganizing the functions such that unpack_frames also handles merging frames, since we already have the original buffer and all of the offsets we need there. However, that would require deserializing headers within unpack_frames, which is probably just as much of a mess.
  • A wrapper class for memoryview that tracks its index when sliced, and logic to merge those. I think we'd have to write this in C/Cython though, otherwise it wouldn't expose the buffer protocol.
  • Some way to get the index/slice a memoryview object is pointing to within its buffer. But I believe there's no Python API for this.

cc @madsbk

This is a hack because in real use, I am pretty sure that `frames` will be either all memoryviews or all not, since they'll either be coming from a comm or from a bytestring. But in tests where we call `loads` directly, this may not be the case.
@jakirkham
Copy link
Member

Yeah I think reworking serialization to be a bit more efficient is one of the things that Mads was looking into. This may involve moving to msgspec. Though should note that Mads is on PTO atm (so not trying to bug him 😉)

@gjoseph92
Copy link
Collaborator Author

Using msgspec would make things easier, especially since it'll give us a memoryview when decoding an extension type, which will make this zero-copy logic straightforward. I'm looking forward to that.

Since Mads is going to be refactoring all this anyway, I'd advocate for getting this in as is then, though it's a bit arcane. I have a feeling it'll help memory performance quite a lot. I think the main reason why #4967 helped with memory pressure so much is that transfers were creating these memory spikes, and it just cut down the number of transfers. Making transfers actually zero-copy (at least for NumPy arrays; not sure about pandas yet?) may make high-communication workloads run much more smoothly.

@gjoseph92 gjoseph92 changed the title Serialization: zero-copy merge subframes when possible Deserialization: zero-copy merge subframes when possible Jul 24, 2021
distributed/protocol/core.py Outdated Show resolved Hide resolved
distributed/protocol/serialize.py Outdated Show resolved Hide resolved
distributed/protocol/serialize.py Outdated Show resolved Hide resolved
@gjoseph92
Copy link
Collaborator Author

Hey @crusaderky any idea why these CI logs are truncated? I'm seeing most of the not ci1 tests fail in test_serialize_bytes, but can't see the actual error. See for example https://github.com/dask/distributed/runs/3195109068.

@crusaderky
Copy link
Collaborator

@gjoseph92 looks like the whole docker container fell over halfway through. Either that or some issue in the github wrapper around it.

@jrbourbeau
Copy link
Member

GitHub actions was down for a bit yesterday. @gjoseph92 could you push an empty commit to rerun CI?

@crusaderky
Copy link
Collaborator

crusaderky commented Jul 30, 2021

Maybe it's a segmentation fault that is triggered by repr() which in turn is triggered by printing the stack trace? I would try to deliberately cause one to see if it spells out SIGSEGV explicitly or if you get an equally nebulous output.

@mrocklin
Copy link
Member

mrocklin commented Jul 30, 2021 via email

@jakirkham
Copy link
Member

jakirkham commented Jul 30, 2021

If the question is when should we merge this, would it make sense to do it after cutting the release?

Should add Mads is getting back next week. So he may have thoughts on how to build on this in cutting down on memory usage.

@gjoseph92
Copy link
Collaborator Author

Yes certainly don't merge this until after the release. Definitely seems like there's a segfault here, which annoyingly I haven't reproduced locally yet on macOS.

Less useful, but also less complex. Short of writing the frames by hand, I couldn't come up with a test that could even trigger this behavior. I'd feel uncomfortable having that code be untested and only run on what's already an edge case. So if this assert is ever triggered, we can come back and add the zero-copy and test it.
A separate `copy_frames` function makes this more readable and easier to test. Also, I came up with a test for this case that's still contrived, but not ridiculously contrived.

That said, we don't want this copy to happen. And I'm pretty confident it will never happen with reall comms, because either the whole message is one buffer (TCP), or memoryviews aren't used at all. This mix-and-match only even happens in tests; see 1869b18. So maybe we should stick with the assert as a warning to future developers, so nobody messes this up and it keeps working with a silent performance regression?
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

As noted in ab6119a: I think this error is currently impossible to raise in real use, and we want to keep it that way. We'd like a future test to fail if something causes this case to happen, rather than a silent copy.
@gjoseph92
Copy link
Collaborator Author

This should get another review. The failing tests were because the whole buffer underlying the frames may have a short prefix from pack_frames (so frames[0] doesn't start at byte 0 in the buffer), which we need to pass all the way down.

This made me think of a strange (I think impossible actually) edge case: when frames contains memoryviews with different underlying buffers, and that initial prefix-offset is passed in. If they're different buffers, it's unlikely they both need the same prefix offset. I say this is impossible because real comms don't mix-and-match like this AFAIK. Currently:

  • TCP: all frames are memoryviews of the same buffer (so same prefix)
  • websockets: all frames are separate bytestrings (not memoryviews so irrelevant)
  • UCX: zero-copy slices of a NumPy array or bytearray? Not totally sure; seems consistent though (not memoryviews so irrelevant)

Nonetheless I didn't like leaving this case around since it could fail in very weird ways, so I added logic to copy these subframes that don't match up just in case: ab6119a. But I've changed my mind again and decided to just make this case an AssertionError, because:

  1. as explained above, it shouldn't happen in real use
  2. if some future change or Comm implementation makes it happen, that should be an explicit failure instead of a silent performance degradation.

I'll be out for the next week, so if consensus is to do the copy instead of the assert, just revert e2da610.

This has become more complicated than I wanted given that we're hopefully redoing all this soon. I really wish there was some way to get the current index into the underlying buffer that a memoryview was pointing to. That would eliminate all this passing down offsets and be more reliable. Is there any way I'm missing to do this without writing a C extension to do pointer arithmetic on ((PyMemoryViewObject *)x)->view.buf vs ((PyMemoryViewObject *)memoryview(x.obj))->view.buf?

@jakirkham
Copy link
Member

jakirkham commented Jul 31, 2021

UCX uses a mix of NumPy arrays and RMM DeviceBuffers depending on if it is in CPU or GPU memory. That said, wouldn't worry about that. We can probably refresh that layer to benefit from some of the same things we are doing for TCP later. For example receiving one CPU and one GPU buffer instead of multiple and then just indexing into them.

As to the pointer arithmetic point, yes one can use NumPy.

In [1]: import numpy as np

In [2]: m = memoryview(bytearray(8))

In [3]: a = np.asarray(m)

In [4]: a.__array_interface__["data"][0]
Out[4]: 140691895000496

That said, NumPy has other utility functions like may_share_memory, which may simplify things. They are not perfect, but maybe good enough and would simplify things potentially. IDK if we want to depend on NumPy though. Maybe one can extract similar info out of ctypes?

@quasiben
Copy link
Member

quasiben commented Aug 2, 2021

add to allowlist

@gjoseph92 gjoseph92 changed the title Deserialization: zero-copy merge subframes when possible 🛑 DNM Deserialization: zero-copy merge subframes when possible Aug 12, 2021
@gjoseph92
Copy link
Collaborator Author

@jakirkham thanks for the pointer 😆 about NumPy! That is exactly the sort of thing I was looking for.

Using that I was able to come up with a much more sensible implementation of this in a new PR.

@gjoseph92 gjoseph92 closed this Aug 12, 2021
@gjoseph92 gjoseph92 deleted the serialization/zero-copy-merge-frames branch August 12, 2021 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary deep copy causes memory flare on network comms
7 participants