Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize copying in maybe_compress & byte_sample #6273

Merged
merged 51 commits into from
May 6, 2022

Conversation

jakirkham
Copy link
Member

Currently there are a bunch of copies that occur in maybe_compress and byte_sample. Some of these are explicit (like calling ensure_bytes) and some are implicit (like slicing). In either case it would be good to avoid additional memory allocation and copying in these functions when it is not needed. After all these code paths can be triggered when sending data over the wire or spilling to disk (either could be occurring due to memory pressure that we don't want to add to).


  • Tests added / passed
  • Passes pre-commit run --all-files

@jakirkham jakirkham force-pushed the avoid_memcpy_compress branch 4 times, most recently from 2981f08 to 705399f Compare May 5, 2022 09:46
Otherwise `memoryview` will raise a `TypeError`.
No need to pay the cost for copying here. Just use an empty `bytes`
object for the `memoryview`. Should be faster in this case and saves us
a check in the `cast` case.
This is converted to `int` here, but is unused below. So go ahead and
drop it as it doesn't seem to be needed.
Also rename `nbytes` variable to `payload_nbytes` for clarity.
To allow more efficient accessing of the `payload` (like when selecting
portions in `byte_sample`), take a `memoryview` of the data. Ensure that
is 1-D contiguous `uint8` data. This makes it very similar to `bytes`,
which will work well in `byte_sample` and compressors that handle only a
narrow form of the Python Buffer Protocol.

This allows us to drop various `ensure_bytes` calls in compression that
would otherwise copy the data. Should reduce memory usage when
serializing as part of transmission or spilling.
@jakirkham
Copy link
Member Author

cc @dask/maintenance (in case anyone has thoughts on this)

Also cc-ing @madsbk given the ensure_memoryview changes

@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 40m 45s ⏱️ + 19m 41s
  2 762 tests +  5    2 683 ✔️ +  4       78 💤 ±0  1 +1 
22 058 runs  +40  21 037 ✔️ +39  1 020 💤 ±0  1 +1 

For more details on these failures, see this check.

Results for commit 8661564. ± Comparison against base commit 2286896.

♻️ This comment has been updated with latest results.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a couple of thoughts, but it's definitely a good thing to do.

distributed/protocol/compression.py Outdated Show resolved Hide resolved
distributed/protocol/compression.py Outdated Show resolved Hide resolved
distributed/utils.py Show resolved Hide resolved
Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>
@jakirkham
Copy link
Member Author

Planning to merge end of day tomorrow if no comments

As comparisons were effectively flipped from how they were before, these
should have `=`s as a condition as well.
This can be quite a bit faster than `append`ing each value (particularly
if resizing of the underlying array needs to occur).
These are basically unused and are expected to be `int`s internally. So
just pick default values that are `int`s to start.
Avoid repeated copies while testing that don't add value here.
Copy link
Contributor

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jakirkham
Copy link
Member Author

Thanks Mads! 🙏

@jakirkham
Copy link
Member Author

One of the CI failures is a known flaky test that was very recently addressed ( #6233 ). The other CI failure is a new flaky test so filed as issue ( #6292 ).

@jakirkham jakirkham merged commit 4d6a438 into dask:main May 6, 2022
@jakirkham jakirkham deleted the avoid_memcpy_compress branch May 6, 2022 23:45
@jakirkham
Copy link
Member Author

Thanks all! 🙏

Going to get this in. If anything else comes up, happy to follow up separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants