P2P offload get_output_partition #7587

fjetter · 2023-02-27T13:17:05Z

This stuff is blocking the event loop on the get_output_partition path

https://github.com/coiled/coiled-runtime/actions/runs/4282337088

Here are a select couple of workloads showing

left upper corner offloads merely the deserialization part, i.e. convert_partition/to_pandas
right upper corner offloads disk as well
lower left (ignore, code not shown here)
bottom right shows version 2022.11.0, i.e. just after Rewrite of P2P control flow #7268 was merged. @hendrikmakait this shows that while developing the various consistency features we actually lost a bit of performance. Most of this is likely due to the iteration and inclusion of input partition ID

Note: When inspecting the memory graphs, this change appears to be performing horribly. This is merely an artifact of the benchmarks. Since the output is not blocked on the event loop anymore, we're processing output partitions more eagerly raising the average memory footprint since there are more partitions in memory at the same time. Actually looking at the dashboard shows no problems.

I haven't run the same benchmarks for arrays but expect a similar improvement

fjetter · 2023-02-27T13:19:39Z

I don't have a good idea on how one would test this without reverse engineering asyncio but I think this change is fine since the benchmarks would protect us from a regression

hendrikmakait

bottom right shows version 2022.11.0, i.e. just after Rewrite of P2P control flow #7268 was merged. @hendrikmakait this shows that while developing the various consistency features we actually lost a bit of performance. Most of this is likely due to the iteration and inclusion of input partition ID

Learning for next time: Add integration/performance regression tests as soon as possible.

I haven't run the same benchmarks for arrays but expect a similar improvement

I'm adding integration tests for arrays at the moment, we should have an answer once they are merged.

I've also had a look at offloading the output path today and from what I understand, our read-API of the disk buffer is unfortunately not thread-safe due to side-effects in updating diagnostics data:

distributed/distributed/shuffle/_buffer.py

Line 264 in f4328bb

self.diagnostics[name] += stop - start

distributed/distributed/shuffle/_disk.py

Line 97 in f4328bb

self.bytes_read += size

TL;DR: I'd recommend offloading conversion for the time being and refactor the buffers to allow offloading disk I/O to threads.

fjetter · 2023-02-27T14:21:51Z

TL;DR: I'd recommend offloading conversion for the time being and refactor the buffers to allow offloading disk I/O to threads.

fine by me

github-actions · 2023-02-27T14:21:59Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files ±0       26 suites ±0 12h 10m 43s ⏱️ + 7m 21s
  3 492 tests ±0   3 387 ✔️ - 1   103 💤 ±0 2 ❌ +1
44 136 runs ±0 42 066 ✔️ - 2 2 068 💤 +1 2 ❌ +1

For more details on these failures, see this check.

Results for commit 7d329f3. ± Comparison against base commit e57f242.

♻️ This comment has been updated with latest results.

fjetter · 2023-03-09T16:51:04Z

@hendrikmakait I removed offloading disk and CI is green-ish. Anything else?

fjetter requested a review from hendrikmakait February 27, 2023 13:17

hendrikmakait requested changes Feb 27, 2023

View reviewed changes

hendrikmakait approved these changes Feb 28, 2023

View reviewed changes

fjetter added 5 commits March 9, 2023 15:28

Offload deserialization in get_output_part

f0a6250

Offload disk read as well

a23b3b2

offload array as well

58b74bb

fix

b5609fe

Do not offload data

7d329f3

fjetter force-pushed the p2p_offload_disk_read branch from 57e65ba to 7d329f3 Compare March 9, 2023 14:28

fjetter requested a review from milesgranger March 9, 2023 14:35

hendrikmakait merged commit 84169b2 into dask:main Mar 9, 2023

fjetter deleted the p2p_offload_disk_read branch March 9, 2023 16:56

fjetter mentioned this pull request Mar 14, 2023

Regression in test_join.py::test_join_big coiled/benchmarks#711

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2P offload get_output_partition #7587

P2P offload get_output_partition #7587

fjetter commented Feb 27, 2023

fjetter commented Feb 27, 2023

hendrikmakait left a comment •

edited

Loading

fjetter commented Feb 27, 2023

github-actions bot commented Feb 27, 2023 •

edited

Loading

fjetter commented Mar 9, 2023

P2P offload get_output_partition #7587

P2P offload get_output_partition #7587

Conversation

fjetter commented Feb 27, 2023

fjetter commented Feb 27, 2023

hendrikmakait left a comment • edited Loading

Choose a reason for hiding this comment

fjetter commented Feb 27, 2023

github-actions bot commented Feb 27, 2023 • edited Loading

Unit Test Results

fjetter commented Mar 9, 2023

hendrikmakait left a comment •

edited

Loading

github-actions bot commented Feb 27, 2023 •

edited

Loading