Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Balance distributed partition size more evenly #1135

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

atolopko-czi
Copy link
Collaborator

@atolopko-czi atolopko-czi commented May 9, 2024

Fixes #1119

Generate the partitions by splitting on obs_joinids instead of on soma chunks.

Previously, when the number of soma chunks was not evenly divisible by the partition count ("world size"), partition sizes could become drastically imbalanced.

The new algo for shuffled, partitioned ids is:

  1. Chunk global ids into soma chunks
  2. Globally shuffle the chunks (do not shuffle the chunks internally)
  3. Create a flat list of global ids again
  4. Partition the flat list of globals id, and select the ids for the current partition
  5. Chunk into soma chunks (again) within the partition

When performing partitioning without shuffling, the soma chunks from step 1 may end up being split across two partitions. This introduces a minor read performance hit, since each original soma chunk that is split across partitions will result in 2 read operations instead of 1.

When performing partitioning with shuffling, each soma chunk from step 1 may end up being split across two soma chunks in step 5. This is because the latter chunks will not be "aligned" with the former chunks when the partition sizes are not evenly divisible by the chunk size. This introduces a 2x read performance hit in the worst case. Consider that each partition-local chunk may be composed of 2 original chunks that are not stored contiguously on disk. This could be addressed by explicitly aligning the step 5 chunks with the step 1 chunks, but this has not been implemented in the current PR.

Now, the worst case imbalance is that any two partitions will only differ in size by 1 row. This is due to the fact the numpy.array_split() evenly distributes the imbalances across the splits that it produces. Since only imbalances of size 1 can occur, it should not be necessary to drop rows or pad rows in any partition.

New unit test cases are included in test_distributed__returns_data_partition_for_rank, which now exercises the imbalanced cases as well. I've also added test_distributed__returns_data_partition_for_rank_globally_shuffled to demonstrate that global shuffling is maintained.

Generate the partitions by splitting on obs_joinids instead of on soma
chunks.

Previously, when the number of soma chunks was not evenly divisble by
the partition count ("world size"), partition sizes could become drastically
imbalanced.

Now, the worst case imbalance is that any two partitions will only
differ in size by 1 row. This is due to the fact the numpy.array_split()
distributed imbalances across the splits that it produces.
@atolopko-czi atolopko-czi self-assigned this May 9, 2024
Copy link

codecov bot commented May 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.12%. Comparing base (c796baf) to head (095c34c).

❗ Current head 095c34c differs from pull request most recent head cd84ef2. Consider uploading reports for the commit cd84ef2 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1135   +/-   ##
=======================================
  Coverage   91.12%   91.12%           
=======================================
  Files          77       77           
  Lines        5857     5860    +3     
=======================================
+ Hits         5337     5340    +3     
  Misses        520      520           
Flag Coverage Δ
unittests 91.12% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ebezzi ebezzi changed the title #1119 Balance distributed partition size more evenly [python] Balance distributed partition size more evenly May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant