Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix in distribute_ranks_to_blocks #1919

Merged
merged 2 commits into from Nov 27, 2022
Merged

bugfix in distribute_ranks_to_blocks #1919

merged 2 commits into from Nov 27, 2022

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Nov 26, 2022

This PR fixes a bug in desispec.scripts.zproc.distribute_ranks_to_blocks where it was creating more blocks (i.e. MPI subcommunicators) than requested. e.g.

from desispec.scripts.zproc import distribute_ranks_to_blocks
nblocks, block_size, block_rank, block_num = distribute_ranks_to_blocks(nblocks=3, rank=63, size=64)
print(f"{nblocks=} {block_size=} {block_rank=} {block_num=}")

main   --> nblocks=3 block_size=21 block_rank=0 block_num=3
thisPR --> nblocks=3 block_size=21 block_rank=20 block_num=2

note that the main returned block_num=3 is too big; it should have been 0, 1, or 2 based upon the requested number of nblocks=3. The consequence was that the GPU node (size=64) zproc script was creating 4 blocks to process 3 afterburners, and there is a race condition with ranks from two different blocks both running the MgII afterburner.

FWIW, I made a similar bug in distributing MPI ranks to GPU devices in PR #1899, so I was familiar with the fix already :)

The key difference is that block_num = int(rank / (size/nblocks)) instead of block_num = int(rank / int(size/nblocks)) with the side effect that different blocks can have different numbers of ranks if size/nblocks isn't an integer, i.e. you can't pre-decide a single integer block_size that applies to every block and still have that use up all the ranks if size/nblocks is not an integer.

This PR includes both the bug fix, and unit tests that fail on current main but pass with this PR. I also tested with running a coadd-redshift job to verify that it also works in practice.

I intend to self-merge since I need this fix for other healpix redshifting development work, but I'm submitting this as a separate PR since it isn't specific to the healpix changes.

@sbailey sbailey added this to In progress in Himalayas via automation Nov 26, 2022
@sbailey sbailey merged commit 15a7ad3 into main Nov 27, 2022
Himalayas automation moved this from In progress to Done Nov 27, 2022
@sbailey sbailey deleted the rank_blocks branch November 27, 2022 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

1 participant