bugfix in distribute_ranks_to_blocks #1919

sbailey · 2022-11-26T07:22:57Z

This PR fixes a bug in desispec.scripts.zproc.distribute_ranks_to_blocks where it was creating more blocks (i.e. MPI subcommunicators) than requested. e.g.

from desispec.scripts.zproc import distribute_ranks_to_blocks
nblocks, block_size, block_rank, block_num = distribute_ranks_to_blocks(nblocks=3, rank=63, size=64)
print(f"{nblocks=} {block_size=} {block_rank=} {block_num=}")

main   --> nblocks=3 block_size=21 block_rank=0 block_num=3
thisPR --> nblocks=3 block_size=21 block_rank=20 block_num=2

note that the main returned block_num=3 is too big; it should have been 0, 1, or 2 based upon the requested number of nblocks=3. The consequence was that the GPU node (size=64) zproc script was creating 4 blocks to process 3 afterburners, and there is a race condition with ranks from two different blocks both running the MgII afterburner.

FWIW, I made a similar bug in distributing MPI ranks to GPU devices in PR #1899, so I was familiar with the fix already :)

The key difference is that block_num = int(rank / (size/nblocks)) instead of block_num = int(rank / int(size/nblocks)) with the side effect that different blocks can have different numbers of ranks if size/nblocks isn't an integer, i.e. you can't pre-decide a single integer block_size that applies to every block and still have that use up all the ranks if size/nblocks is not an integer.

This PR includes both the bug fix, and unit tests that fail on current main but pass with this PR. I also tested with running a coadd-redshift job to verify that it also works in practice.

I intend to self-merge since I need this fix for other healpix redshifting development work, but I'm submitting this as a separate PR since it isn't specific to the healpix changes.

bugfix in distribute_ranks_to_blocks

affb10e

sbailey added this to In progress in Himalayas via automation Nov 26, 2022

better test of assigned rank correctness

d0016a8

sbailey merged commit 15a7ad3 into main Nov 27, 2022

Himalayas automation moved this from In progress to Done Nov 27, 2022

sbailey deleted the rank_blocks branch November 27, 2022 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix in distribute_ranks_to_blocks #1919

bugfix in distribute_ranks_to_blocks #1919

sbailey commented Nov 26, 2022 •

edited

bugfix in distribute_ranks_to_blocks #1919

bugfix in distribute_ranks_to_blocks #1919

Conversation

sbailey commented Nov 26, 2022 • edited

sbailey commented Nov 26, 2022 •

edited