bugfix in distribute_ranks_to_blocks #1919
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a bug in
desispec.scripts.zproc.distribute_ranks_to_blocks
where it was creating more blocks (i.e. MPI subcommunicators) than requested. e.g.note that the main returned
block_num=3
is too big; it should have been 0, 1, or 2 based upon the requested number ofnblocks=3
. The consequence was that the GPU node (size=64) zproc script was creating 4 blocks to process 3 afterburners, and there is a race condition with ranks from two different blocks both running the MgII afterburner.FWIW, I made a similar bug in distributing MPI ranks to GPU devices in PR #1899, so I was familiar with the fix already :)
The key difference is that
block_num = int(rank / (size/nblocks))
instead ofblock_num = int(rank / int(size/nblocks))
with the side effect that different blocks can have different numbers of ranks ifsize/nblocks
isn't an integer, i.e. you can't pre-decide a single integerblock_size
that applies to every block and still have that use up all the ranks if size/nblocks is not an integer.This PR includes both the bug fix, and unit tests that fail on current main but pass with this PR. I also tested with running a coadd-redshift job to verify that it also works in practice.
I intend to self-merge since I need this fix for other healpix redshifting development work, but I'm submitting this as a separate PR since it isn't specific to the healpix changes.