Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix desi_proc specex wrapper MPI rank bug for single camera #1540

Merged
merged 2 commits into from Dec 20, 2021
Merged

Conversation

marcelo-alvarez
Copy link
Contributor

Fixes #1528 by ensuring that 21 MPI processes are used for a single camera.

Following tests with this branch pass:

Cameras: z1
  haswell: srun -N 1 -n 21 -c 3
      knl: srun -N 1 -n 21 -c 12
Cameras: b1
  haswell: srun -N 1 -n 21 -c 3
      knl: srun -N 1 -n 21 -c 12
Cameras: r135
  haswell: srun -N 1 -n 31 -c 2
      knl: srun -N 1 -n 31 -c 8
Cameras: a3
  haswell: srun -N 1 -n 31 -c 2
      knl: srun -N 1 -n 31 -c 8
Cameras: a0123456789
  haswell: srun -N 10 -n 301 -c 2
      knl: srun -N 10 -n 301 -c 9

Please review and merge if appropriate.

@coveralls
Copy link

coveralls commented Dec 18, 2021

Coverage Status

Coverage decreased (-0.008%) to 25.646% when pulling 8e41121 on mpirankbug into 3adabaf on master.

@sbailey
Copy link
Contributor

sbailey commented Dec 18, 2021

Thanks @marcelo-alvarez . Please check running a case with an odd number of cameras, where it selects the number of ranks as 10*n+1, e.g. the r135 case above. It seems to successfully run specex and generate the initial fit-psf-??-????????.fits files within a few minutes, but then it gets wedged before it gets to the desi_proc logic for interpolating over bad fibers and the jobs just keeps running without apparently doing anything. Running with an even number of cameras such that the selected number of ranks is 20*n+1 seems to work fine.

@marcelo-alvarez
Copy link
Contributor Author

Apologies for not verifying the previous solution actually runs without crashing. The latest commit runs without crashing for r135 and is expected to do so for all possible camera combinations, from a single camera to all 30.

@sbailey
Copy link
Contributor

sbailey commented Dec 20, 2021

Looks good. Thanks for including the comment "lowest multiple of 20 exceeding 10 per camera" explaining the magic 20 math for our future selves. Merging now.

@sbailey sbailey merged commit e822758 into master Dec 20, 2021
@sbailey sbailey deleted the mpirankbug branch December 20, 2021 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

desi_proc specex wrapper MPI rank bug for single camera
3 participants