Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

desi_proc specex wrapper MPI rank bug for single camera #1528

Closed
sbailey opened this issue Dec 12, 2021 · 0 comments · Fixed by #1540
Closed

desi_proc specex wrapper MPI rank bug for single camera #1528

sbailey opened this issue Dec 12, 2021 · 0 comments · Fixed by #1540
Assignees
Projects

Comments

@sbailey
Copy link
Contributor

sbailey commented Dec 12, 2021

When running desi_proc --batch on an arc exposure with a single camera, the generated batch script has the wrong number of MPI ranks leading to a specex wrapper failure:

Example:

desi_proc --traceshift -n 20210507 -e 87539 --cameras b5 --batch --nosubmit

Generates a script with

srun -N 1 -n 11 -c 5 desi_proc --traceshift -n 20210507 -e 87539 --cameras b5 ... -mpi

(for simplicity I dropped the full path to desiproc and the long --timing-file option)

Note that has 11 instead of 21 ranks. Running that causes:

...
Traceback (most recent call last):
  File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/desispec/0.48.0/bin/desi_proc", line 7, in 
<module>
    proc.main(args)
  File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/desispec/0.48.0/lib/python3.8/site-packages
/desispec/scripts/proc.py", line 464, in main
    desispec.scripts.specex.run(comm,cmds,args.cameras)
  File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/desispec/0.48.0/lib/python3.8/site-packages
/desispec/scripts/specex.py", line 336, in run
    sc = Schedule(fitbundles,comm=comm,njobs=len(cameras),group_size=group_size)
  File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/desispec/0.48.0/lib/python3.8/site-packages
/desispec/workflow/schedule.py", line 67, in __init__
    raise Exception("can't have group_size larger than world size - 1")
Exception: can't have group_size larger than world size - 1

@marcelo-alvarez please update the bookkeeping and test desi_proc --batch --nosubmit ... with various combinations of single cameras, N>1 random individual cameras, complete spectrographs, and all spectrographs to confirm that it generates the intended number of ranks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Fuji
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants