New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU vs. CPU production running #1901
Conversation
I've flagged this as WIP until I can sort out the tests (gpu_specter is more restrictive about bundlesize, nsubbundles, and nspec) and fix the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a welcome cleanup that will simplify things and make them more suitable for the new normal of perlmutter+gpus+gpu_specter. I have three inline questions I would like to get responses about and one trivial fix to ensure that the dry-run
testing infrastructure continues to work properly.
I have not yet tested the code. I will let Marcelo help with that as part of a larger perlmutter test. I will also do a few more sanity checks on output scripts before approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go ahead and approve this since my comments were satisfied, but a production-style test and subsequent checking of scripts and logs would be useful before merging.
I have fixed the --use-specter case, updated the unit tests, and verified that |
I ran another end-to-end test successfully. Two more changes for the record:
|
This PR contains a number of fixes in preparation for running productions on Perlmutter:
--no-gpu
(single script) or by setting $DESI_NO_GPU (global opt-out). This replaces--use-gpu
(stdstars) and--gpuextract
(extractions).--use-specter
at some levels and--gpuspecter
at others.--system-name
is used by pipeline wrappers to know what kinds of batch jobs to generate and where to send them, but once the code wakes up it uses whatever resources it finds--system-name
automatically picks between CPU and GPU for each job type, and for the jobs that are sent to GPU nodes they correctly use GPUs (fixes desi_run_night default CPU+GPU support not using GPUs? #1881)desispec.gpu.is_gpu_available()
standardizes the logic for identifying if a GPU is available and should be used.Some other changes that came along for the ride:
desi_run_night
defaults to regular queue instead of realtime, since it's primary usage is production runs using the regular queue.--use-specter
extractions default to--nsubbundles=5
instead of 6 for consistency with the gpu_specter default (which can't support 6).Test cases that I checked:
desi_run_night ...
on Perlmutter without specifying--system-name
: jobs are correctly sent to CPU vs. GPU nodes, and the GPU jobs actually use the GPUsdesi_run_night --system-name perlmutter-cpu ...
: jobs are sent only to CPU nodes, and those jobs don't trip over the lack of GPUsdesi_run_night --system-name cori-knl ...
: this PR doesn't break production running on KNL (though I hope we never need to use that again)desi_run_night --use-specter
: correctly uses specter instead of gpu_specterTODO (doesn't work yet):
desi_daily_proc_manager --dry-run-level 1 --use-specter ...
: daily proc on haswell with specter generates the correct jobs scripts (didn't try fully running).Test productions using various iterations of this branch are in
/global/cfs/cdirs/desi/users/sjbailey/spectro/redux/cpugpu-*
(various test cases listed above) andh1
(7 full nights from sv1 and sv3).@akremin and @marcelo-alvarez please take a look; also heads up @dmargala .