Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU vs. CPU production running #1901

Merged
merged 17 commits into from Nov 13, 2022
Merged

GPU vs. CPU production running #1901

merged 17 commits into from Nov 13, 2022

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Nov 8, 2022

This PR contains a number of fixes in preparation for running productions on Perlmutter:

  • Standardize GPU/CPU defaults following suggestions in issue standardize GPU options #1799. Basically GPU usage is now opt-out: if the code can find a GPU it will use it unless --no-gpu (single script) or by setting $DESI_NO_GPU (global opt-out). This replaces --use-gpu (stdstars) and --gpuextract (extractions).
  • gpu_specter is default everywhere, instead of --use-specter at some levels and --gpuspecter at others.
  • --system-name is used by pipeline wrappers to know what kinds of batch jobs to generate and where to send them, but once the code wakes up it uses whatever resources it finds
  • On Perlmutter, not specifying --system-name automatically picks between CPU and GPU for each job type, and for the jobs that are sent to GPU nodes they correctly use GPUs (fixes desi_run_night default CPU+GPU support not using GPUs? #1881)
  • From PR GPU memory and rank allocation #1899 which was spun off of this work, desispec.gpu.is_gpu_available() standardizes the logic for identifying if a GPU is available and should be used.

Some other changes that came along for the ride:

  • desi_run_night defaults to regular queue instead of realtime, since it's primary usage is production runs using the regular queue.
  • --use-specter extractions default to --nsubbundles=5 instead of 6 for consistency with the gpu_specter default (which can't support 6).
  • partial implementation of fiberflat_vs_humidity I/O improvements in improve fiberflat_vs_humidity I/O efficiency #1900
  • tilenight jobs are giving a little more timing overhead to be more robust to NERSC I/O performance fluctuations
  • --dry-run sleeps are shorter for faster iterations (but less stdout movie-like)

Test cases that I checked:

  • desi_run_night ... on Perlmutter without specifying --system-name: jobs are correctly sent to CPU vs. GPU nodes, and the GPU jobs actually use the GPUs
  • desi_run_night --system-name perlmutter-cpu ...: jobs are sent only to CPU nodes, and those jobs don't trip over the lack of GPUs
  • desi_run_night --system-name cori-knl ... : this PR doesn't break production running on KNL (though I hope we never need to use that again)
  • desi_run_night --use-specter: correctly uses specter instead of gpu_specter

TODO (doesn't work yet):

  • desi_daily_proc_manager --dry-run-level 1 --use-specter ...: daily proc on haswell with specter generates the correct jobs scripts (didn't try fully running).

Test productions using various iterations of this branch are in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/cpugpu-* (various test cases listed above) and h1 (7 full nights from sv1 and sv3).

@akremin and @marcelo-alvarez please take a look; also heads up @dmargala .

@sbailey sbailey added this to In progress in Himalayas via automation Nov 8, 2022
@sbailey sbailey added the WIP Work in Progress label Nov 8, 2022
@sbailey
Copy link
Contributor Author

sbailey commented Nov 8, 2022

I've flagged this as WIP until I can sort out the tests (gpu_specter is more restrictive about bundlesize, nsubbundles, and nspec) and fix the desi_proc_daily_manager --use-specter case, but I think this can be usefully reviewed in the meantime anyway.

@marcelo-alvarez marcelo-alvarez linked an issue Nov 8, 2022 that may be closed by this pull request
Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a welcome cleanup that will simplify things and make them more suitable for the new normal of perlmutter+gpus+gpu_specter. I have three inline questions I would like to get responses about and one trivial fix to ensure that the dry-run testing infrastructure continues to work properly.

I have not yet tested the code. I will let Marcelo help with that as part of a larger perlmutter test. I will also do a few more sanity checks on output scripts before approving.

py/desispec/scripts/proc.py Outdated Show resolved Hide resolved
py/desispec/workflow/desi_proc_funcs.py Show resolved Hide resolved
py/desispec/workflow/procfuncs.py Show resolved Hide resolved
py/desispec/workflow/procfuncs.py Outdated Show resolved Hide resolved
Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and approve this since my comments were satisfied, but a production-style test and subsequent checking of scripts and logs would be useful before merging.

@sbailey sbailey marked this pull request as draft November 11, 2022 18:09
@sbailey
Copy link
Contributor Author

sbailey commented Nov 11, 2022

I have fixed the --use-specter case, updated the unit tests, and verified that desi_daily_proc_manager generates the correct scripts and I spot checks submitting some of them to run. When Perlmutter is back, I'll do one more end-to-end test with the latest branch before merging.

@sbailey sbailey marked this pull request as ready for review November 11, 2022 22:03
@sbailey sbailey removed the WIP Work in Progress label Nov 13, 2022
@sbailey
Copy link
Contributor Author

sbailey commented Nov 13, 2022

I ran another end-to-end test successfully. Two more changes for the record:

  • added py/desispec/test/test_gpu.py unit tests which I had previously forgotten to commit
  • updated tilenight to be consistent with non-tilenight and only process LASTSTEP=all by default instead of also trying to process LASTSTEP=skysub. desi_run_night --laststeps ... can still be used to override that.

@sbailey sbailey merged commit 8597eb1 into main Nov 13, 2022
Himalayas automation moved this from In progress to Done Nov 13, 2022
@sbailey sbailey deleted the gpucpu branch November 13, 2022 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

desi_run_night default CPU+GPU support not using GPUs? standardize GPU options
2 participants