GPU vs. CPU production running #1901

sbailey · 2022-11-08T22:38:59Z

This PR contains a number of fixes in preparation for running productions on Perlmutter:

Standardize GPU/CPU defaults following suggestions in issue standardize GPU options #1799. Basically GPU usage is now opt-out: if the code can find a GPU it will use it unless --no-gpu (single script) or by setting $DESI_NO_GPU (global opt-out). This replaces --use-gpu (stdstars) and --gpuextract (extractions).
gpu_specter is default everywhere, instead of --use-specter at some levels and --gpuspecter at others.
--system-name is used by pipeline wrappers to know what kinds of batch jobs to generate and where to send them, but once the code wakes up it uses whatever resources it finds
On Perlmutter, not specifying --system-name automatically picks between CPU and GPU for each job type, and for the jobs that are sent to GPU nodes they correctly use GPUs (fixes desi_run_night default CPU+GPU support not using GPUs? #1881)
From PR GPU memory and rank allocation #1899 which was spun off of this work, desispec.gpu.is_gpu_available() standardizes the logic for identifying if a GPU is available and should be used.

Some other changes that came along for the ride:

desi_run_night defaults to regular queue instead of realtime, since it's primary usage is production runs using the regular queue.
--use-specter extractions default to --nsubbundles=5 instead of 6 for consistency with the gpu_specter default (which can't support 6).
partial implementation of fiberflat_vs_humidity I/O improvements in improve fiberflat_vs_humidity I/O efficiency #1900
tilenight jobs are giving a little more timing overhead to be more robust to NERSC I/O performance fluctuations
--dry-run sleeps are shorter for faster iterations (but less stdout movie-like)

Test cases that I checked:

desi_run_night ... on Perlmutter without specifying --system-name: jobs are correctly sent to CPU vs. GPU nodes, and the GPU jobs actually use the GPUs
desi_run_night --system-name perlmutter-cpu ...: jobs are sent only to CPU nodes, and those jobs don't trip over the lack of GPUs
desi_run_night --system-name cori-knl ... : this PR doesn't break production running on KNL (though I hope we never need to use that again)
desi_run_night --use-specter: correctly uses specter instead of gpu_specter

TODO (doesn't work yet):

desi_daily_proc_manager --dry-run-level 1 --use-specter ...: daily proc on haswell with specter generates the correct jobs scripts (didn't try fully running).

Test productions using various iterations of this branch are in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/cpugpu-* (various test cases listed above) and h1 (7 full nights from sv1 and sv3).

@akremin and @marcelo-alvarez please take a look; also heads up @dmargala .

sbailey · 2022-11-08T22:53:34Z

I've flagged this as WIP until I can sort out the tests (gpu_specter is more restrictive about bundlesize, nsubbundles, and nspec) and fix the desi_proc_daily_manager --use-specter case, but I think this can be usefully reviewed in the meantime anyway.

akremin

This is a welcome cleanup that will simplify things and make them more suitable for the new normal of perlmutter+gpus+gpu_specter. I have three inline questions I would like to get responses about and one trivial fix to ensure that the dry-run testing infrastructure continues to work properly.

I have not yet tested the code. I will let Marcelo help with that as part of a larger perlmutter test. I will also do a few more sanity checks on output scripts before approving.

py/desispec/scripts/proc.py

py/desispec/workflow/desi_proc_funcs.py

py/desispec/workflow/procfuncs.py

akremin

I'll go ahead and approve this since my comments were satisfied, but a production-style test and subsequent checking of scripts and logs would be useful before merging.

sbailey · 2022-11-11T22:03:05Z

I have fixed the --use-specter case, updated the unit tests, and verified that desi_daily_proc_manager generates the correct scripts and I spot checks submitting some of them to run. When Perlmutter is back, I'll do one more end-to-end test with the latest branch before merging.

sbailey · 2022-11-13T05:32:38Z

I ran another end-to-end test successfully. Two more changes for the record:

added py/desispec/test/test_gpu.py unit tests which I had previously forgotten to commit
updated tilenight to be consistent with non-tilenight and only process LASTSTEP=all by default instead of also trying to process LASTSTEP=skysub. desi_run_night --laststeps ... can still be used to override that.

Stephen Bailey added 8 commits November 8, 2022 13:52

WIP to use GPUs when auto-deriving CPU vs. GPU

b597812

move GPU auto-detection to desispec.gpu.is_gpu_available()

00ea75e

GPU vs. CPU defaults

b3613f8

desi_run_night default regular queue

35a336b

more consistent --use-specter options

29f6606

humidity fiberflat I/O efficiency

f27a3d0

increase tilenight job runtime overhead

a72b4ac

pre-PR cleanup

f147457

sbailey added this to In progress in Himalayas via automation Nov 8, 2022

sbailey added the WIP Work in Progress label Nov 8, 2022

marcelo-alvarez mentioned this pull request Nov 8, 2022

standardize GPU options #1799

Closed

marcelo-alvarez linked an issue Nov 8, 2022 that may be closed by this pull request

standardize GPU options #1799

Closed

akremin reviewed Nov 10, 2022

View reviewed changes

py/desispec/scripts/proc.py Outdated Show resolved Hide resolved

py/desispec/workflow/desi_proc_funcs.py Show resolved Hide resolved

py/desispec/workflow/procfuncs.py Show resolved Hide resolved

py/desispec/workflow/procfuncs.py Outdated Show resolved Hide resolved

Stephen Bailey added 5 commits November 10, 2022 11:30

remove unnecessary --nsubbundles option

f32c32a

revert dry run timing change

dcd3045

Use NoGPU context manager when multiprocessing

a05bc35

update extraction tests for gpu_specter

3bf879b

add gpu_specter to requirements

76767e2

akremin approved these changes Nov 11, 2022

View reviewed changes

sbailey marked this pull request as draft November 11, 2022 18:09

Stephen Bailey added 2 commits November 11, 2022 10:56

fix --use-specter option

604a685

update DESI_NO_GPU checking

29fa5d8

sbailey marked this pull request as ready for review November 11, 2022 22:03

Stephen Bailey added 2 commits November 12, 2022 21:04

tilenight don't process laststep=skysub by default, like non-tilenight

5fbb413

adding desispec.gpu tests

a0a333b

sbailey removed the WIP Work in Progress label Nov 13, 2022

sbailey merged commit 8597eb1 into main Nov 13, 2022

Himalayas automation moved this from In progress to Done Nov 13, 2022

sbailey deleted the gpucpu branch November 13, 2022 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU vs. CPU production running #1901

GPU vs. CPU production running #1901

sbailey commented Nov 8, 2022 •

edited

sbailey commented Nov 8, 2022

akremin left a comment

akremin left a comment

sbailey commented Nov 11, 2022

sbailey commented Nov 13, 2022

GPU vs. CPU production running #1901

GPU vs. CPU production running #1901

Conversation

sbailey commented Nov 8, 2022 • edited

sbailey commented Nov 8, 2022

akremin left a comment

Choose a reason for hiding this comment

akremin left a comment

Choose a reason for hiding this comment

sbailey commented Nov 11, 2022

sbailey commented Nov 13, 2022

sbailey commented Nov 8, 2022 •

edited