stdstar memory optimization #1820

sbailey · 2022-08-11T21:09:30Z

This PR fixes issues

stdstar out of memory with 6 exposures #1785: examples of stdstars running out of memory for tiles with many exposures
use stdstars --maxstdstars 30 #1789: go back to using a max of 30 stdstars instead of 50; make that an option not a hardcode
stdstars output fibermap sorted differently than other HDUs #1813: stdstars FIBERMAP HDU is now in the same sort order as the other HDUs
stdstar job "succeeds" even if some petals fail #1781: stdstar jobs would claim to "succeed" (at the slurm level) even if some petals had failed; now if any petal fails the job is flagged as failing

Test runs are in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/stdmem (this branch) and stdmem-main (current master), testing

time srun -N 1 -n 30 -c 4 --cpu-bind=cores desi_proc_joint_fit --obstype science --mpi --mpistdstars \
    --cameras a0123456789 -n 20210928 -e 102103,102104,102105,102106,102107,102108 
time srun -N 1 -n 30 -c 4 --cpu-bind=cores desi_proc_joint_fit --obstype science --mpi --mpistdstars \
    --cameras a0123456789 -n 20220324 -e 127343,127344,127345,127346,127347,127348,127349,127350

On Cori KNL, these run in 9-10 minutes using this branch, and run out of memory on current master. Switching to -n 15 on current master doesn't run out of memory but takes ~20 minutes to run. The script stdmem/compare_stdstars.py confirms that they produce the same answer except for the following differences:

the FIBERMAP HDU is now sorted to match the other HDUs. If both new/old are sorted by FIBER, they are an idetical match
METADATA HDU has two more columns: TARGETID and FIBER (see stdstars output fibermap sorted differently than other HDUs #1813 for the incantations on how to align the previous METADATA HDU with the differently sorted FIBERMAP)
A side effect of handling the integer datatypes for TARGETID and FIBER resulting in the DATA_G-R column becoming float32 instead of float64. The previous code was accidentally upcasting it to float64 when writing out, so the actual results are still the same and pass np.all(new32 == old64).

Unfortunately it still runs out of memory with 64 ranks on Cori KNL due to each rank needing a full copy of the stdstar templates, so the batch config still throttles this step to 32 ranks. I also confirmed that this works on Cori Haswell, Perlmutter CPU (1.8x faster than haswell), and Perlmutter GPU (3.3x faster than haswell).

Details

The main memory saving changes were:

trim each frame to just the standard stars while reading them, instead of reading all N>>1 frames and then filtering them down. Also do this for the sky and fiberflat data.
convert the resolution data to the sparse Resolution object only once per fiber instead of doing it on-the-fly every time it was needed (this also saves time)
when evaluating the stdstar models from the templates + coefficients, re-use the same memory buffer instead of re-allocating multiple times.
The final evaluated stdstar models are written by only a subset of ranks so allocate the memory buffers only for the ranks that need them.

akremin

This is essential for re-processing tiles with many input exposures, and is a welcome improvement to the memory management of the standard star fitting routines.

I checked output files for sp2 and sp7 of night 20210928 and exposures 102103-102108 from the outputs provided at /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/stdmem and /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/stdmem-main. The FLUX, WAVELENGTH, FIBERS, COEFF, and INPUT_FRAMES HDU's matched. The METADATA HDU had the two additional columns and no other changes. The FIBERMAP HDU was sorted differently, but had no other changes. Since the two differences were reported and justified as useful features/changes compared with main, I see no issues in the comparison.

Both the code and data look good, so I'm happy to approve and merge this.

Stephen Bailey added 6 commits August 8, 2022 14:55

fix subframe bitmasking; add test

aeccb42

stdstar memory usage improvements

8c9b9b2

re-remove deprecated sky tests after merge conflict

552a183

fix post-rebase sky test

3e7e504

bugfix for frame slicing

fd901fb

add stdstar METADATA TARGETID,FIBER columns

58f9c7a

sbailey requested a review from akremin August 11, 2022 21:09

sbailey added this to In progress in Himalayas via automation Aug 11, 2022

akremin approved these changes Aug 12, 2022

View reviewed changes

akremin merged commit 6f7d754 into master Aug 12, 2022

Himalayas automation moved this from In progress to Done Aug 12, 2022

akremin deleted the stdstar-memory branch August 12, 2022 20:55

sbailey mentioned this pull request Oct 5, 2022

give more time to stdstar jobs #1869

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stdstar memory optimization #1820

stdstar memory optimization #1820

sbailey commented Aug 11, 2022

akremin left a comment

stdstar memory optimization #1820

stdstar memory optimization #1820

Conversation

sbailey commented Aug 11, 2022

Details

akremin left a comment

Choose a reason for hiding this comment