Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdstar memory optimization #1820

Merged
merged 6 commits into from Aug 12, 2022
Merged

stdstar memory optimization #1820

merged 6 commits into from Aug 12, 2022

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Aug 11, 2022

This PR fixes issues

Test runs are in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/stdmem (this branch) and stdmem-main (current master), testing

time srun -N 1 -n 30 -c 4 --cpu-bind=cores desi_proc_joint_fit --obstype science --mpi --mpistdstars \
    --cameras a0123456789 -n 20210928 -e 102103,102104,102105,102106,102107,102108 
time srun -N 1 -n 30 -c 4 --cpu-bind=cores desi_proc_joint_fit --obstype science --mpi --mpistdstars \
    --cameras a0123456789 -n 20220324 -e 127343,127344,127345,127346,127347,127348,127349,127350 

On Cori KNL, these run in 9-10 minutes using this branch, and run out of memory on current master. Switching to -n 15 on current master doesn't run out of memory but takes ~20 minutes to run. The script stdmem/compare_stdstars.py confirms that they produce the same answer except for the following differences:

  • the FIBERMAP HDU is now sorted to match the other HDUs. If both new/old are sorted by FIBER, they are an idetical match
  • METADATA HDU has two more columns: TARGETID and FIBER (see stdstars output fibermap sorted differently than other HDUs #1813 for the incantations on how to align the previous METADATA HDU with the differently sorted FIBERMAP)
  • A side effect of handling the integer datatypes for TARGETID and FIBER resulting in the DATA_G-R column becoming float32 instead of float64. The previous code was accidentally upcasting it to float64 when writing out, so the actual results are still the same and pass np.all(new32 == old64).

Unfortunately it still runs out of memory with 64 ranks on Cori KNL due to each rank needing a full copy of the stdstar templates, so the batch config still throttles this step to 32 ranks. I also confirmed that this works on Cori Haswell, Perlmutter CPU (1.8x faster than haswell), and Perlmutter GPU (3.3x faster than haswell).

Details

The main memory saving changes were:

  1. trim each frame to just the standard stars while reading them, instead of reading all N>>1 frames and then filtering them down. Also do this for the sky and fiberflat data.
  2. convert the resolution data to the sparse Resolution object only once per fiber instead of doing it on-the-fly every time it was needed (this also saves time)
  3. when evaluating the stdstar models from the templates + coefficients, re-use the same memory buffer instead of re-allocating multiple times.
  4. The final evaluated stdstar models are written by only a subset of ranks so allocate the memory buffers only for the ranks that need them.

@sbailey sbailey requested a review from akremin August 11, 2022 21:09
@sbailey sbailey added this to In progress in Himalayas via automation Aug 11, 2022
Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essential for re-processing tiles with many input exposures, and is a welcome improvement to the memory management of the standard star fitting routines.

I checked output files for sp2 and sp7 of night 20210928 and exposures 102103-102108 from the outputs provided at /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/stdmem and /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/stdmem-main. The FLUX, WAVELENGTH, FIBERS, COEFF, and INPUT_FRAMES HDU's matched. The METADATA HDU had the two additional columns and no other changes. The FIBERMAP HDU was sorted differently, but had no other changes. Since the two differences were reported and justified as useful features/changes compared with main, I see no issues in the comparison.

Both the code and data look good, so I'm happy to approve and merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants