Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore pipeline operations on KNL #1523

Merged
merged 8 commits into from Dec 10, 2021
Merged

restore pipeline operations on KNL #1523

merged 8 commits into from Dec 10, 2021

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Dec 10, 2021

This PR has a potpourri of updates that I made while trying to restore the ability of the pipeline to run efficiently on KNL. Most of them turned out to be unrelated, but I'm opening this PR to re-establish a working baseline for further updates. Changes include:

  • dashboard outputs to $DESI_SPECTRO_REDUX/$SPECPROD/dashboard by default instead of $HOME
  • more user friendly time logging in desi_proc output:
    • instead of dumping full json into the log, log only the duration.max per step (details are still in the *.json files)
    • include the total duration as the last line in the log
  • parallelize over cameras while merging the PSFs in desi_proc_joint_fit
    • The previous code was asking for 200 cores but apparently only using 1; now it asks for 30 ranks and uses all of them.
    • There has been some churn of how many ranks this step asks for, so I highlight this for attention to make sure I'm not misunderstanding the reason for earlier changes
  • add srun --cpu-bind=cores option for steps that are pure MPI (extraction, redrock), and --cpu-bind=none for steps that use multiprocessing (fluxcalib, spectra regrouping). See freeze_iers problematically slow on KNL desiutil#180 for context
  • when generating the desi_proc batch script, derive from the requested command line whether --nightlybias applies instead of passing a separate parameter through the function calls that redundantly tracks that info
  • More specex logging, e.g. log the command that will be run, not just log the commands that fail
  • Keep OMP_NUM_THREADS=(available threads per rank) for PSF fits (where we have profiled that helps), but go back to OMP_NUM_THREADS=1 for other steps.
    • This was motivated by OMP_NUM_THREADS impact on extractions #1515, though in the end my concern about OMP_NUM_THEADS>1 may have been a red herring. Regardless, I've kept here was emperically works in test runs, and then we can separately test whether re-increasing OMP_NUM_THREADS helps extractions or not.

I've tested this branch for dark+nightlybias, arc, flat, science, and redshifts in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/f2k (still running, and includes some NERSC-outage inflicted failures, but basically working).

@akremin

For the record: my biggest problem with testing turned out to be missing KMP_AFFINITY=false which is needed for using MPI+numpy on KNL. Other pieces that may have helped are being more explicit about --cpu-bind and adding the PSF merging parallelism.

@sbailey sbailey requested a review from akremin December 10, 2021 20:02
@sbailey sbailey added this to In progress in Fuji via automation Dec 10, 2021
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 25.676% when pulling 74cbc93 on f2knl into 2de7942 on master.

Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful set of potpourri. I left two comments, one giving full context for the dashboard change to verifying our mutual thoughts and another checking if all the srun's are accounted for.

Note: I didn't do any trial runs of the code or verify that the appropriate cpu-bind is used in each circumstance.

py/desispec/desi_proc_dashboard.py Show resolved Hide resolved
@@ -560,7 +570,7 @@ def create_desi_proc_batch_script(night, exp, cameras, jobdesc, queue, nightlybi
else:
if jobdesc.lower() in ['science','prestdstar']:
fx.write('\n# Do steps through skysub at full MPI parallelism\n')
srun = f'srun -N {nodes} -n {ncores} -c {threads_per_core} {cmd} --nofluxcalib'
srun = f'srun -N {nodes} -n {ncores} -c {threads_per_core} --cpu-bind=cores {cmd} --nofluxcalib'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 567 above is another srun where you haven't added --cpu-bind. It pertains to arcs and flats. Was that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full disclosure: I wasn't sure what the right answer was in that case and not specifying it was working so I left it alone. We can revisit that one later.

@sbailey sbailey merged commit 6ed41bf into master Dec 10, 2021
Fuji automation moved this from In progress to Done Dec 10, 2021
@sbailey sbailey deleted the f2knl branch December 10, 2021 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Fuji
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants