New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restore pipeline operations on KNL #1523
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A useful set of potpourri. I left two comments, one giving full context for the dashboard change to verifying our mutual thoughts and another checking if all the srun
's are accounted for.
Note: I didn't do any trial runs of the code or verify that the appropriate cpu-bind
is used in each circumstance.
@@ -560,7 +570,7 @@ def create_desi_proc_batch_script(night, exp, cameras, jobdesc, queue, nightlybi | |||
else: | |||
if jobdesc.lower() in ['science','prestdstar']: | |||
fx.write('\n# Do steps through skysub at full MPI parallelism\n') | |||
srun = f'srun -N {nodes} -n {ncores} -c {threads_per_core} {cmd} --nofluxcalib' | |||
srun = f'srun -N {nodes} -n {ncores} -c {threads_per_core} --cpu-bind=cores {cmd} --nofluxcalib' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 567 above is another srun
where you haven't added --cpu-bind
. It pertains to arcs and flats. Was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full disclosure: I wasn't sure what the right answer was in that case and not specifying it was working so I left it alone. We can revisit that one later.
This PR has a potpourri of updates that I made while trying to restore the ability of the pipeline to run efficiently on KNL. Most of them turned out to be unrelated, but I'm opening this PR to re-establish a working baseline for further updates. Changes include:
srun --cpu-bind=cores
option for steps that are pure MPI (extraction, redrock), and--cpu-bind=none
for steps that use multiprocessing (fluxcalib, spectra regrouping). See freeze_iers problematically slow on KNL desiutil#180 for context--nightlybias
applies instead of passing a separate parameter through the function calls that redundantly tracks that infoI've tested this branch for dark+nightlybias, arc, flat, science, and redshifts in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/f2k (still running, and includes some NERSC-outage inflicted failures, but basically working).
@akremin
For the record: my biggest problem with testing turned out to be missing KMP_AFFINITY=false which is needed for using MPI+numpy on KNL. Other pieces that may have helped are being more explicit about --cpu-bind and adding the PSF merging parallelism.