Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check file outputs before submitting a spectro pipeline job #1217

Merged
merged 19 commits into from Mar 31, 2021
Merged

Conversation

akremin
Copy link
Member

@akremin akremin commented Mar 31, 2021

This pull request adds a new function check_for_outputs_on_disk that is run within the pipeline prior to creating or submitting a job to the queue during the reduction of spectra data. The motivation is the fact that we may submit jobs that will do nothing more than check if outputs exist for several sub-tasks and then exit, filling the queue for no reason. We may also submit large 5 node jobs to process 30 cameras where only 2 still need to be processed. This does that checking prior to submission and only requests resources and processing for the cameras that still need to be processed.

If all output files are present, then by default the job is not submitted. If some files are present, by default it submits a smaller job to only process the missing data. The expected files are determined using the PROCCAMWORD to know what cameras should be processed and the type of job being submitted. The file names are generated using the desispec.io.findfile. findfile was updated slightly here in the hopes of updating the redshift formats. Those have since become obsolete again in PR #1192. That doesn't use findfile, however, so I will leave this as-is and let a future PR bring everything back in line. The redshift features included in this code are placeholders for the next PR that will integrate redshift fitting into the nightly pipeline manager desi_daily_proc_manager, so the incorrect naming doesn't impact the code.

I have added two command-line arguments --dont-check-job-outputs and --dont-resubmit-partial-jobs to both the nightly processing script desi_daily_proc_manager and the re-run script desi_run_night. The default (without either flag) is to check for files on disk. If all files exist then the job is skipped and not submitted. If some exist then a smaller job is submitted to process only missing cameras. If --dont-resubmit-partial-jobs is set and the other is not, then the pipeline will skip jobs with all outputs existing but otherwise will submit the full job to be run even if some cameras exist. If --dont-check-job-outputs is set, then no checking is done and the jobs are submitted to the queue even if the outputs exist. This is analogous to what was done prior to this addition.

I tested numerous scenarios to ensure that it does the correct thing in various circumstances. I removed just the cals from a night, removed just cframe-*, removed cframe-* and stdstar-* files, removed some tiles but not others, etc., etc. In addition to the calls given explicitly in the script below, I also tested with and without the flags for both command-line scripts.

When it is run, it provides useful context-specific log messages to tell you what action it took (if any):

INFO:procfuncs.py:135:check_for_outputs_on_disk: prestdstar job with exposure(s) [67733] already has the desired 30 sframe's. Not submitting this job.
INFO:procfuncs.py:135:check_for_outputs_on_disk: stdstarfit job with exposure(s) [67710 67711 67712 67713] already has the desired 10 stdstars's. Not submitting this job.
INFO:procfuncs.py:135:check_for_outputs_on_disk: poststdstar job with exposure(s) [67710] already has the desired 30 cframe's. Not submitting this job.
INFO:procfuncs.py:141:check_for_outputs_on_disk: prestdstar job with exposure(s) [67684] has no existing sframe's. Submitting full camword=a0123456789.
INFO:procfuncs.py:135:check_for_outputs_on_disk: poststdstar job with exposure(s) [69584] already has some cframe's. Submitting smaller camword=a3.

All the tests were performed using a test prod setup with the following script:

export DESI_SPECTRO_REDUX="/global/cfs/cdirs/desi/users/$USER/spectro/redux"
export SPECPROD=filecount
export EXPLIST="explist-$SPECPROD.txt"

mkdir -p $DESI_SPECTRO_REDUX
cd $DESI_SPECTRO_REDUX

## Get row names                                                                                                                                                                                      
head -1 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt > $EXPLIST
## Get full night of good data                                                                                                                                                                     
grep 20201214 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt >> $EXPLIST
## Get full night where we'll remove just the joint cals                                                                                                                                           
grep 20201216 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt >> $EXPLIST
## Get only one tile from a night                                                                                                                                                                 
grep 20201218 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt | grep 80607 >> $EXPLIST
## Get only a subset of tiles from a night                                                                                                                                                         
grep 20201219 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt | tail -n 10 >> $EXPLIST
## Get full night where we'll remove poststdstar fits                                                                                                                                              
grep 20201221 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt >> $EXPLIST
## Get full night where we'll remove stdstars and poststdstar fits                                                                                                                                 
grep 20201222 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt >> $EXPLIST
## Get full night with some missing cframes due to immobile petal                                                                                                                                  
grep 20201223 /global/cfs/cdirs/desi/spectro/redux/cascades/run/scripts/tiles/explist-all-sv1.txt >> $EXPLIST

copyprod --explist $EXPLIST /global/cfs/cdirs/desi/spectro/redux/cascades $SPECPROD
cd $SPECPROD

## Include the exposure table for the 17th to generate an entire night from scratch                                                                                                                
cp /global/cfs/cdirs/desi/spectro/redux/cascades/exposure_tables/202012/exposure_table_20201217.csv ./exposure_tables/202012/exposure_table_20201217.csv

## Remove those cals from 20201216                                                                                                                                                                 
rm ./calibnight/20201216/*.fits

## Remove the poststdstar files from 20201221                                                                                                                                                      
rm ./exposures/20201221/*/cframe-*.fits
rm ./exposures/20201221/*/fluxcalib-*.fits

## Remove the stdstars and poststdstars from 20201222                                                                                                                                              
rm ./exposures/20201222/*/stdstars-*.fits
rm ./exposures/20201222/*/cframe-*.fits
rm ./exposures/20201222/*/fluxcalib-*.fits

## Fix the file permissions on the tables                                                                                                                                                          
chmod u+rw ./exposure_tables/202012/exposure_table_202012*.csv

for NIGHT in $(seq 20201214 20201223); do
    echo "Running $NIGHT"
    desi_run_night --night=$NIGHT --dry-run &>$NIGHT.log
done

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 28.961% when pulling 608be87 on filecount into a01bcc0 on master.

@sbailey
Copy link
Contributor

sbailey commented Mar 31, 2021

Looks good! Features sounds great. For now I will trust your testing + any mop up of issues we discover when running Denali. Merging now.

Note: this will almost certainly create merge conflicts for the knl branch in PR #1215, so I will merge this first and rebase that branch and resubmit it.

@sbailey sbailey merged commit a0ec49b into master Mar 31, 2021
@sbailey sbailey deleted the filecount branch March 31, 2021 18:30
@sbailey sbailey mentioned this pull request Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants