-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Himalayas vetting: Anand file existence checks #1929
Comments
These are due to user (me) error: I had submitted 202105nn, 20210610, and 20220130 without any redshift jobs and then forgot to go back and add the pernight redshift jobs. I don't remember exactly why I didn't include the pernight redshift jobs in the first place, but perhaps I wanted to verify that the other steps ran end-to-end before proceeding. The one extra missing TILEQAFITS is tile 1 night 20210406, which had a transient error while fetching the imaging cutout; from himalayas/tiles/pernight/1/20210406/logs/tile-qa-1-20210406.log:
Note: on perlmutter mixing MPI+subprocess calls is fragile but so far this particular call hasn't been a problem in test runs. If that does become more problematic, we'll need a deeper refactor to replace the spawned wget call but let's not try to tackle that right now just before Iron. (I may update my opinion after investigating the missing cumulative tileqa files). |
56 of these are because I hadn't submitted cumulative redshifts for special tiles. The remainder are for a variety of reasons including a testing-leftover bug in the cumulative redshift submission script (now fixed), incomplete cleanup after flagging some exposures as bad, and some leftover incomplete tiles that shouldn't have been submitted in the first place (also now fixed). I haven't resubmitted everything but I don't see any red flags in the remaining missing folders so I'm moving on to the cases where the folders exist but some files are missing. |
134-75=59 tiles have tile-qa*.fits but not tile-qa*.png . One of these (tile 20736 night 20210620) is from a transient I/O failure and succeeded when rerun. Most (all?) of the others are job timeouts driven by sv1 tiles having more exposures per tile than typical and tile_qa_plot.py re-opening all of the cframe files to get the SCORES tables for the TSNR values. Ticket #1951 has some ideas for improving that, but if we don't get those in prior to Iron then we'll have to watch out for job timeouts and re-submit as needed. |
59 of those are because of job timeouts in earlier steps; tile 22471 was submitted with the old bash-based scripts and I apparently forgot the --run-zmtl option. I don't have a good explanation for why that one tile was missed; the original launch was done with a wrapper script so all tiles should have the same options, and I don't see evidence of a cleanup+rerun. Regardless, I'm going to move on from this one. |
59 of these are due to job timeouts in earlier steps. The remaining cases are job timeouts during this step. Re-running would fix. |
As Anand already identified in the original post, these are due to leftover spectratmp.fits.gz files messing up the spectra accounting. After removing the tmp files, the number of spectra, coadd, redrock, and rrdetails files all match. |
Summarizing: I believe all cases are understood, and boil down to two underlying causes:
Most of these cases would have been avoided or much easier to cleanup if the cumulative redshift jobs were tracked in the processing tables too; I opened ticket #1952 about that. Ticket #1951 has some ideas for more efficient tile-qa I/O which would reduce job timeouts, but if better job tracking is implemented those become much easier to identify and handle in the first place. I'll leave this open on Himalayas, but for now I'm going to move on with the fixes that will make this easier for Iron rather than patching the missing pieces for Himalayas. Thanks to @araichoor for his useful file inventory script; we'll definitely use it again for Iron. |
From @araichoor on slack, with minor edits:
about himalayas sanity checks:
I ve started to revive my fujalupe scripts, I just dump here a first result on the file existence, in case others have more bandwith to look at that:
/global/homes/d/desi/labeled_proc_run_files/himalayas/qa/checks-himalayas-outputs-tiles-20221205.ecsv
I ve assumed that survey=cmx/sv1/sv2/sv3 have the pernight/cumlative reduxs, otherwise only cumulative.
see attached file for details: this file lists the expected folders, the existence of the folder (ISFOLDER), the number of found files for each PATTERN (spectra, coadd, etc).
method:
I make a list of tileids from himalayas-exposures.fits
I check for:
I find overall the following “missing” stuff:
for instance, to check for “missing” folders:
=> /global/cfs/cdirs/desi/spectro/redux/himalayas/tiles/pernight/9/20210501 is missing
for instance, to check for “missing” cumulative coadd files:
=> here there 19 spectra files!
that's because there are files like spectra-0-11239-thru20211031_tmp44025.fits.gz…
The text was updated successfully, but these errors were encountered: