Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Himalayas vetting: Anand file existence checks #1929

Open
sbailey opened this issue Dec 8, 2022 · 7 comments
Open

Himalayas vetting: Anand file existence checks #1929

sbailey opened this issue Dec 8, 2022 · 7 comments
Projects

Comments

@sbailey
Copy link
Contributor

sbailey commented Dec 8, 2022

From @araichoor on slack, with minor edits:

about himalayas sanity checks:
I ve started to revive my fujalupe scripts, I just dump here a first result on the file existence, in case others have more bandwith to look at that: /global/homes/d/desi/labeled_proc_run_files/himalayas/qa/checks-himalayas-outputs-tiles-20221205.ecsv

I ve assumed that survey=cmx/sv1/sv2/sv3 have the pernight/cumlative reduxs, otherwise only cumulative.
see attached file for details: this file lists the expected folders, the existence of the folder (ISFOLDER), the number of found files for each PATTERN (spectra, coadd, etc).
method:
I make a list of tileids from himalayas-exposures.fits
I check for:

  • expected folder
  • expected files (for per-petal, comparing to the number of spectra files).

I find overall the following “missing” stuff:

# PROD REDUX PATTERN NMISS
himalayas pernight ISFOLDER 252
himalayas pernight TILEQAFITS 253
himalayas pernight TILEQAPNG 252
himalayas cumulative ISFOLDER 75
himalayas cumulative COADD 1
himalayas cumulative REDROCK 1
himalayas cumulative RRDETAILS 1
himalayas cumulative QSOQN 62
himalayas cumulative QSOMGII 60
himalayas cumulative EMLINE 63
himalayas cumulative ZMTL 60
himalayas cumulative TILEQAFITS 75
himalayas cumulative TILEQAPNG 134

for instance, to check for “missing” folders:

>>> d = Table.read("checks-himalayas-outputs-tiles-20221205.ecsv")
>>> sel = (d["REDUX"] == "pernight") & (~d["ISFOLDER"])
>>> sel.sum()
252
>>> d[sel][0]
<Row index=0>
 REDUX   TILEID  SUBSET                                   FOLDER                                  ISFOLDER SPECTRA COADD REDROCK RRDETAILS TILEQAFITS TILEQAPNG  ZMTL QSOQN QSOMGII EMLINE
 str10   int32    str8                                    str78                                     bool    int64  int64  int64    int64     int64      int64   int64 int64  int64  int64 
-------- ------ -------- ------------------------------------------------------------------------ -------- ------- ----- ------- --------- ---------- --------- ----- ----- ------- ------
pernight      9 20210501 /global/cfs/cdirs/desi/spectro/redux/himalayas/tiles/pernight/9/20210501    False       0     0       0         0          0         0     0     0       0      0

=> /global/cfs/cdirs/desi/spectro/redux/himalayas/tiles/pernight/9/20210501 is missing

for instance, to check for “missing” cumulative coadd files:

>>> d = Table.read("checks-himalayas-outputs-tiles-20221205.ecsv")
>>> sel = (d["REDUX"] == "cumulative") & (d["COADD"] != d["SPECTRA"])
>>> sel.sum()
1
>>> d[sel][0]
<Row index=0>
  REDUX    TILEID  SUBSET                                      FOLDER                                     ISFOLDER SPECTRA COADD REDROCK RRDETAILS TILEQAFITS TILEQAPNG  ZMTL QSOQN QSOMGII EMLINE
  str10    int32    str8                                       str78                                        bool    int64  int64  int64    int64     int64      int64   int64 int64  int64  int64 
---------- ------ -------- ------------------------------------------------------------------------------ -------- ------- ----- ------- --------- ---------- --------- ----- ----- ------- ------
cumulative  11239 20211031 /global/cfs/cdirs/desi/spectro/redux/himalayas/tiles/cumulative/11239/20211031     True      19    10      10        10          1         1    10    10      10     10

=> here there 19 spectra files!
that's because there are files like spectra-0-11239-thru20211031_tmp44025.fits.gz…

@sbailey sbailey added this to To do in Himalayas via automation Dec 8, 2022
@sbailey
Copy link
Contributor Author

sbailey commented Jan 3, 2023

# PROD REDUX PATTERN NMISS
himalayas pernight ISFOLDER 252
himalayas pernight TILEQAFITS 253
himalayas pernight TILEQAPNG 252

These are due to user (me) error: I had submitted 202105nn, 20210610, and 20220130 without any redshift jobs and then forgot to go back and add the pernight redshift jobs. I don't remember exactly why I didn't include the pernight redshift jobs in the first place, but perhaps I wanted to verify that the other steps ran end-to-end before proceeding.

The one extra missing TILEQAFITS is tile 1 night 20210406, which had a transient error while fetching the imaging cutout; from himalayas/tiles/pernight/1/20210406/logs/tile-qa-1-20210406.log:

  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/code/desispec/main/py/desispec/tile_qa_plot.py", line 278, in get_viewer_cutout
    subprocess.check_call(tmpstr, stderr=subprocess.DEVNULL, shell=True)
...
OSError: [Errno 14] Bad address: '/bin/sh'

tmpstr is a wget call to imaging viewer to get the cutout. The code does check for subprocess.CalledProcessError but not OSError. This failure seems rare enough that I think this is ok for now.

Note: on perlmutter mixing MPI+subprocess calls is fragile but so far this particular call hasn't been a problem in test runs. If that does become more problematic, we'll need a deeper refactor to replace the spawned wget call but let's not try to tackle that right now just before Iron. (I may update my opinion after investigating the missing cumulative tileqa files).

@sbailey
Copy link
Contributor Author

sbailey commented Jan 3, 2023

himalayas cumulative ISFOLDER 75
himalayas cumulative TILEQAFITS 75

56 of these are because I hadn't submitted cumulative redshifts for special tiles. The remainder are for a variety of reasons including a testing-leftover bug in the cumulative redshift submission script (now fixed), incomplete cleanup after flagging some exposures as bad, and some leftover incomplete tiles that shouldn't have been submitted in the first place (also now fixed). I haven't resubmitted everything but I don't see any red flags in the remaining missing folders so I'm moving on to the cases where the folders exist but some files are missing.

@sbailey
Copy link
Contributor Author

sbailey commented Jan 3, 2023

himalayas cumulative TILEQAFITS 75
himalayas cumulative TILEQAPNG 134

134-75=59 tiles have tile-qa*.fits but not tile-qa*.png . One of these (tile 20736 night 20210620) is from a transient I/O failure and succeeded when rerun. Most (all?) of the others are job timeouts driven by sv1 tiles having more exposures per tile than typical and tile_qa_plot.py re-opening all of the cframe files to get the SCORES tables for the TSNR values. Ticket #1951 has some ideas for improving that, but if we don't get those in prior to Iron then we'll have to watch out for job timeouts and re-submit as needed.

@sbailey
Copy link
Contributor Author

sbailey commented Jan 4, 2023

himalayas cumulative ZMTL 60

59 of those are because of job timeouts in earlier steps; tile 22471 was submitted with the old bash-based scripts and I apparently forgot the --run-zmtl option. I don't have a good explanation for why that one tile was missed; the original launch was done with a wrapper script so all tiles should have the same options, and I don't see evidence of a cleanup+rerun. Regardless, I'm going to move on from this one.

@sbailey
Copy link
Contributor Author

sbailey commented Jan 4, 2023

himalayas cumulative QSOQN 62
himalayas cumulative QSOMGII 60
himalayas cumulative EMLINE 63

59 of these are due to job timeouts in earlier steps. The remaining cases are job timeouts during this step. Re-running would fix.

@sbailey
Copy link
Contributor Author

sbailey commented Jan 4, 2023

himalayas cumulative COADD 1
himalayas cumulative REDROCK 1
himalayas cumulative RRDETAILS 1

As Anand already identified in the original post, these are due to leftover spectratmp.fits.gz files messing up the spectra accounting. After removing the tmp files, the number of spectra, coadd, redrock, and rrdetails files all match.

@sbailey
Copy link
Contributor Author

sbailey commented Jan 4, 2023

Summarizing: I believe all cases are understood, and boil down to two underlying causes:

  • user error (mine) on what was submitted
  • job timeouts, especially on SV tiles with a larger number of exposures per tile than typical.

Most of these cases would have been avoided or much easier to cleanup if the cumulative redshift jobs were tracked in the processing tables too; I opened ticket #1952 about that. Ticket #1951 has some ideas for more efficient tile-qa I/O which would reduce job timeouts, but if better job tracking is implemented those become much easier to identify and handle in the first place.

I'll leave this open on Himalayas, but for now I'm going to move on with the fixes that will make this easier for Iron rather than patching the missing pieces for Himalayas. Thanks to @araichoor for his useful file inventory script; we'll definitely use it again for Iron.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant