Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail early if assemble_fibermap fails #876

Merged
merged 2 commits into from Feb 21, 2020
Merged

fail early if assemble_fibermap fails #876

merged 2 commits into from Feb 21, 2020

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Feb 14, 2020

This PR makes two changes to desi_proc to make debugging failures a bit more human friendly:

  • if assemble_fibermap fails (e.g. due to a missing input file), exit immediately
    instead of proceeding onto preprocessing and having N>>1 ranks fail with
    interleaved messages about missing fibermaps.
  • science exposures split the processing into two sruns with different
    levels of parallelism -- now if the first step fails, exit the slurm script without
    running the second step (which would just fail again with confusingly similar
    log messages).

I tested this with desi_proc --batch --traceshift --scattered-light -n 20200212 -e 48313 which indeed fails due to a science exposure with a missing input coordinates file, but now it exits more cleanly; see /global/cfs/cdirs/desi/spectro/redux/daily/run/scripts/night/20200212/science-20200212-00048313-b1b2b4b5b8r1r2r4r5r8z1z2z4z5z8-28139563.log:

INFO:util.py:74:runcmd: RUNNING: assemble_fibermap -n 20200212 -e 48313 -o /global/cfs/cdirs/desi/spectro/redux/daily/preproc/20200212/00048313/fibermap-00048313.fits
Traceback (most recent call last):
  File "/global/common/software/desi/users/sjbailey/desispec/bin/assemble_fibermap", line 21, in <module>
    fibermap = assemble_fibermap(args.night, args.expid)
  File "/global/common/software/desi/users/sjbailey/desispec/py/desispec/io/fibermap.py", line 359, in assemble_fibermap
    f'No coordinates*.fits file in fiberassign dir {dirname}')
FileNotFoundError: No coordinates*.fits file in fiberassign dir /global/cfs/cdirs/desi/spectro/data/20200212/00048313
  Outputs
    /global/cfs/cdirs/desi/spectro/redux/daily/preproc/20200212/00048313/fibermap-00048313.fits
INFO:util.py:97:runcmd: Thu Feb 13 16:12:03 2020
CRITICAL:util.py:99:runcmd: FAILED assemble_fibermap -n 20200212 -e 48313 -o /global/cfs/cdirs/desi/spectro/redux/daily/preproc/20200212/00048313/fibermap-00048313.fits
CRITICAL:desi_proc:324:<module>: assemble_fibermap failed for science exposure; exit now
srun: error: nid00535: tasks 50-74: Exited with exit code 13
srun: Terminating job step 28139563.0
srun: error: nid00534: tasks 25-49: Exited with exit code 13
srun: error: nid00536: tasks 75-99: Exited with exit code 13
srun: error: nid00533: tasks 0-24: Exited with exit code 13
FAILED: done at Thu Feb 13 16:12:04 PST 2020

@akremin please take a look

@akremin
Copy link
Member

akremin commented Feb 20, 2020

Hi Stephen, I'm not 100% confident in my testing but I'm getting a consistent error every time I try to run this branch.

This is the output:

    kremin@nid00082:~/> desi_proc  --traceshift --cameras b7,r7,z7 -n 20200219 -e 50937 --mpi --nofluxcalib

    INFO:desi_proc:271:<module>: ----------

    INFO:desi_proc:272:<module>: Input /global/cfs/cdirs/desi/spectro/data/20200219/00050937/desi-00050937.fits.fz

    INFO:desi_proc:273:<module>: Night 20200219 expid 50937

    INFO:desi_proc:274:<module>: Obstype ARC

    INFO:desi_proc:275:<module>: Cameras ['b7', 'r7', 'z7']

    INFO:desi_proc:276:<module>: Output root /global/cfs/cdirs/desi/spectro/redux/kremin

    INFO:desi_proc:277:<module>: ----------

    INFO:desi_proc:299:<module>: Starting preproc at Thu Feb 20 15:32:43 2020

    Traceback (most recent call last):

      File "/global/homes/k/kremin/workspace/test_stephen/desispec/bin/desi_proc", line 316, in <module>

    fibermap_ok = os.path.exists(fibermap)

     File "/global/common/software/desi/cori/desiconda/20190804-1.3.0-spec/conda/lib/python3.6/genericpath.py", line 19, in exists

    os.stat(path)

    TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

this occurs when using the --batch keyword, when directly submitted to the queue, and in interactive mode.

@sbailey
Copy link
Contributor Author

sbailey commented Feb 20, 2020

Thanks. I can reproduce that; debugging now.

@sbailey
Copy link
Contributor Author

sbailey commented Feb 21, 2020

Good catch. My code only worked for OBSTYPE=SCIENCE, but failed for other types of exposures (like this arc you were testing). Changes pushed; please update and retest.

Also note: --mpi option alone is not sufficient to get MPI parallelism; you also need to prefix the command with srun, e.g.

srun -n 20 -c 2 desi_proc  \
    --traceshift --cameras b7,r7,z7 \
    -n 20200219 -e 50937 --mpi --nofluxcalib

though in this case the bug arose before it ever got around to needing MPI.

@akremin akremin merged commit 45770f8 into master Feb 21, 2020
@akremin
Copy link
Member

akremin commented Feb 21, 2020

That issue was solved. The code runs and the pull request only affects desi_proc. Approving

@sbailey sbailey deleted the dailyproc_fibermap branch February 21, 2020 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants