Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operate on a single exposure #604

Merged
merged 4 commits into from May 4, 2018
Merged

Operate on a single exposure #604

merged 4 commits into from May 4, 2018

Conversation

tskisner
Copy link
Member

desi_pipe task and chain options now accept an exposure ID. This can be used for fine-grained task selection. Note that not all exposure types are valid for all pipeline steps. For example, selecting psf tasks for a science exposure will return an empty list of tasks. I addressed another issue (#591) while working on this. Closes #596 and closes #591.

@tskisner
Copy link
Member Author

Thanks for making those options consistent. My next section of work will focus on the nightly script- bringing it up to date with desi_pipe and configuring it to take per exposure actions based on the exposure type.

@sbailey
Copy link
Contributor

sbailey commented Apr 29, 2018

I fixed an options bug in desi_start_night which I think pre-dated this PR. After that, it submitted jobs but they failed at the very first preproc step. From /global/cscratch1/sd/sjbailey/desi/spectro/redux/dailytest/run/scripts/night/20160726/preproc_20180428-214214/cori-haswell_20180428-214443.log:

INFO:pipe_exec.py:81:main:   Using spectro production dir /global/cscratch1/sd/sjbailey/desi/spectro/redux/dailytest
Traceback (most recent call last):
  File "/global/cscratch1/sd/sjbailey/desi/code/desispec/bin/desi_pipe_exec_mpi", line 42, in <module>
    pipe_exec.main(args, comm=comm)
  File "/global/cscratch1/sd/sjbailey/desi/code/desispec/py/desispec/scripts/pipe_exec.py", line 116, in main
    comm=comm, db=db)
  File "/global/cscratch1/sd/sjbailey/desi/code/desispec/py/desispec/pipeline/run.py", line 178, in run_task_list
    runtasks = [ x for x in tasklist if states[x] == "ready" ]
  File "/global/cscratch1/sd/sjbailey/desi/code/desispec/py/desispec/pipeline/run.py", line 178, in <listcomp>
    runtasks = [ x for x in tasklist if states[x] == "ready" ]
KeyError: 'preproc_20160726_z_0_00000001'
slurmstepd: error: *** STEP 11952209.0 ON nid00755 CANCELLED AT 2018-04-28T21:46:50 DUE TO TIME LIMIT ***
srun: got SIGCONT
srun: forcing job termination

The state of my database is:

[cori02 desispec] desi_pipe top --once
----------------+---------+---------+---------+---------+---------+---------+
   Task Type    | waiting | ready   | running | done    | failed  | submit  |
----------------+---------+---------+---------+---------+---------+---------+
preproc         |        0|        6|        0|        0|        0|        6|
psf             |        3|        0|        0|        0|        0|        3|
psfnight        |        3|        0|        0|        0|        0|        3|
traceshift      |        3|        0|        0|        0|        0|        3|
extract         |        3|        0|        0|        0|        0|        3|
fiberflat       |        3|        0|        0|        0|        0|        3|
fiberflatnight  |        3|        0|        0|        0|        0|        3|
sky             |        0|        0|        0|        0|        0|        0|
starfit         |        0|        0|        0|        0|        0|        0|
fluxcalib       |        0|        0|        0|        0|        0|        0|
cframe          |        0|        0|        0|        0|        0|        0|
spectra         |        0|        0|        0|        0|        0|       NA|
redshift        |        0|        0|        0|        0|        0|       NA|
----------------+---------+---------+---------+---------+---------+---------+

Notes:

  • the failed preproc steps died with a specific error, but are still in the "ready" state and not even in the "running" state when they should actually be in the "failed" state
  • the failure might be unrelated to this PR, or it might be caused by an update to the expid DB filtering logic or something, thus I'm reporting it in this PR because it is failing on this branch (though I haven't gone back and retried with master).
  • there were no individual task logs; this error message came from the top level slurm log (side note: why are cori-haswell_11952209.log cori-haswell_20180428-214443.log separate?)

Bottom line: I never got on to testing the specific desi_pipe chain --expid option. Code changes here look fine, but please take a look at whether these are quickly fixable problems to include in this PR too. And if they were caused by this PR, they certainly should be fixed before merging.

@julienguy
Copy link
Contributor

One issue when running desi_pipe chain in interactive shell script mode.
(one can load my prod with source /global/cscratch1/sd/jguy/sim2017/redux/month9/setup.sh)

  • desi_pipe tasks --tasktype sky --night 20191001 --expid 3577 --state ready
    shows a full frame to run (30 tasks)

  • the batch jobs work fine. desi_pipe chain --tasktypes sky,starfit,fluxcalib,cframe --night 20191001 --expid 3577 --pack --nersc edison --nersc_queue debug

  • but the shell script fails. desi_pipe chain --tasktypes sky,starfit,fluxcalib,cframe --night 20191001 --expid 3577 --pack

Step(s) to run: sky,starfit,fluxcalib,cframe
logging to /global/cscratch1/sd/jguy/sim2017/redux/month9/run/scripts/sky-cframe_20180430-092809/run_20180430-092809.log
WARNING:  script /global/cscratch1/sd/jguy/sim2017/redux/month9/run/scripts/sky-cframe_20180430-092809/run.sh had return code = 1
cat /global/cscratch1/sd/jguy/sim2017/redux/month9/run/scripts/sky-cframe_20180430-092809/run_20180430-092809.log
INFO:pipe_exec.py:66:main: Python startup time: 0 min 3 sec
INFO:pipe_exec.py:79:main: Starting at Mon Apr 30 09:28:12 2018
INFO:pipe_exec.py:80:main:   Using raw dir /global/cscratch1/sd/jguy/sim2017/sim/month_cosmics
INFO:pipe_exec.py:81:main:   Using spectro production dir /global/cscratch1/sd/jguy/sim2017/redux/month9
INFO:pipe_exec.py:121:main:   0 tasks already done, 0 tasks were ready, and 0 failed
INFO:pipe_exec.py:124:main: Run time: 0 min 0 sec
/global/common/software/desi/users/jguy/desispec/py/desispec/scripts/pipe_exec.py:139: RuntimeWarning: No tasks were ready or done
  warnings.warn("No tasks were ready or done", RuntimeWarning)

@tskisner
Copy link
Member Author

tskisner commented May 4, 2018

@julienguy, this latest small commit should fix the typo that caused your problem. The shell script generation in packed mode now puts all commands in the script, not just the last one in the list.

@sbailey , some comments:

  • The cori-haswell_20180428-214443.log file is the output of desi_pipe_exec{_mpi} and is date stamped with time the job actually runs (9:44:43pm on 4/28/18 in this case). The other log (cori-haswell_11952209.log) is the output of the overall slurm script, and includes the jobid for easy reference to logs on mynersc, etc. The slurm output name can be either a fixed string (that gets overwritten on subsequent runs of the same slurm script) or can contain the job ID (as above). This is set in the #SBATCH -o option. We could choose to not redirect the desi_pipe_exec output to a separate log. In that case you would just have a single cori-haswell_11952209.log file with the outputs of all the packed desi_pipe_exec commands. I kind of like the individual datestamped log files, but if that is too much clutter then we can dump it all into the main log.

  • Nightly script. Although I did port that to calling "desi_pipe chain" for the original set of steps we discussed, subsequent discussion has outlined how this script will be rewritten to perform different operations on a per-exposure basis, depending on the exposure type. It is only a couple hundred lines and half of that is checking input parameters. This is the next task on my list (issue Nightly script improvements #597). If something there is broken, I suggest we either (1) ignore it since we don't yet use this script for anything and are about to replace it or (2) do that work in this branch and wait to merge the PR until that work is done. I am fine with either solution.

@julienguy
Copy link
Contributor

I verify the first issue is fixed.
Another bug when running an interactive shell script, with the logging redirection apparently ...

source /global/cscratch1/sd/jguy/sim2017/redux/month9/setup.sh
desi_pipe chain --tasktypes sky,starfit,fluxcalib,cframe --night 20191001 --expid 3578 --pack

cat /global/cscratch1/sd/jguy/sim2017/redux/month9/run/scripts/sky-cframe_20180504-113730/run_20180504-113730.log
INFO:pipe_exec.py:66:main: Python startup time: 0 min 4 sec
INFO:pipe_exec.py:79:main: Starting at Fri May  4 11:37:34 2018
INFO:pipe_exec.py:80:main:   Using raw dir /global/cscratch1/sd/jguy/sim2017/sim/month_cosmics
INFO:pipe_exec.py:81:main:   Using spectro production dir /global/cscratch1/sd/jguy/sim2017/redux/month9
Traceback (most recent call last):
  File "/global/common/software/desi/users/jguy/desispec/bin/desi_pipe_exec", line 14, in <module>
    pipe_exec.main(args)
  File "/global/common/software/desi/users/jguy/desispec/py/desispec/scripts/pipe_exec.py", line 116, in main
    comm=comm, db=db)
  File "/global/common/software/desi/users/jguy/desispec/py/desispec/pipeline/run.py", line 294, in run_task_list
    logfile=tasklog, db=db)
  File "/global/common/software/desi/users/jguy/desispec/py/desispec/pipeline/run.py", line 86, in run_task
    comm=comm)
  File "/global/common/software/desi/edison/desiconda/20180130-1.2.4-spec/conda/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/global/common/software/desi/users/jguy/desispec/py/desispec/parallel.py", line 403, in stdouterr_redirected
    with open(fname) as infile:
FileNotFoundError: [Errno 2] No such file or directory: '/global/cscratch1/sd/jguy/sim2017/redux/month9/run/logs/night/20191001/sky_20191001_b_7_00003578.log_0'

@tskisner
Copy link
Member Author

tskisner commented May 4, 2018

Hmmm, I have not seen that one before. I have not tested the plain shell scripts much at NERSC. I can try to reproduce that on an interactive node.

@tskisner
Copy link
Member Author

tskisner commented May 4, 2018

@julienguy, I just successfully ran a chain of jobs using shell scripts:

  1. Get an edison login node:
salloc -N 1 -A desi -p debug -t 00:30:00
  1. Run (for example):
desi_pipe chain --night 20191001 --expid 3578 --pack --tasktypes starfit,fluxcalib,cframe

Was your previous test on an edison login node or a compute node? The login nodes have memory limits that might have been exceeded.

@tskisner
Copy link
Member Author

tskisner commented May 4, 2018

Approved in offline discussion by @sbailey and @julienguy.

@tskisner tskisner merged commit b129297 into master May 4, 2018
@tskisner tskisner deleted the issue_596 branch May 4, 2018 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job chains should have an exposure ID option DataBase.getready() should optionally have a night parameter
3 participants