New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise in JobQueueCluster._call when command returns with non zero exit code #146

Merged
merged 6 commits into from Oct 17, 2018

Conversation

Projects
None yet
3 participants
@lesteve
Copy link
Collaborator

lesteve commented Aug 30, 2018

Fix #142.

Here is the same snippet as in #142:

from dask_jobqueue import SGECluster
from dask.distributed import Client
resource_spec = 'h_vmem=1000G,mem_req=16G'

cluster = SGECluster(queue='asdf',
                     cores=1,
                     memory='16GB',
                     resource_spec=resource_spec)
cluster.scale(2)

Error on master

Unable to run job: Job was rejected because job requests unknown queue "asdf".
Exiting.

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2aaac6552620>, 2)
Traceback (most recent call last):
  File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/ioloop.py", line 760, in _run_callback
    ret = callback()
  File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 375, in scale_up
    self.start_workers(n - active_and_pending)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 302, in start_workers
    job = self._job_id_from_submit_output(out.decode())
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 420, in _job_id_from_submit_output
    raise ValueError(msg)
ValueError: Could not parse job id from submission command output.
Job id regexp is '(?P<job_id>\\d+)'
Submission command output is:

Error with this PR

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2aaaeb9e3598>, 2)
Traceback (most recent call last):
  File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/ioloop.py", line 760, in _run_callback
    ret = callback()
  File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 379, in scale_up
    self.start_workers(n - active_and_pending)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 301, in start_workers
    out = self._submit_job(fn)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 293, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 350, in _call
    cmd_str, out, err))
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
qsub -terse /tmp/tmpj48t7zkx.sh
stdout:

stderr:
Unable to run job: Job was rejected because job requests unknown queue "asdf".
Exiting.

A few remarks:

  • Because I was touching this code, I took this opportunity of removing JobQueueCluster._calls which calls multiple commands since we are only calling one command at a time. Maybe @mrocklin remembers why there was _calls/_call (maybe this is related to the comment "and an opportunity to go asynchronous in the future." that I don't fully understand if I am honest).
  • I am using .decode() for stdout and stderr to have better error messages. If anyone sees a reason why that could be problematic, let me know. Note: on Python 3, subprocess.Popen + communicate returns bytes and not str.
@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Aug 30, 2018

In general I really like the improved error messages.

I took this opportunity of removing JobQueueCluster._calls which calls multiple commands since we are only calling one command at a time. Maybe @mrocklin remembers why there was _calls/_call (maybe this is related to the comment "and an opportunity to go asynchronous in the future." that I don't fully understand if I am honest)

No objection from me

@guillaumeeb
Copy link
Member

guillaumeeb left a comment

Thanks for this improved message, a few comments linked to LSF.

A list of commands, each of which is a list of strings to hand to subprocess.communicate
cmd: List(str))
A command, each of which is a list of strings to hand to
subprocess.communicate

This comment has been minimized.

@guillaumeeb

guillaumeeb Sep 4, 2018

Member

subprocess.Popen ?

@@ -308,43 +310,48 @@ def scheduler(self):
""" The scheduler of this cluster """
return self.local_cluster.scheduler

def _calls(self, cmds, **kwargs):
""" Call a command using subprocess.communicate
def _call(self, cmd):

This comment has been minimized.

@guillaumeeb

guillaumeeb Sep 4, 2018

Member

You should keep **kwargs attribute for LSF specificities.


proc = subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)

This comment has been minimized.

@guillaumeeb

guillaumeeb Sep 4, 2018

Member

Add **kwargs here, see previous comment.

@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Oct 15, 2018

@lesteve, looks like you've been busy with other things lately. Do you plan on getting back to dask-jobqueue (this would be quite welcomed 😀 )? Or should we take over your ongoing work?

@guillaumeeb guillaumeeb referenced this pull request Oct 15, 2018

Closed

0.4.1 release #174

@lesteve

This comment has been minimized.

Copy link
Collaborator Author

lesteve commented Oct 15, 2018

Indeed I have been busy with other things lately, sorry about that ...

Looks like there are non-trivial merge conflicts in this PR, I'll try to revive it. This will very likely not happen this week but I'll try my best for next week.

guillaumeeb added some commits Oct 16, 2018

@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Oct 16, 2018

Looks like there are non-trivial merge conflicts in this PR, I'll try to revive it. This will very likely not happen this week but I'll try my best for next week.

So I've done the merge and corrected my remark, one final look from you @lesteve would be good.

@lesteve

This comment has been minimized.

Copy link
Collaborator Author

lesteve commented Oct 16, 2018

Thanks a lot for this! There was still a merge conflict for some reason, so I merged upstream/master and pushed a new commit.

@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Oct 17, 2018

Thanks @lesteve.

@guillaumeeb guillaumeeb merged commit f9a6851 into dask:master Oct 17, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment