Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make job submission asynchronous #470

Open
bbockelm opened this issue Oct 16, 2020 · 9 comments · May be fixed by #473
Open

Make job submission asynchronous #470

bbockelm opened this issue Oct 16, 2020 · 9 comments · May be fixed by #473
Milestone

Comments

@bbockelm
Copy link

I have noticed that execution of commands (e.g., condor_submit for the condor backend) appear to be synchronous. In fact, there's a small note about this in the code itself:

https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/core.py#L305

We've started to notice this particularly at very busy batch schedulers. For example, when dask labextension (https://github.com/dask/dask-labextension) is used in a Jupyter notebook, it will spawn the Dask scheduler inside the jupyter hub process (I think I got this terminology right?) and not the notebook. Because it's in the hub itself, if dask jobqueue is non-responsive then the entire UI freezes (as no I/O is done in the event loop). This triggers user complaints of "Jupyter stops working when we use Dask".

The impact of the blocking behavior can be easily seen by replacing the submit executable with a shell script that does a sleep 20 before invoking the real submit executable.

@mivade
Copy link

mivade commented Oct 16, 2020

For what it's worth, this should be pretty trivial to make async using asyncio.create_subprocess_exec. Its API is nearly identical to using subprocess.Popen.

@mivade
Copy link

mivade commented Oct 16, 2020

Or to not have to change anything else using the _call method with something like:

async def _submit_job(self, script_filename):
    return await asyncio.get_running_loop().run_in_executor(None, self._call(...))

@oshadura
Copy link

@bbockelm I tested both solutions proposed by @mivade in our environment, but I still see some non-responsiveness (I will investigate).

@oshadura oshadura linked a pull request Nov 13, 2020 that will close this issue
@guillaumeeb
Copy link
Member

it will spawn the Dask scheduler inside the jupyter hub process (I think I got this terminology right?)

For the terminology, it's on the Jupyterlab UI process (the notebook server) that dask-labextension will run, and not in the Kernel (another process were the code gets executed, e.g. the notebook cells).

Because it's in the hub itself, if dask jobqueue is non-responsive then the entire UI freezes (as no I/O is done in the event loop)

So if I understand, the condor_submit command is taking time to run, and so it block all Jupyterlab UI through dask-labextension.

An easy solution is stop using jupyterlab-extension for starting Dask cluster for the time being 😄. I understand this can be seen as a regression for users... For my part I've never use dask-labextension to launch Dask clusters on our job scheduling system, I'm always doing it inside a notebook cell (so the Kernel). I only use the extension to watch my computations.

I also think that this might be a Condor issue (job submission should be almost immediate in job queueing systems), or that this can be handled in dask-labextention maybe?

Anyway, if you find a simple way to make things asynchronous here, this would be welcomed too!

@guillaumeeb
Copy link
Member

Related to #567

@guillaumeeb guillaumeeb added this to the 0.9 milestone Aug 30, 2022
@jrueb
Copy link
Contributor

jrueb commented Jul 4, 2023

I am experiencing a similar issue as described here. In fact, my workers actually exit because the main process is hanging for so long, all because it's busy waiting for condor_submit to exit.
The suggestion from @mivade of using run_in_executor fixes the issue for me and submission now is amazingly fast. The exact code in core.py for _submit_job looks like this with the fix

    async def _submit_job(self, script_filename):
        return await asyncio.get_running_loop().run_in_executor(None, self._call, shlex.split(self.submit_command) + [script_filename])

Would love to see this changed!

@guillaumeeb
Copy link
Member

Well, I know almost nothing in asyncio stuff. I think we should make dask-jobqueue more compatible with it, but I'm also not sure we cannot make it just by adding small changes like this, or can we?

cc @jacobtomlinson.

@jacobtomlinson
Copy link
Member

@jrueb it would be great to see a PR with this change. If self._call hangs for a long time with blocking IO it makes sense to run it in an executor.

@jrueb
Copy link
Contributor

jrueb commented Jul 6, 2023

Okay, I will look into it and make a PR once I got a satisfying solution. Will also be interesting to see why the last PR for this was never finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants