Stop workers politely #31

pitrou · 2017-05-31T13:53:17Z

No description provided.

pitrou · 2017-05-31T13:53:24Z

mrocklin · 2017-05-31T13:58:24Z

dask_drmaa/core.py

+JOB_IDS = ['JOB_ID', 'SLURM_JOB_ID', 'LSB_JOBID']
+JOB_ID = ''.join('$' + s for s in JOB_IDS)
+TASK_IDS = ['SGE_TASK_ID', 'SLURM_ARRAY_TASK_ID', 'LSB_JOBINDEX']
+TASK_ID = ''.join('$' + s for s in TASK_IDS)


Seems much cleaner

mrocklin · 2017-05-31T13:58:53Z

dask_drmaa/adaptive.py

                    logger.info("Starting workers due to resource constraints: %s", workers)

                if busy and not s.idle:
-                    workers = self.cluster.start_workers(len(busy))
+                    workers = yield self.cluster._start_workers(len(busy))


Whoops, thank you for catching this.

mrocklin · 2017-05-31T14:01:46Z

dask_drmaa/core.py

+                raise gen.Return(workers)
+
+    @gen.coroutine
+    def _wait_for_started_workers(self, ids, timeout, kwargs):


In some cases we may ask for many more workers than we are likely to get from the job scheduler. Thoughts on how to make this process robust to getting only some of the requested workers? Perhaps a coroutine per worker?

You mean the job scheduler may refuse to launch certain jobs? I don't really know how job schedulers work.

Hrm, actually, I suppose that since we're doing array jobs the job scheduler will probably allocate workers in an all-or-nothing manner. I'll retract my comment. In the future we may want to change start_workers to ask for batches of workers in some ideal size.

Just thought I'd chime in: with array jobs you can actually get scheduled incrementally, and it would be nice for jobs to start working without waiting for the other workers.

pitrou · 2017-05-31T16:02:44Z

Closed & re-opened to try and schedule a Travis build.

pitrou · 2017-05-31T22:37:08Z

dask_drmaa/core.py

+
+            # We got enough new workers, see if they correspond
+            # to the runBulkJobs request
+            environs = yield client._run(get_environ, workers=worker_addresses)


Instead of this, we may be able to exploit the fact that we pass the job id as --name to dask-worker?

That was my original intent with passing --name. I wasn't able to get this to work robustly though. In the end I think I decided to just rely on the scheduler to close things down politely and avoid having to maintain a map between worker addresses and sge job ids.

Have you found what was the cause of the flakiness? It seems that --name itself should be decently robust, and I would be surprised if the environment variables weren't always present -- besides, those are the same environment variables my code uses, so it would have the same problem.

mrocklin · 2017-06-01T10:17:53Z

I wasn't able to get the job scheduler to reliably use the environment variables. This was probably just due to my ignorance on how to use drmaa and job schedulers effectively.

…

On Thu, Jun 1, 2017 at 5:54 AM, Antoine Pitrou ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask_drmaa/core.py <#31 (comment)>: > + n_remaining = len(ids) + worker_addresses = [] + while len(worker_addresses) < n_remaining: + try: + worker, action = yield self._worker_updates.get(deadline) + except QueueEmpty: + logger.error("Timed out waiting for the following workers: %s", + sorted(ids)) + yield client._shutdown(fast=True) + raise gen.Return(workers) + if action == 'add': + worker_addresses.append(worker) + + # We got enough new workers, see if they correspond + # to the runBulkJobs request + environs = yield client._run(get_environ, workers=worker_addresses) Have you found what was the cause of the flakiness? It seems that --name itself should be decently robust, and I would surprise if the environment variables weren't always present -- besides, those are the same environment variables my code uses, so it would have the same problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszOgmmzSD3aqL7ckTWKFbfDdRQ7Lbks5r_oo-gaJpZM4NrqmZ> .

pitrou · 2017-06-01T10:29:56Z

I wasn't able to get the job scheduler to reliably use the environment variables.

You mean set them? Or are they set manually by the system administrator?

mrocklin · 2017-06-01T10:32:10Z

The environment variables are, I think, set by the job scheduler when creating the job. So our current approach is to do something like the following:

dask-worker ... --name $JOB_ID:TASK_ID

And we expect this to create a name like 16.1. However I found that this wasn't the case. Perhaps things have improved since then though. Regardless, my hope was to just avoid needing to need this mapping altogether. Is needing this mapping proving to be important now?

pitrou · 2017-06-01T10:41:33Z

Is needing this mapping proving to be important now?

Well, if we want to "politely" stop workers, then yes it is :-) At least, the environment variables need to be set properly.

mrocklin · 2017-06-01T10:42:34Z

This is only if we want to politely stop workers with certain Job IDs though, yes? Is this an important feature?

pitrou · 2017-06-01T10:46:39Z

This is only if we want to politely stop workers with certain Job IDs though, yes?

Oh, I hadn't thought about that. Yes, that's a good point.

Is this an important feature?

I don't know. But it had a xfail test case :-)

jakirkham · 2017-11-03T16:30:51Z

Any thoughts on the merge conflicts?

pitrou · 2017-11-03T17:05:13Z

@jakirkham that depends if this PR is desirable at all. I see @TomAugspurger did some changes on dask-drmaa semi-recently, perhaps he has an opinion on this.

TomAugspurger · 2017-11-03T17:43:16Z

No strong thoughts. Glancing through the changes, this approach seems better than the changes in af9273f (which was only focused on making sure that the temporary worker directories are cleaned up).

pitrou · 2017-11-03T20:26:09Z

@jakirkham perhaps you would be interested in trying to rebase this PR / fix the conflicts?

jakirkham · 2017-11-03T21:01:35Z

Not right now. Maybe later. Mainly interested in getting this released and then making use of it. Afterwards we can revisit outstanding issues.

mrocklin · 2017-11-03T21:03:27Z

@jakirkham depending on the kind of system that you're on you may also find this wiki page of interest: https://github.com/pangeo-data/pangeo/wiki/Getting-Started-with-Dask-on-Cheyenne This is how I tend to operate on HPC systems these days.

…

On Fri, Nov 3, 2017 at 5:01 PM, jakirkham ***@***.***> wrote: Not right now. Maybe later. Mainly interested in getting this released and then making use of it. Afterwards we can revisit outstanding issues. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIDeMIO-IH8JUXIk1i-B2rszZE-2ks5sy38vgaJpZM4NrqmZ> .

jakirkham · 2017-11-03T21:44:03Z

That is very interesting. Thanks for the link.

We do have a strategy in place for starting jobs on the cluster. For legacy reasons, it uses ipyparallel and then launches a distributed cluster on top of that. Though am now thinking that maybe we should just use distributed directly. Switching to this drmaa-based startup method looks to be a small change, which will do the job. So think we'll try that near term to address our needs. If this needs to change again for some reason, will revisit other options down the road.

mrocklin · 2017-11-03T21:57:47Z

DRMAA seems simple enough that if it fits it's a good choice. I certainly know of groups that use this package daily. They're able to hand it to new developers who seem to find it comfortable enough. I think that challenges have arisen whenever groups have wanted to do clever things with their job scheduler and the DRMAA interface wasn't sufficiently expressible. In that case sometimes providing a custom job script to dask-drmaa worked nicely. In other cases it was too complex.

jakirkham · 2017-11-05T22:49:14Z

Yeah have used DRMAA in the past for other applications and have found it works quite well for simple tasks. Have pretty minimal requirements as to what the Distributed cluster needs to do. So think this should be ok. Especially after some brief experimentation with it.

This may be just my opinion; however, in cases of more complex usage, it's probably not just DRMAA that is insufficiently expressive, but the underlying scheduler as well.

Stop workers politely

665ed74

mrocklin reviewed May 31, 2017

View reviewed changes

pitrou closed this May 31, 2017

pitrou reopened this May 31, 2017

Try to make _stop_all_workers more robust in adaptive scenarios

c6510ba

pitrou force-pushed the stop_workers_politely branch from b5807f4 to c6510ba Compare May 31, 2017 17:23

pitrou commented May 31, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop workers politely #31

Stop workers politely #31

pitrou commented May 31, 2017

pitrou commented May 31, 2017

mrocklin May 31, 2017

mrocklin May 31, 2017

mrocklin May 31, 2017

pitrou May 31, 2017

mrocklin May 31, 2017

llchan Jul 15, 2017

pitrou commented May 31, 2017

pitrou May 31, 2017

mrocklin May 31, 2017

pitrou Jun 1, 2017 •

edited

Loading

mrocklin commented Jun 1, 2017 via email

pitrou commented Jun 1, 2017

mrocklin commented Jun 1, 2017

pitrou commented Jun 1, 2017

mrocklin commented Jun 1, 2017

pitrou commented Jun 1, 2017

jakirkham commented Nov 3, 2017

pitrou commented Nov 3, 2017

TomAugspurger commented Nov 3, 2017

pitrou commented Nov 3, 2017

jakirkham commented Nov 3, 2017

mrocklin commented Nov 3, 2017 via email

jakirkham commented Nov 3, 2017

mrocklin commented Nov 3, 2017

jakirkham commented Nov 5, 2017

Stop workers politely #31

Are you sure you want to change the base?

Stop workers politely #31

Conversation

pitrou commented May 31, 2017

pitrou commented May 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Jun 1, 2017 • edited Loading

Choose a reason for hiding this comment

mrocklin commented Jun 1, 2017 via email

pitrou commented Jun 1, 2017

mrocklin commented Jun 1, 2017

pitrou commented Jun 1, 2017

mrocklin commented Jun 1, 2017

pitrou commented Jun 1, 2017

jakirkham commented Nov 3, 2017

pitrou commented Nov 3, 2017

TomAugspurger commented Nov 3, 2017

pitrou commented Nov 3, 2017

jakirkham commented Nov 3, 2017

mrocklin commented Nov 3, 2017 via email

jakirkham commented Nov 3, 2017

mrocklin commented Nov 3, 2017

jakirkham commented Nov 5, 2017

pitrou Jun 1, 2017 •

edited

Loading