Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't make False add-keys report to scheduler #2421

Merged
merged 3 commits into from Jul 15, 2019
Merged

Conversation

tjb900
Copy link
Contributor

@tjb900 tjb900 commented Dec 18, 2018

Possible fix for #2420, for now putting here mostly to trigger CI.

Cleans up (imho) a couple of things in the worker:

  • responsibility for maintaining the in_flight_tasks / in_flight_workers maps is entirely contained within the transition_dep_* functions
  • better handles the case where dependencies are released (due to e.g. stealing) after they are marked as in-flight but before or during the transfer. In particular, removes an add-keys notification to the scheduler for keys that we might then throw in the trash due to being no-longer needed.

@mrocklin
Copy link
Member

Thank you for submitting this @tjb900

It looks like there are a few smaller changes here. In general we're pretty conservative with changes to the scheduling logic, so I encouarage you to isolate changes into different PRs if possible and also demonstrate that they improve concrete situations with tests if possible. Often constructing a minimal test that shows off failing behavior can be difficult, but is quite useful to ensure that future changes don't subtly break things again.

@tjb900
Copy link
Contributor Author

tjb900 commented Dec 19, 2018

Will do @mrocklin - thanks for the feedback. Will keep this PR for the removal of the add-keys only.

@mrocklin
Copy link
Member

I think that we still want to inform the scheduler that we have this data in the common case that we decide to store it.

@tjb900
Copy link
Contributor Author

tjb900 commented Dec 20, 2018

I'm pretty sure that's done inside transition_dep_flight_memory?

@tjb900
Copy link
Contributor Author

tjb900 commented Dec 20, 2018

(still planning to come up with a test, too - so far I have just removed all of the other changes)

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up and adding the test @tjb900 . I've added a few minor comments in-line.

One larger issue is time. We have a few thousand tests in the test suite and like to run them frequently. Ideally we would find a way to significantly reduce the time of this test (most take under a few hundred milliseconds) while still making it sensitive to the issue at hand. Thoughts?

return 2

y = c.submit(bogus_task, x, workers=b.address)
yield c._cancel(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can safely drop the underscore here. The public functions will operate asynchronously in an asynchronous context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned that y may never start. Should we wait until b is executing something?

while not b.executing:
    yield gen.sleep(0.01)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's important that y never starts - we need to cause x to be transferred from a to b, but then have y be cancelled while that transfer is occurring - so that x is never actually placed in memory on b. This timing is hard to arrange in the test (and I suspect actually now even harder with #2428),

yield gen.sleep(5)
try:
yield wait(y)
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend except Exception: otherwise this catches KeyboardInterrupts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree this except is way too broad actually - it's really for the CancelledError. Will fix.

@tjb900
Copy link
Contributor Author

tjb900 commented Jan 4, 2019

Thanks @mrocklin. Understand re the timing. The long waits are arbitrary and should really be replaced with something better.

The three cases I need to trigger on are:

  • gather_dep on b has started to fetch the data. This is the cue to cancel y (but note that we need to arrange for the cancellation notice from s to reach b before the data from a does).
  • gather_dep on b has finished fetching the data, and enqueued any resulting communications to the s
  • those batched communications from b to s have completed

For the first two I don't have any great ideas, for now. Checking b.log seems very hacky. Ideally we could put test hooks into the RPCs so the test could control when they are allowed to return (is this something that would be welcome?).

For the third, maybe I can call the batched comm callback directly from the test rather than waiting for the periodic callback.

@tjb900
Copy link
Contributor Author

tjb900 commented Jan 4, 2019

@mrocklin tbh, I'm a little concerned that this test is going to end up so targeted at the current implementation of the worker that it is highly unlikely to catch a future regression. On top of this, because it depends on the internal details of the worker rather than some kind of semi-stable API, it comes with a maintenance burden of updating the test to match any logic changes inside the worker.

Is there instead a longer running "test everything" test which we could augment to catch this situation? For instance, by verifying after a large amount of work that the scheduler and workers' ideas of what they all have are consistent with each other?

@mrocklin
Copy link
Member

mrocklin commented Jan 4, 2019

For gather_dep it might work to look at Worker.in_flight_workers. This will be populated while the dependency is possibly in flight, although there is definitely a period when it's in in_flight_workers before the transfer has started.

but note that we need to arrange for the cancellation notice from s to reach b before the data from a does

As a warning, you should count on random 3ish second delays from time to time. A combination of slow containers on travis-ci and some intense GC can play havoc with any test that is depending on something happening within a finite window.

Another approach?

I've tended to avoid intriciately orchestrated tests for this reason, and instead tend to try to find statistical tests. Is there something we could do quickly a hundred times that would have some measurable negative effect?

In particular, I wonder if we might draw inspiration from your original concern:

We are working in an adaptive-scaling environment, and are hitting a race condition where sometimes workers don't have what the scheduler thinks they have - and thus frees data that it shouldn't.

Is there a lightweight system we could put together that would simulate this? Maybe we start and stop workers very quickly and submit and cancel tasks very quickly at the same time? Then we verify that at the end (or throughout) that everything is as it should be? My guess is that you could construct a system that would test what you've done here, and probably test a bunch of other things at the same time as well and have it finish in less time. A test like this might also survive future refactorings and keep us honest for longer.

Thoughts?

@mrocklin
Copy link
Member

mrocklin commented Jan 4, 2019

Whoops, I wrote my answer while you were writing yours. Sounds like we came to the same conclusion independently :)

@mrocklin
Copy link
Member

mrocklin commented Jan 4, 2019

Is there instead a longer running "test everything" test which we could augment to catch this situation? For instance, by verifying after a large amount of work that the scheduler and workers' ideas of what they all have are consistent with each other?

There are a variety of such tests in test_stress.py, but obviously none of them have caught this :)

I think that a new stress test that focused on highly adaptive clusters would be useful generally if you're interested in cooking something up (no obligation though).

@mrocklin
Copy link
Member

Checking in here @tjb900 . Any thoughts?

@tjb900
Copy link
Contributor Author

tjb900 commented Feb 1, 2019

Definitely keen to work on a test that works out the adaptive functionality - but am snowed at the moment, so it's kind of on hold sorry.

@jrbourbeau
Copy link
Member

Checking in here, @tjb900 just wondering what the status is on this PR. Do you plan to pick things up again?

@tjb900
Copy link
Contributor Author

tjb900 commented Jun 24, 2019

Hi @jrbourbeau - sorry I've left this languishing for so long. Realistically, I don't think there's a chance I'll get to it sorry. Happy to close.

@jrbourbeau
Copy link
Member

I removed test_worker_keeps_data_gh2420 and merged the current master branch into this branch, hope that's okay @tjb900

The proposed changes seem reasonable to me (I discussed this issue with @mrocklin yesterday). @mrocklin, when you get a moment, would you mind taking another look at this PR

@mrocklin mrocklin changed the title WIP: don't make False add-keys report to scheduler Don't make False add-keys report to scheduler Jul 15, 2019
@mrocklin mrocklin merged commit d493498 into dask:master Jul 15, 2019
muammar added a commit to muammar/distributed that referenced this pull request Jul 18, 2019
* upstream/master: (33 commits)
  SpecCluster: move init logic into start (dask#2850)
  Dont reuse closed worker in get_worker (dask#2841)
  Add alternative SSHCluster implementation (dask#2827)
  Extend prometheus metrics endpoint (dask#2792) (dask#2833)
  Include type name in SpecCluster repr (dask#2834)
  Don't make False add-keys report to scheduler (dask#2421)
  Add Nanny to worker docs (dask#2826)
  Respect security configuration in LocalCluster (dask#2822)
  bump version to 2.1.0
  Fix typo that prevented error message (dask#2825)
  Remove dask-mpi (dask#2824)
  Updates to use update_graph in task journey docs (dask#2821)
  Fix Client repr with memory_info=None (dask#2816)
  Fix case where key, rather than TaskState, could end up in ts.waiting_on (dask#2819)
  Use Keyword-only arguments (dask#2814)
  Relax check for worker references in cluster context manager (dask#2813)
  Add HTTPS support for the dashboard (dask#2812)
  CLN: Use dask.utils.format_bytes (dask#2810)
  bump version to 2.0.1
  Add python_requires entry to setup.py (dask#2807)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants