New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't make False add-keys report to scheduler #2421
Conversation
Thank you for submitting this @tjb900 It looks like there are a few smaller changes here. In general we're pretty conservative with changes to the scheduling logic, so I encouarage you to isolate changes into different PRs if possible and also demonstrate that they improve concrete situations with tests if possible. Often constructing a minimal test that shows off failing behavior can be difficult, but is quite useful to ensure that future changes don't subtly break things again. |
Will do @mrocklin - thanks for the feedback. Will keep this PR for the removal of the add-keys only. |
I think that we still want to inform the scheduler that we have this data in the common case that we decide to store it. |
I'm pretty sure that's done inside |
(still planning to come up with a test, too - so far I have just removed all of the other changes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up and adding the test @tjb900 . I've added a few minor comments in-line.
One larger issue is time. We have a few thousand tests in the test suite and like to run them frequently. Ideally we would find a way to significantly reduce the time of this test (most take under a few hundred milliseconds) while still making it sensitive to the issue at hand. Thoughts?
distributed/tests/test_worker.py
Outdated
return 2 | ||
|
||
y = c.submit(bogus_task, x, workers=b.address) | ||
yield c._cancel(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can safely drop the underscore here. The public functions will operate asynchronously in an asynchronous context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly concerned that y
may never start. Should we wait until b
is executing something?
while not b.executing:
yield gen.sleep(0.01)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it's important that y
never starts - we need to cause x
to be transferred from a
to b
, but then have y
be cancelled while that transfer is occurring - so that x
is never actually placed in memory on b
. This timing is hard to arrange in the test (and I suspect actually now even harder with #2428),
distributed/tests/test_worker.py
Outdated
yield gen.sleep(5) | ||
try: | ||
yield wait(y) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend except Exception:
otherwise this catches KeyboardInterrupts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree this except
is way too broad actually - it's really for the CancelledError
. Will fix.
Thanks @mrocklin. Understand re the timing. The long waits are arbitrary and should really be replaced with something better. The three cases I need to trigger on are:
For the first two I don't have any great ideas, for now. Checking For the third, maybe I can call the batched comm callback directly from the test rather than waiting for the periodic callback. |
@mrocklin tbh, I'm a little concerned that this test is going to end up so targeted at the current implementation of the worker that it is highly unlikely to catch a future regression. On top of this, because it depends on the internal details of the worker rather than some kind of semi-stable API, it comes with a maintenance burden of updating the test to match any logic changes inside the worker. Is there instead a longer running "test everything" test which we could augment to catch this situation? For instance, by verifying after a large amount of work that the scheduler and workers' ideas of what they all have are consistent with each other? |
For
As a warning, you should count on random 3ish second delays from time to time. A combination of slow containers on travis-ci and some intense GC can play havoc with any test that is depending on something happening within a finite window. Another approach?I've tended to avoid intriciately orchestrated tests for this reason, and instead tend to try to find statistical tests. Is there something we could do quickly a hundred times that would have some measurable negative effect? In particular, I wonder if we might draw inspiration from your original concern:
Is there a lightweight system we could put together that would simulate this? Maybe we start and stop workers very quickly and submit and cancel tasks very quickly at the same time? Then we verify that at the end (or throughout) that everything is as it should be? My guess is that you could construct a system that would test what you've done here, and probably test a bunch of other things at the same time as well and have it finish in less time. A test like this might also survive future refactorings and keep us honest for longer. Thoughts? |
Whoops, I wrote my answer while you were writing yours. Sounds like we came to the same conclusion independently :) |
There are a variety of such tests in I think that a new stress test that focused on highly adaptive clusters would be useful generally if you're interested in cooking something up (no obligation though). |
Checking in here @tjb900 . Any thoughts? |
Definitely keen to work on a test that works out the adaptive functionality - but am snowed at the moment, so it's kind of on hold sorry. |
Checking in here, @tjb900 just wondering what the status is on this PR. Do you plan to pick things up again? |
Hi @jrbourbeau - sorry I've left this languishing for so long. Realistically, I don't think there's a chance I'll get to it sorry. Happy to close. |
* upstream/master: (33 commits) SpecCluster: move init logic into start (dask#2850) Dont reuse closed worker in get_worker (dask#2841) Add alternative SSHCluster implementation (dask#2827) Extend prometheus metrics endpoint (dask#2792) (dask#2833) Include type name in SpecCluster repr (dask#2834) Don't make False add-keys report to scheduler (dask#2421) Add Nanny to worker docs (dask#2826) Respect security configuration in LocalCluster (dask#2822) bump version to 2.1.0 Fix typo that prevented error message (dask#2825) Remove dask-mpi (dask#2824) Updates to use update_graph in task journey docs (dask#2821) Fix Client repr with memory_info=None (dask#2816) Fix case where key, rather than TaskState, could end up in ts.waiting_on (dask#2819) Use Keyword-only arguments (dask#2814) Relax check for worker references in cluster context manager (dask#2813) Add HTTPS support for the dashboard (dask#2812) CLN: Use dask.utils.format_bytes (dask#2810) bump version to 2.0.1 Add python_requires entry to setup.py (dask#2807) ...
Possible fix for #2420, for now putting here mostly to trigger CI.
Cleans up (imho) a couple of things in the worker:
in_flight_tasks
/in_flight_workers
maps is entirely contained within thetransition_dep_*
functionsadd-keys
notification to the scheduler for keys that we might then throw in the trash due to being no-longer needed.