Retry operations on network issues #3294

jochen-ott-by · 2019-12-03T09:21:03Z

We operate distributed in the cloud and see tcp connection aborts. Unfortunately, distributed often does not recover cleanly from such situations, although a simple re-try would have helped in most cases.
This PR proposes to add a more generic retry to some operations.

Notes:

only some operations are re-tried, as for some operations, triggering it twice may have undesired effects. There are probably more operations that can / should be re-tried, so this is just a start for operations where it's "obviously" safe to retry.
parameters for the re-tries (maximum number of retry attempts, delay between re-tries) is configurable. The default is to not re-try at all to not change the current behavior (some might rely on/prefer seeing all connection failures, fast)

mrocklin · 2019-12-03T15:51:35Z

Thank you @byjott . In principle this seems fine to me. I'm glad to have you all making sure that Dask operates well in situations where network connections are unstable.

What do you think about adding a retry= keyword directly into the rpc or ConnectionPool API?

await self.scheduler.foo(..., retry=True)

(not a requirement, it's an honest question)

jochen-ott-by · 2019-12-04T06:51:01Z

Thanks for looking into this.

What do you think about adding a retry= keyword directly into the rpc or ConnectionPool API?
await self.scheduler.foo(..., retry=True)

Actually, I did this as the first implementation. It would also work and be more concise at the calling site.
I only changed it into this one after realizing that:

IMO, this breaks the cleanliness and separation of the API: currently, all arguments to foo are forwarded to the scheduler process, as part of the rpc. However, any retry parameter would not be forwarded but rather take effect on a different level. This mixing of parameters that are forwarded and some that are interpreted at some intermediate level goes against the "principle of least surpris".
As a side-effect, you would not be able to actually have a parameter called retry that you want forwarded to the scheduler. (This might not be a practical limitation, but still such a limitation feels arbitrary and surprising).
not all of the operations we want to re-try would go through the PooledRPCCall interface (see e.g. gather_deps_from_worker), so we need a more generic async wrapper that retries (or something similarly generic) anyway, so why not just use only that
As far as I understand the code, Client.scheduler can either be a rpc or a PooledRPCCall, and I didn't quite get whether there are more options, so such a change would need to touch this (implicit) interface that rpc and PooledRPCCall implement. I avoided the risk of missing some class here and opted for the more explicit way to wrap the calls instead of intrusively adding parameters.

fjetter · 2019-12-04T11:13:51Z

I'm wondering if there are any scenarios where we would not like to retry IOError/EnvironmentErrors, i.e. what about adding this retry mechanism to the ConnectionPool/rpc without adding the explicit parameter?

jochen-ott-by · 2019-12-04T11:26:32Z

I'm wondering if there are any scenarios where we would not like to retry IOError/EnvironmentErrors, i.e. what about adding this retry mechanism to the ConnectionPool/rpc without adding the explicit parameter?

I though about that, but as I wrote above: "only some operations are re-tried, as for some operations, triggering it twice may have undesired effects". To give an example: I'm not sure whether we really want to re-try a scheduler "rebalance" operation, as this could lead to two "rebalance" operations being active at the same time, and I'm not sure the code was written to make this safe. Similar thoughts hold for other operations.
I guess one could make it safe by generating a unique id on the client side for each semantically distinct request and let the server ignore the second request with the same id (or better yet: send the same answer if need be). I think this would be safe, but it would also require a major refactoring.

mrocklin · 2019-12-05T01:34:33Z

distributed/distributed.yaml

+    retry_operations:  # some operations (such as gathering data) are subject to re-tries with the below parameters
+      max_retries: 0  # the maximum number to retry an operation in case of a connection problem
+      base_delay: 1s  # the first non-zero delay between re-tries
+      max_delay: 20s  # the maximum delay between re-tries


I recommend the following names:

retry: count: 0 delay: min: 1s max: 20s

Also, if you want to keep with multi-word names then I recommend the use of hyphens over underscores. Mostly my reasoning here is for consistency with the names that are currently here.

Alright, I just changed it to what you proposed. Thanks for the quick review!

mrocklin · 2019-12-05T01:35:04Z

Other than the small comment about the configuration names I'm happy to merge what's here. Thanks @byjott for your efforts here.

mrocklin · 2019-12-05T15:33:40Z

Thank you for your effort here @byjott . This is in.

jochen-ott-by force-pushed the retry-operations branch 2 times, most recently from 5a0cc70 to e81b9bc Compare December 3, 2019 10:59

mrocklin reviewed Dec 5, 2019

View reviewed changes

Jochen Ott added 2 commits December 5, 2019 11:19

Retry some operations on network issues

39f9d4d

Rename retry-related parameters in distributed.yml

59e0b03

jochen-ott-by force-pushed the retry-operations branch from 5053a36 to 59e0b03 Compare December 5, 2019 10:19

mrocklin merged commit 4e9eb46 into dask:master Dec 5, 2019

marco-neumann-by mentioned this pull request Dec 9, 2019

Re-computions (flapping) due to GIL-holding #3204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry operations on network issues #3294

Retry operations on network issues #3294

jochen-ott-by commented Dec 3, 2019

mrocklin commented Dec 3, 2019

jochen-ott-by commented Dec 4, 2019 •

edited

Loading

fjetter commented Dec 4, 2019

jochen-ott-by commented Dec 4, 2019

mrocklin Dec 5, 2019

jochen-ott-by Dec 5, 2019

mrocklin commented Dec 5, 2019

mrocklin commented Dec 5, 2019

Retry operations on network issues #3294

Retry operations on network issues #3294

Conversation

jochen-ott-by commented Dec 3, 2019

mrocklin commented Dec 3, 2019

jochen-ott-by commented Dec 4, 2019 • edited Loading

fjetter commented Dec 4, 2019

jochen-ott-by commented Dec 4, 2019

mrocklin Dec 5, 2019

Choose a reason for hiding this comment

jochen-ott-by Dec 5, 2019

Choose a reason for hiding this comment

mrocklin commented Dec 5, 2019

mrocklin commented Dec 5, 2019

jochen-ott-by commented Dec 4, 2019 •

edited

Loading