Increase robustness to TimeoutError during connect #5096

fjetter · 2021-07-20T13:51:34Z

This intermediate cap is very aggressive and I believe this causes instabilities. We should be more forgiving. The initial intermediate cap is intended to account for slow or flaky DNS servers and I believe removing this artificial intermediate cap is safe since we have a backoff and jitter.

There have been lengthy discussions around the intermediate cap on #4176

Unit Test Results

      18 files +      1       18 suites +1 9h 26m 27s ⏱️ + 10m 12s
  2 595 tests ±      0   2 514 ✔️ +      6     80 💤 ±  0 1 ❌ - 6
23 221 runs +1 384 21 783 ✔️ +1 305 1 436 💤 +84 2 ❌ - 5

For more details on these failures, see this check.

Results for commit 89f37cd. ± Comparison against base commit d0a93e9.

crusaderky · 2022-02-08T17:55:21Z

Do I understand correctly that your change means that the code will do exactly two connection attempts?

The whole method could be rewritten without any loops:

... = await wait_for(..., timeout=timeout / 5)
backoff = random.uniform(0, min(0.01, timeout / 5 * 4))
await asyncio.sleep(backoff)
... = await wait_for(..., timeout=max(0, timeout / 5 * 4 - backoff)

+ error handling

fjetter · 2022-02-08T17:56:53Z

Do I understand correctly that your change means that the code will do exactly two connection attempts?

It also retries OSErrors

crusaderky · 2022-02-08T18:31:20Z

Full functionally equivalent replacement for lines 271:317:

    # Prefer two small attempts over one long attempt. This should protect
    # primarily from DNS race conditions
    # gh3104, gh4176, gh4167, gh5096
    def try_connect(timeout):
        return asyncio.wait_for(
            connector.connect(loc, deserialize=deserialize, **connection_args),
            timeout=timeout / 5,
        )

    try:
        comm = await try_connect(timeout / 5)
    except FatalCommClosedError:
        raise
    # Note: CommClosed inherits from OSError
    except (asyncio.TimeoutError, OSError):
        backoff = random.uniform(0, min(0.01, timeout / 5 * 4))
        logger.debug(
            "Could not connect to %s, waiting for %.3fs before retrying", loc, backoff
        )
        await asyncio.sleep(backoff)
        try:
            comm = await try_connect(timeout / 5 * 4 - backoff)
        except FatalCommClosedError:
            raise
        except (asyncio.TimeoutError, OSError) as exc:
            raise OSError(
                f"Timed out trying to connect to {addr} after {timeout} s"
            ) from exc

fjetter · 2022-02-09T10:30:16Z

Full functionally equivalent replacement for lines 271:317:

Well, it's not fully equivalent. If there is more than one OSError your implementation will fail while the current implementation tries again until the timeout expires. That can happen if the remote is under high load such that the TCP connection is rejected immediately (e.g. socket backlog full).
That behaviour is even tested in distributed/comm/tests/test_comm.py::test_retry_connect
Also, we'd need to calculate time left for the handshake again further down so there is not much gained by calculating the timeouts individually as you proposed.

Subjectively, I don't consider the loop adds much complexity but I understand your concern about this implementation. There has been long discussions about it in #4176
I do not want to change too much about this implementation without proper cause. This area has been known to cause a lot of stability issues in the past and I would like us to move carefully. I am convinced that the current retry+backoff algorithm works as intended but I acknowledge that there are things to improve on.

Specifically, the questions that concern me about this are

Should we even have an intermediate cap? What is this weird DNS situation we're trying to protect us from? Is this a general enough problem that we should even fix it? Wouldn't the backoff+jitter be sufficient? Would a random small jitter at the start of the implementation simplify things and serve the same purpose?
Should the handshake be covered by this timeout? The handshake may be significantly slower if the remote is busy. The handshake will even be blocked if the remote is holding the GIL (not sure about the initial connect. No idea what happens in kernel land vs user land)
Is our default connect timeout sufficiently large if we include the handshake (see also Increase default for connect timeout #4228 and Improve CI stability #5022)
Is the FullJitter approach suited for our application or would another backoff algorithm serve us better?
@graingert proposed a slightly different algorithm where we would not cancel the first attempt if it times out (asyncio.wait_for cancels the future) but rather schedule a second one and take whichever returns first. afaik, this is a relatively common pattern when connecting to remotes but I would like us to have this tested.

I'm open to discussing all of the above points and more but would suggest to do so in a dedicated issue and discuss what our requirements are first. As I said, this has been a very frequent source of instabilities and I would like to move carefully.

This change is only intended to remove the staggered intermediate caps which doesn't change any logic but rather reduces our aggressiveness in timeouts which should overall improve robustness.

fjetter · 2022-02-09T10:50:02Z

I will go ahead and merge this now since nobody raised an objection about the specific change I am proposing of removing intermediate caps after the initial connect attempt.

To discuss future improvements of the implementation, I would like to move to a different ticket as mentioned above.

fjetter mentioned this pull request Jul 20, 2021

client.run can timeout while connecting to workers on large clusters #5095

Closed

gjoseph92 mentioned this pull request Jul 20, 2021

TCP timeouts connecting to cluster #5099

Open

gjoseph92 added a commit to gjoseph92/coiled-parameter-sweep-profiling that referenced this pull request Jul 22, 2021

add dask/distributed#5096

ac4b57e

Improve connect robustness

89f37cd

fjetter force-pushed the connect_robust branch from c05ec61 to 89f37cd Compare February 8, 2022 13:06

fjetter self-assigned this Feb 8, 2022

fjetter changed the title ~~Improve connect robustness~~ Increase robustness to TimeoutError during connect Feb 9, 2022

fjetter merged commit f4da40d into dask:main Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase robustness to TimeoutError during connect #5096

Increase robustness to TimeoutError during connect #5096

fjetter commented Jul 20, 2021 •

edited

mrocklin commented Jul 30, 2021

jacobtomlinson commented Jul 30, 2021

fjetter commented Sep 8, 2021

fjetter commented Feb 8, 2022

fjetter commented Feb 8, 2022 •

edited

github-actions bot commented Feb 8, 2022

crusaderky commented Feb 8, 2022

fjetter commented Feb 8, 2022

crusaderky commented Feb 8, 2022 •

edited

fjetter commented Feb 9, 2022 •

edited

fjetter commented Feb 9, 2022

Increase robustness to TimeoutError during connect #5096

Increase robustness to TimeoutError during connect #5096

Conversation

fjetter commented Jul 20, 2021 • edited

mrocklin commented Jul 30, 2021

jacobtomlinson commented Jul 30, 2021

fjetter commented Sep 8, 2021

fjetter commented Feb 8, 2022

fjetter commented Feb 8, 2022 • edited

github-actions bot commented Feb 8, 2022

Unit Test Results

crusaderky commented Feb 8, 2022

fjetter commented Feb 8, 2022

crusaderky commented Feb 8, 2022 • edited

fjetter commented Feb 9, 2022 • edited

fjetter commented Feb 9, 2022

fjetter commented Jul 20, 2021 •

edited

fjetter commented Feb 8, 2022 •

edited

crusaderky commented Feb 8, 2022 •

edited

fjetter commented Feb 9, 2022 •

edited