Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase robustness to TimeoutError during connect #5096

Merged
merged 1 commit into from Feb 9, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jul 20, 2021

This intermediate cap is very aggressive and I believe this causes instabilities. We should be more forgiving. The initial intermediate cap is intended to account for slow or flaky DNS servers and I believe removing this artificial intermediate cap is safe since we have a backoff and jitter.

There have been lengthy discussions around the intermediate cap on #4176

see also

cc @jacobtomlinson @jcrist @gjoseph92

This might close #5095

gjoseph92 added a commit to gjoseph92/coiled-parameter-sweep-profiling that referenced this pull request Jul 22, 2021
@mrocklin
Copy link
Member

Question, what should happen here. @jacobtomlinson I think that you were active in this code early on?

@jacobtomlinson
Copy link
Member

Honestly I don't remember. But looking at the code around this change we losing one of two backoffs with this change. Which I agree is probably fine.

@fjetter
Copy link
Member Author

fjetter commented Sep 8, 2021

There has been some discussion about this intermediate cap, particularly in comment #4176 (comment) (xref #3104)

@jcrist you seemed to have an opinion here

@fjetter
Copy link
Member Author

fjetter commented Feb 8, 2022

Just rebased in case this is connected to our recent CI stability issues

@fjetter
Copy link
Member Author

fjetter commented Feb 8, 2022

I am still convinced this is a major improvement over the current implementation and would like to go ahead with merging this unless there are any major objections

@graingert pointed out that an even better logic would be to not cancel the initial attempt but schedule another one and use whichever comes back first. I suggest to implement this in a follow up

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2022

Unit Test Results

       18 files  +       1         18 suites  +1   9h 26m 27s ⏱️ + 10m 12s
  2 595 tests ±       0    2 514 ✔️ +       6       80 💤 ±  0  1  - 6 
23 221 runs  +1 384  21 783 ✔️ +1 305  1 436 💤 +84  2  - 5 

For more details on these failures, see this check.

Results for commit 89f37cd. ± Comparison against base commit d0a93e9.

@fjetter fjetter self-assigned this Feb 8, 2022
@crusaderky
Copy link
Collaborator

Do I understand correctly that your change means that the code will do exactly two connection attempts?

The whole method could be rewritten without any loops:

... = await wait_for(..., timeout=timeout / 5)
backoff = random.uniform(0, min(0.01, timeout / 5 * 4))
await asyncio.sleep(backoff)
... = await wait_for(..., timeout=max(0, timeout / 5 * 4 - backoff)

+ error handling

@fjetter
Copy link
Member Author

fjetter commented Feb 8, 2022

Do I understand correctly that your change means that the code will do exactly two connection attempts?

It also retries OSErrors

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 8, 2022

Full functionally equivalent replacement for lines 271:317:

    # Prefer two small attempts over one long attempt. This should protect
    # primarily from DNS race conditions
    # gh3104, gh4176, gh4167, gh5096
    def try_connect(timeout):
        return asyncio.wait_for(
            connector.connect(loc, deserialize=deserialize, **connection_args),
            timeout=timeout / 5,
        )

    try:
        comm = await try_connect(timeout / 5)
    except FatalCommClosedError:
        raise
    # Note: CommClosed inherits from OSError
    except (asyncio.TimeoutError, OSError):
        backoff = random.uniform(0, min(0.01, timeout / 5 * 4))
        logger.debug(
            "Could not connect to %s, waiting for %.3fs before retrying", loc, backoff
        )
        await asyncio.sleep(backoff)
        try:
            comm = await try_connect(timeout / 5 * 4 - backoff)
        except FatalCommClosedError:
            raise
        except (asyncio.TimeoutError, OSError) as exc:
            raise OSError(
                f"Timed out trying to connect to {addr} after {timeout} s"
            ) from exc

@fjetter
Copy link
Member Author

fjetter commented Feb 9, 2022

Full functionally equivalent replacement for lines 271:317:

Well, it's not fully equivalent. If there is more than one OSError your implementation will fail while the current implementation tries again until the timeout expires. That can happen if the remote is under high load such that the TCP connection is rejected immediately (e.g. socket backlog full).
That behaviour is even tested in distributed/comm/tests/test_comm.py::test_retry_connect
Also, we'd need to calculate time left for the handshake again further down so there is not much gained by calculating the timeouts individually as you proposed.

Subjectively, I don't consider the loop adds much complexity but I understand your concern about this implementation. There has been long discussions about it in #4176
I do not want to change too much about this implementation without proper cause. This area has been known to cause a lot of stability issues in the past and I would like us to move carefully. I am convinced that the current retry+backoff algorithm works as intended but I acknowledge that there are things to improve on.

Specifically, the questions that concern me about this are

  • Should we even have an intermediate cap? What is this weird DNS situation we're trying to protect us from? Is this a general enough problem that we should even fix it? Wouldn't the backoff+jitter be sufficient? Would a random small jitter at the start of the implementation simplify things and serve the same purpose?
  • Should the handshake be covered by this timeout? The handshake may be significantly slower if the remote is busy. The handshake will even be blocked if the remote is holding the GIL (not sure about the initial connect. No idea what happens in kernel land vs user land)
  • Is our default connect timeout sufficiently large if we include the handshake (see also Increase default for connect timeout #4228 and Improve CI stability #5022)
  • Is the FullJitter approach suited for our application or would another backoff algorithm serve us better?
  • @graingert proposed a slightly different algorithm where we would not cancel the first attempt if it times out (asyncio.wait_for cancels the future) but rather schedule a second one and take whichever returns first. afaik, this is a relatively common pattern when connecting to remotes but I would like us to have this tested.

I'm open to discussing all of the above points and more but would suggest to do so in a dedicated issue and discuss what our requirements are first. As I said, this has been a very frequent source of instabilities and I would like to move carefully.

This change is only intended to remove the staggered intermediate caps which doesn't change any logic but rather reduces our aggressiveness in timeouts which should overall improve robustness.

@fjetter fjetter changed the title Improve connect robustness Increase robustness to TimeoutError during connect Feb 9, 2022
@fjetter
Copy link
Member Author

fjetter commented Feb 9, 2022

I will go ahead and merge this now since nobody raised an objection about the specific change I am proposing of removing intermediate caps after the initial connect attempt.

To discuss future improvements of the implementation, I would like to move to a different ticket as mentioned above.

@fjetter fjetter merged commit f4da40d into dask:main Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

client.run can timeout while connecting to workers on large clusters
4 participants