New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase robustness to TimeoutError during connect #5096
Conversation
Question, what should happen here. @jacobtomlinson I think that you were active in this code early on? |
Honestly I don't remember. But looking at the code around this change we losing one of two backoffs with this change. Which I agree is probably fine. |
There has been some discussion about this intermediate cap, particularly in comment #4176 (comment) (xref #3104) @jcrist you seemed to have an opinion here |
c05ec61
to
89f37cd
Compare
Just rebased in case this is connected to our recent CI stability issues |
I am still convinced this is a major improvement over the current implementation and would like to go ahead with merging this unless there are any major objections @graingert pointed out that an even better logic would be to not cancel the initial attempt but schedule another one and use whichever comes back first. I suggest to implement this in a follow up |
Do I understand correctly that your change means that the code will do exactly two connection attempts? The whole method could be rewritten without any loops: ... = await wait_for(..., timeout=timeout / 5)
backoff = random.uniform(0, min(0.01, timeout / 5 * 4))
await asyncio.sleep(backoff)
... = await wait_for(..., timeout=max(0, timeout / 5 * 4 - backoff) + error handling |
It also retries |
Full functionally equivalent replacement for lines 271:317: # Prefer two small attempts over one long attempt. This should protect
# primarily from DNS race conditions
# gh3104, gh4176, gh4167, gh5096
def try_connect(timeout):
return asyncio.wait_for(
connector.connect(loc, deserialize=deserialize, **connection_args),
timeout=timeout / 5,
)
try:
comm = await try_connect(timeout / 5)
except FatalCommClosedError:
raise
# Note: CommClosed inherits from OSError
except (asyncio.TimeoutError, OSError):
backoff = random.uniform(0, min(0.01, timeout / 5 * 4))
logger.debug(
"Could not connect to %s, waiting for %.3fs before retrying", loc, backoff
)
await asyncio.sleep(backoff)
try:
comm = await try_connect(timeout / 5 * 4 - backoff)
except FatalCommClosedError:
raise
except (asyncio.TimeoutError, OSError) as exc:
raise OSError(
f"Timed out trying to connect to {addr} after {timeout} s"
) from exc |
Well, it's not fully equivalent. If there is more than one Subjectively, I don't consider the loop adds much complexity but I understand your concern about this implementation. There has been long discussions about it in #4176 Specifically, the questions that concern me about this are
I'm open to discussing all of the above points and more but would suggest to do so in a dedicated issue and discuss what our requirements are first. As I said, this has been a very frequent source of instabilities and I would like to move carefully. This change is only intended to remove the staggered intermediate caps which doesn't change any logic but rather reduces our aggressiveness in timeouts which should overall improve robustness. |
I will go ahead and merge this now since nobody raised an objection about the specific change I am proposing of removing intermediate caps after the initial connect attempt. To discuss future improvements of the implementation, I would like to move to a different ticket as mentioned above. |
This intermediate cap is very aggressive and I believe this causes instabilities. We should be more forgiving. The initial intermediate cap is intended to account for slow or flaky DNS servers and I believe removing this artificial intermediate cap is safe since we have a backoff and jitter.
There have been lengthy discussions around the intermediate cap on #4176
see also
cc @jacobtomlinson @jcrist @gjoseph92
This might close #5095