Skip to content

Fix "None-None" race condition with keep-alive timeout#110

Merged
mrc merged 1 commit intomasterfrom
issue-103
Aug 20, 2020
Merged

Fix "None-None" race condition with keep-alive timeout#110
mrc merged 1 commit intomasterfrom
issue-103

Conversation

@mrc
Copy link
Copy Markdown
Contributor

@mrc mrc commented Aug 16, 2020

Keep-alives are handled by a scheduled routine, which closes the
connection if the keep-alive has expired. The routine calculates the
keep-alive time remaining based on the last response time, but that's
not set until after the routine is scheduled, creating a race
condition. The fix is to set the last response time before scheduling
the keep-alive routine. Additionally the calculation is a bit more
robust now, so the scheduled routine doesn't need to deal with None.

  • Addresses issue Exception in callback HttpProtocol.keep_alive_timeout_callback() #103

  • What I did: addressed a race condition

  • How I did it: set a variable which is used inside a scheduled routine (_last_response_time used inside keep_alive_timeout_callback()) before the routine is scheduled

  • How to verify it: added a test

  • One line description for the changelog: Fix "None-None" race condition with keep-alive timeout

  • One moving picture involving robots (not mandatory but encouraged): bebot

@mrc mrc marked this pull request as ready for review August 16, 2020 20:49
@mrc mrc force-pushed the issue-103 branch 4 times, most recently from f7b3f65 to d897197 Compare August 17, 2020 21:42
time_elapsed = current_time - self._last_response_time
if time_elapsed < self.keep_alive_timeout:
time_left = self.keep_alive_timeout - time_elapsed
time_left = self.keep_alive_time_left()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @mrc - thanks for the fix. for some historical context, this code comes from https://github.com/huge-success/sanic and we copied version 0.0.4 I believe (they're on 20.6.3 now ;) because importing it added 2 seconds and was causing functions to timeout starting up... fun stuff. full details would be in #60 if I misinterpreted what we did there. sorry that we didn't really make this very clear at all, we should probably have made it more obvious at the headers of these files (ie THIS CODE IS COPIED VERBATIM DO NOT MODIFY... etc etc)

in any case, I thought it may be good to know as it seems they have fixed this upstream as well here: https://github.com/huge-success/sanic/blob/master/sanic/server.py#L271 and it may be desirable to follow upstream rather than diverge, I guess further it may be desirable to copy a newer version of sanic http libs we used into here that fixes who knows how many other bugs as well, provided someone has the time, and probably would be wise to attribute this better so that when there are bugs in the http code this is more obvious (sorry).

in any event, I think this change is fine to land esp if it's causing issues as the copying process is possibly a little bit of work (I didn't look to see if the api changed, but I'd guess that it has considering it's been a while), but hopefully the historical context is useful for the team to know should anything else come up too. cheers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, very interesting! Yes I agree, copying the updated version in seems like a good approach. This change is a bit of a band aid because the error keeps popping up in our logs when _last_response_time is None right at the start (I guess, I found it hard to reproduce). As you're OK with the change as-is then I'll add a comment to the docs and files, and include context about why it's copied, which seems like a minimal change that paves the way for reimporting the current version. Thanks for the attention to detail!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm looking at sanic and I don't know that it does fix this upstream, but I'm not really familiar with it so let me know if I'm missing something.

The keep-alive callback reads self._last_response_time https://github.com/huge-success/sanic/blob/master/sanic/server.py#L279
but it's set after the callback is scheduled, which I think is where the race condition is:
https://github.com/huge-success/sanic/blob/master/sanic/server.py#L538
https://github.com/huge-success/sanic/blob/master/sanic/server.py#L538

What do you reckon? I will still add a signpost comment into our branch to point to the fork, but I think I can also fix it upstream, unless you see the bug is somewhere else (or my fix won't work).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean now, it's None and you're getting an exception because a connection hasn't been made to set it to 0 yet - I guess possible alternative is to set it to 0 when instantiated rather than None (it looks explicit), it'd be 'less code' but I'm not sure it's any better and wouldn't handle any case where it got set back to None (did not look, perhaps exists but I'd guess not).

you're right that doesn't look fixed upstream either. I am sure that it would be appreciated, it seems like the 'right' thing to do to update upstream and then copy it in, there's always time constraints for doing things like this so I think we can fix here and then proceed to do all of that, it takes time to land things upstream and then test updated versions, especially in this weird one where we copied some stuff and the api may have changed. your fix looks good to me! the test is nice too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a bit odd... I'm going through the history of https://github.com/huge-success/sanic and I can't quite see where we would have branched it. I think it's after 0.6.0 but I am having trouble narrowing it down more than that. It looks like Denis cleaned up the files (adding a license header, class def changes like class Router():->class Router(object):, and some stuff trimmed out which I'm guessing is to help reduce cold starts). I'm glad you explained in your first comment why we copied vs imported, that takes away the temptation. I'll cut a new issue to do that later, because the trick will be to preserve Denis's optimizations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually now I'm thinking about it, I don't know enough about asyncio to know if this is where the race is. It looks like the call to schedule the callback is just using call_later which isn't async/doesn't await so I'm not sure how the callback is happening. So I'm guessing that rearranging things to set the variable before scheduling the callback won't have any effect, and I can't see how the callback can happen with both the LHS & RHS as None. I still think (boneheaded maybe) that this patch will help because of the check for None, I just can't see how that happens at runtime...

Keep-alives are handled by a scheduled routine, which closes the
connection if the keep-alive has expired. The routine calculates the
keep-alive time remaining based on the last response time, but that's
not set until after the routine is scheduled, creating a race
condition. The fix is to set the last response time before scheduling
the keep-alive routine. Additionally the calculation is a bit more
robust now, so the scheduled routine doesn't need to deal with None.

The async code is largely imported from [sanic](https://github.com/huge-success/sanic),
modified and stripped-down to reduce cold start times.

Addresses issue #103
@mrc
Copy link
Copy Markdown
Contributor Author

mrc commented Aug 19, 2020

Updated with signpost.md and a note in the commit log about sanic.

@mrc mrc merged commit f44a70b into master Aug 20, 2020
@mrc mrc deleted the issue-103 branch August 20, 2020 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants