Conversation
f7b3f65 to
d897197
Compare
| time_elapsed = current_time - self._last_response_time | ||
| if time_elapsed < self.keep_alive_timeout: | ||
| time_left = self.keep_alive_timeout - time_elapsed | ||
| time_left = self.keep_alive_time_left() |
There was a problem hiding this comment.
hey @mrc - thanks for the fix. for some historical context, this code comes from https://github.com/huge-success/sanic and we copied version 0.0.4 I believe (they're on 20.6.3 now ;) because importing it added 2 seconds and was causing functions to timeout starting up... fun stuff. full details would be in #60 if I misinterpreted what we did there. sorry that we didn't really make this very clear at all, we should probably have made it more obvious at the headers of these files (ie THIS CODE IS COPIED VERBATIM DO NOT MODIFY... etc etc)
in any case, I thought it may be good to know as it seems they have fixed this upstream as well here: https://github.com/huge-success/sanic/blob/master/sanic/server.py#L271 and it may be desirable to follow upstream rather than diverge, I guess further it may be desirable to copy a newer version of sanic http libs we used into here that fixes who knows how many other bugs as well, provided someone has the time, and probably would be wise to attribute this better so that when there are bugs in the http code this is more obvious (sorry).
in any event, I think this change is fine to land esp if it's causing issues as the copying process is possibly a little bit of work (I didn't look to see if the api changed, but I'd guess that it has considering it's been a while), but hopefully the historical context is useful for the team to know should anything else come up too. cheers.
There was a problem hiding this comment.
Oh wow, very interesting! Yes I agree, copying the updated version in seems like a good approach. This change is a bit of a band aid because the error keeps popping up in our logs when _last_response_time is None right at the start (I guess, I found it hard to reproduce). As you're OK with the change as-is then I'll add a comment to the docs and files, and include context about why it's copied, which seems like a minimal change that paves the way for reimporting the current version. Thanks for the attention to detail!
There was a problem hiding this comment.
Hmm I'm looking at sanic and I don't know that it does fix this upstream, but I'm not really familiar with it so let me know if I'm missing something.
The keep-alive callback reads self._last_response_time https://github.com/huge-success/sanic/blob/master/sanic/server.py#L279
but it's set after the callback is scheduled, which I think is where the race condition is:
https://github.com/huge-success/sanic/blob/master/sanic/server.py#L538
https://github.com/huge-success/sanic/blob/master/sanic/server.py#L538
What do you reckon? I will still add a signpost comment into our branch to point to the fork, but I think I can also fix it upstream, unless you see the bug is somewhere else (or my fix won't work).
There was a problem hiding this comment.
I see what you mean now, it's None and you're getting an exception because a connection hasn't been made to set it to 0 yet - I guess possible alternative is to set it to 0 when instantiated rather than None (it looks explicit), it'd be 'less code' but I'm not sure it's any better and wouldn't handle any case where it got set back to None (did not look, perhaps exists but I'd guess not).
you're right that doesn't look fixed upstream either. I am sure that it would be appreciated, it seems like the 'right' thing to do to update upstream and then copy it in, there's always time constraints for doing things like this so I think we can fix here and then proceed to do all of that, it takes time to land things upstream and then test updated versions, especially in this weird one where we copied some stuff and the api may have changed. your fix looks good to me! the test is nice too.
There was a problem hiding this comment.
Yeah this is a bit odd... I'm going through the history of https://github.com/huge-success/sanic and I can't quite see where we would have branched it. I think it's after 0.6.0 but I am having trouble narrowing it down more than that. It looks like Denis cleaned up the files (adding a license header, class def changes like class Router():->class Router(object):, and some stuff trimmed out which I'm guessing is to help reduce cold starts). I'm glad you explained in your first comment why we copied vs imported, that takes away the temptation. I'll cut a new issue to do that later, because the trick will be to preserve Denis's optimizations.
There was a problem hiding this comment.
Hmm actually now I'm thinking about it, I don't know enough about asyncio to know if this is where the race is. It looks like the call to schedule the callback is just using call_later which isn't async/doesn't await so I'm not sure how the callback is happening. So I'm guessing that rearranging things to set the variable before scheduling the callback won't have any effect, and I can't see how the callback can happen with both the LHS & RHS as None. I still think (boneheaded maybe) that this patch will help because of the check for None, I just can't see how that happens at runtime...
Keep-alives are handled by a scheduled routine, which closes the connection if the keep-alive has expired. The routine calculates the keep-alive time remaining based on the last response time, but that's not set until after the routine is scheduled, creating a race condition. The fix is to set the last response time before scheduling the keep-alive routine. Additionally the calculation is a bit more robust now, so the scheduled routine doesn't need to deal with None. The async code is largely imported from [sanic](https://github.com/huge-success/sanic), modified and stripped-down to reduce cold start times. Addresses issue #103
|
Updated with signpost.md and a note in the commit log about sanic. |
Keep-alives are handled by a scheduled routine, which closes the
connection if the keep-alive has expired. The routine calculates the
keep-alive time remaining based on the last response time, but that's
not set until after the routine is scheduled, creating a race
condition. The fix is to set the last response time before scheduling
the keep-alive routine. Additionally the calculation is a bit more
robust now, so the scheduled routine doesn't need to deal with None.
Addresses issue Exception in callback HttpProtocol.keep_alive_timeout_callback() #103
What I did: addressed a race condition
How I did it: set a variable which is used inside a scheduled routine (
_last_response_timeused insidekeep_alive_timeout_callback()) before the routine is scheduledHow to verify it: added a test
One line description for the changelog: Fix "None-None" race condition with keep-alive timeout
One moving picture involving robots (not mandatory but encouraged):