New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix static file cached 304 response #1692

Merged
merged 7 commits into from Aug 14, 2018

Conversation

Projects
None yet
4 participants
@mrvanes
Contributor

mrvanes commented Feb 2, 2018

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Bugfix

  • What is the related issue number (starting with #)
    #1690

  • What is the current behavior? (You can also link to an open issue here)
    #1690

  • What is the new behavior (if this is a feature change)?
    No timeouts

@codecov

This comment has been minimized.

codecov bot commented Feb 2, 2018

Codecov Report

Merging #1692 into master will decrease coverage by 0.53%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1692      +/-   ##
==========================================
- Coverage   80.83%   80.29%   -0.54%     
==========================================
  Files         105      104       -1     
  Lines       13591    13552      -39     
==========================================
- Hits        10986    10882     -104     
- Misses       2605     2670      +65

@webknjaz webknjaz requested review from jaraco, webknjaz and cherrypy/contributors Feb 2, 2018

@webknjaz

This comment has been minimized.

Member

webknjaz commented Feb 2, 2018

Thanks! I'll try to review this ASAP. But if not, it's okay to ping me.

# self.body is a generator object that must be yielded
# for _caching.tee_output() to finish it's job and
# remove orphaned threading._Event object from cache
for i in self.body:

This comment has been minimized.

@webknjaz

webknjaz Feb 3, 2018

Member

This could be replaced by self.collapse_body(), but I'm still trying to figure out a fix on a different level.

@jaraco

This change seems to have elicited a new failure in the Windows AppVeyor tests. I've restarted the tests for this PR twice and they fail consistently. We'll have to devise a fix for that failed test before we can accept this.

@jaraco jaraco force-pushed the cherrypy:master branch from 7af3dd3 to 513d4bf Feb 4, 2018

@@ -929,6 +939,7 @@ def finalize(self):
# and 304 (not modified) responses MUST NOT
# include a message-body."
dict.pop(headers, 'Content-Length', None)
self._flush_body()

This comment has been minimized.

@webknjaz

webknjaz Feb 4, 2018

Member

@jaraco do you fully agree with such approach? I'd prefer not putting the body there over consuming it here.

@jaraco

I'm still uneasy with this PR. I don't fully understand its implications and as much as I've tried to document the changes, I don't know what to write to explain the behavior.

Furthermore, this sort of change really needs a test to capture the desired behavior. The associated ticket describes a scenario which should trigger the failure, so what needs to be done is codify that scenario in a test. I'm out of time to work on this today, but I welcome others to implement the test.

@cherrypy cherrypy deleted a comment from codacy-bot Feb 4, 2018

@cherrypy cherrypy deleted a comment from codacy-bot Feb 4, 2018

@cherrypy cherrypy deleted a comment from codacy-bot Feb 4, 2018

@webknjaz

This comment has been minimized.

Member

webknjaz commented Feb 4, 2018

@jaraco it feels like you got into my head and wrote down every single thought I had regarding this PR :)
I've been trying to go through the code and trace everything which is happening during such flow, but only succeed with a few parts of it so far. I need more time to understand it in full. It seems to me that it patches the side-effect, but should fix the root cause of the issue.
And of course regression test is 100% mandatory for such a complicated case.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Feb 4, 2018

It took me more than a month to understand the time-outs we experienced (and that included working my way into the dungeons of cherrypy). I'm glad it takes you guys a while as well ;)

The issue I linked fully describes the bug and condition, and it's hard to explain it better. It does help to exactly follow the code as I explain, but it also helps to fully grok the CachingTool.

The problem arises around a race condition that is caused by expiring a previous normally cached static file. As the MemoryCache.expire_cache daemon thread only deletes the variant and leaves the uri in the store dictionary, MemoryCache.get() will now follow variant = uricache.wait() codepath on a next request, and AntiStampedeCache will guard the cache by inserting a threading._Event object in the cache (because it didn't find a result). Now, if this next request happens to be an 'if-modified-since' request, the static.serve_fileobj() path will except (caused by cptools.validate_since()) before the body is filled with the fileObject and set headers to 304 not-modified. Response.finalize() on it's turn used to skip the tee_output generator on this and leave the threading._Event behind, holding off the next requests untill the threading._Event timed out. Causing our observed time-outs.

However, only consuming tee_output on this 304 case would fill the cache with an empty body and repsonse headers 304. This is nonsense for someone that would request the same uri/variant without an if-modified-since header, because that client would not have the file in cache, but would be served with an empty body, containing a header (not-modified) it never asked for!

I do doubt the sanity of the coarse empty body case cache deletion in tee_output. It certainly helps to solve the observed case and doesn't harm others. The only problem is, it interferes with the caching of potentially expensive empty body results, but I have no clue how to correctly address that and well, we had to move on with our project ;)

By the way: This bug also raises my question why first requests to non-existing uri's in caching.store won't receive the AntiStampedeCache guard treatment. Of all circumstances, a freshly restarted server that has empty (memory)cache should be protected against a request stampede? Of course, I see why it happens, but I think it could use some attention as well?

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Feb 12, 2018

So, did my second attempt at explaining the race help?

@webknjaz

This comment has been minimized.

Member

webknjaz commented Feb 12, 2018

I didn't have time to check this attentively just yet.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Feb 14, 2018

I suddenly realised why I had so much trouble understanding the CachingTool in the first place and suggest another fix for that: the _caching.get() function isn't a getter, so it's name is confusing. It loads the server.response if cache is available and returns true or false on success/failure to do that. I think it should be renamed load().

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Mar 28, 2018

Still no time to review the PR?

@jaraco jaraco force-pushed the cherrypy:master branch from b7344b6 to 4ce8547 May 28, 2018

@c00kiemon5ter

This comment has been minimized.

c00kiemon5ter commented Jun 6, 2018

Hi everyone,

I think this is fine as it is now. It is a proper fix, even if it is not the best possible solution. As this does not affect any public API or method signature, I think this should be merged. If you think there is room for improvement, lets open an issue and link it to this PR with a reference to the merge commit and work on it when time allows - but until that is done we can have the fix.

What do you think?

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jun 6, 2018

@c00kiemon5ter Thanks for your opinion!

The main blocker for me as a maintainer is being able to understand the implementation and what are the consequences for other parts of code.

I'm really sorry that I didn't find time to dig into it, but to me it feels irresponsible to blindly merge changes. I hope you understand this.

It's still in my long backlog. As a community member, you can try helping @jaraco and me to verify things and comment on this more, so that it will make it easier for us to do reviews and would take much less time.
Normally we have some limited amount of FOSS time available and we cannot always commit to spending a lot of it, but if it starts looking easy to go through the chances that we'll look closer and maybe reach "merge stage" are much higher. Thanks you in advance for understanding!

P.S. @mrvanes could you please rebase this PR on top of current master?

mrvanes and others added some commits Feb 2, 2018

Extract function for _flush_body. Use 'consume' to consume an iterato…
…r. Use 'iter' to make iterator from iterable.
Rewrite comments with some punctuation to make it read more clearly. …
…I still don't understand why deleting the cache is necessary. The previous comment said 'an empty response will be cached', but that seems inconsistent with the 'if body' block.
Rewrite comments with some punctuation to make it read more clearly. …
…I still don't understand why deleting the cache is necessary. The previous comment said 'an empty response will be cached', but that seems inconsistent with the 'if body' block.

@mrvanes mrvanes force-pushed the SURFscz:master branch from d517e0e to f901ad3 Jun 6, 2018

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jun 7, 2018

// Side note:

def get(invalid_methods=('POST', 'PUT', 'DELETE'), debug=False, **kwargs):

it looks like the list should also include PATCH

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jun 7, 2018

// Also side note:

https://tools.ietf.org/html/rfc7231#section-4.3.7

Responses to the OPTIONS method are not cacheable.

@jaraco

jaraco approved these changes Aug 14, 2018

@jaraco jaraco merged commit 157a4e4 into cherrypy:master Aug 14, 2018

1 of 4 checks passed

LGTM analysis: Python Analysis Failed (could not build the base commit (8090b79))
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
WIP ready for review
Details

@mrvanes mrvanes referenced this pull request Nov 28, 2018

Open

mdx timeout issue #121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment