AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response #1690
Comments
Could it be that a cached 304'd static.serve_file() never reaches _caching.tee_output in CachingTool because of the nature of the exception (it is neither caught in serve_file nor CachingTool._wrapper) and thus never replaces the stale threading.Event? |
Hi, Do you have dumped HTTP traffic at least? It would probably help us in the investigation. @jaraco what do you think? It looks like info provided isn't enough. |
No, we don't have a dump of the traffic. The 304 is caused by a client we don't own and the problem is not reproducible by us, apart from the fact that it happens every now and then. |
Well, we haven't deployed the workaround yet, so I can start a tcpdump and see if the problem occurs in the next couple of days. |
The tcp dump for an offending request reads as follows: GET /static/js/pyff.js HTTP/1.1
Host: mdq.*.nl
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
Accept: */*
DNT: 1
Referer: https://mdq.*.nl/role/idp.ds?return=https%3A%2F%2Fproxy.*.nl%2FSaml2SP%2Fdisco&entityID=https%3A%2F%2Fproxy.*.nl%2FSaml2SP%2Fproxy_saml2_backend.xml
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,nl;q=0.8,de;q=0.7,fr;q=0.6
Cookie: crowd.token_key=***
If-Modified-Since: Thu, 11 Jan 2018 12:33:22 GMT
X-Forwarded-For: 2001:*:*:*:*:*:*
X-Forwarded-Port: 443
X-Forwarded-Proto: https
Forwarded: for=2001:*:*:*:*:*:*; proto=https; by=2001:*:*:*:*:*:* The cherry logs say (as said before, I added some extra logging, which shows that AntiStampedeCache.wait is setting a threading._Event object to hold off other requesters)
What stands out to me is that my explicit logging of the MemoryCache.put() method is missing after the final cherrypy.access GET 304 log. This means that either the before_finalize hook isn't called after the 304 Exception or that it doesn't contain _cache.tee_output() or _cache.tee_output() refuses to replace the threading._Event in cache. Looking at _cprequest.respond I find it hard to believe that 'before_finalize' isn't run in the case of an HTTPRedirect exception: def respond(self, path_info):
"""Generate a response for the resource at self.path_info. (Core)"""
try:
try:
try:
self._do_respond(path_info)
except (cherrypy.HTTPRedirect, cherrypy.HTTPError):
inst = sys.exc_info()[1]
inst.set_response()
self.stage = 'before_finalize (HTTPError)'
self.hooks.run('before_finalize')
cherrypy.serving.response.finalize()
... So either _cptools.CachingTool fails to add the _caching.tee_output hook because the request is not cacheable, which is very unlikely because we see 'request is not cached' in the logs, which implies not request.cached, which implies request.cacheable which causes the request.hooks.attach(tee_output) path. This leaves tee_output as a suspect but that would only fail to put the result if a 'Cache-Control' request header containing 'no-cache' or 'no-store' was set. Which isn't the case? So, I can't, for the life of me, understand why MemoryCache.put() isn't called in the before_finalize hook in the above 304'd GET? But the reality is that it doesn't and causes the next request to time out for 30 seconds as said. |
Could you please set |
Ok, so I now know what happens, how to reproduce it and have a proposal for a fix. How to reproduce:
What happens:
My solution:
for i in self.body:
pass This forces the tee(response.body) generator to execute and finish it's caching job. # save the cache data
body = ntob('').join(output)
if not body:
cherrypy._cache.delete()
else:
cherrypy._cache.put((response.status, response.headers or {},
body, response.time), len(body)) This prevents tee() from caching empty responses and solves the problem on the short term. However, I'm not convinced this is the best solution, because it will undermine the caching of any empty body response which might be expensive even though they don't actually contain a body and hence interferes with the intention of the cache. |
@mrvanes are you saying that you use (hot) autoreload feature, which is there for development env purposes only? |
I mean, what does your restart procedure consist of? Can it be distilled into cherrypy-only example? |
I don't know where you read that I use autoreload? I said I restart pyff (which is the service we need, which shows a bug that is caused by cherrypy). I assume when I restart pyff, that implies restarting cherrypy? You can inspect how pyff uses cherrypy here: https://github.com/leifj/pyFF. We lack time and resources to clean this further up and, with all due respect I don't see what more we can do. I explained exactly how the bug can be reproduced, how it is triggered in cherrypy code and how it can be solved? What more information do you need? |
yeah, I see. But it would help if you point on how do restart that thing. |
I wrote a systemd start file that (re)starts the pyff executable. We run pyff in foreground in systemd so we have logging to syslog. So I restart pyff like this: "service pyff restart" |
do you know whether this command kills the process and starts new one or it sends it some signal? do you use |
No, as far as I know, systemd stops the foreground process and starts it from scratch, as if I did a ctrl-c on commandline and then execute the pyff command again. |
That's weird: the cache you refer to is in-memory. It doesn't match the process you described. |
Yes, the in-memory cache is corrupted on a request that causes a 304 response. I restart the service to make sure the cache is empty to start with, so that the procedure to reproduce the problem is the same for everyone, it's not a requirement to trigger the bug. |
If you restart the process this way there's no memory footprint left from the previous process run. |
The bug is not caused by previous runs, it's caused by a bug in the CachingTool that's exposed by requesting three normal GET's on a static resource. |
How do we proceed from here? Is there anything we can do to get the fix upstream? Would you like to see a pull-request? |
@mrvanes hello, sorry for being silent. I was going to dig into the issue to understand what's going on better and experience that it person, but I didn't have time for this. Unfortunately we're running out of maintainers (and time we can contribute into keeping the project in a good shape). |
AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response and results in a 30 second timeout on subsequent request.
We are not able to reliably reproduce this problem, we only know that it happens and have the logs to scrutinize the TOOLS.CACHING code path followed when hitting the bug.
We've narrowed down the bug to
The logs show that the 304 provoking request follows the AntistampeCache.wait path, setting the threading.Event while returning a None to MemoryCache.get() variant object, causing it to return None to get()'s cache_data, causing it to follow the 'request is not cached' path of get(), returning False, but MemoryCache.put() is never called, thus leaving the threading.Event object orphaned.
The subsequent request for the same static file also follows the AntistampeCache.wait path. Encountering the orphaned threading.Event object causing it to fruitlessly wait 30 seconds after which it resolves the problem by diligently populating the cache object with the (eventually) responded static file.
After this, normal operation is resumed.
This happens mostly once a day, probably after cache has expired, after one of the clients was the last to requests an 'If-Modified-Since' static file response. But until now, we were not able to come up with a clean reproducible test case.
CherryPy is part of the Pyff daemon we deployed.
IdentityPython/pyFF#116
The logging showing the problem, which was produced by us inserted extra debugging lines looks as follows. I understand that interpreting these logs without knowledge of the location of these statements is awkward. Nevertheless it clearly shows the timeout after the offending 304'd static file request.
What we noticed is that the caching.py get() cherrypy.HTTPRedirect 304 exception code path is not touched on the 304 response. This must mean that the 304 is generated in static.py in serve_file() by cptools.validate_since().
Up to now, we were unable to explain why this specific 304 response would provoke the cacheobject not to be populated by the corresponding cache, while having set a threading.Event.
What we did see was that MemoryCache.expire_cache() clears (del) the AntistampedeCache variant object in the store dictionary, but keeps the store uri key. This means that after cache expiration a store[uri] key exists but has no cache object attached. This might explain the 304 following the Antistampede.wait path without replacing the threading.Event object but we were not sure and unable to force faster cache expiration to produce a test-case (up to now).
For now, we had to decide to let the bug go and work-around it by letting nginx serve Pyff's static files, although this is a less than ideal solution of course.
The text was updated successfully, but these errors were encountered: