New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response #1690

Closed
mrvanes opened this Issue Jan 19, 2018 · 20 comments

Comments

Projects
None yet
3 participants
@mrvanes
Contributor

mrvanes commented Jan 19, 2018

AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response and results in a 30 second timeout on subsequent request.

We are not able to reliably reproduce this problem, we only know that it happens and have the logs to scrutinize the TOOLS.CACHING code path followed when hitting the bug.

We've narrowed down the bug to

  • static files
  • an 'If-Modified-Since' request provoking a 304 Not modified response, leaving an orphaned threading.Event, causing
  • a 30 seconds timeout on subsequent (normal) request

The logs show that the 304 provoking request follows the AntistampeCache.wait path, setting the threading.Event while returning a None to MemoryCache.get() variant object, causing it to return None to get()'s cache_data, causing it to follow the 'request is not cached' path of get(), returning False, but MemoryCache.put() is never called, thus leaving the threading.Event object orphaned.

The subsequent request for the same static file also follows the AntistampeCache.wait path. Encountering the orphaned threading.Event object causing it to fruitlessly wait 30 seconds after which it resolves the problem by diligently populating the cache object with the (eventually) responded static file.

After this, normal operation is resumed.

This happens mostly once a day, probably after cache has expired, after one of the clients was the last to requests an 'If-Modified-Since' static file response. But until now, we were not able to come up with a clean reproducible test case.

CherryPy is part of the Pyff daemon we deployed.
IdentityPython/pyFF#116

  • CherryPy version: 11.1.0
  • Python version: 2.7 due to Pyff dependancy
  • OS: Debian GNU/Linux 9
  • Browser: Chrome 63

The logging showing the problem, which was produced by us inserted extra debugging lines looks as follows. I understand that interpreting these logs without knowledge of the location of these statements is awkward. Nevertheless it clearly shows the timeout after the offending 304'd static file request.

Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING get https:/***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING Wait result for key: (), type(value) <type 'NoneType'>
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING Cache was None, set Event <threading._Event object at 0x7f7b6c623a50>
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING variant found
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING request is not cached
[...]
Jan 19 08:43:26 proxy2 pyffd[4100]: 145.97.144.156 - - [19/Jan/2018:08:43:26] "GET /static/bootstrap/css/bootstrap.min.css HTTP/1.1" 304 - "https://***/role/idp.ds?return=https%3A%2F%2F***%2FSaml2SP%2Fdisco&entit
yID=https%3A%2F%2F***%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"


Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING get https://***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Wait result for key: (), type(value) <class 'threading._Event'>
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Event <threading._Event object at 0x7f7b6c623a50>
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Waiting up to 30 seconds
[...]
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING Timed out 30 seconds
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING variant found
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING request is not cached
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING get() status: None
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING Storing status: 200 OK, uri:https://***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:48:22 proxy2 pyffd[4100]: 2001:610:514:172:31da:c3b9:12ad:1b76 - - [19/Jan/2018:08:48:22] "GET /static/bootstrap/css/bootstrap.min.css HTTP/1.1" 200 109518 "https://***/role/idp.ds?return=https%3A%2F%2F***%2FSaml2SP%2Fdisco&entityID=https%3A%2F%2F***%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

What we noticed is that the caching.py get() cherrypy.HTTPRedirect 304 exception code path is not touched on the 304 response. This must mean that the 304 is generated in static.py in serve_file() by cptools.validate_since().

Up to now, we were unable to explain why this specific 304 response would provoke the cacheobject not to be populated by the corresponding cache, while having set a threading.Event.

What we did see was that MemoryCache.expire_cache() clears (del) the AntistampedeCache variant object in the store dictionary, but keeps the store uri key. This means that after cache expiration a store[uri] key exists but has no cache object attached. This might explain the 304 following the Antistampede.wait path without replacing the threading.Event object but we were not sure and unable to force faster cache expiration to produce a test-case (up to now).

For now, we had to decide to let the bug go and work-around it by letting nginx serve Pyff's static files, although this is a less than ideal solution of course.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 20, 2018

Could it be that a cached 304'd static.serve_file() never reaches _caching.tee_output in CachingTool because of the nature of the exception (it is neither caught in serve_file nor CachingTool._wrapper) and thus never replaces the stale threading.Event?

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 20, 2018

Hi,

Do you have dumped HTTP traffic at least? It would probably help us in the investigation.

@jaraco what do you think? It looks like info provided isn't enough.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 20, 2018

No, we don't have a dump of the traffic. The 304 is caused by a client we don't own and the problem is not reproducible by us, apart from the fact that it happens every now and then.

@baszoetekouw

This comment has been minimized.

baszoetekouw commented Jan 20, 2018

Well, we haven't deployed the workaround yet, so I can start a tcpdump and see if the problem occurs in the next couple of days.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 25, 2018

The tcp dump for an offending request reads as follows:

GET /static/js/pyff.js HTTP/1.1
Host: mdq.*.nl
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
Accept: */*
DNT: 1
Referer: https://mdq.*.nl/role/idp.ds?return=https%3A%2F%2Fproxy.*.nl%2FSaml2SP%2Fdisco&entityID=https%3A%2F%2Fproxy.*.nl%2FSaml2SP%2Fproxy_saml2_backend.xml
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,nl;q=0.8,de;q=0.7,fr;q=0.6
Cookie: crowd.token_key=***
If-Modified-Since: Thu, 11 Jan 2018 12:33:22 GMT
X-Forwarded-For: 2001:*:*:*:*:*:*
X-Forwarded-Port: 443
X-Forwarded-Proto: https
Forwarded: for=2001:*:*:*:*:*:*; proto=https; by=2001:*:*:*:*:*:*

The cherry logs say (as said before, I added some extra logging, which shows that AntiStampedeCache.wait is setting a threading._Event object to hold off other requesters)

TOOLS.CACHING get https://mdq.*.nl/static/js/pyff.js
TOOLS.CACHING Wait result for key: (), type(value) <type 'NoneType'>
TOOLS.CACHING Cache was None, set Event <threading._Event object at 0x7fecc052fc10>
TOOLS.CACHING uricache type(variant): <type 'NoneType'>
TOOLS.CACHING request is not cached
TOOLS.CACHING get() status: None
TOOLS.STATICDIR Checking file '/opt/pyff/pyff-env/local/lib/python2.7/site-packages/pyff/site/static/js/pyff.js' to fulfill '/static/js/pyff.js'
INFO:cherrypy.access.140655269161424:2001:*:*:*:*:* - - [24/Jan/2018:11:01:58] "GET /static/js/pyff.js HTTP/1.1" 304 - "https://mdq.*.nl/role/idp.ds?return=https%3A%2F%2Fproxy.*.nl%2FSaml2SP%2Fdisco&entityID=https%3A%2F%2Fproxy.*.nl%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

What stands out to me is that my explicit logging of the MemoryCache.put() method is missing after the final cherrypy.access GET 304 log. This means that either the before_finalize hook isn't called after the 304 Exception or that it doesn't contain _cache.tee_output() or _cache.tee_output() refuses to replace the threading._Event in cache.

Looking at _cprequest.respond I find it hard to believe that 'before_finalize' isn't run in the case of an HTTPRedirect exception:

    def respond(self, path_info):
        """Generate a response for the resource at self.path_info. (Core)"""
        try:
            try:
                try:
                    self._do_respond(path_info)
                except (cherrypy.HTTPRedirect, cherrypy.HTTPError):
                    inst = sys.exc_info()[1]
                    inst.set_response()
                    self.stage = 'before_finalize (HTTPError)'
                    self.hooks.run('before_finalize')
                    cherrypy.serving.response.finalize()
...

So either _cptools.CachingTool fails to add the _caching.tee_output hook because the request is not cacheable, which is very unlikely because we see 'request is not cached' in the logs, which implies not request.cached, which implies request.cacheable which causes the request.hooks.attach(tee_output) path.

This leaves tee_output as a suspect but that would only fail to put the result if a 'Cache-Control' request header containing 'no-cache' or 'no-store' was set. Which isn't the case?

So, I can't, for the life of me, understand why MemoryCache.put() isn't called in the before_finalize hook in the above 304'd GET? But the reality is that it doesn't and causes the next request to time out for 30 seconds as said.

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 25, 2018

Could you please set tools.caching.debug to True? This should produce more logging from cherrypy.lib.caching.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 26, 2018

Ok, so I now know what happens, how to reproduce it and have a proposal for a fix.

How to reproduce:

  • Configure service with static handler and cache support
  • Restart cherrypy based service (pyff in my case), this cleans the cache
  • Do a normal GET request for a static resource, this fills the cache for the store[uri][variant]
  • Note the Last-Modified date of response
  • Wait for the cache to expire, this clears the store[uri][variant] (but not store[uri]!)
  • Do an If-Modified-Since GET on the static resource, with date Last-modified noted above
  • Do a normal GET request for the static resource and experience the AntiStampedeCache time-out for the request is fulfilled (which is set to 30 seconds in pyff).

What happens:

  • Because the expire only cleared the variant of the cache, AntiStampedeCache is now activated and inserts a threading._Event for the request
  • As soon as static.serve_fileobj encounters the 304 exception (raised by validate_since) it raises the exception
  • before_finalize is run and calls caching.tee_output, which replaces response.body with the tee(response.body) generator object (which is ultimately responsible for put'ing the response in cache)
  • Response.finalize() encounters an elif code < 200 or code in (204, 205, 304): and consequently pops the Content-Length header and sets self.body to ntob('') replacing the body generator object
  • Whoever is responsible for sending the respons body will now iterate over the ntob('') object and the tee(response.body) generator is never touched, leaving the AntiStampedeCaching threading._Event for this uri/variant orphaned

My solution:

  • Even though a 1**, 204, 205 or 304 are forbidden to contain a body, it is imperative to iterate over self.body before assigning ntob('') to self.body for the CachingTool to finish it's job, so I inserted the following lines above the self.body = ntob('') line in Response.finalize():
for i in self.body:
    pass

This forces the tee(response.body) generator to execute and finish it's caching job.
But now, the cache contains the cached 304 object, which is nonsense on a normal GET for the same resource. So I had to insert the following code in caching.tee_output.tee():

# save the cache data
body = ntob('').join(output)
if not body:
    cherrypy._cache.delete()
else:
    cherrypy._cache.put((response.status, response.headers or {},
                   body, response.time), len(body))

This prevents tee() from caching empty responses and solves the problem on the short term.

However, I'm not convinced this is the best solution, because it will undermine the caching of any empty body response which might be expensive even though they don't actually contain a body and hence interferes with the intention of the cache.

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 26, 2018

@mrvanes are you saying that you use (hot) autoreload feature, which is there for development env purposes only?

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 26, 2018

I mean, what does your restart procedure consist of? Can it be distilled into cherrypy-only example?

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 27, 2018

I don't know where you read that I use autoreload? I said I restart pyff (which is the service we need, which shows a bug that is caused by cherrypy). I assume when I restart pyff, that implies restarting cherrypy?

You can inspect how pyff uses cherrypy here: https://github.com/leifj/pyFF. We lack time and resources to clean this further up and, with all due respect I don't see what more we can do. I explained exactly how the bug can be reproduced, how it is triggered in cherrypy code and how it can be solved? What more information do you need?

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 27, 2018

yeah, I see. But it would help if you point on how do restart that thing.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 27, 2018

I wrote a systemd start file that (re)starts the pyff executable. We run pyff in foreground in systemd so we have logging to syslog. So I restart pyff like this: "service pyff restart"

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 27, 2018

do you know whether this command kills the process and starts new one or it sends it some signal? do you use -a or --autoreload there?

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 27, 2018

No, as far as I know, systemd stops the foreground process and starts it from scratch, as if I did a ctrl-c on commandline and then execute the pyff command again.

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 27, 2018

That's weird: the cache you refer to is in-memory. It doesn't match the process you described.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 27, 2018

Yes, the in-memory cache is corrupted on a request that causes a 304 response. I restart the service to make sure the cache is empty to start with, so that the procedure to reproduce the problem is the same for everyone, it's not a requirement to trigger the bug.

@webknjaz

This comment has been minimized.

Member

webknjaz commented Jan 27, 2018

If you restart the process this way there's no memory footprint left from the previous process run.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Jan 27, 2018

The bug is not caused by previous runs, it's caused by a bug in the CachingTool that's exposed by requesting three normal GET's on a static resource.

@mrvanes

This comment has been minimized.

Contributor

mrvanes commented Feb 1, 2018

How do we proceed from here? Is there anything we can do to get the fix upstream? Would you like to see a pull-request?

@webknjaz

This comment has been minimized.

Member

webknjaz commented Feb 1, 2018

@mrvanes hello, sorry for being silent. I was going to dig into the issue to understand what's going on better and experience that it person, but I didn't have time for this. Unfortunately we're running out of maintainers (and time we can contribute into keeping the project in a good shape).
Yes, PR with a regression test would be great, however I'm not sure about the hacky approach of reading body. Thus I wanted to look at it personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment