Use HTTP proxy caching instead of file/disk caching #296

Closed
knowtheory opened this Issue Nov 10, 2015 · 10 comments

Projects

None yet

2 participants

@knowtheory
Member

DocumentCloud has since its outset used Rails's page caching mechanism to store JS and JSON blobs to feed our frontend javascript apps.

Page caching is nice for several reasons.

  • It's easy to understand: a file is either cached on disk or it isn't
  • it's easy to control: if you need to expire the cache you delete the file
  • and the configuration is simple: so long as the rails app writes out files to disk into a directory structure that matches your rails routes, NGINX will happily serve files located in those directories instead of making requests from rails.

Page caching also has a variety of limitations:

  • caching is limited to uris that map to file paths (e.g. query params aren't respected)
  • caching is also limited to uri paths which are shorter than the file system character limit on file names

The problem

These limitations mean that caching fails in a variety of circumstances ranging from very long urls (which can happen when a user attempts to embed a document collection specified by a very long search query) or in the case of JSONP resources.

Proposal

We can/should switch to using NGINX as a caching proxy to the app instead of disk caching resource urls.

What'll this entail?

This will entail rewriting our nginx configs to set up the front end proxy. It will also require reworking a few parts of the app platform:

  • prefer_secure and secure_only and the structure of the proxy<->app relationship need to be addressed so that the app knows when/how to redirect users to secure resources.
  • Rework caching itself to be specified with cache control headers
  • Rework cache expiry to use proxy_cache_bypass in order to update cached resources in place.
  • benchmark endpoints prior to implementing HTTP caching
  • Tests to ensure that caching ALWAYS respects login status AND caching expiry is respected AND can withstand high volumes of requests.
  • benchmark endpoints after HTTP caching is implemented

Potential risks/drawbacks

  • HTTP caching is not a panacea and over-caching can produce problematic end user experiences.

More discussion to follow.

@knowtheory
Member

forgot to drop these stats demonstrating the difference between a cached & uncached endpoint: https://gist.github.com/knowtheory/307fcf34acd6a9427787

@reefdog
Contributor
reefdog commented Nov 13, 2015

Issues that are affected/solved by this:

@esthervillars esthervillars added a commit that referenced this issue Mar 8, 2016
@esthervillars esthervillars Adds jmeter for #296 caching, initial test for endpoints at dci 104
Adds listeners to the testplan in order to output reports and graphs after the test runs.
064933c
@esthervillars esthervillars added a commit that referenced this issue Mar 8, 2016
@esthervillars esthervillars Adds jmeter for #296 caching, initial test for endpoints at dci 104
Adds listeners to the testplan in order to output reports and graphs after the test runs.
49bdb08
@reefdog
Contributor
reefdog commented Mar 21, 2016

Nginx reserves key-based cache expiration for the $1,000/year platform, so we're starting with a simple five-second cache on all resources, with no platform expiration. This gets us cache performance for popular resources, but one downside: unpopular resources will always be cold.

We should factor this into our benchmarks and consider how this affects real-world use.

@knowtheory
Member

Right so, generally speaking response times tend to follow a power law distribution.

Front end caching doesn't obviate work that has to be done backend services, whether database, processing queue or search server.

What front end caching does is reduce duplicate requests. For now, this generally matches DocumentCloud's behavior in several important cases (for the purpose of serving web traffic).

@reefdog
Contributor
reefdog commented Mar 21, 2016

Yeah, just wanted us to stay aware since we're transfering from a cache system that, while overall vastly inferior, was equitable in its performance distribution. Realized our benchmarks should try to take both use cases into consideration. Like BitTorrent, the new cache system will work best when it counts most, but could regress performance of the long tail.

@reefdog
Contributor
reefdog commented Apr 26, 2016

@knowtheory We can close this, yeah?

@reefdog
Contributor
reefdog commented Oct 25, 2016

Ping. We can close, yeah? There's a list of unchecked things above, is why I ask before just doing it.

@knowtheory
Member

Working from the list above backwards:

  • Caching was deployed and helped us withstand the attention focused on the panama papers.
  • I made caching configurable per request/endpoint in order to give public search caches a shelf life longer than 10 seconds (the default).
  • I benchmarked NGINX as a reverse proxy cache for all public & anonymous resources with the NGINX configuration in the repo. We used jmeter to test this and saw substantial performance benefits (matching disk caching performance).
  • I set up our NGINX reverse proxy configuration and platform to distinguish between authenticated and unauthenticated users via a cookie (see the cache config and cookie settings for it )
@knowtheory knowtheory closed this Oct 25, 2016
@reefdog
Contributor
reefdog commented Oct 25, 2016 edited

Caching was deployed and helped us withstand the attention focused on the panama papers.

Haha, forgot that we got that out just before those hit. Great timing.

Right, so one thing I'd forgotten is that we only serve cached responses to unauthenticated users. Which is fine, it covers the "keep the servers up when slammed by Panama Paper traffic" intended use case, but I was just getting some unexpected test results until I remembered that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment