Earlier this week we began experimenting with using Amazon CloudFront as a CDN for serving static assets. We've also rolled out some general asset delivery optimizations. Depending on how far away you are from our main Washington D.C. datacenters, you should see a nice decrease in overall page load times.
The rest of this post goes into detail on how we implemented this stuff and also a bit on how we're measuring performance around the world.
Some background: over the years we've spent a lot of time optimizing asset delivery. This includes things like js/css asset bundling and using multiple asset hosts. Recently, we started in on another round of optimizations driven by a general goal of decreasing page load times outside of the US, and also by changes in the page load performance profile due to the move to SSL for all asset delivery.
Measuring Page Load Performance
We're using BrowserMob to monitor full page load performance on a few key pages. BrowserMob is interesting for this kind of profiling for a couple of reasons. First, it measures full page load time in a real browser (including all assets and Ajax requests), as opposed to monitoring the response time of an individual request. A report like the following is available for each run:
The green portion of each bar represents connect time (more than half of which is usually attributed to "SSL Handshaking"). Purple is waiting for the response to begin. Grey is the actual receiving of the response data.
If this looks familiar, it's probably because these reports are very similar to the Network/Resource graphing tools built into most modern browsers's development tools. What's great about BrowserMob, though, is that these run at regular intervals and from multiple locations around the world. The results are then graphed on a nice timeline.
Here are the results for the past week's worth of changes for a public repository page:
Each point on the graph is the overall page load time for a run at a specific location. The big red circle areas are timeouts or other errors.
Using a Single Asset Host
The first thing we wanted to test was moving to a single asset host. i.e.,
assets.github.com instead of
Since github.com went 100% SSL, we've found that the cost of performing SSL handshakes against multiple asset hosts slightly outweighed the benefits provided by the browser's ability to do more request parallelization. This gets worse as you move further away and incur more latency. Most modern browsers support between four and eight simultaneous connections now, too, so distributing requests between asset hosts has less of a payoff in general.
The BrowserMob report shows that this didn't have a massive impact on average good page load times but it seemed to stablize things quite a bit. With multiple assets hosts, timeouts and drastically different load times were frequent. This leveled out after moving to a single asset host.
Problems with Multiple Points of Origin
This is something we surfaced during our research that hasn't been addressed yet. It's worth mentioning for anyone hosting assets from multiple servers in a load balancing setup.
At GitHub, we currently have six frontend servers. They run nginx and also the GitHub application code, background jobs, etc. Each asset request is routed round robin to one of these hosts. This results in assets having multiple points of origin, which, depending on your server and deployment configuration, can lead to a couple of subtle performance issues:
The last modified times on files may vary between machines based on when the assets were deployed. (This is especially true if you use git for deployment as timestamps are set to the time of last checkout.) When the same asset has different timestamps on different origin machines, conditional HTTP GET requests using
If-Modified-Sincecan lead to full
200 OKresponses instead of nice, contentless
304 Not Modifiedresponses.
Using long-lived expiration headers avoids many of these requests altogether but browsers love to validate content in a number of circumstances, including manual refresh, hitting
<ENTER>in your URL bar, and also randomly on Tuesdays.
SSL handshake needs to be performed on each of the origin servers. A browser opening six connections may land on six different hosts and need to perform six different handshakes. Stated succinctly:
We're not doing this yet, on account of it requiring a fairly major redesign of our frontend architecture. Luckily, most CDNs like CloudFront have tuned their SSL negotiations fairly well and we're able to take advantage of that for asset requests.
We used CloudFront's support for Custom Origins. This means we don't have
to deal with shipping assets to an S3 bucket on deploy. Instead, you point the
CloudFront distribution to your existing asset host (
assets.github.com in our
case) and then change the asset URLs referenced in page responses to the
CloudFront distribution. If an asset isn't available at a CloudFront server, it
will be fetched and stored for subsequent requests.
This was fairly easy to get working using Rails's built in support for
configurable asset hosts. One issue we did run into is that CloudFront
ignores the query string portion of the URL, which is used to force the browser
to reload cached assets when changed. We got around this by moving the asset
id into the path portion of the URL. So instead of assets being referenced as
/stylesheets/bundle_common.css?85e47ae, they are now referenced as
/85e47ae/stylesheets/bundle_common.css. A simple Nginx rewrite handles
locating the file on disk when these URLs are requested.
One other thing worth mentioning is that, while CloudFront supports SSL, you
won't be able to use a custom domain name. All of our assets are currently
referenced with these ugly
NOTE: BrowserMob runs on Amazon EC2 but all CloudFront access is performed over external interfaces. Still, the relationship between the EC2 and CloudFront networks should be taken into account when interpreting these results.
Our goal in moving assets to a CDN is mostly to decrease load times around the world by serving them from hosts that are geographically nearer to the client making the request. We expected to see decent gains on the US West Coast and in Europe, and large gains as you moved closer to the other side of the world. So what actually happened?
We were disappointed in the BrowserMob results for the US West Coast and Europe. They stayed relatively flat or got worse. This may be something strange with BrowserMob's server locations, however. Running similar comparisons from a browser on my desktop in San Francisco shows good gains. We'll be in touch with BrowserMob to determine if there might be problems with their DNS not resolving hosts to near servers properly.
The results for Singapore were more in line with what we were hoping for. Here's a single run from before we turned CloudFront on:
And a run after we turned CloudFront on:
Connect + SSL Handshake time saw dramatic improvement. Wait time also decreased considerably. Lastly, CloudFront seems to do a significantly better job at not terminating Keep-Alive connections prematurely compared to our origin asset servers.
There's one other bit of weirdness we're noticing with CloudFront that we left off of the timeline graph above. For some reason, page load time from Dallas became extremely erratic when assets were switched over to CloudFront:
We don't want to draw any conclusions from such a small sample but this would seem to indicate that individual CloudFront nodes are susceptible to some kind of temporary overload or other source of instability.
That's where we're at. This is still very much an experiment and we'd like to compare performance of other CDNs and collect a little more data before making a final decision.