Following three months of near 100% uptime, we’ve just been through three major outages in as many days. I wanted to take some time to detail the problems and what we intend to do to prevent similar downtime in the future.
Outage #1 (02/02/2010 9:55:09AM PST) was initiated by a load spike on one of our file servers (fs1a). When a file server stops responding to heartbeat, the slave server in the pair kills the master and takes over. In this case, the master was not killed quickly enough and the storage partitions did not migrate cleanly to the slave. Cleanup on the split-blain file server pair was delayed due to some inefficient DRBD configuration that we’ve been meaning to update. By rolling out improvements to the DRBD configuration, this type of problem should be prevented from happening in the future.
Outage #2 (02/03/2010 6:10:08PM PST) looked like a power outage at first, since so many machines were affected, but the root cause was the deployment of a faulty DRBD configuration update that propagated to all machines (courtesy of Puppet) and started causing pairs of machines to halt replication to prevent corruption caused by an invalid configuration file. Eventually the load balancer pair was affected and we could no longer even serve the Angry Unicorn page. The way that the servers went down, the number of servers that went down, and the length of time it takes to resync downed pairs resulted in a lengthy outage. There are several steps to preventing this kind of outage in the future. First and most obvious is to maintain tighter control and testing of proposed system-wide configuration changes. We also plan to deploy (well-tested) changes to the DRBD configuration that will reduce cleanup times and automate the startup process for downed machines. These changes will result in shorter recovery times in the event of single failovers and wider machine-level restarts.
Outage #3 (02/04/2010 2:37:08AM PST) was caused by massive load spikes across all five file servers. To prevent extended downtime we marked all file servers as offline (preventing them from going into failover) and looking for the cause of the load. After inspecting the HTTP logs, we identified a Yahoo! spider that was making thousands of requests but never waiting for responses. After banning the spider, the load returned to normal and we were able to bring the file servers back online. We are looking at our rate limiting strategy and will be making improvements over time to get the best performance for legitimate users and the best protection from anomalous behavior.
In order to execute the improvements to various infrastructure elements, we will be having scheduled maintenance windows at 10PM PST over the next week. Most of these changes will not require any downtime, but some of them may result in temporary unavailability of file server partitions. As we perform the maintenance, we’ll keep you updated via the GitHub Twitter account, so make sure to check there for the latest maintenance news.
We sincerely apologize for the recent problems and are working very hard to address each flaw. Stability is one of our biggest goals this year, and I look forward to making your GitHub experience as flawless as possible.