GitHub.com suffered two outages early this week that resulted in one hour and 46 minutes of downtime and another hour of significantly degraded performance. This is far below our standard of quality, and for that I am truly sorry. I want to explain what happened and give you some insight into what we're doing to prevent it from happening again.
First, some background
During a maintenance window in mid-August our operations team replaced our aging pair of DRBD-backed MySQL servers with a 3-node cluster. The servers collectively present two virtual IPs to our application: one that's read/write and one that's read-only. These virtual IPs are managed by Pacemaker and Heartbeat, a high availability cluster management stack that we use heavily in our infrastructure. Coordination of MySQL replication to move 'active' (a MySQL master that accepts reads and writes) and 'standby' (a read-only MySQL slave) roles around the cluster is handled by Percona Replication Manager, a resource agent for Pacemaker. The application primarily uses the 'active' role for both reads and writes.
This new setup provides, among other things, more efficient failovers than our old DRBD setup. In our previous architecture, failing over from one database to another required a cold start of MySQL. In the new infrastructure, MySQL is running on all nodes at all times; a failover simply moves the appropriate virtual IP between nodes after flushing transactions and appropriately changing the
read_only MySQL variable.
Monday, September 10th
The events that led up to Monday's downtime began with a rather innocuous database migration. We use a two-pass migration system to allow for zero-downtime MySQL schema migration. This has been a relatively recent addition, but we've used it a handful of times without any issue.
Monday's migration caused higher load on the database than our operations team has previously seen during these sorts of migrations. So high, in fact, that they caused Percona Replication Manager's health checks to fail on the master. In response to the failed master health check, Percona Replication manager moved the 'active' role and the master database to another server in the cluster and stopped MySQL on the node it perceived as failed.
At the time of this failover, the new database selected for the 'active' role had a cold InnoDB buffer pool and performed rather poorly. The system load generated by the site's query load on a cold cache soon caused Percona Replication Manager's health checks to fail again, and the 'active' role failed back to the server it was on originally.
At this point, I decided to disable all health checks by enabling Pacemaker's
maintenance-mode; an operating mode in which no health checks or automatic failover actions are performed. Performance on the site slowly recovered as the buffer pool slowly reached normal levels.
Tuesday, September 11th
The following morning, our operations team was notified by a developer of incorrect query results returning from the node providing the 'standby' role. I investigated the situation and determined that when the cluster was placed into
maintenance-mode the day before, actions that should have caused the node elected to serve the 'standby' role to change its replication master and start replicating were prevented from occurring. I determined that the best course of action was to disable
maintenance-mode to allow Pacemaker and the Persona Replication Manager to rectify the situation.
Upon attempting to disable
maintenance-mode, a Pacemaker segfault occurred that resulted in a cluster state partition. After this update, two nodes (I'll call them 'a' and 'b') rejected most messages from the third node ('c'), while the third node rejected most messages from the other two. Despite having configured the cluster to require a majority of machines to agree on the state of the cluster before taking action, two simultaneous master election decisions were attempted without proper coordination. In the first cluster, master election was interrupted by messages from the second cluster and MySQL was stopped.
In the second, single-node cluster, node 'c' was elected at 8:19 AM, and any subsequent messages from the other two-node cluster were discarded. As luck would have it, the 'c' node was the node that our operations team previously determined to be out of date. We detected this fact and powered off this out-of-date node at 8:26 AM to end the partition and prevent further data drift, taking down all production database access and thus all access to github.com.
As a result of this data drift, inconsistencies between MySQL and other data stores in our infrastructure were possible. We use Redis to query dashboard event stream entries and repository routes from automatically generated MySQL ids. In situations where the id MySQL generated for a record is used to query data in Redis, the cross-data-store foreign key relationships became out of sync for records created during this window.
Consequentially, some events created during this window appeared on the wrong users' dashboards. Also, some repositories created during this window were incorrectly routed. We've removed all of the leaked events, and performed an audit of all repositories incorrectly routed during this window. 16 of these repositories were private, and for seven minutes from 8:19 AM to 8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the repository's list of collaborators or team members. We've contacted all of the owners of these repositories directly. If you haven't received a message from us, your repository was not affected.
After confirming that the out-of-date database node was properly terminated, our operations team began to recover the state of the cluster on the 'a' and 'b' nodes. The original attempt to disable
maintenance-mode was not reflected in the cluster state at this time, and subsequent attempts to make changes to the cluster state were unsuccessful. After tactical evaluation, we team determined that a Pacemaker restart was necessary to obtain a clean state.
At this point, all Pacemaker and Heartbeat processes were stopped on both nodes, then started on the 'a' node. MySQL was successfully started on the 'a' node and assumed the 'active' role. Performance on the site slowly recovered as the buffer pool slowly reached normal levels.
In summary, three primary events contributed to the downtime of the past few days. First, several failovers of the 'active' database role happened when they shouldn't have. Second, a cluster partition occurred that resulted in incorrect actions being performed by our cluster management software. Finally, the failovers triggered by these first two events impacted performance and availability more than they should have.
In ops I trust
The automated failover of our main production database could be described as the root cause of both of these downtime events. In each situation in which that occurred, if any member of our operations team had been asked if the failover should have been performed, the answer would have been a resounding no. There are many situations in which automated failover is an excellent strategy for ensuring the availability of a service. After careful consideration, we've determined that ensuring the availability of our primary production database is not one of these situations. To this end, we've made changes to our Pacemaker configuration to ensure failover of the 'active' database role will only occur when initiated by a member of our operations team.
We're also investigating solutions to ensure that these failovers don't impact performance when they must be performed, either in an emergency situation or as a part of scheduled maintenance. There are various facilities for warming the InnoDB buffer pool of slave databases that we're investigating and testing for this purpose.
Finally, our operations team is performing a full audit of our Pacemaker and Heartbeat stack focusing on the code path that triggered the segfault on Tuesday. We're also performing a strenuous round of hardware testing on the server on which the segfault occurred out of an abundance of caution.
Status Downtime on Tuesday, September 11th
We host our status site on Heroku to ensure its availability during an outage. However, during our downtime on Tuesday our status site experienced some availability issues.
As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90. This had a negative effect since we were running an old development database addon (shared database). The number of dynos maxed out the available connections to the database causing additional processes to crash.
We worked with Heroku Support to bring a production database online that would be able to handle the traffic the site was receiving. Once this database was online we saw an immediate improvement to the availability of the status site.
Since the outage we've added a database slave to improve our availability options for unforeseen future events.
The recent changes made to our database stack were carefully made specifically with high availability in mind, and I can't apologize enough that they had the opposite result in this case. Our entire operations team is dedicated to providing a fast, stable GitHub experience, and we'll continue to refine our infrastructure, software, and methodologies to ensure this is the case.