In just a few short weeks we will be moving GitHub to a new home at Rackspace. We’re aware of the current stability and performance issues, and we want to let you know what we’re doing about it. After all, we’re GitHub users too! The move to Rackspace will bring about a new backend architecture and a lot more servers, leading to a much improved user experience for everyone. Thanks for sticking with us through our growing pains!
Since we have a highly technical audience, I wanted to share some background on the reasons behind the move, what we’ve been doing to prepare for the big changes ahead, and what kinds of service improvements you can look forward to seeing on the new infrastructure.
As you may know, we’ve had a hosting partnership with Engine Yard since very early in the GitHub story. Engine Yard is dedicated to supporting open source initiatives and saw value in helping us grow the site to foster innovation within the Ruby community. For their tireless support and expertise, we are extremely grateful. We wouldn’t be where we are now without them.
The decision to move hosts is never an easy one. The logistics of migrating a site as large and complex as GitHub are intimidating. The single most important reason we’re undertaking this effort is so that we can give you, our customer, a better experience on GitHub. We’re growing at a rate of over 400 new users and 1000 new repositories every day and these rates are only increasing with time. We need to take drastic action now to put in place the kind of infrastructure that will allow us to provide you with a top-notch user experience.
In making the decision to move hosts, we put together a set of requirements that would be necessary to ensure the viability of our business over the next ten years.
- Price. In order to keep ahead of the traffic curve, we need to have immediate access to affordable, commodity hardware. Within five years, it’s not hard to imagine that a cluster of 100 or more servers will be necessary to keep GitHub running smoothly. To guarantee a sustainable business, this amount of hardware must not be prohibitively expensive.
- Flexibility. We’ve grown to a size where it no longer makes sense to have every server virtualized. The benefits of running bare metal are obvious and have been empirically proven. We need to have the option to run bare metal when it is appropriate to the task at hand. We also need to be able to configure boxes with custom setups. If we need six large hard drives in a certain class of machine, then we must be able to get that. If we need boxes with 32GB or 64GB of RAM, those must be available.
- Capacity. It is undesirable to be the biggest fish in the pond. Our host must have experience with sites that are several orders of magnitude larger than us. We must feel comfortable knowing that all of the scalability requirements that we will encounter over the next ten years will be tractable on an available, battle-tested infrastructure.
- Control. Having direct access (via DRAC or similar) to the actual hardware means we can control every aspect of each server’s setup, from network layout and burn-in tests to operating system and RAID configuration. When system-level problems arise, we must be in a position to fix them without the need for outside intervention. At the end of the day, we should be responsible for as much of our stack as financially feasible.
- Globalization. Our long term plan involves making GitHub faster for our international customers. Our host should have data centers in Europe and Asia so that we don’t have to look outside our primary provider to provision hardware around the world.
- Cloud. On-demand access to a cloud infrastructure will be important to us as we increase the number and variety of low-frequency but long-running jobs that we process. A provider that has a first-class cloud offering would be ideal for keeping latencies low and pricing simple.
- Trust. Our host should be a big-name player in the hosting field with an excellent reputation and multiple recommendations from other large sites. The entire future of our company rests on making GitHub stable, fast, and efficient. It is essential that our host be able to keep up with our exacting standards and provide us with competent service.
After evaluating our options, it became evident that Rackspace was the right choice for our ongoing hosting needs. They meet or exceed every one of our requirements and are the only large provider with a strong offering in both traditional and cloud services. In addition, we’ve arranged a partnership deal with Rackspace that includes discounted hardware that will allow us to bring more machines online faster than would otherwise be possible. It was important to us that this partnership not create any conflict of interest, so we’ll be true paying customers of Rackspace, just with mutually beneficial opportunities that will help keep our plans at their current low prices.
To give you a concrete idea of what this new partnership means to you, consider this: on Engine Yard we currently have the following resources (not including DB and CORAID which are out of our control):
- 10 VMs
- 39 VCPUs
- 54GB RAM
On Rackspace, we’ll be enjoying the following setup:
- 16 physical machines
- 128 physical cores
- 288GB RAM
I think the specifications speak for themselves! Within the new hardware layout, we’re placing a significant importance on high availability and redundancy. On Rackspace, every piece of our infrastructure will have failover. That means two database servers, four web servers, two GitHub Pages instances, two Gem Server instances, two Archive Download instances, distributed Job runners, three pairs of file servers, and plenty more.
Speaking of file servers, the move to Rackspace means we’ll finally be leaving our shared file system behind. We’ve far exceeded the normal IO tolerances of GFS and it has become the source of many of the problems in our stack. It has also prevented us from adding additional hardware to the site for nearly a year. Tremendous kudos go to Chris Wanstrath for his ceaseless and amazing work over the last six months to optimize the site enough that it stays running (hurray for Memcache and Redis!).
Since April I’ve been working on a brand new federated backend architecture that will allow us to store repositories on commodity file servers. When we need more storage capacity, we merely have to add more machines and update a routing table. The file servers expose an RPC interface to the Git repositories that can be accessed from anywhere in the cluster. This will allow us to horizontally and separately scale the frontend, backend, and other pieces of the infrastructure.
There are too many improvements in too many parts of our process and infrastructure to cover in this already lengthy post. I’ll happily dive into the specifics of the new architecture and other logistics in a series of follow up articles over the coming weeks.
Right now we’re putting the finishing touches on the production Rackspace cluster and working on the big repository data migration. I’ll be keeping you updated on the progress and the countdown to the final move. We’re aiming to restrict downtime to the day of the actual move and limit service interruption as much as possible.
It takes a lot of effort to build, maintain, and host a site like GitHub. I’d like to thank Engine Yard for getting us to where we are, Rackspace for helping us get to where we’re going, and you for making GitHub such an amazing project to work on.