Downtime last Saturday

On Saturday, December 22nd we had a significant outage and we want to take the time to explain what happened. This was one of the worst outages in the history of GitHub, and it's not at all acceptable to us. I'm very sorry that it happened and our entire team is working hard to prevent similar problems in the future.

Background

We had a scheduled maintenance window Saturday morning to perform software updates on our aggregation switches. This software update was recommended by our network vendor and was expected to address the problems that we encountered in an earlier outage. We had tested this upgrade on a number of similar devices without incident, so we had a good deal of confidence. Still, performing an update like this is always a risky proposition so we scheduled a maintenance window and had support personnel from our vendor on the phone during the upgrade in case of unforseen problems.

What went wrong?

In our network, each of our access switches, which our servers are connected to, are also connected to a pair of aggregation switches. These aggregation switches are installed in pairs and use a feature called MLAG to appear as a single switch to the access switches for the purposes of link aggregation, spanning tree, and other layer 2 protocols that expect to have a single master device. This allows us to perform maintenance tasks on one aggregation switch without impacting the partner switch or the connectivity for the access switches. We have used this feature successfully many times.

Our plan involved upgrading the aggregation switches one at a time, a process called in-service software upgrade. You upload new software to one switch, configure the switch to reboot on the new version, and issue a reload command. The remaining switch detects that its peer is no longer connected and begins a failover process to take control over the resources that the MLAG pair jointly managed.

We ran into some unexpected snags after the upgrade that caused 20-30 minutes of instability while we attempted to work around them within the maintenance window. Disabling the links between half of the aggregation switches and the access switches allowed us to mitigate the problems while we continued to work with our network vendor to understand the cause of the instability. This wasn't ideal since it compromised our redundancy and only allowed us to operate at half of our uplink capacity, but our traffic was low enough at the time that it didn't pose any real problems. At 1100 PST we made the decision to revert the software update and return to a redundant state at 1300 PST if we did not have a plan for resolving the issues we were experiencing with the new version.

Beginning at 1215 PST, our network vendor began gathering some final forensic information from our switches so that they could attempt to discover the root cause for the issues we'd been seeing. Most of this information gathering was isolated to collecting log files and retrieving the current hardware status of various parts of the switches. As a final step, they wanted to gather the state of one of the agents running on a switch. This involves terminating the process and causing it to write its state in a way that can be analyzed later. Since we were performing this on the switch that had its connections to the access switches disabled they didn't expect there to be any impact. We have performed this type of action, which is very similar to rebooting one switch in the MLAG pair, many times in the past without incident.

This is where things began going poorly. When the agent on one of the switches is terminated, the peer has a 5 second timeout period where it waits to hear from it again. If it does not hear from the peer, but still sees active links between them, it assumes that the other switch is still running but in an inconsistent state. In this situation it is not able to safely takeover the shared resources so it defaults back to behaving as a standalone switch for purposes of link aggregation, spanning-tree, and other layer two protocols.

Normally, this isn't a problem because the switches also watch for the links between peers to go down. When this happens they wait 2 seconds for the link to come back up. If the links do not recover, the switch assumes that its peer has died entirely and performs a stateful takeover of the MLAG resources. This type of takeover does not trigger any layer two changes.

When the agent was terminated on the first switch, the links between peers did not go down since the agent is unable to instruct the hardware to reset the links. They do not reset until the agent restarts and is again able to issue commands to the underlying switching hardware. With unlucky timing and the extra time that is required for the agent to record its running state for analysis, the link remained active long enough for the peer switch to detect a lack of heartbeat messages while still seeing an active link and failover using the more disruptive method.

When this happened it caused a great deal of churn within the network as all of our aggregated links had to be re-established, leader election for spanning-tree had to take place, and all of the links in the network had to go through a spanning-tree reconvergence. This effectively caused all traffic between access switches to be blocked for roughly a minute and a half.

Fileserver Impact

Our fileserver architecture consists of a number of active/passive fileserver pairs which use Pacemaker, Heartbeat and DRBD to manage high-availability. We use DRBD from the active node in each pair to transfer a copy of any data that changes on disk to the standby node in the pair. Heartbeat and Pacemaker work together to help manage this process and to failover in the event of problems on the active node.

With DRBD, it's important to make sure that the data volumes are only actively mounted on one node in the cluster. DRBD helps protect against having the data mounted on both nodes by making the receiving side of the connection read-only. In addition to this, we use a STONITH (Shoot The Other Node In The Head) process to shut power down to the active node before failing over to the standby. We want to be certain that we don't wind up in a "split-brain" situation where data is written to both nodes simultaneously since this could result in potentially unrecoverable data corruption.

When the network froze, many of our fileservers which are intentionally located in different racks for redundancy, exceeded their heartbeat timeouts and decided that they needed to take control of the fileserver resources. They issued STONITH commands to their partner nodes and attempted to take control of resources, however some of those commands were not delivered due to the compromised network. When the network recovered and the cluster messaging between nodes came back, a number of pairs were in a state where both nodes expected to be active for the same resource. This resulted in a race where the nodes terminated one another and we wound up with both nodes stopped for a number of our fileserver pairs.

Once we discovered this had happened, we took a number of steps immediately:

  1. We put GitHub.com into maintenance mode.
  2. We paged the entire operations team to assist with the recovery.
  3. We downgraded both aggregation switches to the previous software version.
  4. We developed a plan to restore service.
  5. We monitored the network for roughly thirty minutes to ensure that it was stable before beginning recovery.

Recovery

When both nodes are stopped in this way it's important that the node that was active before the failure is active again when brought back online, since it has the most up to date view of what the current state of the filesystem should be. In most cases it was straightforward for us to determine which node was the active node when the fileserver pair went down by reviewing our centralized log data. In some cases, though, the log information was inconclusive and we had to boot up one node in the pair without starting the fileserver resources, examine its local log files, and make a determination about which node should be active.

This recovery was a very time consuming process and we made the decision to leave the site in maintenance mode until we had recovered every fileserver pair. That process took over five hours to complete because of how widespread the problem was; we had to restart a large percentage of the the entire GitHub file storage infrastructure, validate that things were working as expected, and make sure that all of the pairs were properly replicating between themselves again. This process, proceeded without incident and we returned the site to service at 20:23 PST.

Where do we go from here?

  1. We worked closely with our network vendor to identify and understand the problems that led to the failure of MLAG to failover in the way that we expected. While it behaved as designed, our vendor plans to revisit the respective timeouts so that more time is given for link failure to be detected to guard against this type of event.
  2. We are postponing any software upgrades to the aggregation network until we have a functional duplicate of our production environment in staging to test against. This work was already underway. In the mean time, we will continue to monitor for the MAC address learning problems that we discussed in our previous report and apply a workaround as necessary.
  3. From now on, we will place our fileservers high availability software into maintenance mode before we perform any network changes, no matter how minor, at the switching level. This allows the servers to continue functioning but will not take any automated failover actions.
  4. The fact that the cluster communication between fileserver nodes relies on any network infrastructure has been a known problem for some time. We're actively working with our hosting provider to address this.
  5. We are reviewing all of our high availability configurations with fresh eyes to make sure that the failover behavior is appropriate.

Summary

I couldn't be more sorry about the downtime and the impact that downtime had on our customers. We always use problems like this as an opportunity for us to improve, and this will be no exception. Thank you for your continued support of GitHub, we are working hard and making significant investments to make sure we live up to the trust you've placed in us.

Scheduled Maintenance Windows

As our infrastructure continues to grow and evolve, it's sometimes necessary to perform system maintenance that may cause downtime. We have a number of projects queued up over the coming months to take our infrastructure to the next level, so we are announcing a scheduled maintenance window on Saturday mornings beginning at 0500 Pacific.

We do not intend to perform maintenance every Saturday, and even when we do, most of them will not be disruptive to customers. We are using these windows only in cases where the tasks we're performing have a higher than normal level of risk of impacting the site.

We will always update our status site before we begin and again when we're done. In cases where we expect there to be more than a few minutes of disruption we will also make an announcement on the GitHub Blog by the preceding Friday.

To get things started on the right foot, we will be performing an upgrade of the software on some of our network switches this Saturday during the new maintenance window. We do not expect this to cause any visible disruption.

The Octoverse in 2012

I am continually blown away by the staggering amount of work happening on GitHub. Every day, our users commit code, open and close issues, and make plans for their software to take over the world. We track all of this activity and make the public data available via our API.

Over half a million individual events happen every day on GitHub. Here's a look into the ever-expanding Octoverse in 2012.

Push It

2012 GitHub Activity

Since the beginning of the year, we've seen a doubling in activity, with pushes alone responsible for over 60% of the events in a given day. On a typical weekday, 10k people sign up for a GitHub account, and our users:

  • push 140GB of new data
  • create 25k repositories and 7k pull requests
  • push to 125k repositories

Best of all:

  • 10k people create their very first repository

We're Growing. Fast.

Looking over the past few years, the amount of people using GitHub is growing at an incredible rate; there are now 2.8MM GitHub users, which represents 133% growth in 2012 alone. Even more impressive is how much those users are doing on GitHub. In that same time period, the overall number of repositories increased 171% to 4.6MM.

Year-over-year user and repository growth

Since software is changing the world, it shouldn't be surprising that it's developed by people from all corners of the globe. While the United States is the most active country on GitHub.com, it accounts for only 28% of our traffic.

The top 10 countries visiting GitHub.com are: United States, Germany, United Kingdom, China, Japan, France, India, Canada, Russia and Brazil. The top 10 cities are: London, San Francisco, New York, Paris, Moscow, Beijing, Berlin, Bangalore, Sydney and Toronto.

Notable OSS in 2012

Stars are a way to keep track of repositories that you find interesting. These projects, all created this year, attracted the most stargazers:

  1. FortAwesome/FontAwesome: The iconic font designed for use with Twitter Bootstrap
  2. textmate/textmate: TextMate is a graphical text editor for OS X 10.7+
  3. meteor/meteor: Meteor, an ultra-simple, database-everywhere, data-on-the-wire, pure-Javascript web framework
  4. saasbook/hw3_rottenpotatoes: A project used in a free Software as a Service course taught through BerkeleyX
  5. ivaynberg/select2: Select2 is a jQuery based replacement for select boxes
  6. jkbr/httpie: HTTPie is a CLI, cURL-like tool for humans
  7. maker/ratchet: Prototype iPhone apps with simple HTML, CSS, and JS components
  8. twitter/bower: A package manager for the web
  9. Kicksend/mailcheck: Email domain spelling suggester
  10. jmechner/Prince-of-Persia-Apple-II: A running-jumping-swordfighting game for the Apple II from 1985-89

It's better to work together than to work alone. By developing software on GitHub, you're making it easy for 2.8MM people to help you out. In the past year, these projects attracted the highest numbers of unique contributors:

  1. mxcl/homebrew: The missing package manager for OS X
  2. rails/rails: Ruby on Rails
  3. CyanogenMod/android_frameworks_base: Android base frameworks
  4. CocoaPods/Specs: CocoaPods (cocoapods.org) specifications
  5. symfony/symfony: The Symfony PHP framework
  6. zendframework/zf2: Zend Framework
  7. openstack/nova: OpenStack Compute (Nova)
  8. saltstack/salt: Central system and configuration manager for infrastructure
  9. TrinityCore/TrinityCore: TrinityCore Open Source MMO Framework
  10. github/hubot-scripts: optional scripts for hubot, a customizable, kegerator-powered life embetterment robot

:heart::boom::camel:

Across commit messages, issues, pull requests, and comments, emoji is a vital part of GitHub's daily workflow. Life, and our products, just wouldn't be the same without it. When we looked at the popular emoji used on weekdays (green) versus those same emoji on weekends (blue), we saw the :fire::fire::fire: is spreading.

During the week, the business of :ship:ing gets done, with :shipit:, :sparkles:, :-1:, and :+1: taking the lead:

Weekday and weekend emoji

The most popular emoji on the weekend paint a different picture; time for a :cocktail: under a :palm_tree::

Weekend emoji

Thank you!

We believe GitHub is the best place to build software, but it wouldn't be the same without you. Thank you for building, sharing and shipping. Thank you for proving that it's better to work together than to work alone.

From the GitHub family to you, thanks. Next year is going to be even more amazing.

Issue autocompletion

Today we're adding autocompletion for Issues and Pull Requests, similar to the @mention and emoji autocompletion you already know and love. Type # to see Issue and Pull Request suggestions, then type a number or any text to filter the list.

issue
autocompletion

Customize your receipt details

Does your accountant or tax collector require specific details on your receipts? You can now add information like your legal company name, billing address, and tax ID number to your receipts. Just visit your billing page and click the magic button.

optional billing information

New Homepage

Today, we're launching a refreshed homepage that reflects the core idea of GitHub: It's better to work together than to work alone. Thank you to our awesome community for making this true.

New GitHub homepage

Goodbye, Uploads

In addition to providing downloadable source code archives, GitHub previously allowed you to upload files (separate from the versioned files) in the repository, and make it available for download in the Downloads Tab. Supporting these types of uploads was a source of great confusion and pain – they were too similar to the files in a Git repository. As part of our ongoing effort to keep GitHub focused on building software, we are deprecating the Downloads Tab.

  • The ability to upload new files via the web site is disabled today.

  • Existing links to previously uploaded files will continue to work for the foreseeable future.

  • Repositories that already have uploads will continue to list their downloads for the next 90 days (tack on /downloads to the end of any repository to see them).

  • The Downloads API is officially deprecated and will be disabled in 90 days.

Update (December 5, 2013):

  • The Downloads API is officially deprecated and will be disabled at a future date.

Onward

We encourage you to continue distributing your code through downloadable source code archives. However, some projects need to host and distribute large binary files in addition to source archives. If this applies to you, we recommend using one of the many fantastic services that exist exactly for this purpose such as Amazon S3 / Amazon CloudFront or SourceForge. Check out our help article on distributing large binaries.

Check out Releases!

Welcome to a New Gist

At GitHub we love using Gist to share code. Whether it's a simple snippet or a full app, Gist is a great way to get your point across. And the fact that every Gist is a fully forkable git repository makes it even better.

Today we're excited to share the next generation of Gist. Rewritten from scratch using better libraries and our styleguide, the new Gist is part of our plan to make sharing code easier than ever.

What's new?

Everything, because we rewrote it. But here are some of our favorite new features.

Discover Gists

The new Gist makes it easier to find what you're looking for, whether you're browsing new code on Discover or searching by language with our new Gist Search.

Edit like an Ace

Gist, like GitHub, is now powered by the Ace editor. Syntax highlighted, indentation aware editing is now at your fingertips.

(Try dragging a file of code from your desktop to the editor for even more fun.)

History is written by the Gisters

You can now view the full history of every Gist, complete with diffs. Never be blamed for sloppy coding again.

Forking

The new Gist tells you which forks have activity, making it easier to find interesting changes from coworkers or complete strangers.

And more…

There's more new stuff but you'll have to poke around to find it.

We hope you enjoy the new Gist as much as we do! :heart_decoration:

GitHub system status API

Today we're releasing a new system status API to serve up status info in a delicious JSON flavor.

Screen Shot 2012-12-10 at 10.33.06 AM.png

Issue attachments

They say a picture is worth a thousand words. Reducing words is one of our top priorities at GitHub HQ, so we're releasing image attachments for issues today.

Update: Oh, and if you're using Chrome, you can now paste an image into the comment box to upload it!

New Enterprise.GitHub.com

Today, we’re launching a completely redesigned homepage for GitHub Enterprise, the private, installable version of GitHub running on your servers. Beyond the visual changes, we’ve tightened up the copy to better communicate what Enterprise is and how it works. Current Enterprise users will see a redesigned dashboard and more when they sign in.

Check out the new GitHub Enterprise, still the best way to build and ship software on your servers.

:metal:

Creating files on GitHub

Starting today, you can create new files directly on GitHub in any of your repositories. You’ll now see a "New File" icon next to the breadcrumb whenever you’re viewing a folder’s tree listing:

Creating a new file

Clicking this icon opens a new file editor right in your browser:

New file editor

If you try to create a new file in a repository that you don’t have access to, we will even fork the project for you and help you send a pull request to the original repository with your new file.

This means you can now easily create README, LICENSE, and .gitignore files, or add other helpful documentation such as contributing guidelines without leaving the website—just use the links provided!

Create common files on GitHub

For .gitignore files, you can also select from our list of common templates to use as a starting point for your file:

Create .gitignore files using our templates

ProTip™: You can pre-fill the filename field using just the URL. Typing ?filename=yournewfile.txt at the end of the URL will pre-fill the filename field with the name yournewfile.txt.

Enjoy!

New Status Site

We just launched a new status.github.com.

Downtime is important

Even if you're shooting for five nines — 99.999% uptime — you're still going to have to account for five minutes of downtime a year. As a service, what happens during those five minutes can make all the difference.

Sharing our tools

We track a huge amount of data internally surrounding our reliability and performance, and we wanted to make that accessible to you, too.

All of the graphs on our status page are generated from our own internal Graphite service. Data is pulled from Graphite and rendered using the excellent D3 JavaScript library.

Adaptive UI

We rely heavily on our mobile devices these days and we want everything we make to be beautiful and usable on as many devices as we can. We made sure the new status website would be no exception by giving it an adaptive user interface with CSS media queries to ensure that even on-the-go you'll be able to see at a glance what's going on with GitHub.


In a perfect world you'll never have to use the status site, but if you do, we hope this will give you more visibility. We'll also be automatically posting all updates to @githubstatus if you'd like notices to appear in your Twitter feed.

Network problems last Friday

On Friday, November 30th, GitHub had a rough day. We experienced 18 minutes of complete unavailability along with sporadic bursts of slow responses and intermittent errors for the entire day. I'm very sorry this happened and I want to take some time to explain what happened, how we responded, and what we're doing to help prevent a similar problem in the future.

Note: I initially forgot to mention that we had a single fileserver pair offline for a large part of the day affecting a small percentage of repositories. This was a side effect of the network problems and their impact on the high-availability clustering between the fileserver nodes. My apologies for missing this on the initial writeup.

Background

To understand the problem on Friday, you first need to understand how our network is constructed. GitHub has grown incredibly quickly over the past few years. A consequence of that growth is that our infrastructure has, at times, struggled to keep up with the growth.

Most recently, we've been seeing some significant problems with network performance throughout our network. Actions that should respond in under a millisecond were taking several times that long with occasional spikes to hundreds of times that long. Services that we've wanted to roll out have been blocked by scalability concerns and we've had a number of brief outages that have been the result of the network straining beyond the breaking point.

The most pressing problem was with the way our network switches were interconnected. Conceptually, each of our switches were connected to the switches in the neighboring racks. Any data that had to travel from a server on one end of the network to a server on the other end had to pass through all of the switches in between. This design often put a very large strain on the switches in the middle of the chain and those links became saturated, slowing down any data that had to pass through them.

To solve this problem, we purchased additional switches to build what's called an aggregation network, which is more of a tree structure. Network switches at the top of the tree (aggregation swtiches) are directly connected to switches in each server cabinet (access switches). This topology assures that data never has to move between more than 3 tiers: The switch in the originating cabinet, the aggregation switches, and the switch in the destination cabinet. This allows the links between switches to be much more efficiently used.

What went wrong?

Last week the new aggregation switches finally arrived and were installed in our datacenter. Due to the lack of available ports in our access switches, we needed to disconnect access switches, change the configuration to support the aggregation design, and then reconnect them to the aggregation switches. Fortunately, we've built our network with redundant switches in each server cabinet and each server is connected to both of these switches. We generally refer to these as "A" and "B" switches.

Our plan was to perform this operation on the B switches and observe the behavior before transitioning to the A switches and completing the migration. On Thursday, November 29th we made these changes on the B devices and despite a few small hiccups the process went essentially according to plan. We were initially encouraged by the data we were collecting and planned to make similar changes to the A switches the following morning.

On Friday morning, we began making the changes to bring the A switches into the new network. We moved one device at a time and the maintenance proceeded exactly as planned until we reached the final switch. As we connected the final A switch, we lost connectivity with the B switch in the same cabinet. Investigating further, we discovered a misconfiguration on this pair of switches that caused what's called a "bridge loop" in the network. The switches are specifically configured to detect this sort of problem and to protect the network by disabling links where they detect an issue, and that's what happened in this case.

We were able to quickly resolve the initial problem and return the affected B switch to service, completing the migration. Unfortunately, we were not seeing the performance levels we expected. As we dug deeper we saw that all of the connections between the access switches and the aggregation switches were completely saturated. We initially diagnosed this as a "broadcast storm" which is one possible consequence of a bridge loop that goes undetected.

We spent most of the day auditing our switch configurations again, going through every port trying to locate what we believed to be a loop. As part of that process we decided to disconnect individual links between the access and aggregation switches and observe behavior to see if we could narrow the scope of the problem further. When we did this, we discovered another problem: The moment we disconnected one of the access/aggregation links in a redundant pair, the access switch would disable its redundant link as well. This was unexpected and meant that we did not have the ability to withstand a failure of one of our aggregation switches.

We escalated this problem to our switch vendor and worked with them to identify a misconfiguration. We had a setting that was intended to detect partial link failure between two links. Essentially it would monitor to try and ensure that both the transmit and receive functions were functioning correctly. Unfortunately, this feature is not supported between the aggregation and access switch models. When we shut down an individual link, this watchdog process would erroneously trigger and force all the links to be disabled. The 18 minute period of hard downtime we had was during this troubleshooting process when we lost connectivity to multiple switches simultaneously.

Once we removed the misconfigured setting on our access switches we were able to continue testing links and our failover functioned as expected. We were able to remove any single switch at either the aggregation or access layer without impacting the underlying servers. This allowed us to continue moving through individual links in the hunt for what we still believed was a loop induced broadcast storm.

After a couple more hours of troubleshooting we were unable to track down any problems with the configuration and again escalated to our network vendor. They immediately began troubleshooting the problem with us and escalated it to their highest severity level. We spent five hours Friday night troubleshooting the problem and eventually discovered a bug in the aggregation switches was to blame.

When a network switch receives an ethernet frame, it inspects the contents of that frame to determine the destination MAC address. It then looks up the MAC address in an internal MAC address table to determine which port the destination device is connected to. If it finds a match for the MAC address in its table, it forwards the frame to that port. If, however, it does not have the destination MAC address in its table it is forced to "flood" that frame to all of its ports with the exception of the port that it was received from.

In the course of our troubleshooting we discovered that our aggregation switches were missing a number of MAC addresses from their tables, and thus were flooding any traffic that was sent to those devices across all of their ports. Because of these missing addresses, a large percentage of our traffic was being sent to every access switch and not just the switch that the destination devices was connected to. During normal operation, the switch should "learn" which port each MAC address is connected through as it processes traffic. For some reason, our switches were unable to learn a significant percentage of our MAC addresses and this aggregate traffic was enough to saturate all of the links between the access and aggregation switches, causing the poor performance we saw throughout the day.

We worked with the vendor until late on Friday night to formulate a mitigation plan and to collect data for their engineering team to review. Once we had a mitigation plan, we scheduled a network maintenance window on Saturday morning at 0600 Pacific to attempt to work around the problem. The workaround involved restarting some core processes on the aggregation switches in order to attempt to allow them to learn MAC addresses again. This workaround was successful and traffic and performance returned to normal levels.

Where do we go from here?

  1. We have worked with our network vendor to provide diagnostic information which led them to discover the root cause for the MAC learning issues. We expect a final fix for this issue within the next week or so and will be deploying a software update to our switches at that time. In the mean time we are closely monitoring our aggregation to access layer capacity and have a workaround process if the problem comes up again.
  2. We designed this maintenance so that it would have no impact on customers, but we clearly failed. With this in mind, we are planning to invest in a duplicate of our network stack from our routers all the way through our access layer switches to be used in a staging environment. This will allow us to more fully test these kinds of changes in the future, and hopefully detect bugs like the one that caused the problems on Friday.
  3. We are working on adding additional automated monitoring to our network to alert us sooner if we have similar issues.
  4. We need to be more mindful of tunnel-vision during incident response. We fixated for a very long time on the idea of a bridge loop and it blinded us to other possible causes. We hope to begin doing more scheduled incident response exercises in the coming months and will build scenarios that reinforce this.
  5. The very positive experience we had with our network vendor's support staff has caused us to change the way we think about engaging support. In the future, we will contact their support team at the first sign of trouble in the network.

Summary

We know you depend on GitHub and we're going to continue to work hard to live up to the trust you place in us. Incidents like the one we experienced on Friday aren't fun for anyone, but we always strive to use them as a learning opportunity and a way to improve our craft. We have many infrastructure improvements planned for the coming year and the lessons we learned from this outage will only help us as we plan them.

Finally, I'd like to personally thank the entire GitHub community for your patience and kind words while we were working through these problems on Friday.

Tidying up after Pull Requests

At GitHub, we love to use Pull Requests all day, every day. The only trouble is that we end up with a lot of defunct branches after Pull Requests have been merged or closed. From time to time, one of us would clear out these branches with a script, but we thought it would be better to take care of this step as part of our regular workflow on GitHub.com.

Starting today, after a Pull Request has been merged, you’ll see a button to delete the lingering branch:



If the Pull Request was closed without being merged, the button will look a little different to warn you about deleting unmerged commits:



Of course, you can only delete branches in repositories that you have push access to.

Enjoy your tidy repositories!