Skip to content

fjarlq/post-mortems

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 

Repository files navigation

A list of post mortems!

Knight Capital. A combination of conflicting deployed versions and re-using a previously used bit caused a $460M loss.

Etsy. Sending multicast traffic without properly configuring switches caused an Etsy global outage.

Healthcare.gov.

Sweden. Use of different rulers by builders causes lopsided ship, which sank in 1628.

NASA. Use of different units of measurement (metric vs. English) caused Mars Climate Orbiter to fail. This is basically the same issue Sweden ran into in 1628 with its ship.

Stack Overflow. A bad firewall config blocked stackexchange/stackoverflow.

Microsoft. A bad config took down Azure storage.

Cloudflare. A bad config (router rule) caused all of their edge routers to crash, taking down all of Cloudflare.

Google. A bad config (autogenerated) took down most Google services.

Google. A bad config caused a quota service to fail, which caused multiple services to fail (including gmail).

Facebook. A bad config took down both Facebook and Instagram.

Valve. Although there's no official postmortem, it looks like a bad BGP config severed Valve's connection to Level 3, Telia, and Abovenet/Zayo, which resulted in a global Steam outage.

GPS/GLONASS. A bad update that caused incorrect orbital mechanics calculations caused GPS satellites that use GLONASS to broadcast incorrect positions for 10 hours. The bug was noticed and rolled back almost immediately due to (?) this didn't fix the issue.

Gitlab. After the primary locked up and was restarted, it was brought back up with the wrong filesystem, causing a global outage.

Heroku. Having a system that requires scheduled manual updates inevitably resulted in an error which caused US customers to be unable to scale, stop or restart dynos, or route HTTP traffic, and also prevented all customers from being able to deploy.

Bitly. Hosted source code repo contained credentials granting access to bitly backups, including hashed passwords.

Sun/Oracle. Sun famously didn't include ECC in a couple generations of server parts. This resulted in data corruption and crashing. Following Sun's typical MO, they made customers that reported a bug sign an NDA before explaining the issue.

Kickstarter. Primary DB became inconsistent with all replicas, which wasn't detected until a query failed. This was caused by a MySQL bug which sometimes caused order by to be ignored.

AppNexus. A double free revealed by a database update caused all "impression bus" servers to crash simultaneously. This wasn't caught in staging and made it into production because a time delay is required to trigger the bug, and the staging period didn't have a built-in delay.

Google. Checking the vendor string instead of feature flags renders NaCl unusable on otherwise compatible non-mainstream hardware platforms.

Etsy. First, a deploy that was supposed to be a small bugfix deploy also caused live databases to get upgraded on running production machines. To make sure that this didn't cause any corruption, Etsy stopped serving traffic to run integrity checks. Second, an overflow in ids (signed 32-bit ints) caused some database operations to fail. Etsy didn't trust that this wouldn't result in data corruption and took down the site while the upgrade got pushed.

Dropbox. This postmortem is pretty thin and I'm not sure what happened. It sounds like, maybe, a scheduled OS upgrade somehow caused some machines to get wiped out, which took out some databases.

Joyent. Operations on Manta were blocked because a lock couldn't be obtained on their PostgreSQL metadata servers. This was due to a combination of PostgreSQL's transaction wraparound maintence taking a lock on something, and a Joyent query that unecessarily tried to take a global lock.

CircleCI. A GitHub outage and recovery caused an unexpectedly large incoming load. For reasons that aren't specified, a large load causes CircleCI's queue system to slow down, in this case to handling one transaction per minute.

GoCardless. All queries on a critical PostgreSQL table were blocked by the combination of an extremely fast database migration and a long-running read query, causing 15 seconds of downtime.

CCP Games. A problematic logging channel results in cluster nodes dying off during the cluster start sequence after rolling out a new game patch, resulting in a day's worth of troubleshooting and extended downtime.

Amazon. Major Amazon EC2/RDS outage in US East Region, April 21 - 24, 2011. Human error during a routine networking upgrade led to a resource crunch, exacerbated by software bugs, that ultimately resulted in an outage across all US East Availability Zones, affecting many popular websites, as well as a loss of 0.07% of volumes.

Netflix. Netflix's extensive preparations enable them to gracefully handle a degradation of Amazon EBS service.

Medium. Due to a series of unfortunate events, Polish users were unable to use their "Ś" key on Medium.

Unfortunately, most of the interesting post-mortems I know about are locked inside confidential pages at Google and Microsoft. Please add more links if you know of any interesting public post mortems! This is a pretty good resource; other links to collections of post mortems are also appreciated.

Contributors

  • Dan Luu
  • Julia Hansbrough
  • Nat Welch
  • James Graham
  • Grey Baker
  • Amber Yust
  • Connor Shea
  • Matt Day

About

A collection of postmortems. Pull requests welcome!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published