Permalink
Browse files

First stab at an incident response document

  • Loading branch information...
henrik committed Sep 28, 2016
1 parent a7c2f97 commit 488ec1d3e51dbe0dea45e149d668b529cb86d0d0
Showing with 42 additions and 0 deletions.
  1. +1 −0 README.md
  2. +41 −0 incidents/README.md
View
@@ -26,3 +26,4 @@ For reference and discussion.
* [Attending a conference](/attending_a_conference)
* [Holding a demo](/holding_a_demo)
* [Starting a project](/starting_a_project)
+* [Incident response](/incidents)
View
@@ -0,0 +1,41 @@
+![Barsoom](http://barsoom.se/barsoom.png)
+
+# Incident response
+
+A checklist for what to do for incidents such as site downtime.
+
+## Checklist
+
+### Until the problem is resolved
+
+- [ ] Assign an **incident lead** – a single person that is responsible for this checklist. They should delegate tasks explicitly.
+- [ ] If there are remote workers, **"get everone in the same room"** by setting up a video and audio link, e.g. [Zoom](https://zoom.us/).
+- [ ] The incident lead should assign a **communicator**. The communicator ensures that we inform the support, auction houses, customers or other affected parties. May be in person, by chat, by phone, Auctionet system messages etc.
+ - [ ] Communicate when the problem starts.
+ - [ ] Communicate when there is some workaround.
+ - [ ] Communicate when the problem is resolved (from the affected party's standpoint).
+- [ ] The incident lead should assign a team of **deep delvers** to dig into the underlying issue.
+- [ ] The incident lead should assign a team of **quickfixers** to see what we can do right now to **minimise the impact** and **unblock affected parties**.
+- [ ] The incident lead may want to create a Trello card to keep track of things for this incident.
+
+Anyone not tapped by the incident lead is free to keep working on other things. It is the lead's responsibility to call for all hands if necessary.
+
+### Post mortem
+
+Not too long after the problem is resolved, we want a "post mortem" meeting.
+
+The goal of the meeting is to come up with any learnings and actions that let us do better work in future.
+
+- [ ] CTO and product owner should attend so we can decide what resources to allocate.
+- [ ] The discussion should be facilitated (have someone managing it) to keep us on track.
+
+#### Meeting agenda
+
+- [ ] Timeline: What happened? What did we do? What happened then? Where did we leave things?
+- [ ] How did this affect end users? Auction houses, buyers, sellers, support, finance, ….
+- [ ] Reflect on the post mortem. Can we do post mortems better? Update this document with any learnings.
+
+
+## Inspiration
+
+* [SVTi incident routines](http://svti.svt.se/2016/04/tre-tekniker-och-en-bebis/) (if link is broken, try [Google's cache](http://webcache.googleusercontent.com/search?q=cache%3Asvti.svt.se%2F2016%2F04%2Ftre-tekniker-och-en-bebis%2F&oq=cache%3Asvti.svt.se%2F2016%2F04%2Ftre-tekniker-och-en-bebis%2F&aqs=chrome..69i57j69i58.790j0j4&sourceid=chrome&ie=UTF-8))

0 comments on commit 488ec1d

Please sign in to comment.