A checklist for what to do for incidents such as site downtime.
Until the problem is resolved
- Assign an incident lead – a single person that is responsible for this checklist. They should delegate tasks explicitly.
- If there are remote workers, "get everone in the same room" by setting up a video and audio link, e.g. Zoom.
- The incident lead should assign a communicator. The communicator ensures that we inform every affected party. May be in person, by chat, by phone, Auctionet system messages etc.
- Consider communicating:
- When we first notice the problems.
- When there is some workaround.
- When the problems are resolved (from the affected party's standpoint).
- The incident lead should assign a team of deep delvers to dig into the underlying issue.
- The incident lead should assign a team of quickfixers to see what we can do right now to minimise the impact and unblock affected parties.
- The incident lead may want to create a Trello card to keep track of things for this incident.
Anyone not tapped by the incident lead is free to keep working on other things. It is the lead's responsibility to call for all hands if necessary.
Not too long after the problem is resolved, we want a "post mortem" meeting.
The goal of the meeting is to come up with any learnings and actions that let us do better work in future.
- CTO and product owner should attend so we can decide what resources to allocate.
- The discussion should be facilitated (have someone managing it) to keep us on track.
- Timeline: What happened? What did we do? What happened then? Where did we leave things?
- How did this affect end users? Auction houses, buyers, sellers, support, finance, …. What can we then improve?
- Reflect on the post mortem. Can we do post mortems better? Update this document with any learnings.