Skip to content
This repository has been archived by the owner on Aug 27, 2019. It is now read-only.

cloud-gov/cg-postmortems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

cloud.gov post mortems [DEPRECATED]

We no longer use this repository for postmortem work. Instead, we have team guidance in the ops docs and postmortem notes in the team Google Drive folder.

If you're experiencing a problem with cloud.gov or you want to discuss an ongoing incident: check the cloud.gov StatusPage or email cloud-gov-support@gsa.gov. This repository and wiki are only for post-mortems after the incident has been closed.

We hold a post mortem as soon as possible after a cloud.gov service disruption or other incident. We use a broad definition of incident; ITIL says "Failure of a configuration item that has not yet affected service is also an incident — for example, failure of one disk from a mirror set. The ITIL incident management process ensures that normal service operation is restored as quickly as possible and the business impact is minimized."

We keep our post mortems in the wiki attached to this repo.

For more information on post mortems, check out:

How we run post mortems - the short version

Before the post mortem we'll put together a timeline of the incident on the wiki, beginning at the time the incident was announced in Slack, and ending at the point we declared the incident over. Everybody is welcome to add their observations to the timeline, including which actions were taken when, the effects observed, and their understanding of the events.

  1. The facilitator starts by reading the retrospective prime directive.
  2. We review the timeline and add anything we have missed.
  3. We analyze the factors that contributed to the incident.
  4. We propose, discuss, and prioritize remediation steps to reduce the likelihood of future incidents, to improve detection and response times for future incidents, to improve our incident handling processes and training, and to validate and test these remediation steps.

We add the work that comes out of the post mortem to our backlog. We then schedule a meeting 2 months after the incident to review our progress.

Protecting sensitive information

In this public repository, we exclude any information that may be sensitive, such as information related to exploitable security flaws/vulnerabilities, sensitive infrastructure details, or PII. For details, see the 18F Open Source Policy guidelines for protecting sensitive information. If we want to reference sensitive information as part of a postmortem, we can make an access-controlled GSA Google Doc and link to it from this repository.

About

How we ensure we are always learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published