```{=latex}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{textcomp}
\usepackage{fancyvrb}

\newcommand{\passthrough}[1]{\lstset{mathescape=false}#1\lstset{mathescape=true}}
\newcommand{\tightlist}{}
```

```{=latex}
\title{Getting Started with Incident Retrospectives}
\author{Moshe Zadka -- https://cobordism.com}
\date{}

\begin{document}
\begin{titlepage}
\maketitle
\end{titlepage}

\frame{\titlepage}
```

```{=latex}
\begin{frame}
\frametitle{Acknowledgement of Country}

Belmont (in San Francisco Bay Area Peninsula)

Ancestral homeland of the Ramaytush Ohlone people

\end{frame}
```

I live in Belmont,
in the San Francisco Bay Area Peninsula.
I wish to acknowledge it as the
ancestral homeland
of the
Ramaytush Ohlone people.

## The Basics

```{=latex}
\begin{frame}
\frametitle{Basic Terms Defined}

Getting on the same page

\end{frame}
```

Before talking about how to do incident retrospectives,
we need to make sure we are all on the same page.
Let's start by definining some terms.

### Incident

```{=latex}
\begin{frame}
\frametitle{Incident}

Something happened that was not meant to happen\pause

and it would be bad "enough" if it happened again

\end{frame}
```

An incident is when something happens
that should not have happened.
To be an incident it should be the case that it would be
*bad*
if it happened again.

### Incident Retrospective

```{=latex}
\begin{frame}
\frametitle{Incident Retrospective}

Analysis\pause

Joint review\pause

Agreement on next steps


\end{frame}
```

An incident retrospective is an analysis of the incident.
The analysis should be
*consensual*:
the team must agree the analysis is correct.
The team must also agree on what the next steps are.

### Goals of Incident Retrospective

```{=latex}
\begin{frame}
\frametitle{Goals of Incident Retrospective}

\pause
Understanding

\pause
Doing better

\end{frame}
```

An incident retrospective is a lot of effort.
Someone has to prepare a draft write-up,
and the team needs to comment on it.

For the retrospective to be worthwhile,
it has to give commensurate benefit.
The benefits of a retrospective include
*understanding*
the system that failed better,
and how it can fail.

The most important benefit is that retrospectives are a guide to
*doing better*.
What is doing better?

One way of doing better is to avoid such incidents altogether.
This is not the only way!

Better incidents include
faster remediation,
less impact on end users,
less effort for the team to remediate,
and other aspects.
When discussing how to do better,
it is important to understand in what sense would it be better.

### Non-Goals of Incident Retrospective

```{=latex}
\begin{frame}
\frametitle{Non-Goals of Incident Retrospective}

\pause
Deciding who's responsible

\pause
Punishing
\end{frame}
```

It is equally important to note what are the non-goals.
Incident retrospectives are a process of improvement,
not a process of discipline.

They are not used for deciding which person is responsible.
They are not used for deciding on a punishment.

### Timeline

```{=latex}
\begin{frame}
\frametitle{Timeline}

\pause
Sequence of events

\pause
Minute/second resolution

\pause
Consistent time zone

\pause
Increasing

\pause
Thorough

\end{frame}
```

Part of a retrospective is a timeline.
Timelines,
in general,
are in
"human resolution":
minutes,
or at best seconds.

A timeline should be in a consistent time zone even,
or especially,
in the face of a globally distributed team.
It should be in the order of time --
the stamps should always increase.
A timeline should also be
*thorough*:
include all things which were involved in the incident.

### Action Items

```{=latex}
\begin{frame}
\frametitle{Action Items}

\pause
Concrete

\pause
Actionable

\end{frame}
```

Part of a retrospective are action items,
or reocmmendations.
Such recommendations should be
*concrete*
and
*actionable*.

Being concrete,
without being actionable,
does not make something a recommendation.
Being actionable,
without being concrete,
makes it impossible to know if the recommendation
has been implemented.

## Process

```{=latex}
\begin{frame}
\frametitle{Incident Management Process}

\pause
Step by step

\end{frame}
```

Incident retrospective is a
*process*.
It is not only about the end-goal:
the journey is important as well.

### Incident Management Notes

```{=latex}
\begin{frame}
\frametitle{Incident Management Notes}

\pause
Rough

\pause
Immediate

\end{frame}
```

While analyzing and remediating an incident,
it is important to keep notes.
This is should not delay the actual remediation process.

One way of doing it is to give the team a heads-up
on any actions needed,
or analysis results,
on a chat channel.
This is important,
in any case,
for allowing the team to help and understand what is being done.

### Assigning Responsibility

```{=latex}
\begin{frame}
\frametitle{Assigning Responsibility}

\pause
Prepare the retrospective

\pause
Not: "At fault"

\pause
Not: "Handled the incident"

\pause
Not: "Knows best"

\end{frame}
```

There should be one person assigned to the retrospective.
This assignment should be treated as
"who best to write the retrospective",
and not implied to be a "punishment".

This means it is not based on who
"caused"
the incident,
or who was assigned to remediate it.
It is also not necessarily the person who
knows the system best.

The assignment should be based on a combination of ability to do it
in terms of
availability
and skills
as well as
who would learn from it the most.
It is important that everyone in the team will have experience
in writing retrospectives.

### Research

```{=latex}
\begin{frame}
\frametitle{Research}

\pause
Prepare timeline\pause
(with references)

\pause
Understand the system

\end{frame}
```

Writing a retrospective requires doing research.
The research includes both gathering information
on exactly what happened
as well as how the system functioned,
and malfunctioned.

### Write-up

```{=latex}
\begin{frame}
\frametitle{Write-up}

\pause
...More details later

\end{frame}
```

The details of the write-up are worthy of their own section.
For now,
we will skip them.

### Review

```{=latex}
\begin{frame}
\frametitle{Review}

\pause
Team: On-call rotation

\pause
Offer corrections/amendments

\end{frame}
```

After the draft write-up is ready,
is should be put up for review by the team.
The definition of the
"team"
here
includes,
at least,
anyone in an on-call rotation for the system.

It can also include adjacent teams
as well as people who contribute code 
but are not on the rotation.

### Follow-up

```{=latex}
\begin{frame}
\frametitle{Follow-up}

\pause
Recommendation to implementation

\end{frame}
```

The final piece of the retrospective
is to make specific recommendations.
The incident is not properly
"done"
until the recommendations have been implemented.

## System Safety and Control Theory (3)

```{=latex}
\begin{frame}
\frametitle{System Safety and Control Theory}

\pause
Applications to writing incident retrospectives

\end{frame}
```

Before talking about incident retrospective write-ups,
it is important to understand the theory of systems failure.
This theory guides as to how to analyze the system,
what belongs in the timeline,
and how to move from analysis to recommendation.

### System

```{=latex}
\begin{frame}
\frametitle{System}

\pause
Parts we control

\end{frame}
```

The
"system"
is the part under the team's control.
The definition of the
"system"
depends on the definition of the
"team".


### Environment

```{=latex}
\begin{frame}
\frametitle{Environment}

\pause
Parts we don't control...

\pause
...that effect the system

\end{frame}
```

The
"environment"
is anything that interacts with the system
that is not under the control of the team.

This includes infrastructure the system depends on,
as well as any systems that depend on it.
The environment is also any ambient factors:
for example,
other systems using the same infrastructure.

### Control

```{=latex}
\begin{frame}
\frametitle{Control}

\pause
System responding to environment

\end{frame}
```

A
"control"
is a part of the system that takes
input from the environment
and
changes the behavior of the system.
In general,
controls
are designed to keep a system functioning inside
given parameters.

### Safety Control

```{=latex}
\begin{frame}
\frametitle{Safety Control}

\pause
Part of system

\pause
designed to avoid failure

\end{frame}
```

A
*safety control*
is a control that is designed to prevent,
or mitigate,
a system failure.
It responds to a potential issue
and changes the system to make the failure less likely.

### Safety Control Failure

```{=latex}
\begin{frame}
\frametitle{Safety Control Failure}

\pause
Safety control...

\pause
...environment...

\pause
...failure

\end{frame}
```

A safety control failure
happens when a safety control
encounters a problem it was designed to mitigte
and does not mitigate it appropriately.

### Further Resources

```{=latex}
\begin{frame}
\frametitle{Further Resources}

\pause
Professor Nancy Levenson

\pause
CAST Handbook

\end{frame}
```

This was a speedrun of the concepts involved in CAST,
Causal Analysis based on Systems Theory.
Professor Nancy Levenson of MIT
has published the
CAST Handbook,
based on her research.

The book is available for free on the internet.
Reading it is highly useful.

## Anatomy of a Write-up: High-level (4)

```{=latex}
\begin{frame}
\frametitle{Anatomy of a Write-up}

Table of Contents

\end{frame}
```

Now that the basics of systems theory has been established,
it is time to talk on how to apply it to writing incident retrospectives.
The first step is to design the
"table of contents"
for the write-up.

### Summary

```{=latex}
\begin{frame}
\frametitle{Summary}

\pause
One paragraph

\pause
Appears first....

\pause
...written last

\end{frame}
```

Though the summary appears first,
it is written last.
Nothing should be in the summary that is not a result
of the rest of the write-up.

### Timeline

```{=latex}
\begin{frame}
\frametitle{Timeline}

\pause
Bullet list

\pause
Time in timezone

\pause
What happened

\pause
Optional: Link to references

\end{frame}
```

The next section is the timeline.
It consists of a bullet list
of timestamps and a short description of what happened.

The timeline,
ideally,
includes links to references and context for what happened.
Alarms,
pull requests,
and more are all useful.

### What Went Well

```{=latex}
\begin{frame}
\frametitle{What Went Well}

\pause
A bad time...

\pause
...great as a morale boost

\end{frame}
```

It is always useful to congratulate the team on what went well.
While not strictly useful to doing better,
it works well as a morale boost.

The rest of the incident retrospective is about problems.
It is often preceived as a criticism.

Putting this front and center allows the reader to have a little
moment of relief.

### What Could Have Been Done Better

```{=latex}
\begin{frame}
\frametitle{What Could Have Been Done Better}

\pause
Appears before recommendations...

\pause
...written after

\end{frame}
```

This is not the recommendations yet.
This is not necessarily actionable --
some of the
"what could have been done better"
things could only be clear in retrospect.

This is more of a
"fantasy":
how would things go in the best of all scenarios.
This is useful for inspiration later on.

### Analysis

```{=latex}
\begin{frame}
\frametitle{Analysis}

\pause
Based on CAST

\pause
Details later

\end{frame}
```

The analysis is based on CAST:
Causal Analysis based on Systems Theory.
The details of how to write an analysis using
CAST
will be clarified later.
For now,
let's skip to the next part.

### Recommendations

```{=latex}
\begin{frame}
\frametitle{Recommendations}

\pause
Concrete...

\pause
...actionable

\pause
Open tickets (issues/bugs/...)

\end{frame}
```

As explained before,
the recommendations must be concrete and actionable.
If you take nothing else from this talk,
this is the most important part.

The goal of the retrospective is to make things better.
The way to make things better is to have specific things
that can be done.


## Anatomy of a Write-up: Details (5)

```{=latex}
\begin{frame}
\frametitle{Anatomy of a Write Up: Analysis}

\pause
...Finally, details!

\end{frame}
```

After all the build-up,
let's dive in:
how does CAST analysis works?
Here is the
step-by-step
guide.

### Safety Controls

```{=latex}
\begin{frame}
\frametitle{Safety Controls}

\pause
List safety controls

\pause
Short explanation

\end{frame}
```

Hopefully,
the system had some safety controls.
Though they failed to stop the incident,
they are still important.

List each one,
and note what it does.



### Safety Control Failures

```{=latex}
\begin{frame}
\frametitle{Safety Controls Failures}

\pause
Every safety control failed

\pause
This is what
"incident"
means

\end{frame}
```

The relevant safety controls all failed.
It is important to clearly analyse
the failure.

This often requires the most research.
You will need to understand exactly how the control was supposed
to work.

### Systemic Problems

```{=latex}
\begin{frame}
\frametitle{Systemic Problems}

\pause
Beyond an individual safety control

\pause
Systemic problems leading to the incident

\end{frame}
```

Are there any systemic problems?

For example,
what is the process of deciding on safety controls?
How are they tested?



### Missing Safety Controls

```{=latex}
\begin{frame}
\frametitle{Missing Safety Controls}

\pause
Aspects that lack any safety controls

\end{frame}
```

Are there any aspects that are completely missing safety contorls?
Add a list of safety controls that could have stopped or mitigated
the incident.

### Why not "root causes"?

```{=latex}
\begin{frame}
\frametitle{Why not "root causes"?}

\pause
What is the "root"?

\pause
What is special about the "root"?

\pause
"Root" often in environment

\end{frame}
```

Often,
the analysis is called
"root cause analysis"
or
"contributing causes analysis".
Root cause analysis,
properly speaking,
is one which identifies the one
"cause"
leading to the issues,
and then
*fixing it*.

This is a fundamentally broken way of working.
The problem is not the
"root cause":
the problem is that the system did not have appropriate safety controls.

The
"cause",
or
"contributing causes",
might be outside of our locus on control.
Increased customer load
or underlying infrastructure failure
can bring a system down.

We do not have control over the customers
or the infrastructure.
We can make sure that we shed load appropriately
or fail over to alternative infrastructure.

### Actionable Recommendations

```{=latex}
\begin{frame}
\frametitle{Actionable Recommendations}

\pause
Recommendation must be related to
issues identified

\pause
Additional safety control

\pause
Improvement to existing safety control

\end{frame}
```

The recommendations should be related to the analysis.
Each recommendation should be tracable to one of the specific
problems written in the analysis.

Specifically,
avoid recommendations which would
*not*
have helped in those particular circumstances.

```{=latex}
\begin{frame}
\frametitle{What Are Recommendations?}

\pause
Software changes

\pause
Configuration changes

\pause
Process changes

\pause
Clear definition "done"

\end{frame}
```

Recommendations can include anything.
Often,
they'll include changes to software or configuration.

Recommendations can,
however,
also include
"process"
changes.
For example,
are there are things that someone needs to read
before joining an on-call rotation?
Do we need to develop more training materials?

A recommendation does need to communicate what would
"done"
look like.

## Incident Retrospective Review (6)

```{=latex}
\begin{frame}
\frametitle{Incident Retrospective Review}

\pause
"Post-Mortem Meeting"?

\pause
De-emphasize the meeting!

\end{frame}
```

When we talk about incident retrospectives,
or
"post-mortems",
the usual word that follows is
"meeting".
This should,
in general,
be the least important part of the process.

This is not when the review starts,
or ends.

### Async Feedback

```{=latex}
\begin{frame}
\frametitle{Async Feedback}

\pause
Comments from the team

\pause
Give enough time

\end{frame}
```

The review starts when there is a draft of the written analysis ready.
At that point,
ask for feedback from the team.

Allow enough time,
and have some method,
for the team to collaborate on feedback.
For example,
this can be done using pull request or merge request comments,
comments in a shared document writing system,
or anything else that works.


This allows people enough time to properly consider the analysis,
read it,
and think about ways it could be done better.
Comments need to be clear about what kind of change they are asking for,
and why.


### Addressing Async Feedback

```{=latex}
\begin{frame}
\frametitle{Addressing Async Feedback}

\pause
Change

\pause
Clarify

\pause
Disagree

\end{frame}
```

For each comment,
you might want to make the change it suggests,
ask a question if it is not clear what change to make,
or
disagree that this change should be done better.
All of these options are relevant and reasonable.

For some comments,
it will turn out that the clarification is harder to achieve in an
asynchronous way.
For others,
there might be fundamental disagreement.

This is the point where it is reasonable to schedule a meeting to
discuss those.


### Incident Retrospective Meeting

```{=latex}
\begin{frame}
\frametitle{Incident Retrospective Meeting}

\pause
Limit back-and-forth

\pause
Invite on an as-needed basis

\end{frame}
```

In general,
avoid more than one
back-and-forth
on a specific comment.
Any comment where the thread is at least three bits long
should be resolved during the meeting.

When setting up the meeting,
only the original author and the people involved in the discussion
are mandatory participants.
Others in the team can be
"optional".

### Roles in Incident Retrospecive Meeting

```{=latex}
\begin{frame}
\frametitle{Roles in Incident Retrospective Meeting}

\pause
Presenter

\pause
Moderator

\pause
Note-taker

\end{frame}
```

### Follow-up on Incident Retrospective Meeting

```{=latex}
\begin{frame}
\frametitle{Follow-up on Incident Retrospective Meeting}

\pause
Attach notes

\pause
Make agreed-on changes

\pause
Announce

\end{frame}
```

### Finalizing Write-up

```{=latex}
\begin{frame}
\frametitle{Finalize Write-Up}

\pause
Formally note
"write-up is complete"

\pause
List of
"done"
write-ups

\end{frame}
```

## Implementing Recommendations (7)

```{=latex}
\begin{frame}
\frametitle{Implementing Recommendations}

\pause
The main point!

\end{frame}
```

### Prioritizing

```{=latex}
\begin{frame}
\frametitle{Prioritizing}

\pause
Clearly decide


\end{frame}
```

### Tracking

```{=latex}
\begin{frame}
\frametitle{Tracking}

\pause
Labels/links

\pause
Searchable

\end{frame}
```

### Commitments

```{=latex}
\begin{frame}
\frametitle{Commitments}

\pause
When?

\pause
Document

\end{frame}
```

### Revisiting Recommendations

```{=latex}
\begin{frame}
\frametitle{Revisiting Recommendations}

\pause
Not written in stone

\end{frame}
```

### Common Recommendations

```{=latex}
\begin{frame}
\frametitle{Common Recommendations}

\pause
Some issues cause multiple incidents

\end{frame}
```

### Communicating Status of Recommendations

```{=latex}
\begin{frame}
\frametitle{Communicating Status of Recommendations}

\pause
Planned/in-progress/dropped

\end{frame}
```

## Onboarding (8)

```{=latex}
\begin{frame}
\frametitle{Onboarding}

\pause
From here to there
\end{frame}
```

### Documenting the Process

```{=latex}
\begin{frame}
\frametitle{Document the Process}

\pause
Step-by-step

\pause
Template

\end{frame}
```

### See One, Do One, Teach One

```{=latex}
\begin{frame}
\frametitle{See one, Do one, Teach one}

\pause
Medical school system

\pause
Watch someone else doing it

\pause
Do it with someone else watching

\pause
Reverse roles in step 1

\end{frame}
```

### Allocating Time

```{=latex}
\begin{frame}
\frametitle{Allocating time}

\pause
A new job responsibility

\end{frame}
```

### Integrating Feedback

```{=latex}
\begin{frame}
\frametitle{Allocating time}

\pause
A new job responsibility

\end{frame}
```

### Clear Standards

```{=latex}
\begin{frame}
\frametitle{Clear standards}

\pause
What's "good enough"?

\end{frame}
```

### Deadlines

```{=latex}
\begin{frame}
\frametitle{Deadlines}

\pause
When is it due?

\end{frame}
```

## Iterating (9)

```{=latex}
\begin{frame}
\frametitle{Iterating}

\pause
Getting better

\end{frame}
```

### Updating the Process

```{=latex}
\begin{frame}
\frametitle{Updating the Process}

\pause
Process for update

\end{frame}
```

### Communicating Changes

```{=latex}
\begin{frame}
\frametitle{Communicating changes}

\pause
Announce clearly

\end{frame}
```

### Clearly Needed Changes

```{=latex}
\begin{frame}
\frametitle{Clearly needed changes}

\pause
Missing things

\pause
Unclear

\end{frame}
```

### Analyzing Retrospectives Over Time

```{=latex}
\begin{frame}
\frametitle{Changes over time}

\pause
Common problems

\pause
Make it easy to analyze

\end{frame}
```

### Standard vs. Team-led

```{=latex}
\begin{frame}
\frametitle{Standard vs. Team-led}

\pause
Can different teams have different processes?

\end{frame}
```

### Intentional Learning

```{=latex}
\begin{frame}
\frametitle{Intentional Learning}

\pause
Research

\pause
Discuss

\end{frame}
```

## Summary (10)

```{=latex}
\begin{frame}
\frametitle{Summary}

Highlights

\end{frame}
```

### Back to basics: Blameless

```{=latex}
\begin{frame}
\frametitle{Basics: Blameless}

\pause
Avoid blame

\pause
"Humans are not the cause"

\end{frame}
```

### Consistency Matters

```{=latex}
\begin{frame}
\frametitle{Consistency}

\pause
Process

\end{frame}
```

### Measure

```{=latex}
\begin{frame}
\frametitle{Measure}

\pause
Is it useful?

\end{frame}
```

### Iterate

```{=latex}
\begin{frame}
\frametitle{Iterate}

\pause
Remove what doesn't work

\pause
Add what's needed

\end{frame}
```

### Keep the Goal in Mind

```{=latex}
\begin{frame}
\frametitle{Goal}

\pause
Better incidents!

\end{frame}
```

### Enjoy!

```{=latex}
\end{document}
```