```{=latex}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{textcomp}
\usepackage{fancyvrb}

\newcommand{\passthrough}[1]{\lstset{mathescape=false}#1\lstset{mathescape=true}}
\newcommand{\tightlist}{}
```

```{=latex}
\title{Modeling Alert Quality}
\author{Moshe Zadka -- https://cobordism.com}
\date{}

\begin{document}
\begin{titlepage}
\maketitle
\end{titlepage}

\frame{\titlepage}
```

```{=latex}
\begin{frame}
\frametitle{Acknowledgement of Country}

Belmont (in San Francisco Bay Area Peninsula)

Ancestral homeland of the Ramaytush Ohlone

\end{frame}
```

## What are alerts?

Before I talk about alert
*quality*,
I want to make sure we are all on the same page:
what are alerts?
Whether good or bad,
it is important to distinguish them from other things.

```{=latex}
\begin{frame}
\frametitle{What are alerts?}

Good or bad

\end{frame}
```

### Monitoring

There can be different models of alerts.
This is a specific,
but popular model.
If alerts are to be about a system,
then the system has to send some monitoring,
or observability,
data somewhere.

If we want more than a single data point to cause an alert,
this data sink has to aggregate the data.
This is going to be the
*source*
of the alerts.

```{=latex}
\begin{frame}
\frametitle{Monitoring}

System $\to$ Aggregator

\end{frame}
```

### Event

A system that constantly alerts is as useless as a system that always alerts.
If alerts are based on data,
and should not be sent all the time,
we can model an alert as a kind of
*event*.
The definition of an event is that some query against the aggregated data
returns an atypical value.

But not events are alerts!

```{=latex}
\begin{frame}
\frametitle{Event}

Aggregator query \pause atypical value

\end{frame}
```

### Alert: Low priority

A
*low priority*
alert is when an event indicates a
*problem*,
but the problem is not urgent.
This kind of alert usually shows up in e-mail,
as a task,
or in slack.

```{=latex}
\begin{frame}
\frametitle{Low priority alert}

Bad event (not urgent)

\end{frame}
```

### Alert: High priority

A
*high-priority*
alert is an event that we have decided should require an immediate action.
This kind of alert will usually text,
ring,
or otherwise cause sounds,
lights,
and vibrations from a mobile device.


This is intended to draw immediate attention,
potentially waking someone up.
These alerts,
high priority ones,
are the focus of this talk.

It is those alerts whose quality we want to measure.

```{=latex}
\begin{frame}
\frametitle{High priority alert}

Break-fix needed! \pause

Focus of this talk

\end{frame}
```

## What is alert quality?

So now that we have decided what alerts to measure,
what exactly will we measure?
It is useful to break the measurement into measuring three kinds of alarms:

* True alarms:
  Those are alarms that indicated a real problem that needed to be fixed.
* False alarms:
  Those are alarms that happened, despite there not being a problem,
  or a problem that had to be fixed immediately.
* Missing alarms:
  Much like the curious case of the dog in the night time,
  alarms that
  *do not*
  exist
  are just as important as those which do.
  A missing alarm is a problem that,
  in retrospect,
  needed to be fixed urgently,
  and yet no alarm was sent.
  
Distinguishing these alerts from each other can only be done retroactively.
If you knew an alert was a false alarm,
there would be no need to send it!
This leads to an important aspect of alert quality:
it can only be measured retroactively.

```{=latex}
\begin{frame}
\frametitle{What is alert quality made of?}

\pause

True alarms \pause

False alarms \pause

Missing alarms

\end{frame}
```

### True Alarm

What parameters would make up the quality of a true alarm?
In other words,
what numbers are good when they go in one direction
and bad when they go in another?

Right now we are not focused on how to trade them off,
let alone improving!
All we want here is to capture the data.

The most important thing about a true alarm is the latency of the alert.
So important it is,
in fact,
that it is useful to break it down:

* From the beginning of the issue to the detection
* From the detection to someone acknowleding it
* From acknowledgement to having some diagnosis (enough to fix)
* From initial diagnosis to remediation.

Though the aggregate matters,
breaking it down gives better insights into where the problems are.

```{=latex}
\begin{frame}
\frametitle{True Alarm}

\pause
\begin{itemize}
\item Start to detect \pause
\item Detect to acknowledge \pause
\item Acknowledge to diagnosis \pause
\item Diagnosis to remediation
\end{itemize}

\end{frame}
```

### Missing Alarm

A missing alarm,
by definition,
still caused a problem that needed fixing.
Since this is the measurement stage,
this problem has already been fixed.

This means that the very same parameters
can be measured for the missing alarm:
latency to remediation,
broken down by similar metrics.

Indeed, one of the measurement of a true alarm
should be to improve at least one of those.
The glaring one is
"time to detect",
but this needs not be the only one.

A true alarm could route better,
improving detection to acknowledgement.
It could add diagnostic information,
allowing faster diagnosis.
It could point to the right runbook,
allowing faster remediation.

```{=latex}
\begin{frame}
\frametitle{Missing Alarm}


\begin{itemize}
\item Start to detect \pause
\item Detect to acknowledge \pause
\item Acknowledge to diagnosis \pause
\item Diagnosis to remediation
\end{itemize}


\end{frame}
```

### False Alarm

A false alarm is one,
by definition,
that did not have any remediation involved.
This means that the latency for a false alarm
is from the detection to diagnosis.

In this case,
there are fewer steps.
It is still worthwhile to break down the times
from detection to acknowledgement,
and from acknowledgement to the
"all clear"
diagnosis.

```{=latex}
\begin{frame}
\frametitle{False Alarm}

Detect to acknowledgement \pause

Acknowledgement to diagnosis

\end{frame}
```

### Cost of alerting

A
**false**
alarm is one where no incident happened.
In contrast,
a
**useless**
alarm
is one that indicated an incident that
had already been alerted on.

This might mean a previous alert has already indicated the incident,
or it might mean a human became aware of the issue in some other way.
For example,
users reporting issues in an application falls
under this bucket.

Both kind of alerts are overhead:
had their not been an alerting system at all,
they would not have been sent,
with no degradation to the service provided.

```{=latex}
\begin{frame}
\frametitle{Alerting costs}

False alarm \pause

Useless alarm

\end{frame}
```

### Cost of not alerting

If there is only measurement of costs of alerting,
than the incentive is to never alert.
People have the revealed preference of building alerting systems.

The reason is because there is a cost to
*not*
alerting.
A missing alarm can result in
increased remediation latency.
Measuring both the total increase,
and breaking down the increases,
is useful.

```{=latex}
\begin{frame}
\frametitle{Non-alerting costs}

Extra time to remediate \pause

Broken down

\end{frame}
```

### Alert quality as cost

Putting those two costs together allows modeling
alert quality
**as**
cost.
The total cost of alerting,
plus the total cost of not alerting,
is an
"anti-quality"
measurement.

In order to get a quality measurement,
negate it.
If dealing with negative numbers is too depressing,
since the best is zero,
add a large constant.

The important thing about alert quality is
how to improve it,
so adding a constant value does not change the
resulting actions.
It is,
sometimes,
nicer
to avoid having to say
"OKR is to reduce alert value by 10%".

```{=latex}
\begin{frame}
\frametitle{Alert quality as value}

Cost of alerting \pause

plus cost of not alerting \pause

Negated \pause

Plus a constant

\end{frame}
```



## Cost of alarm

In order to model alert quality as
(anti-)cost,
we need to measure cost.
*Measuring*
is always and forever a process of
*estimation*.

In other words,
measuring alerting costs
means gathering the data from the incidents,
and estimating a cost per alert.
This can mean that sometimes it's useful to give a cost
not in dollars,
but in some fake currency.

This can sometimes communicate better
*systemic*
estimation errors.
Systemic errors end up not changing the suggested actions or overall feeling,
so reducing those might not be as useful.

```{=latex}
\begin{frame}
\frametitle{Breaking down alerting costs}

Data \to Estimation

\end{frame}
```

### False alarm

Since a false alarm does not result in any degradation,
any amount of people involved in it is
"wasted".
Because of that,
totalling the total amount of person time
invested in the diagnosis
is important.

```{=latex}
\begin{frame}
\frametitle{False alarm}

Number of people \pause

Time

\end{frame}
```

### Convenience

Alerts,
false,
true,
or missing,
can have different levels of
"convenience".
Think of convenience as
"amount of engineer dissatisfaction"
or
"burn-out factor".

This convenience can depend on various aspects,
and ultimately on the people involved.
Is an alert on Saturday at 4pm worse
than one on Tuesday at 2am?

It can also depend on what else these people are doing.
Especially an alert which adds distraction
or context-switching overhead
to an engineer can be harmful.



```{=latex}
\begin{frame}
\frametitle{Alarm convenience}

Off business hours? \pause

Delaying critical project?

\end{frame}
```

### People involved

The
*number*
of people involved is also an important cost metric.
How many people needed to get involved?
Across how many teams?
Was it to find the responsible party or
to get help?

```{=latex}
\begin{frame}
\frametitle{People diagnosing and remediating}

Interaction with other teams? \pause

Finding responsible party?

\end{frame}
```

### Work involved

How much work was involved in remediation?
By whom?
*Where* was this work spent:
diagnosis,
test,
deployment,
etc.
This is useful to make sure all remediation work was involved.


```{=latex}
\begin{frame}
\frametitle{Work diagnosing and remediating}

Work to diagnose \pause

Work to test \pause

Work to deploy

\end{frame}
```

## Cost of incident

Separate from the cost of the
*alert*
itself,
it is important to measure the cost of the incident.
An incident means that a service that someone cared about was degraded.

How much caring,
how much degradation,
and how long,
is important.
This is separate from how much
*work*
it took to remediate.

For example,
a one-line fix that took an hour for the automated deployment to finish
might have little work,
but a lot of time to remediate.
In contrast,
a fix that took five minutes to deploy,
but three people over twenty minutes to develop,
takes more work,
but takes less long to remediate.


```{=latex}
\begin{frame}
\frametitle{Incident cost}

Separate from work on incident

\end{frame}
```

### Time to detect

```{=latex}
\begin{frame}
\frametitle{Time to detect}

Unknown problem

\end{frame}
```

### Time to remediate

```{=latex}
\begin{frame}
\frametitle{Time to remediate}

Known problem

\end{frame}
```

### Immediate cost

```{=latex}
\begin{frame}
\frametitle{Immediate cost}

SLA missed \pause

Business missed

\end{frame}
```

### Reputational cost

```{=latex}
\begin{frame}
\frametitle{Reputation cost}

Customer feedback \pause

Customer continued business \pause

New customer acquisition

\end{frame}
```

### Secondary incidents

```{=latex}
\begin{frame}
\frametitle{Secondary incidents cost}

Any degradation caused by remediations/mitigations
\end{frame}
```

## Balancing costs

```{=latex}
\begin{frame}
\frametitle{Balancing cost}

What would constitute "better"?

\end{frame}
```

### Gathering data

```{=latex}
\begin{frame}
\frametitle{Gather data}

Estimate when you need to

\end{frame}
```

### Deciding on priorities

```{=latex}
\begin{frame}
\frametitle{Priorities}

Strategy \pause

Tactics

\end{frame}
```

### Tracking trailing OKRs

```{=latex}
\begin{frame}
\frametitle{Tracking quality}

Actual quality: Lagging indicator

\end{frame}
```

### Tracking immediate OKRs

```{=latex}
\begin{frame}
\frametitle{Tracking quality: immediate}

Approximate quality \pause

Track that

\end{frame}
```

### Black swans

```{=latex}
\begin{frame}
\frametitle{Tracking quality: black swans}

Take into account wide "safety margins"

\end{frame}
```

### Goodhart's law

```{=latex}
\begin{frame}
\frametitle{Tracking quality: Goodhart's law}

Not a target \pause

Feedback

\end{frame}
```

## Summary

### Alert quality matters`

```{=latex}
\begin{frame}
\frametitle{Alert quality matters}

Burn out \pause

Customer satisfaction

\end{frame}
```

### Alert quality take effort to track

```{=latex}
\begin{frame}
\frametitle{Alert quality difficult to track}

Time and effort!

\end{frame}
```

### Improve and iterate

```{=latex}
\begin{frame}
\frametitle{Alert improvement}

Measure \pause

Fix \pause

Iterate

\end{frame}
```

```{=latex}
\end{document}
```