Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident management docs #1573

Merged
merged 2 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/_snippets/cloud/features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ This includes investigating issues root cause and impact, communicating issues t
Distribute highly configurable alerts to different channels and integrations.
Automatically tag owners, and enable setting status and assigns at the alert level.
</Card>
<Card title="Automated Grouping">
<Card title="Automated Grouping" href="/features/alerts-and-incidents/incidents">
Different failures who relate to the same issue are grouped automatically to a single incident.
This accelerates triage and response, and reduces alerts fautigue.
</Card>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Elementary can be configured to send alerts on:

- Model run failures
- Failures and/or warnings of dbt tests (including Elementary dbt package and other packages)
- Failures and/or warnings Elementary Anomaly Detection monitors
- Failures and/or warning of custom SQL tests
- dbt source freshness failures
2 changes: 1 addition & 1 deletion docs/_snippets/quickstart/quickstart-cards.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
title="Elementary Cloud Platform"
icon="cloud"
iconType="solid"
href="https://elementary-data.frontegg.com/oauth/account/sign-up"
href="/cloud/introduction"
>
<br />
Built on top of the OSS package, ideal for teams monitoring mission-critical data pipelines, requiring guaranteed uptime and reliability, short-time-to-value, advanced features, collaboration, and professional support.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ sidebarTitle: Alerts & incidents overview
Alerts and incidents in Elementary are designed to shorten your time to response and time to resolution when data issues occur.

- **Alert -** Notification about an event that indicates a data issue.
- **Incident -** A data issue that starts with an event, but can include several events grouped to an incident. An incident has a start time, status, assignee and end time.
- **[Incident](/features/alerts-and-incidents/incidents) -** A data issue that starts with an event, but can include several events grouped to an incident. An incident has a start time, status, severity, assignee and end time.

Alerts provide information and context for recipients to quickly triage, prioritize and resolve issues.
For collaboration and promoting ownership, alerts include owners and tags.
You can create distribution rules to route alerts to the relevant people and channels, for faster response.

An alert would either open a new incident, or be automatically grouped and added to an ongoing incident.
From the alert itself, you can update the status and assignee of an incident. In the incidents management page,
From the alert itself, you can update the status and assignee of an incident. In the [incidents page](/features/alerts-and-incidents/incident-management),
you will be able to track all open and historical incidents, and get metrics on the quality of your response.

## Alerts & incidents core functionality
Expand All @@ -28,6 +28,10 @@ you will be able to track all open and historical incidents, and get metrics on
- **Group alerts to incidents** -
- **Alerts suppression** -

## Alert types

<Snippet file="cloud/features/alerts-and-incidents/alert-types.mdx" />

## Supported alert integrations

<Snippet file="cloud/integrations/cards-groups/alerts-destination-cards.mdx" />
54 changes: 54 additions & 0 deletions docs/features/alerts-and-incidents/incident-management.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Incident Management
sidebarTitle: Incident management
---

<Snippet file="cloud/cloud-feature-tag.mdx" />

The `Incidents` page is designed to enable your team to stay on top of open incidents and collaborate on resolving them.
The page gives a comprehensive overview of all current and previous incidents, where users can view the status, prioritize, assign and resolve incidents.

## Incidents view and filters

The page provides a view of all incidents, and useful filters:

- **Quick Filters:** Preset quick filters for all, unresolved and “open and unassigned” incidents.
- **Filter:** Allows users to filter incidents based on various criteria such as status, severity, model name and assignee.
- **Time frame:** Filter incidents which were open in a certain timeframe.

<iframe
width="700"
height="400"
src="https://res.cloudinary.com/diuctyblm/video/upload/v1719927342/incidents-wide_ki16tu.mp4"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
alt="Elementary Lineage"
></iframe>


## Interacting with Incidents

An incident has a status, assignee and severity.
These can be set in the Incidents page, or from an alert in integrations that support alert actions.

- **Incident status**: Will be set to `open` by default, and can be changed to `Acknowledged` and back to `Open`. When an alert is manually or automatically set as `Resolved`, it will close and will no longer be modified.
- **Incident assignee**: An incident can be assigned to any user on the team, and they will be notified.
- If you assign an incident to a user, it is recommended to leave the incident `Open` until the user changes status to `Acknowledged`.
- **Incident severity**: Users can set a severity level (High, Low, Normal, Critical) to an incident. _Coming soon_ Severity will be automated by an analysis of the impacted assets.

## Incidents overview and metrics

The top bar of the page present aggregated metrics on incidents, to provide an overall status.
You will also be able to track your average resolution time.

_ _Coming soon_ _ The option to create and share a periodic summary of incidents will be supported in the future.

<Frame>
<div className="dark:bg-white rounded-md p-1">
<img
src="https://res.cloudinary.com/diuctyblm/image/upload/v1719926968/Untitled_3_nvhcix.png"
alt="Incidents overview"
/>
</div>
</Frame>
53 changes: 53 additions & 0 deletions docs/features/alerts-and-incidents/incidents.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Incidents in Elementary
sidebarTitle: Incidents
---

<Snippet file="cloud/cloud-feature-tag.mdx" />

One of the challenges data teams face is tracking and understand and collaborate on the status of data issues.
Tests fail daily, pipelines are executed frequently, alerts are sent to different channels.
There is a need for a centralized place to track:
- What data issues are open? Which issues were already resolved?
- Who is on it, and what's the latest status?
- Are multiple failures part of the same issue?
- What actions and events happened since the incident started?
- Did such issue happen before? Who resolved it and how?

In Elementary, these are solved with `Incidents`.

A comprehensive view of all incidents can be found in the [Incidents page](/features/alerts-and-incidents/incident-management).

## How incidents work?

Every failure or warning in Elementary will automatically open a new incident or be added as an event to an ongoing incident.
Based on grouping rules, different failures are grouped to the same incident.

An incident has a [status, assignee and severity](/features/alerts-and-incidents/incident-management#interacting-with-incidents).
These can be set in the [Incidents page](/features/alerts-and-incidents/incident-management), or from an alert in integrations that support alert actions.

<Frame>
<div className="dark:bg-white rounded-md p-1">
<img
src="https://res.cloudinary.com/diuctyblm/image/upload/v1719927516/incidents_ducynb.png"
alt="Elementary Incidents"
/>
</div>
</Frame>

## How incidents are resolved?

Each incident starts at the first failure, and ends when the status is changed manually or automatically to `Resolved`.
An incident is **automatically resolved** when the failing tests, monitors and / or models are successful again.

## Incident grouping rules

Different failures and warnings are grouped to the same incident by the following grouping rules:

1. Additional failures of the same test / monitor on a table that has an active incident.
2. _ _Coming soon_ _ Freshness and volume issues that are downstream of an open incident on a model failure.
3. _ _Coming soon_ _ Failures of the same test / monitor that are on downstream tables of an active incident.

## Incident deep dive

_ _Coming soon_ _
1 change: 1 addition & 0 deletions docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@
"features/alerts-and-incidents/alert-configuration"
]
},
"features/alerts-and-incidents/incidents",
"features/alerts-and-incidents/incident-management"
]
},
Expand Down
Loading