Alerts don't group their incidents by a dedupKey / Object ID #77772

gmmorris · 2020-09-17T14:42:32Z

This is a follow up from #76908.

Each execution of the PagerDuty action is seen by the PagerDuty service as a unique incident.
To group incidents together a user can set the optional dedupKey parameter, but this means that the default behaviour is that multiple executions on the same alert will not be grouped.
This will become especially painful when the Action On Resolve issue is addressed, as a user could drag two PagerDuty actions on the same alert (one to open an incident and one to resolve) and by default they will not be grouped (meaning resolution won't actually resolve the incident in Pager Duty).

We should group these by default (but allow users to override the default value), probably based on either the Alert ID or the AlertInstance ID (I can see arguments for both).

EDIT: The same applies to Jira, ServiceNow and IBM Resilient with the Object ID field. This issue should fix all 4 integrations.

The text was updated successfully, but these errors were encountered:

pmuellr · 2020-10-22T14:47:55Z

The default seems very likely to be context dependent on the customer.

It does seem like it would make sense that if a PD action was added to a "resolved" action group, and a dedup key was provided (I believe it's required when using the resolve action), that a PD action used in an "active" action group should use the same context variable as the "resolved" one. You'd almost want a validation of that. But of course, "it depends". You could certainly come up with some use case where a customer would want something different.

It feels to me like we'll need some special doc on this in the PD action - and if the other incident management actions (servicenow, resilient, jira) have similar sorts of "capabilities", we'd need it there as well. Where we can explain these flows.

Also, PD provides customization of some of this stuff on their end, regarding what happens to incidents posted with a dedupkey that have already been resolved - open it again, just append the incident but leave it resolved, etc.

It feels like setting the dedupkey to the narrowest possible value would be the "safest" thing - ie, the alert instance. Otherwise, incidents posted using the alert id, and then resolved by that same alert id, are going to end up resolving things that perhaps shouldn't be resolved. Again, "it depends".

elasticmachine · 2020-10-26T20:22:07Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

flaper87 · 2020-10-28T09:32:35Z

The default seems very likely to be context dependent on the customer.

I'd phrase this differently and say that it's dependent on the context of the alert itself, and not so much on the customer (intended as the infrastructure being monitored). The relevance of this is that the context is built from the data being monitored. Unless this is what you meant, of course. :D

It does seem like it would make sense that if a PD action was added to a "resolved" action group, and a dedup key was provided (I believe it's required when using the resolve action), that a PD action used in an "active" action group should use the same context variable as the "resolved" one. You'd almost want a validation of that. But of course, "it depends". You could certainly come up with some use case where a customer would want something different.

It feels to me like we'll need some special doc on this in the PD action - and if the other incident management actions (servicenow, resilient, jira) have similar sorts of "capabilities", we'd need it there as well. Where we can explain these flows.

I started writing a bunch of text explaining a bit more about Riemann internals and how we do things but ended up writing something way more verbose and complex than what you are probably after here. So, here are a few examples of deduplication keys that we create, based on the event data. All the deduplication keys are computed at runtime. This is to say, we don't have any hardcoded deduplication keys as they all depend on the event's data and the monitoring logic.

Disk running out of space: This alert is based on metricbeat data and it looks something like "metricbeat ${metricset} ${mout_point}@${host}" (metricbeat system.filesystem /mnt@host-where-metricbeat-runs).
Not enough Logstash instances: We fire this when the number of available logstash instances for Llama goes under a specific threshold. This is interesting because
Heartbeat for a service: We fire this alert when, at least, 75% of the heartbeats has a state down. The deduplication key is computed with "heartbeat ${monitor type}@${monitor_name}`. Example: heartbeat http@infra.inventory
Gobld errors: Triggered when a CI deployment has been failing consistently: The deduplication key is simpler in this case "gobld errors@{CI hostname}". Example gobld errors@apm.ci.elastic.co

All the above data exists in the event already, which is available in the context of the alert logic. This data would be indexed in the cluster if we were using Kibana Alert. The more data we can make available in the context of the alert, the better. Having access to the data in the records being queried would be key.

Also, PD provides customization of some of this stuff on their end, regarding what happens to incidents posted with a dedupkey that have already been resolved - open it again, just append the incident but leave it resolved, etc.

In our case, a new incident is always opened if there is no incident opened that has the same deduplication key.

It feels like setting the dedupkey to the narrowest possible value would be the "safest" thing - ie, the alert instance. Otherwise, incidents posted using the alert id, and then resolved by that same alert id, are going to end up resolving things that perhaps shouldn't be resolved. Again, "it depends".

If a default will be provided, then I think using the narrowest option is correct. However, I would go as far as saying that perhaps a default is not really needed and a deduplication key should be required. I think this would remove surprises for users. Regardless, it is important to allow for the deduplication key to be customized.

arisonl · 2020-10-28T09:57:57Z

It feels like setting the dedupkey to the narrowest possible value would be the "safest" thing - ie, the alert instance.

Narrowest option sounds right, if we have to choose a default. Is an instance_id unique across alerts? Does an instance that goes from Active to Ok and then back to Active retain the same instance_id? How do IDs get generated? For example I see that instance_ids default to * and there is an _id field too. Clear documentation would help avoiding surprises.

mikecote · 2020-10-28T12:12:10Z

++ on using the instance id. We have a roadmap item to allow summarizing the instances in a single action call (#68828) which would allow to create a single incident encapsulating all the instances and let the end user choose what they want. Also, they could change the default dedupKey / Object ID for now and get similar behaviour.

YulNaumenko · 2020-11-11T22:02:40Z

EDIT: The same applies to Jira, ServiceNow and IBM Resilient with the Object ID field. This issue should fix all 4 integrations.

There is no ObjectID field in Jira, ServiceNow or IBM Resilient. We have a service params for internal action execution needs which is called savedObjectId, but this field is not saved in the incidents itself. We can choose some existing Jira, ServiceNow and IBM Resilient fields to store this info, but it definitely will be a different fields. The purpose of this fields make sence if we are going to support the deduplication for this external services incidents.

gmmorris mentioned this issue Sep 17, 2020

[Actions] avoids setting a default dedupKey on PagerDuty #77773

Merged

2 tasks

kertal added the Feature:Alerting label Sep 23, 2020

gmmorris mentioned this issue Oct 22, 2020

[Meta] Alerting-Infra requirements #76914

Open

mikecote added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 26, 2020

mikecote changed the title ~~Alerts don't group their Pager Duty executions by a shared dedupKey~~ Alerts don't group their incidents by a dedupKey / Object ID Oct 30, 2020

YulNaumenko self-assigned this Nov 6, 2020

YulNaumenko mentioned this issue Nov 11, 2020

Ability to resolve ServiceNow, IBM Resilient and Jira incidents #83221

Open

YulNaumenko mentioned this issue Nov 11, 2020

Added default dedupKey value as an {{alertInstanceId}} to provide grouping functionality for PagerDuty incidents. #83226

Merged

YulNaumenko closed this as completed in #83226 Nov 20, 2020

YulNaumenko mentioned this issue Dec 1, 2020

Added default dedupKey value as an {{alertInstanceId}} to provide grouping functionality for PagerDuty incidents. #84598

Merged

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerts don't group their incidents by a dedupKey / Object ID #77772

Alerts don't group their incidents by a dedupKey / Object ID #77772

gmmorris commented Sep 17, 2020 •

edited by mikecote

Loading

pmuellr commented Oct 22, 2020

elasticmachine commented Oct 26, 2020

flaper87 commented Oct 28, 2020

arisonl commented Oct 28, 2020 •

edited

Loading

mikecote commented Oct 28, 2020 •

edited

Loading

YulNaumenko commented Nov 11, 2020 •

edited

Loading

Alerts don't group their incidents by a dedupKey / Object ID #77772

Alerts don't group their incidents by a dedupKey / Object ID #77772

Comments

gmmorris commented Sep 17, 2020 • edited by mikecote Loading

pmuellr commented Oct 22, 2020

elasticmachine commented Oct 26, 2020

flaper87 commented Oct 28, 2020

arisonl commented Oct 28, 2020 • edited Loading

mikecote commented Oct 28, 2020 • edited Loading

YulNaumenko commented Nov 11, 2020 • edited Loading

gmmorris commented Sep 17, 2020 •

edited by mikecote

Loading

arisonl commented Oct 28, 2020 •

edited

Loading

mikecote commented Oct 28, 2020 •

edited

Loading

YulNaumenko commented Nov 11, 2020 •

edited

Loading