Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerts don't group their incidents by a dedupKey / Object ID #77772

Closed
gmmorris opened this issue Sep 17, 2020 · 6 comments · Fixed by #83226
Closed

Alerts don't group their incidents by a dedupKey / Object ID #77772

gmmorris opened this issue Sep 17, 2020 · 6 comments · Fixed by #83226
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Sep 17, 2020

This is a follow up from #76908.

Each execution of the PagerDuty action is seen by the PagerDuty service as a unique incident.
To group incidents together a user can set the optional dedupKey parameter, but this means that the default behaviour is that multiple executions on the same alert will not be grouped.
This will become especially painful when the Action On Resolve issue is addressed, as a user could drag two PagerDuty actions on the same alert (one to open an incident and one to resolve) and by default they will not be grouped (meaning resolution won't actually resolve the incident in Pager Duty).

We should group these by default (but allow users to override the default value), probably based on either the Alert ID or the AlertInstance ID (I can see arguments for both).

EDIT: The same applies to Jira, ServiceNow and IBM Resilient with the Object ID field. This issue should fix all 4 integrations.

@pmuellr
Copy link
Member

pmuellr commented Oct 22, 2020

The default seems very likely to be context dependent on the customer.

It does seem like it would make sense that if a PD action was added to a "resolved" action group, and a dedup key was provided (I believe it's required when using the resolve action), that a PD action used in an "active" action group should use the same context variable as the "resolved" one. You'd almost want a validation of that. But of course, "it depends". You could certainly come up with some use case where a customer would want something different.

It feels to me like we'll need some special doc on this in the PD action - and if the other incident management actions (servicenow, resilient, jira) have similar sorts of "capabilities", we'd need it there as well. Where we can explain these flows.

Also, PD provides customization of some of this stuff on their end, regarding what happens to incidents posted with a dedupkey that have already been resolved - open it again, just append the incident but leave it resolved, etc.

It feels like setting the dedupkey to the narrowest possible value would be the "safest" thing - ie, the alert instance. Otherwise, incidents posted using the alert id, and then resolved by that same alert id, are going to end up resolving things that perhaps shouldn't be resolved. Again, "it depends".

@mikecote mikecote added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 26, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@flaper87
Copy link

The default seems very likely to be context dependent on the customer.

I'd phrase this differently and say that it's dependent on the context of the alert itself, and not so much on the customer (intended as the infrastructure being monitored). The relevance of this is that the context is built from the data being monitored. Unless this is what you meant, of course. :D

It does seem like it would make sense that if a PD action was added to a "resolved" action group, and a dedup key was provided (I believe it's required when using the resolve action), that a PD action used in an "active" action group should use the same context variable as the "resolved" one. You'd almost want a validation of that. But of course, "it depends". You could certainly come up with some use case where a customer would want something different.

It feels to me like we'll need some special doc on this in the PD action - and if the other incident management actions (servicenow, resilient, jira) have similar sorts of "capabilities", we'd need it there as well. Where we can explain these flows.

I started writing a bunch of text explaining a bit more about Riemann internals and how we do things but ended up writing something way more verbose and complex than what you are probably after here. So, here are a few examples of deduplication keys that we create, based on the event data. All the deduplication keys are computed at runtime. This is to say, we don't have any hardcoded deduplication keys as they all depend on the event's data and the monitoring logic.

  • Disk running out of space: This alert is based on metricbeat data and it looks something like "metricbeat ${metricset} ${mout_point}@${host}" (metricbeat system.filesystem /mnt@host-where-metricbeat-runs).

  • Not enough Logstash instances: We fire this when the number of available logstash instances for Llama goes under a specific threshold. This is interesting because

  • Heartbeat for a service: We fire this alert when, at least, 75% of the heartbeats has a state down. The deduplication key is computed with "heartbeat ${monitor type}@${monitor_name}`. Example: heartbeat http@infra.inventory

  • Gobld errors: Triggered when a CI deployment has been failing consistently: The deduplication key is simpler in this case "gobld errors@{CI hostname}". Example gobld errors@apm.ci.elastic.co

All the above data exists in the event already, which is available in the context of the alert logic. This data would be indexed in the cluster if we were using Kibana Alert. The more data we can make available in the context of the alert, the better. Having access to the data in the records being queried would be key.

Also, PD provides customization of some of this stuff on their end, regarding what happens to incidents posted with a dedupkey that have already been resolved - open it again, just append the incident but leave it resolved, etc.

In our case, a new incident is always opened if there is no incident opened that has the same deduplication key.

It feels like setting the dedupkey to the narrowest possible value would be the "safest" thing - ie, the alert instance. Otherwise, incidents posted using the alert id, and then resolved by that same alert id, are going to end up resolving things that perhaps shouldn't be resolved. Again, "it depends".

If a default will be provided, then I think using the narrowest option is correct. However, I would go as far as saying that perhaps a default is not really needed and a deduplication key should be required. I think this would remove surprises for users. Regardless, it is important to allow for the deduplication key to be customized.

@arisonl
Copy link
Contributor

arisonl commented Oct 28, 2020

It feels like setting the dedupkey to the narrowest possible value would be the "safest" thing - ie, the alert instance.

Narrowest option sounds right, if we have to choose a default. Is an instance_id unique across alerts? Does an instance that goes from Active to Ok and then back to Active retain the same instance_id? How do IDs get generated? For example I see that instance_ids default to * and there is an _id field too. Clear documentation would help avoiding surprises.

Screenshot 2020-11-02 at 18 43 00

@mikecote
Copy link
Contributor

mikecote commented Oct 28, 2020

++ on using the instance id. We have a roadmap item to allow summarizing the instances in a single action call (#68828) which would allow to create a single incident encapsulating all the instances and let the end user choose what they want. Also, they could change the default dedupKey / Object ID for now and get similar behaviour.

@mikecote mikecote changed the title Alerts don't group their Pager Duty executions by a shared dedupKey Alerts don't group their incidents by a dedupKey / Object ID Oct 30, 2020
@YulNaumenko YulNaumenko self-assigned this Nov 6, 2020
@YulNaumenko
Copy link
Contributor

YulNaumenko commented Nov 11, 2020

EDIT: The same applies to Jira, ServiceNow and IBM Resilient with the Object ID field. This issue should fix all 4 integrations.

There is no ObjectID field in Jira, ServiceNow or IBM Resilient. We have a service params for internal action execution needs which is called savedObjectId, but this field is not saved in the incidents itself. We can choose some existing Jira, ServiceNow and IBM Resilient fields to store this info, but it definitely will be a different fields. The purpose of this fields make sence if we are going to support the deduplication for this external services incidents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
9 participants