slack action not reporting rate-limiting as it should #44127

pmuellr · 2019-08-27T17:46:07Z

I've set up a scenario to test what happens when slack returns a rate-limiting reponse from an action.

setup:

create slack incoming webhook, at slackhq
create an action using the webhook url, using the .slack action type, with a secret of webhookUrl (value being the entire URL)
create an alert using an always firing alert type, with a 1s interval, so that Kibana is posting to slack as frequently as it can
separately, run autocannon on the slack url, to force the rate-limiting to occur, which will occur when the alert fires

Assuming your webhook url is available in the env var SLACK_WEBHOOKURL, these are commands to create the action and alert, and then run autocannon:

$ kbn-action create .slack slack {} "{webhookUrl: '$SLACK_WEBHOOKURL'}"
{
    "id": "b97f9ae5-38ab-4bae-ab26-37c56a341fa7",
    "actionTypeId": ".slack",
    "description": "slack",
    "config": {}
}

$ kbn-alert create .always-firing 1s {} "[{group: default, id: 'b97f9ae5-38ab-4bae-ab26-37c56a341fa7', params: {message: via_alert}}]"
...

$ autocannon --amount 10000 --method POST --body '{"text":"via autocannon"}' $SLACK_WEBHOOKURL

When run, the following message is printed in the Kibana console:

server    log   [13:43:57.186] [error][task_manager] Task actions:.slack "3da8bc20-c8f2-11e9-a68f-d137ee6661ab" failed: Error: an error occurred in action "b97f9ae5-38ab-4bae-ab26-37c56a341fa7" posting a slack message, retry at 2019-08-27T17:43:58.185Z: An HTTP protocol error occurred: statusCode = undefined

It's partially correct - it got a retry date! The problem is with the end of the message, it should indicate that the request was rate-limited, instead of an http protocol error.

Or perhaps something more nefarious is going on, but I doubt it, as the case where a retry date is returned is pretty constrained ...

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-27T17:47:06Z

Pinging @elastic/kibana-stack-services

pmuellr · 2019-08-27T18:18:26Z

I created a better always firing alert that provides a counter, to see if we lose any messages during rate-limiting process. Creating it looks like this:

$ kbn-alert create .always-firing 1s {} "{group:default id:'c4bee054-f51e-4ef1-88bd-ac010e720097' params:{message: 'via alert; {{context.date}} {{context.count}}'}}"

In slack, I saw the following:

via alert; 2019-08-27T18:06:02.590Z 10
via autocannon 
... (repeated 95 times)
via alert; 2019-08-27T18:06:05.637Z 11
via autocannon
... (repeated 199 times)
via alert; 2019-08-27T18:06:08.721Z 12
via autocannon
... (repeated 23 times)
via alert; 2019-08-27T18:06:24.297Z 17
via alert; 2019-08-27T18:06:27.349Z 18

So we lost messages 13-16 during the rate-limiting; there were also 4 server log messages as shown ^^^. So, seems like it the right message got printed, probably from the slack action type, but it didn't actually go into rate-limiting mode ...

elasticmachine · 2019-12-20T15:26:32Z

Pinging @elastic/kibana-reporting-services (Team:Reporting Services)

pmuellr · 2020-03-20T15:55:18Z

Thinking about this for 7.7. Without the event log for customers to notice rate limiting is happening, not sure it's all that important to fix this. But want to retry on the latest code.

For customers, once you get rate-limited by a webhook, the channel that it's posting too seems to add a message indicating that too many messages are being sent. So, even if we can't be precise here, at least customers should get that feedback.

pmuellr · 2020-03-20T17:06:51Z

huh, I think maybe this is fixed. autocannon does not seem to be able to post Slack webhooks anymore - some kind of SSL self-signed cert issue or such. So built a bash script to curl the webhook manually instead, 100 times.

And noticed this in the Kibana log:

server    log   [13:02:00.588] [error][plugins][taskManager][taskManager] Task actions:.slack "82411f40-6acc-11ea-bd72-93d5ed836879" failed: Error: error posting a slack message, retry at 2020-03-20T17:02:01.588Z

But also noticed this, indicating some cleanup is required in the event log:

server    log   [13:02:00.590] [info][eventLog][plugins] event logged: {
  "event": {
    "provider": "actions",
    "action": "execute",
    "start": "2020-03-20T17:02:00.266Z",
    "end": "2020-03-20T17:02:00.588Z",
    "duration": 322000000
  },
  "kibana": {
    "namespace": "default",
    "saved_objects": [
      {
        "type": "action",
        "id": "e98523ad-b13b-48cd-a8cf-343004a343ca"
      }
    ],
    "server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d"
  },
  "message": "action execution failure: .slack:e98523ad-b13b-48cd-a8cf-343004a343ca: slack-muellerware",
  "error": {
    "message": "error posting a slack message, retry at 2020-03-20T17:02:01.588Z: An HTTP protocol error occurred: statusCode = undefined; retry at 2020-03-20T17:02:01.588Z"
  },
  "@timestamp": "2020-03-20T17:02:00.588Z",
  "ecs": {
    "version": "1.3.1"
  }
}

pmuellr · 2020-03-20T20:51:28Z

Actually, now that I look at that event log entry above, seems like it's coming from the action, so it's probably just SAYING it wants to retry then. Still need to figure out if it's doing the retry or not. I think perhaps I can use the slack server simulator in FTS to help diagnose this, will be easier to force a 429 than with the real server.

elasticmachine · 2020-12-17T18:44:42Z

Pinging @elastic/kibana-app-services (Team:AppServices)

elasticmachine · 2020-12-17T18:48:54Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-03-16T17:55:02Z

Closing this issue - we haven't had any reports of this being a problem, and it's not even clear how involved it might be to make it work the way we wanted (if it's not already), as task manager has changed so much since this was first opened.

We're also looking at changing action execution to happen within the alert task execution, instead of as separate tasks - see issue #90888 .

Assuming we go down that route, trying to manage retries will become a lot harder - presumably we'd schedule the retries as new one-off tasks? For now, per team discussion, we'll shelve that thought till we see some actual requirements for it.

pmuellr added Feature:Alerting Team:Stack Services v7.5.0 v8.0.0 labels Aug 27, 2019

pmuellr added the release_note:skip Skip the PR/issue when compiling release notes label Aug 27, 2019

pmuellr removed release_note:skip Skip the PR/issue when compiling release notes v7.5.0 v8.0.0 labels Aug 27, 2019

pmuellr added this to Backlog in Make it Action Aug 28, 2019

mikecote moved this from Backlog to Bugs in Make it Action Dec 6, 2019

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

mikecote added the bug Fixes for quality problems that affect the customer experience label Dec 13, 2019

mikecote moved this from Bugs to Bugs & Code Debt in Make it Action Dec 13, 2019

bmcconaghy added (Deprecated) Team:Reporting Services Use Team:Global Experiance and Feature:Reporting instead. and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Dec 20, 2019

mikecote moved this from Bugs & Code Debt to To-Do 7.7 (Ordered by priority) in Make it Action Mar 18, 2020

mikecote moved this from To-Do 7.7 (Ordered by priority) to Bugs & Code Debt in Make it Action Mar 24, 2020

tsullivan added the Team:AppServices label Dec 17, 2020

tsullivan added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed (Deprecated) Team:Reporting Services Use Team:Global Experiance and Feature:Reporting instead. Team:AppServices labels Dec 17, 2020

mikecote removed this from Bugs & Tech Debt in Make it Action Jan 27, 2021

mikecote added this to Bugs & Tech Debt in Kibana Alerting Jan 27, 2021

YulNaumenko removed this from Bugs & Tech Debt in Kibana Alerting Mar 10, 2021

pmuellr closed this as completed Mar 16, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slack action not reporting rate-limiting as it should #44127

slack action not reporting rate-limiting as it should #44127

pmuellr commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

pmuellr commented Aug 27, 2019

elasticmachine commented Dec 20, 2019

pmuellr commented Mar 20, 2020

pmuellr commented Mar 20, 2020

pmuellr commented Mar 20, 2020

elasticmachine commented Dec 17, 2020

elasticmachine commented Dec 17, 2020

pmuellr commented Mar 16, 2021

slack action not reporting rate-limiting as it should #44127

slack action not reporting rate-limiting as it should #44127

Comments

pmuellr commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

pmuellr commented Aug 27, 2019

elasticmachine commented Dec 20, 2019

pmuellr commented Mar 20, 2020

pmuellr commented Mar 20, 2020

pmuellr commented Mar 20, 2020

elasticmachine commented Dec 17, 2020

elasticmachine commented Dec 17, 2020

pmuellr commented Mar 16, 2021