Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slack action not reporting rate-limiting as it should #44127

Closed
pmuellr opened this issue Aug 27, 2019 · 9 comments
Closed

slack action not reporting rate-limiting as it should #44127

pmuellr opened this issue Aug 27, 2019 · 9 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Aug 27, 2019

I've set up a scenario to test what happens when slack returns a rate-limiting reponse from an action.

setup:

  • create slack incoming webhook, at slackhq
  • create an action using the webhook url, using the .slack action type, with a secret of webhookUrl (value being the entire URL)
  • create an alert using an always firing alert type, with a 1s interval, so that Kibana is posting to slack as frequently as it can
  • separately, run autocannon on the slack url, to force the rate-limiting to occur, which will occur when the alert fires

Assuming your webhook url is available in the env var SLACK_WEBHOOKURL, these are commands to create the action and alert, and then run autocannon:

$ kbn-action create .slack slack {} "{webhookUrl: '$SLACK_WEBHOOKURL'}"
{
    "id": "b97f9ae5-38ab-4bae-ab26-37c56a341fa7",
    "actionTypeId": ".slack",
    "description": "slack",
    "config": {}
}

$ kbn-alert create .always-firing 1s {} "[{group: default, id: 'b97f9ae5-38ab-4bae-ab26-37c56a341fa7', params: {message: via_alert}}]"
...

$ autocannon --amount 10000 --method POST --body '{"text":"via autocannon"}' $SLACK_WEBHOOKURL

When run, the following message is printed in the Kibana console:

server    log   [13:43:57.186] [error][task_manager] Task actions:.slack "3da8bc20-c8f2-11e9-a68f-d137ee6661ab" failed: Error: an error occurred in action "b97f9ae5-38ab-4bae-ab26-37c56a341fa7" posting a slack message, retry at 2019-08-27T17:43:58.185Z: An HTTP protocol error occurred: statusCode = undefined

It's partially correct - it got a retry date! The problem is with the end of the message, it should indicate that the request was rate-limited, instead of an http protocol error.

Or perhaps something more nefarious is going on, but I doubt it, as the case where a retry date is returned is pretty constrained ...

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@pmuellr pmuellr added the release_note:skip Skip the PR/issue when compiling release notes label Aug 27, 2019
@pmuellr
Copy link
Member Author

pmuellr commented Aug 27, 2019

I created a better always firing alert that provides a counter, to see if we lose any messages during rate-limiting process. Creating it looks like this:

$ kbn-alert create .always-firing 1s {} "{group:default id:'c4bee054-f51e-4ef1-88bd-ac010e720097' params:{message: 'via alert; {{context.date}} {{context.count}}'}}"

In slack, I saw the following:

via alert; 2019-08-27T18:06:02.590Z 10
via autocannon 
... (repeated 95 times)
via alert; 2019-08-27T18:06:05.637Z 11
via autocannon
... (repeated 199 times)
via alert; 2019-08-27T18:06:08.721Z 12
via autocannon
... (repeated 23 times)
via alert; 2019-08-27T18:06:24.297Z 17
via alert; 2019-08-27T18:06:27.349Z 18

So we lost messages 13-16 during the rate-limiting; there were also 4 server log messages as shown ^^^. So, seems like it the right message got printed, probably from the slack action type, but it didn't actually go into rate-limiting mode ...

@pmuellr pmuellr removed release_note:skip Skip the PR/issue when compiling release notes v7.5.0 v8.0.0 labels Aug 27, 2019
@pmuellr pmuellr added this to Backlog in Make it Action Aug 28, 2019
@mikecote mikecote moved this from Backlog to Bugs in Make it Action Dec 6, 2019
@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@mikecote mikecote added the bug Fixes for quality problems that affect the customer experience label Dec 13, 2019
@mikecote mikecote moved this from Bugs to Bugs & Code Debt in Make it Action Dec 13, 2019
@bmcconaghy bmcconaghy added (Deprecated) Team:Reporting Services Use Team:Global Experiance and Feature:Reporting instead. and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Dec 20, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-reporting-services (Team:Reporting Services)

@mikecote mikecote moved this from Bugs & Code Debt to To-Do 7.7 (Ordered by priority) in Make it Action Mar 18, 2020
@pmuellr
Copy link
Member Author

pmuellr commented Mar 20, 2020

Thinking about this for 7.7. Without the event log for customers to notice rate limiting is happening, not sure it's all that important to fix this. But want to retry on the latest code.

For customers, once you get rate-limited by a webhook, the channel that it's posting too seems to add a message indicating that too many messages are being sent. So, even if we can't be precise here, at least customers should get that feedback.

@pmuellr
Copy link
Member Author

pmuellr commented Mar 20, 2020

huh, I think maybe this is fixed. autocannon does not seem to be able to post Slack webhooks anymore - some kind of SSL self-signed cert issue or such. So built a bash script to curl the webhook manually instead, 100 times.

And noticed this in the Kibana log:

server    log   [13:02:00.588] [error][plugins][taskManager][taskManager] Task actions:.slack "82411f40-6acc-11ea-bd72-93d5ed836879" failed: Error: error posting a slack message, retry at 2020-03-20T17:02:01.588Z

But also noticed this, indicating some cleanup is required in the event log:

server    log   [13:02:00.590] [info][eventLog][plugins] event logged: {
  "event": {
    "provider": "actions",
    "action": "execute",
    "start": "2020-03-20T17:02:00.266Z",
    "end": "2020-03-20T17:02:00.588Z",
    "duration": 322000000
  },
  "kibana": {
    "namespace": "default",
    "saved_objects": [
      {
        "type": "action",
        "id": "e98523ad-b13b-48cd-a8cf-343004a343ca"
      }
    ],
    "server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d"
  },
  "message": "action execution failure: .slack:e98523ad-b13b-48cd-a8cf-343004a343ca: slack-muellerware",
  "error": {
    "message": "error posting a slack message, retry at 2020-03-20T17:02:01.588Z: An HTTP protocol error occurred: statusCode = undefined; retry at 2020-03-20T17:02:01.588Z"
  },
  "@timestamp": "2020-03-20T17:02:00.588Z",
  "ecs": {
    "version": "1.3.1"
  }
}

@pmuellr
Copy link
Member Author

pmuellr commented Mar 20, 2020

Actually, now that I look at that event log entry above, seems like it's coming from the action, so it's probably just SAYING it wants to retry then. Still need to figure out if it's doing the retry or not. I think perhaps I can use the slack server simulator in FTS to help diagnose this, will be easier to force a 429 than with the real server.

@mikecote mikecote moved this from To-Do 7.7 (Ordered by priority) to Bugs & Code Debt in Make it Action Mar 24, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServices)

@tsullivan tsullivan added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed (Deprecated) Team:Reporting Services Use Team:Global Experiance and Feature:Reporting instead. Team:AppServices labels Dec 17, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote removed this from Bugs & Tech Debt in Make it Action Jan 27, 2021
@mikecote mikecote added this to Bugs & Tech Debt in Kibana Alerting Jan 27, 2021
@YulNaumenko YulNaumenko removed this from Bugs & Tech Debt in Kibana Alerting Mar 10, 2021
@pmuellr
Copy link
Member Author

pmuellr commented Mar 16, 2021

Closing this issue - we haven't had any reports of this being a problem, and it's not even clear how involved it might be to make it work the way we wanted (if it's not already), as task manager has changed so much since this was first opened.

We're also looking at changing action execution to happen within the alert task execution, instead of as separate tasks - see issue #90888 .

Assuming we go down that route, trying to manage retries will become a lot harder - presumably we'd schedule the retries as new one-off tasks? For now, per team discussion, we'll shelve that thought till we see some actual requirements for it.

@pmuellr pmuellr closed this as completed Mar 16, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

6 participants