Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178

gotjosh · 2020-07-28T10:26:29Z

We've observed an erratic behaviour of this alert during rollouts. On all occasions, we've noticed the alert isn't sustained for more than a few minutes and we flag it as a false positive. I think it's because we're using the 5m rate for 5 minutes the alert only gets to see 1 sample.

We have a couple of options, doubling the alert duration (from 5m to 10m) but that seems unfeasible given the criticality of this, or use the 1m rate for 5m. The latter feels like a minimal change - and it didn't cause the alert to fire (for false positives) on occasions where the current one would.

Apologies for not creating an issue first, given the minimal diff of the change I thought this would be better discussed inside of a pull request.

current interval (5m)

proposed interval (1m)

Signed-off-by: gotjosh josue@grafana.com

We've observed an erratic behaviour of this alert during rollouts. On all occasions we've noticed the alert isn't sustain for more than a few minutes and we flag it as a false positive.Because we're using the 5m rate for 5 minutes the alert only gets to see 1 sample. Doubling the alert duration (from 5m to 10m) but that seems unfeasible given the criticality of this, or use the 1m rate for 5m. The latter feels like a minimal change - and it didn't cause the alert to fire on occasions where the current one would. Signed-off-by: gotjosh <josue@grafana.com>

pracucci

👍

gotjosh · 2020-08-03T12:27:05Z

Just an FYI that the CI failure seems unrelated.

gotjosh · 2020-08-18T08:45:33Z

@xiang90 apologies for ping you directly, any chance to take a look on this? 🙏

stale · 2020-11-16T10:02:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

pracucci · 2020-11-17T09:20:55Z

Still valid for us

ptabor · 2021-02-01T20:35:07Z

Documentation/etcd-mixin/mixin.libsonnet

                /
-              sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code)
+              sum(rate(grpc_server_handled_total{%(etcd_selector)s}[1m])) without (grpc_type, grpc_code)
                > 5
            ||| % $._config,
            'for': '5m',


Does it mean that in worst case the service might by 80% of time unhealthy and the alert will not trigger ?

4min of failures, 1 min below threshold, 4min of failures, ...

Shell we decrease it to 2 or 3 min ?

[marking as a requested change]

ptabor · 2021-02-01T20:37:20Z

Interesting. Thank you for the contribution.
I wonder [a little off topic], what are the errors (grpc_code & method) and in what circumstances do they happen ?
Is it a multi-node cluster being upgraded one-by-one ?
I want to make sure there is no bug that we would be hiding by such alerting change.

ptabor · 2021-02-10T11:20:22Z

Documentation/etcd-mixin/mixin.libsonnet

                /
-              sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code)
+              sum(rate(grpc_server_handled_total{%(etcd_selector)s}[1m])) without (grpc_type, grpc_code)
                > 5
            ||| % $._config,
            'for': '5m',


[marking as a requested change]

spzala · 2021-02-10T21:43:58Z

@gotjosh thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and need to open it in the doc repo. /cc @ptabor @nate-double-u

spzala

@gotjosh thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and need to open it in the doc repo.

stale · 2021-05-12T04:44:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

pracucci approved these changes Aug 3, 2020

View reviewed changes

stale bot added the stale label Nov 16, 2020

stale bot removed the stale label Nov 17, 2020

ptabor reviewed Feb 1, 2021

View reviewed changes

nate-double-u mentioned this pull request Feb 2, 2021

Migrate documentation: Remove docs from etcd-io/etcd #12660

Merged

ptabor suggested changes Feb 10, 2021

View reviewed changes

spzala requested changes Feb 10, 2021

View reviewed changes

stale bot added the stale label May 12, 2021

stale bot closed this Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178

Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178

gotjosh commented Jul 28, 2020 •

edited

Loading

pracucci left a comment

gotjosh commented Aug 3, 2020

gotjosh commented Aug 18, 2020

stale bot commented Nov 16, 2020

pracucci commented Nov 17, 2020

ptabor Feb 1, 2021

ptabor Feb 10, 2021

ptabor commented Feb 1, 2021

ptabor Feb 10, 2021

spzala commented Feb 10, 2021

spzala left a comment

stale bot commented May 12, 2021

Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178

Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178

Conversation

gotjosh commented Jul 28, 2020 • edited Loading

pracucci left a comment

Choose a reason for hiding this comment

gotjosh commented Aug 3, 2020

gotjosh commented Aug 18, 2020

stale bot commented Nov 16, 2020

pracucci commented Nov 17, 2020

ptabor Feb 1, 2021

Choose a reason for hiding this comment

ptabor Feb 10, 2021

Choose a reason for hiding this comment

ptabor commented Feb 1, 2021

ptabor Feb 10, 2021

Choose a reason for hiding this comment

spzala commented Feb 10, 2021

spzala left a comment

Choose a reason for hiding this comment

stale bot commented May 12, 2021

gotjosh commented Jul 28, 2020 •

edited

Loading