New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4 #12080
Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4 #12080
Conversation
…ld to 4 A cluster with three members could see three leader changes during a healthy rolling reboot, and we don't want to alert on that. Growing to 4 reduces false-alarms for clusters with three or fewer members, and that's probably most clusters. It will also slightly increase the risk of false-negatives, but if the cluster is struggling with high latency, it seems likely that it would quickly pass the new threshold too. The hard-coded threshold means that we are still likely to get false-positives during rolling reboots of clusters with four or more members. Ideally we'd scale this with the cluster size, or something, but I'm not sure how to do that. Three members is the minimum size for high availability, so reducing false positives for that case seems worth addressing even if we leave larger clusters largely unchanges. Also manually catch etcd3_alert.rules up to speed, since it seems to have been passed over by 16fc8a2 (Documentation/op-guide: Re-generate alert rules and dashboard from mixin, 2020-04-07, etcd-io#11768).
|
lgtm, nice one! To catch roughly the same characteristic we should probably adapt the timing as well, however, I think it's ok to start with what you have now and see if that works.
I think this alert is probably even Side note, we can probably remove the old alerting syntax rules, Prometheus has removed this syntax >2.5 years ago. |
Is that "delete |
|
Let's do that in a follow up, I don't want to distract this contribution with that :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me thanks @wking
Semaphore CI failure is flake.
|
merging |
Frederic says [1]: > Side note, we can probably remove the old alerting syntax rules, > Prometheus has removed this syntax >2.5 years ago. [1]: etcd-io#12080 (comment)
…old to 4 Generated with: $ make jsonnet/vendor # to install jb $ jb --version v0.3.1 $ (cd jsonnet && jb update) GET https://github.com/coreos/etcd/archive/429826b46719046e43cfae3f4c37b8cd78985af6.tar.gz 200 GET https://github.com/coreos/kube-prometheus/archive/bce16b41eb10ea6810139496f4025455e70568a7.tar.gz 200 GET https://github.com/coreos/prometheus-operator/archive/8fa7349020073bcc7932abb2bc4bd08e4bc3a0ac.tar.gz 200 GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/d353b060f1965cc8ec2dce00c6dd2d565cb1825f.tar.gz 200 GET https://github.com/openshift/openshift-state-metrics/archive/9ba3c5222f789c0b5e0b99a2a648315b39a8dc70.tar.gz 200 GET https://github.com/openshift/telemeter/archive/ca9c4f2c12e19e6fb0dc7b5bb545c7a5b8b193c8.tar.gz 200 GET https://github.com/thanos-io/kube-thanos/archive/0ebda25e5643dea0a37d401bb4a566127a0935f5.tar.gz 200 GET https://github.com/thanos-io/thanos/archive/52e10c6e0f644ea98fd057e7fbece828d8dd07c7.tar.gz 200 GET https://github.com/ksonnet/ksonnet-lib/archive/0d2f82676817bbf9e4acf6495b2090205f323b9f.tar.gz 200 GET https://github.com/prometheus/node_exporter/archive/4ef4548ad5bd4bc11c7e6b2da569cf1aa05be907.tar.gz 200 GET https://github.com/prometheus/prometheus/archive/1f73073d73bf671d545e79335a300f62d050a5f3.tar.gz 200 GET https://github.com/brancz/kubernetes-grafana/archive/57b4365eacda291b82e0d55ba7eec573a8198dda.tar.gz 200 GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200 GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200 GET https://github.com/grafana/grafonnet-lib/archive/5c6e8a8113486cdecd0961730aeaada3e6c69fe7.tar.gz 200 GET https://github.com/grafana/jsonnet-libs/archive/eb710690d60faf16bc29f842e1d4acf875d86826.tar.gz 200 GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/685e8ccb62c8eaa67450b412d62163157276e1a2.tar.gz 200 $ touch jsonnet/jsonnetfile.lock.json # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go $ make generate-in-docker This pulls in: $ hack/jsonnet-lock-diff.py jsonnet/jsonnetfile.lock.json {"git": {"remote": "https://github.com/coreos/etcd", "subdir": "Documentation/etcd-mixin"}} 0c5cffc60 Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4 {"git": {"remote": "https://github.com/coreos/kube-prometheus", "subdir": "jsonnet/kube-prometheus"}} 4d6e3d5 enable etcd latency metrics in kube-apiserver bbd4e61 Bump Grafana version to v6.7.4 {"git": {"remote": "https://github.com/coreos/prometheus-operator", "subdir": "jsonnet/prometheus-operator"}} 0fc8b1be update generated files for pvc metadata fix {"git": {"remote": "https://github.com/grafana/grafonnet-lib", "subdir": "grafonnet"}} 682d817 Fix documentation typo in graph panel ae4e089 Add staircase line and axis decimals options to graph panel ef70649 Fix tablePanel.addTargets fa6b4c3 Move 'SELECT' and 'FROM' tagged query sections from objects to methods 7507d6d Add InfluxDB query editor support 133ea7f Add @name to all function docs 0905313 Fix x_axis_buckets value when histogram mode 76e4948 Add support for sideWidth in graph panels 1c04821 adds alertlist panel type {"git": {"remote": "https://github.com/grafana/jsonnet-libs", "subdir": "grafana-builder"}} 0c8e4d3 allow table link in new tab (openshift#238) {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics"}} 44ab592e Remove prefix from version label f5729a5c add leases collector c210e732 Add security context to deployment and statefulset 6be758aa jsonnet/kube-state-metrics: Add missing volumeattachments resource {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics-mixin"}} 42c8f26b jsonnet/kube-state-metrics-mixin: add default KSM selector e4e9d325 jsonnet/kube-state-metrics-mixin/alerts.libsonnet: Fix selector typo 345ded07 jsonnet,scripts: Add kube-state-metrics alerts mixin {"git": {"remote": "https://github.com/openshift/telemeter", "subdir": "jsonnet/telemeter"}} 6815a7ab jsonnet: Upgrade apps API versions used to apps/v1 48fe1cf5 Stop using service ca from service account token {"git": {"remote": "https://github.com/prometheus/node_exporter", "subdir": "docs/node-mixin"}} bd3e6d2 Add NodeTextFileCollectorScrapeError alert to mixin {"git": {"remote": "https://github.com/prometheus/prometheus", "subdir": "documentation/prometheus-mixin"}} 27b1009ac Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489) 6e7554639 Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0 The etcd change is from [1]. [1]: etcd-io/etcd#12080
…old to 4 Generated with: $ make jsonnet/vendor # to install jb $ jb --version v0.3.1 $ (cd jsonnet && jb update) GET https://github.com/coreos/etcd/archive/429826b46719046e43cfae3f4c37b8cd78985af6.tar.gz 200 GET https://github.com/coreos/kube-prometheus/archive/bce16b41eb10ea6810139496f4025455e70568a7.tar.gz 200 GET https://github.com/coreos/prometheus-operator/archive/8fa7349020073bcc7932abb2bc4bd08e4bc3a0ac.tar.gz 200 GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/d353b060f1965cc8ec2dce00c6dd2d565cb1825f.tar.gz 200 GET https://github.com/openshift/openshift-state-metrics/archive/9ba3c5222f789c0b5e0b99a2a648315b39a8dc70.tar.gz 200 GET https://github.com/openshift/telemeter/archive/ca9c4f2c12e19e6fb0dc7b5bb545c7a5b8b193c8.tar.gz 200 GET https://github.com/thanos-io/kube-thanos/archive/0ebda25e5643dea0a37d401bb4a566127a0935f5.tar.gz 200 GET https://github.com/thanos-io/thanos/archive/52e10c6e0f644ea98fd057e7fbece828d8dd07c7.tar.gz 200 GET https://github.com/ksonnet/ksonnet-lib/archive/0d2f82676817bbf9e4acf6495b2090205f323b9f.tar.gz 200 GET https://github.com/prometheus/node_exporter/archive/4ef4548ad5bd4bc11c7e6b2da569cf1aa05be907.tar.gz 200 GET https://github.com/prometheus/prometheus/archive/1f73073d73bf671d545e79335a300f62d050a5f3.tar.gz 200 GET https://github.com/brancz/kubernetes-grafana/archive/57b4365eacda291b82e0d55ba7eec573a8198dda.tar.gz 200 GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200 GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200 GET https://github.com/grafana/grafonnet-lib/archive/5c6e8a8113486cdecd0961730aeaada3e6c69fe7.tar.gz 200 GET https://github.com/grafana/jsonnet-libs/archive/eb710690d60faf16bc29f842e1d4acf875d86826.tar.gz 200 GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/685e8ccb62c8eaa67450b412d62163157276e1a2.tar.gz 200 $ touch jsonnet/jsonnetfile.lock.json # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go $ make generate-in-docker This pulls in: $ hack/jsonnet-lock-diff.py jsonnet/jsonnetfile.lock.json {"git": {"remote": "https://github.com/coreos/etcd", "subdir": "Documentation/etcd-mixin"}} 0c5cffc60 Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4 {"git": {"remote": "https://github.com/coreos/kube-prometheus", "subdir": "jsonnet/kube-prometheus"}} 4d6e3d5 enable etcd latency metrics in kube-apiserver bbd4e61 Bump Grafana version to v6.7.4 {"git": {"remote": "https://github.com/coreos/prometheus-operator", "subdir": "jsonnet/prometheus-operator"}} 0fc8b1be update generated files for pvc metadata fix {"git": {"remote": "https://github.com/grafana/grafonnet-lib", "subdir": "grafonnet"}} 682d817 Fix documentation typo in graph panel ae4e089 Add staircase line and axis decimals options to graph panel ef70649 Fix tablePanel.addTargets fa6b4c3 Move 'SELECT' and 'FROM' tagged query sections from objects to methods 7507d6d Add InfluxDB query editor support 133ea7f Add @name to all function docs 0905313 Fix x_axis_buckets value when histogram mode 76e4948 Add support for sideWidth in graph panels 1c04821 adds alertlist panel type {"git": {"remote": "https://github.com/grafana/jsonnet-libs", "subdir": "grafana-builder"}} 0c8e4d3 allow table link in new tab (openshift#238) {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics"}} 44ab592e Remove prefix from version label f5729a5c add leases collector c210e732 Add security context to deployment and statefulset 6be758aa jsonnet/kube-state-metrics: Add missing volumeattachments resource {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics-mixin"}} 42c8f26b jsonnet/kube-state-metrics-mixin: add default KSM selector e4e9d325 jsonnet/kube-state-metrics-mixin/alerts.libsonnet: Fix selector typo 345ded07 jsonnet,scripts: Add kube-state-metrics alerts mixin {"git": {"remote": "https://github.com/openshift/telemeter", "subdir": "jsonnet/telemeter"}} 6815a7ab jsonnet: Upgrade apps API versions used to apps/v1 48fe1cf5 Stop using service ca from service account token {"git": {"remote": "https://github.com/prometheus/node_exporter", "subdir": "docs/node-mixin"}} bd3e6d2 Add NodeTextFileCollectorScrapeError alert to mixin {"git": {"remote": "https://github.com/prometheus/prometheus", "subdir": "documentation/prometheus-mixin"}} 27b1009ac Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489) 6e7554639 Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0 The etcd change is from [1]. [1]: etcd-io/etcd#12080
A cluster with three members could see three leader changes during a healthy rolling reboot, and we don't want to alert on that. Growing to 4 reduces false-alarms for clusters with three or fewer members, and that's probably most clusters. It will also slightly increase the risk of false-negatives, but if the cluster is struggling with high latency, it seems likely that it would quickly pass the new threshold too.
The hard-coded threshold means that we are still likely to get false-positives during rolling reboots of clusters with four or more members. Ideally we'd scale this with the cluster size, or something, but I'm not sure how to do that. Three members is the minimum size for high availability, so reducing false positives for that case seems worth addressing even if we leave larger clusters largely unchanges.
Also manually catch
etcd3_alert.rulesup to speed, since it seems to have been passed over by 16fc8a2 (#11768).