Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4 #12080

Merged
merged 1 commit into from Jul 8, 2020

Conversation

wking
Copy link
Contributor

@wking wking commented Jun 25, 2020

A cluster with three members could see three leader changes during a healthy rolling reboot, and we don't want to alert on that. Growing to 4 reduces false-alarms for clusters with three or fewer members, and that's probably most clusters. It will also slightly increase the risk of false-negatives, but if the cluster is struggling with high latency, it seems likely that it would quickly pass the new threshold too.

The hard-coded threshold means that we are still likely to get false-positives during rolling reboots of clusters with four or more members. Ideally we'd scale this with the cluster size, or something, but I'm not sure how to do that. Three members is the minimum size for high availability, so reducing false positives for that case seems worth addressing even if we leave larger clusters largely unchanges.

Also manually catch etcd3_alert.rules up to speed, since it seems to have been passed over by 16fc8a2 (#11768).

…ld to 4

A cluster with three members could see three leader changes during a
healthy rolling reboot, and we don't want to alert on that.  Growing
to 4 reduces false-alarms for clusters with three or fewer members,
and that's probably most clusters.  It will also slightly increase the
risk of false-negatives, but if the cluster is struggling with high
latency, it seems likely that it would quickly pass the new threshold
too.

The hard-coded threshold means that we are still likely to get
false-positives during rolling reboots of clusters with four or more
members.  Ideally we'd scale this with the cluster size, or something,
but I'm not sure how to do that.  Three members is the minimum size
for high availability, so reducing false positives for that case seems
worth addressing even if we leave larger clusters largely unchanges.

Also manually catch etcd3_alert.rules up to speed, since it seems to
have been passed over by 16fc8a2 (Documentation/op-guide:
Re-generate alert rules and dashboard from mixin, 2020-04-07, etcd-io#11768).
@brancz
Copy link
Contributor

brancz commented Jun 26, 2020

lgtm, nice one! To catch roughly the same characteristic we should probably adapt the timing as well, however, I think it's ok to start with what you have now and see if that works.

The hard-coded threshold means that we are still likely to get false-positives during rolling reboots of clusters with four or more members. Ideally we'd scale this with the cluster size, or something, but I'm not sure how to do that.

I think this alert is probably even info severity. As unless etcd is not replying to requests correctly, this is just noise, but it's interesting when etcd doesn't reply as expected.

Side note, we can probably remove the old alerting syntax rules, Prometheus has removed this syntax >2.5 years ago.

@wking
Copy link
Contributor Author

wking commented Jun 26, 2020

Side note, we can probably remove the old alerting syntax rules...

Is that "delete etcd3_alert.rules"? Did you want me to do that here?

@brancz
Copy link
Contributor

brancz commented Jun 26, 2020

Let's do that in a follow up, I don't want to distract this contribution with that :)

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me thanks @wking

Semaphore CI failure is flake.

@hexfusion
Copy link
Contributor

merging TestCtlV2ClusterHealth failure in https://semaphoreci.com/etcd-io/etcd/branches/pull-request-12080/builds/1 unrelated to change.

@hexfusion hexfusion merged commit 429826b into etcd-io:master Jul 8, 2020
@wking wking deleted the raise-etcd-leader-changes-to-four branch July 8, 2020 15:00
wking added a commit to wking/etcd that referenced this pull request Jul 8, 2020
Frederic says [1]:

> Side note, we can probably remove the old alerting syntax rules,
> Prometheus has removed this syntax >2.5 years ago.

[1]: etcd-io#12080 (comment)
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Jul 8, 2020
…old to 4

Generated with:

  $ make jsonnet/vendor  # to install jb
  $ jb --version
  v0.3.1
  $ (cd jsonnet && jb update)
  GET https://github.com/coreos/etcd/archive/429826b46719046e43cfae3f4c37b8cd78985af6.tar.gz 200
  GET https://github.com/coreos/kube-prometheus/archive/bce16b41eb10ea6810139496f4025455e70568a7.tar.gz 200
  GET https://github.com/coreos/prometheus-operator/archive/8fa7349020073bcc7932abb2bc4bd08e4bc3a0ac.tar.gz 200
  GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/d353b060f1965cc8ec2dce00c6dd2d565cb1825f.tar.gz 200
  GET https://github.com/openshift/openshift-state-metrics/archive/9ba3c5222f789c0b5e0b99a2a648315b39a8dc70.tar.gz 200
  GET https://github.com/openshift/telemeter/archive/ca9c4f2c12e19e6fb0dc7b5bb545c7a5b8b193c8.tar.gz 200
  GET https://github.com/thanos-io/kube-thanos/archive/0ebda25e5643dea0a37d401bb4a566127a0935f5.tar.gz 200
  GET https://github.com/thanos-io/thanos/archive/52e10c6e0f644ea98fd057e7fbece828d8dd07c7.tar.gz 200
  GET https://github.com/ksonnet/ksonnet-lib/archive/0d2f82676817bbf9e4acf6495b2090205f323b9f.tar.gz 200
  GET https://github.com/prometheus/node_exporter/archive/4ef4548ad5bd4bc11c7e6b2da569cf1aa05be907.tar.gz 200
  GET https://github.com/prometheus/prometheus/archive/1f73073d73bf671d545e79335a300f62d050a5f3.tar.gz 200
  GET https://github.com/brancz/kubernetes-grafana/archive/57b4365eacda291b82e0d55ba7eec573a8198dda.tar.gz 200
  GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200
  GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200
  GET https://github.com/grafana/grafonnet-lib/archive/5c6e8a8113486cdecd0961730aeaada3e6c69fe7.tar.gz 200
  GET https://github.com/grafana/jsonnet-libs/archive/eb710690d60faf16bc29f842e1d4acf875d86826.tar.gz 200
  GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/685e8ccb62c8eaa67450b412d62163157276e1a2.tar.gz 200
  $ touch jsonnet/jsonnetfile.lock.json  # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go
  $ make generate-in-docker

This pulls in:

  $ hack/jsonnet-lock-diff.py jsonnet/jsonnetfile.lock.json
  {"git": {"remote": "https://github.com/coreos/etcd", "subdir": "Documentation/etcd-mixin"}}
  0c5cffc60 Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4

  {"git": {"remote": "https://github.com/coreos/kube-prometheus", "subdir": "jsonnet/kube-prometheus"}}
  4d6e3d5 enable etcd latency metrics in kube-apiserver
  bbd4e61 Bump Grafana version to v6.7.4

  {"git": {"remote": "https://github.com/coreos/prometheus-operator", "subdir": "jsonnet/prometheus-operator"}}
  0fc8b1be update generated files for pvc metadata fix

  {"git": {"remote": "https://github.com/grafana/grafonnet-lib", "subdir": "grafonnet"}}
  682d817 Fix documentation typo in graph panel
  ae4e089 Add staircase line and axis decimals options to graph panel
  ef70649 Fix tablePanel.addTargets
  fa6b4c3 Move 'SELECT' and 'FROM' tagged query sections from objects to methods
  7507d6d Add InfluxDB query editor support
  133ea7f Add @name to all function docs
  0905313 Fix x_axis_buckets value when histogram mode
  76e4948 Add support for sideWidth in graph panels
  1c04821 adds alertlist panel type

  {"git": {"remote": "https://github.com/grafana/jsonnet-libs", "subdir": "grafana-builder"}}
  0c8e4d3 allow table link in new tab (openshift#238)

  {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics"}}
  44ab592e Remove prefix from version label
  f5729a5c add leases collector
  c210e732 Add security context to deployment and statefulset
  6be758aa jsonnet/kube-state-metrics: Add missing volumeattachments resource

  {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics-mixin"}}
  42c8f26b jsonnet/kube-state-metrics-mixin: add default KSM selector
  e4e9d325 jsonnet/kube-state-metrics-mixin/alerts.libsonnet: Fix selector typo
  345ded07 jsonnet,scripts: Add kube-state-metrics alerts mixin

  {"git": {"remote": "https://github.com/openshift/telemeter", "subdir": "jsonnet/telemeter"}}
  6815a7ab jsonnet: Upgrade apps API versions used to apps/v1
  48fe1cf5 Stop using service ca from service account token

  {"git": {"remote": "https://github.com/prometheus/node_exporter", "subdir": "docs/node-mixin"}}
  bd3e6d2 Add NodeTextFileCollectorScrapeError alert to mixin

  {"git": {"remote": "https://github.com/prometheus/prometheus", "subdir": "documentation/prometheus-mixin"}}
  27b1009ac Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489)
  6e7554639 Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0

The etcd change is from [1].

[1]: etcd-io/etcd#12080
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Jul 8, 2020
…old to 4

Generated with:

  $ make jsonnet/vendor  # to install jb
  $ jb --version
  v0.3.1
  $ (cd jsonnet && jb update)
  GET https://github.com/coreos/etcd/archive/429826b46719046e43cfae3f4c37b8cd78985af6.tar.gz 200
  GET https://github.com/coreos/kube-prometheus/archive/bce16b41eb10ea6810139496f4025455e70568a7.tar.gz 200
  GET https://github.com/coreos/prometheus-operator/archive/8fa7349020073bcc7932abb2bc4bd08e4bc3a0ac.tar.gz 200
  GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/d353b060f1965cc8ec2dce00c6dd2d565cb1825f.tar.gz 200
  GET https://github.com/openshift/openshift-state-metrics/archive/9ba3c5222f789c0b5e0b99a2a648315b39a8dc70.tar.gz 200
  GET https://github.com/openshift/telemeter/archive/ca9c4f2c12e19e6fb0dc7b5bb545c7a5b8b193c8.tar.gz 200
  GET https://github.com/thanos-io/kube-thanos/archive/0ebda25e5643dea0a37d401bb4a566127a0935f5.tar.gz 200
  GET https://github.com/thanos-io/thanos/archive/52e10c6e0f644ea98fd057e7fbece828d8dd07c7.tar.gz 200
  GET https://github.com/ksonnet/ksonnet-lib/archive/0d2f82676817bbf9e4acf6495b2090205f323b9f.tar.gz 200
  GET https://github.com/prometheus/node_exporter/archive/4ef4548ad5bd4bc11c7e6b2da569cf1aa05be907.tar.gz 200
  GET https://github.com/prometheus/prometheus/archive/1f73073d73bf671d545e79335a300f62d050a5f3.tar.gz 200
  GET https://github.com/brancz/kubernetes-grafana/archive/57b4365eacda291b82e0d55ba7eec573a8198dda.tar.gz 200
  GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200
  GET https://github.com/kubernetes/kube-state-metrics/archive/d667979ed55ad1c4db44d331b51d646f5b903aa7.tar.gz 200
  GET https://github.com/grafana/grafonnet-lib/archive/5c6e8a8113486cdecd0961730aeaada3e6c69fe7.tar.gz 200
  GET https://github.com/grafana/jsonnet-libs/archive/eb710690d60faf16bc29f842e1d4acf875d86826.tar.gz 200
  GET https://github.com/kubernetes-monitoring/kubernetes-mixin/archive/685e8ccb62c8eaa67450b412d62163157276e1a2.tar.gz 200
  $ touch jsonnet/jsonnetfile.lock.json  # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go
  $ make generate-in-docker

This pulls in:

  $ hack/jsonnet-lock-diff.py jsonnet/jsonnetfile.lock.json
  {"git": {"remote": "https://github.com/coreos/etcd", "subdir": "Documentation/etcd-mixin"}}
  0c5cffc60 Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4

  {"git": {"remote": "https://github.com/coreos/kube-prometheus", "subdir": "jsonnet/kube-prometheus"}}
  4d6e3d5 enable etcd latency metrics in kube-apiserver
  bbd4e61 Bump Grafana version to v6.7.4

  {"git": {"remote": "https://github.com/coreos/prometheus-operator", "subdir": "jsonnet/prometheus-operator"}}
  0fc8b1be update generated files for pvc metadata fix

  {"git": {"remote": "https://github.com/grafana/grafonnet-lib", "subdir": "grafonnet"}}
  682d817 Fix documentation typo in graph panel
  ae4e089 Add staircase line and axis decimals options to graph panel
  ef70649 Fix tablePanel.addTargets
  fa6b4c3 Move 'SELECT' and 'FROM' tagged query sections from objects to methods
  7507d6d Add InfluxDB query editor support
  133ea7f Add @name to all function docs
  0905313 Fix x_axis_buckets value when histogram mode
  76e4948 Add support for sideWidth in graph panels
  1c04821 adds alertlist panel type

  {"git": {"remote": "https://github.com/grafana/jsonnet-libs", "subdir": "grafana-builder"}}
  0c8e4d3 allow table link in new tab (openshift#238)

  {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics"}}
  44ab592e Remove prefix from version label
  f5729a5c add leases collector
  c210e732 Add security context to deployment and statefulset
  6be758aa jsonnet/kube-state-metrics: Add missing volumeattachments resource

  {"git": {"remote": "https://github.com/kubernetes/kube-state-metrics", "subdir": "jsonnet/kube-state-metrics-mixin"}}
  42c8f26b jsonnet/kube-state-metrics-mixin: add default KSM selector
  e4e9d325 jsonnet/kube-state-metrics-mixin/alerts.libsonnet: Fix selector typo
  345ded07 jsonnet,scripts: Add kube-state-metrics alerts mixin

  {"git": {"remote": "https://github.com/openshift/telemeter", "subdir": "jsonnet/telemeter"}}
  6815a7ab jsonnet: Upgrade apps API versions used to apps/v1
  48fe1cf5 Stop using service ca from service account token

  {"git": {"remote": "https://github.com/prometheus/node_exporter", "subdir": "docs/node-mixin"}}
  bd3e6d2 Add NodeTextFileCollectorScrapeError alert to mixin

  {"git": {"remote": "https://github.com/prometheus/prometheus", "subdir": "documentation/prometheus-mixin"}}
  27b1009ac Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489)
  6e7554639 Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0

The etcd change is from [1].

[1]: etcd-io/etcd#12080
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants