Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges #11448

Merged
merged 1 commit into from Dec 14, 2019

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Dec 13, 2019

The etcdHighNumberOfLeaderChanges alert had a copy and paste error when it was converted from docs to mixin in #10244 - we moved from "increase over 15m > 3" to "rate over 15m > 3" which is not the same (rate is measured per second, so it should have been
"rate over 15m > (3 / 60 / 15)"). As part of fixing that, we need to capture when prometheus starts or when new etcd clusters are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are already at 5 leader changes, we should fire on that transition.

This alert is also now more responsive, so if you get a quick burst of 3 leader changes we'll alert within 5m rather than 15m.

The `etcdHighNumberOfLeaderChanges` alert had a copy and paste
error when it was converted from docs to mixin in 10244 - we moved
from "increase over 15m > 3" to "rate over 15m > 3" which is not
the same (rate is measured per second, so it should have been
"rate over 15m > (3 / 60 / 15)").  As part of fixing that, we
need to capture when prometheus starts or when new etcd clusters
are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are
already at 5 leader changes, we should fire on that transition.

This alert is also now more responsive, so if you get a quick
burst of 3 leader changes we'll alert within 5m rather than 15m.
@smarterclayton
Copy link
Contributor Author

@hexfusion as per our discovery today that the alert was not firing in any production clusters

@hexfusion
Copy link
Contributor

/lgtm

cc @jingyih @gyuho @xiang90

@codecov-io
Copy link

Codecov Report

Merging #11448 into master will decrease coverage by 0.22%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11448      +/-   ##
==========================================
- Coverage   64.62%    64.4%   -0.23%     
==========================================
  Files         403      403              
  Lines       38079    38079              
==========================================
- Hits        24610    24525      -85     
- Misses      11838    11908      +70     
- Partials     1631     1646      +15
Impacted Files Coverage Δ
client/members.go 65.32% <0%> (-20.17%) ⬇️
pkg/transport/timeout_conn.go 80% <0%> (-20%) ⬇️
etcdserver/api/v3rpc/util.go 51.61% <0%> (-16.13%) ⬇️
client/keys.go 79.39% <0%> (-12.07%) ⬇️
proxy/grpcproxy/register.go 69.44% <0%> (-11.12%) ⬇️
etcdserver/api/v3rpc/lease.go 69.31% <0%> (-5.69%) ⬇️
clientv3/concurrency/mutex.go 62.16% <0%> (-5.41%) ⬇️
proxy/grpcproxy/watcher.go 89.79% <0%> (-4.09%) ⬇️
pkg/testutil/recorder.go 77.77% <0%> (-3.71%) ⬇️
etcdserver/api/v3election/election.go 61.11% <0%> (-2.78%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 378b05b...fd26d82. Read the comment docs.

@xiang90 xiang90 merged commit 322c38e into etcd-io:master Dec 14, 2019
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Jun 25, 2020
Generated with:

  $ make jsonnet/vendor  # to install jb
  $ jb --version
  v0.3.1
  $ (cd jsonnet && jb install https://github.com/coreos/etcd/Documentation/etcd-mixin)
  GET https://github.com/coreos/etcd/archive/2b79442d8e9fc54b1ac27e7e230ac0e4c132a054.tar.gz 200
  $ touch jsonnet/jsonnetfile.lock.json  # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go
  $ make generate-in-docker

This pulls in:

  $ git --no-pager log --oneline f1eca4e1fa5de962ff8079af836bb390e88d1f4c..2b79442d8e9fc54b1ac27e7e230ac0e4c132a054 -- Documentation/etcd-mixin
  2c4877064 Documentation/etcd-mixin: Use etcd_mvcc_db_total_size_in_bytes metric
  68c5f6066 Documentation/etcd-mixin: Set unique UID for Grafana dashboard
  322c38e16 Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges (#11448)

The first two of those are [1].  The last is [2], and as the
discussion there points out, the rate>3 approach is effectively "never
fire" (because we are unlikely to ever have more than three elections
per second).

[1]: etcd-io/etcd#11768
[2]: etcd-io/etcd#11448
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Jun 25, 2020
Generated with:

  $ make jsonnet/vendor  # to install jb
  $ jb --version
  v0.3.1
  $ (cd jsonnet && jb install https://github.com/coreos/etcd/Documentation/etcd-mixin)
  GET https://github.com/coreos/etcd/archive/2b79442d8e9fc54b1ac27e7e230ac0e4c132a054.tar.gz 200
  $ touch jsonnet/jsonnetfile.lock.json  # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go
  $ make generate-in-docker

This pulls in:

  $ git --no-pager log --oneline f1eca4e1fa5de962ff8079af836bb390e88d1f4c..2b79442d8e9fc54b1ac27e7e230ac0e4c132a054 -- Documentation/etcd-mixin
  2c4877064 Documentation/etcd-mixin: Use etcd_mvcc_db_total_size_in_bytes metric
  68c5f6066 Documentation/etcd-mixin: Set unique UID for Grafana dashboard
  322c38e16 Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges (#11448)

The first two of those are [1].  The last is [2], and as the
discussion there points out, the rate>3 approach is effectively "never
fire" (because we are unlikely to ever have more than three elections
per second).

Also interesting, is that f1eca4e1fa is in etcd's release-3.4 branch,
while I'm pinning the master branch.  The current coreos release-3.4
tip is etcd-io/etcd@31e49a4df30, which has no etcd-mixin changes since
f1eca4e1fa.  My impression is that mixin changes are unlikely to be
backported to release branches, and also unlikely to depend on the
intricacies of the underlying etcd version, so I'm tracking master
instead of release-3.4 in this commit.  The move to f1eca4e1fa had
landed here via 5c251d9 (jsonnet/jsonnetfile.lock.json: jb update,
2020-04-17, openshift#760).

[1]: etcd-io/etcd#11768
[2]: etcd-io/etcd#11448
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants