New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges #11448
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The `etcdHighNumberOfLeaderChanges` alert had a copy and paste error when it was converted from docs to mixin in 10244 - we moved from "increase over 15m > 3" to "rate over 15m > 3" which is not the same (rate is measured per second, so it should have been "rate over 15m > (3 / 60 / 15)"). As part of fixing that, we need to capture when prometheus starts or when new etcd clusters are captured with a high leader change - i.e. if you start a new etcd cluster and at the moment prometheus first scrapes you are already at 5 leader changes, we should fire on that transition. This alert is also now more responsive, so if you get a quick burst of 3 leader changes we'll alert within 5m rather than 15m.
|
@hexfusion as per our discovery today that the alert was not firing in any production clusters |
hexfusion
approved these changes
Dec 13, 2019
Codecov Report
@@ Coverage Diff @@
## master #11448 +/- ##
==========================================
- Coverage 64.62% 64.4% -0.23%
==========================================
Files 403 403
Lines 38079 38079
==========================================
- Hits 24610 24525 -85
- Misses 11838 11908 +70
- Partials 1631 1646 +15
Continue to review full report at Codecov.
|
wking
added a commit
to wking/cluster-monitoring-operator
that referenced
this pull request
Jun 25, 2020
Generated with: $ make jsonnet/vendor # to install jb $ jb --version v0.3.1 $ (cd jsonnet && jb install https://github.com/coreos/etcd/Documentation/etcd-mixin) GET https://github.com/coreos/etcd/archive/2b79442d8e9fc54b1ac27e7e230ac0e4c132a054.tar.gz 200 $ touch jsonnet/jsonnetfile.lock.json # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go $ make generate-in-docker This pulls in: $ git --no-pager log --oneline f1eca4e1fa5de962ff8079af836bb390e88d1f4c..2b79442d8e9fc54b1ac27e7e230ac0e4c132a054 -- Documentation/etcd-mixin 2c4877064 Documentation/etcd-mixin: Use etcd_mvcc_db_total_size_in_bytes metric 68c5f6066 Documentation/etcd-mixin: Set unique UID for Grafana dashboard 322c38e16 Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges (#11448) The first two of those are [1]. The last is [2], and as the discussion there points out, the rate>3 approach is effectively "never fire" (because we are unlikely to ever have more than three elections per second). [1]: etcd-io/etcd#11768 [2]: etcd-io/etcd#11448
wking
added a commit
to wking/cluster-monitoring-operator
that referenced
this pull request
Jun 25, 2020
Generated with: $ make jsonnet/vendor # to install jb $ jb --version v0.3.1 $ (cd jsonnet && jb install https://github.com/coreos/etcd/Documentation/etcd-mixin) GET https://github.com/coreos/etcd/archive/2b79442d8e9fc54b1ac27e7e230ac0e4c132a054.tar.gz 200 $ touch jsonnet/jsonnetfile.lock.json # recover from any previous flubbed generation which might have left timestamps in place convincing Make it didn't need to rebuild bindata.go $ make generate-in-docker This pulls in: $ git --no-pager log --oneline f1eca4e1fa5de962ff8079af836bb390e88d1f4c..2b79442d8e9fc54b1ac27e7e230ac0e4c132a054 -- Documentation/etcd-mixin 2c4877064 Documentation/etcd-mixin: Use etcd_mvcc_db_total_size_in_bytes metric 68c5f6066 Documentation/etcd-mixin: Set unique UID for Grafana dashboard 322c38e16 Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges (#11448) The first two of those are [1]. The last is [2], and as the discussion there points out, the rate>3 approach is effectively "never fire" (because we are unlikely to ever have more than three elections per second). Also interesting, is that f1eca4e1fa is in etcd's release-3.4 branch, while I'm pinning the master branch. The current coreos release-3.4 tip is etcd-io/etcd@31e49a4df30, which has no etcd-mixin changes since f1eca4e1fa. My impression is that mixin changes are unlikely to be backported to release branches, and also unlikely to depend on the intricacies of the underlying etcd version, so I'm tracking master instead of release-3.4 in this commit. The move to f1eca4e1fa had landed here via 5c251d9 (jsonnet/jsonnetfile.lock.json: jb update, 2020-04-17, openshift#760). [1]: etcd-io/etcd#11768 [2]: etcd-io/etcd#11448
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
etcdHighNumberOfLeaderChangesalert had a copy and paste error when it was converted from docs to mixin in #10244 - we moved from "increase over 15m > 3" to "rate over 15m > 3" which is not the same (rate is measured per second, so it should have been"rate over 15m > (3 / 60 / 15)"). As part of fixing that, we need to capture when prometheus starts or when new etcd clusters are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are already at 5 leader changes, we should fire on that transition.
This alert is also now more responsive, so if you get a quick burst of 3 leader changes we'll alert within 5m rather than 15m.