Improve logging in LeaderChecker #78883

DaveCTurner · 2021-10-09T09:38:22Z

Today if the LeaderChecker decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

[instance-0000000006] master node [...] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
    at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
    at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
    at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
...
    at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

We use DiscoveryNode#toString which is far too chatty.
There's basically nothing useful in these stack traces.
It's easy to miss the RemoteTransportException in the middle.
It's also easy to miss the root cause below it.
We say the master node failed which sounds very bad but, well, you
know, that's just like, uh, our opinion. The master node is often
fine, it just rejected our checks for some reason.
Reports of unstable clusters include these messages because they're
noisy and look important, but don't include the more informative ones
from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

We use DiscoveryNode#descriptionWithoutAttributes throughout.
We suppress the full stack traces unless DEBUG logging is on.
The LeaderChecker now provides the message to be logged, rather than
putting all the details into an exception that wraps around the root
cause.
The message describes the root cause rather than just saying that the
"master node failed"
We distinguish timeouts from rejections and report the count of each.
The message guides towards checking the master node logs too.

Today if the `LeaderChecker` decides it's time to restart discovery then we log a verbose and confusing message that looks something like this: [instance-0000000006] master node [...] failed, restarting discovery org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1] ... at java.lang.Thread.run(Thread.java:831) [?:?] Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check] Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1] ... at java.lang.Thread.run(Thread.java:831) ~[?:?] There's quite a few problems with this: - We use `DiscoveryNode#toString` which is far too chatty. - There's basically nothing useful in these stack traces. - It's easy to miss the `RemoteTransportException` in the middle. - It's also easy to miss the root cause below it. - We say the master node failed which sounds very bad but, well, you know, that's just like, uh, our opinion. The master node is often fine, it just rejected our checks for some reason. - Reports of unstable clusters include these messages because they're noisy and look important, but don't include the more informative ones from the master because the master logs look quieter. This commit reworks the logging in this area to avoid these problems: - We use `DiscoveryNode#descriptionWithoutAttributes` throughout. - We suppress the full stack traces unless `DEBUG` logging is on. - The `LeaderChecker` now provides the message to be logged, rather than putting all the details into an exception that wraps around the root cause. - The message describes the root cause rather than just saying that the "master node failed" - We distinguish timeouts from rejections and report the count of each. - The message guides towards checking the master node logs too.

elasticmachine · 2021-10-11T12:07:18Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-10-11T12:10:17Z

server/src/main/java/org/elasticsearch/cluster/coordination/LeaderChecker.java

    private class CheckScheduler implements Releasable {

-        private final AtomicBoolean isClosed = new AtomicBoolean();
-        private final AtomicLong failureCountSinceLastSuccess = new AtomicLong();


We don't actually need an AtomicLong here, there's only ever one request in flight so everything is properly synchronised already.

fcofdez

LGTM 👍

Today if the `LeaderChecker` decides it's time to restart discovery then we log a verbose and confusing message that looks something like this: [instance-0000000006] master node [...] failed, restarting discovery org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1] ... at java.lang.Thread.run(Thread.java:831) [?:?] Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check] Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1] ... at java.lang.Thread.run(Thread.java:831) ~[?:?] There's quite a few problems with this: - We use `DiscoveryNode#toString` which is far too chatty. - There's basically nothing useful in these stack traces. - It's easy to miss the `RemoteTransportException` in the middle. - It's also easy to miss the root cause below it. - We say the master node failed which sounds very bad but, well, you know, that's just like, uh, our opinion. The master node is often fine, it just rejected our checks for some reason. - Reports of unstable clusters include these messages because they're noisy and look important, but don't include the more informative ones from the master because the master logs look quieter. This commit reworks the logging in this area to avoid these problems: - We use `DiscoveryNode#descriptionWithoutAttributes` throughout. - We suppress the full stack traces unless `DEBUG` logging is on. - The `LeaderChecker` now provides the message to be logged, rather than putting all the details into an exception that wraps around the root cause. - The message describes the root cause rather than just saying that the "master node failed" - We distinguish timeouts from rejections and report the count of each. - The message guides towards checking the master node logs too.

DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 9, 2021

DaveCTurner force-pushed the 2021-10-09-leader-checker-better-logging branch from babf188 to 29ba8ad Compare October 9, 2021 09:49

DaveCTurner marked this pull request as ready for review October 11, 2021 12:07

elasticmachine added the Team:Distributed Meta label for distributed team label Oct 11, 2021

DaveCTurner commented Oct 11, 2021

View reviewed changes

fcofdez self-requested a review October 14, 2021 09:55

fcofdez approved these changes Oct 15, 2021

View reviewed changes

DaveCTurner merged commit 83d96e4 into elastic:master Oct 15, 2021

DaveCTurner deleted the 2021-10-09-leader-checker-better-logging branch October 15, 2021 09:09

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging in LeaderChecker #78883

Improve logging in LeaderChecker #78883

DaveCTurner commented Oct 9, 2021

elasticmachine commented Oct 11, 2021

DaveCTurner Oct 11, 2021

fcofdez left a comment

Improve logging in LeaderChecker #78883

Improve logging in LeaderChecker #78883

Conversation

DaveCTurner commented Oct 9, 2021

elasticmachine commented Oct 11, 2021

DaveCTurner Oct 11, 2021

Choose a reason for hiding this comment

fcofdez left a comment

Choose a reason for hiding this comment