Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging in LeaderChecker #78883

Conversation

DaveCTurner
Copy link
Contributor

Today if the LeaderChecker decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

[instance-0000000006] master node [...] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
    at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
    at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
    at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
...
    at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

  • We use DiscoveryNode#toString which is far too chatty.
  • There's basically nothing useful in these stack traces.
  • It's easy to miss the RemoteTransportException in the middle.
  • It's also easy to miss the root cause below it.
  • We say the master node failed which sounds very bad but, well, you
    know, that's just like, uh, our opinion. The master node is often
    fine, it just rejected our checks for some reason.
  • Reports of unstable clusters include these messages because they're
    noisy and look important, but don't include the more informative ones
    from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

  • We use DiscoveryNode#descriptionWithoutAttributes throughout.
  • We suppress the full stack traces unless DEBUG logging is on.
  • The LeaderChecker now provides the message to be logged, rather than
    putting all the details into an exception that wraps around the root
    cause.
  • The message describes the root cause rather than just saying that the
    "master node failed"
  • We distinguish timeouts from rejections and report the count of each.
  • The message guides towards checking the master node logs too.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 9, 2021
Today if the `LeaderChecker` decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

    [instance-0000000006] master node [...] failed, restarting discovery
    org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
        ...
        at java.lang.Thread.run(Thread.java:831) [?:?]
    Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
    Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
        at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
        at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

- We use `DiscoveryNode#toString` which is far too chatty.
- There's basically nothing useful in these stack traces.
- It's easy to miss the `RemoteTransportException` in the middle.
- It's also easy to miss the root cause below it.
- We say the master node failed which sounds very bad but, well, you
  know, that's just like, uh, our opinion. The master node is often
  fine, it just rejected our checks for some reason.
- Reports of unstable clusters include these messages because they're
  noisy and look important, but don't include the more informative ones
  from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

- We use `DiscoveryNode#descriptionWithoutAttributes` throughout.
- We suppress the full stack traces unless `DEBUG` logging is on.
- The `LeaderChecker` now provides the message to be logged, rather than
  putting all the details into an exception that wraps around the root
  cause.
- The message describes the root cause rather than just saying that the
  "master node failed"
- We distinguish timeouts from rejections and report the count of each.
- The message guides towards checking the master node logs too.
@DaveCTurner DaveCTurner force-pushed the 2021-10-09-leader-checker-better-logging branch from babf188 to 29ba8ad Compare October 9, 2021 09:49
@DaveCTurner DaveCTurner marked this pull request as ready for review October 11, 2021 12:07
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Oct 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

private class CheckScheduler implements Releasable {

private final AtomicBoolean isClosed = new AtomicBoolean();
private final AtomicLong failureCountSinceLastSuccess = new AtomicLong();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need an AtomicLong here, there's only ever one request in flight so everything is properly synchronised already.

@fcofdez fcofdez self-requested a review October 14, 2021 09:55
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@DaveCTurner DaveCTurner merged commit 83d96e4 into elastic:master Oct 15, 2021
@DaveCTurner DaveCTurner deleted the 2021-10-09-leader-checker-better-logging branch October 15, 2021 09:09
DaveCTurner added a commit that referenced this pull request Oct 15, 2021
Today if the `LeaderChecker` decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

    [instance-0000000006] master node [...] failed, restarting discovery
    org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
        ...
        at java.lang.Thread.run(Thread.java:831) [?:?]
    Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
    Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
        at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
        at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

- We use `DiscoveryNode#toString` which is far too chatty.
- There's basically nothing useful in these stack traces.
- It's easy to miss the `RemoteTransportException` in the middle.
- It's also easy to miss the root cause below it.
- We say the master node failed which sounds very bad but, well, you
  know, that's just like, uh, our opinion. The master node is often
  fine, it just rejected our checks for some reason.
- Reports of unstable clusters include these messages because they're
  noisy and look important, but don't include the more informative ones
  from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

- We use `DiscoveryNode#descriptionWithoutAttributes` throughout.
- We suppress the full stack traces unless `DEBUG` logging is on.
- The `LeaderChecker` now provides the message to be logged, rather than
  putting all the details into an exception that wraps around the root
  cause.
- The message describes the root cause rather than just saying that the
  "master node failed"
- We distinguish timeouts from rejections and report the count of each.
- The message guides towards checking the master node logs too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants