-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve logging in LeaderChecker (#78883)
Today if the `LeaderChecker` decides it's time to restart discovery then we log a verbose and confusing message that looks something like this: [instance-0000000006] master node [...] failed, restarting discovery org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1] ... at java.lang.Thread.run(Thread.java:831) [?:?] Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check] Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1] ... at java.lang.Thread.run(Thread.java:831) ~[?:?] There's quite a few problems with this: - We use `DiscoveryNode#toString` which is far too chatty. - There's basically nothing useful in these stack traces. - It's easy to miss the `RemoteTransportException` in the middle. - It's also easy to miss the root cause below it. - We say the master node failed which sounds very bad but, well, you know, that's just like, uh, our opinion. The master node is often fine, it just rejected our checks for some reason. - Reports of unstable clusters include these messages because they're noisy and look important, but don't include the more informative ones from the master because the master logs look quieter. This commit reworks the logging in this area to avoid these problems: - We use `DiscoveryNode#descriptionWithoutAttributes` throughout. - We suppress the full stack traces unless `DEBUG` logging is on. - The `LeaderChecker` now provides the message to be logged, rather than putting all the details into an exception that wraps around the root cause. - The message describes the root cause rather than just saying that the "master node failed" - We distinguish timeouts from rejections and report the count of each. - The message guides towards checking the master node logs too.
- Loading branch information
1 parent
3eadef4
commit fdae6f1
Showing
3 changed files
with
179 additions
and
56 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.