-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More logging for slow cluster state application #45007
More logging for slow cluster state application #45007
Conversation
Today the lag detector may remove nodes from the cluster if they fail to apply a cluster state within a reasonable timeframe, but it is rather unclear from the default logging that this has occurred and there is very little extra information beyond the fact that the removed node was lagging. Moreover the only forewarning that the lag detector might be invoked is a message indicating that cluster state publication took unreasonably long, which does not contain enough information to investigate the problem further. This commit adds a good deal more detail to make the issues of slow nodes more prominent: - every 10 seconds (by default) we log a warning indicating that a publication is still waiting for responses from some nodes, including the identities of the problematic nodes. - the lag detector logs a more detailed warning when a fatally-lagging node is detected. - if applying a cluster state takes too long then the cluster applier service logs a breakdown of all the tasks it ran as part of that process.
Pinging @elastic/es-distributed |
The one-line log message from the
|
@@ -1348,8 +1372,10 @@ public void onSuccess(String source) { | |||
} | |||
} | |||
lagDetector.startLagDetector(publishRequest.getAcceptedState().version()); | |||
logIncompleteNodes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because 10s
divides 30s
exactly, there is a chance that by default sometimes we'll log the third periodic warning and then the end-of-publication warning in quick succession. We could just not log the end-of-publication warning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex
who knows :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an earlier debug log is of limited value, because people don't often run with debug logging on the offchance that it catches something interesting like this. I do agree that it seems sufficient to emit a single info message after 10 seconds, and then a warning after 30 seconds, so I've implemented that in 16fae08.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, maybe fix the appender stop order but it shouldn't matter with the DeterministicTaskQueue
I think.
@@ -1348,8 +1372,10 @@ public void onSuccess(String source) { | |||
} | |||
} | |||
lagDetector.startLagDetector(publishRequest.getAcceptedState().version()); | |||
logIncompleteNodes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex
who knows :D)
+ defaultMillis(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING), "waiting for warning to be emitted"); | ||
mockLogAppenderForLagDetection.assertAllExpectationsMatched(); | ||
} finally { | ||
mockLogAppenderForLagDetection.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't matter here, but in general you should always remove appenders and then stop them to not get an error for trying to use a stopped appender? (same comment in the other log appender spots)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks, adjusted in c9df59c. It doesn't matter here indeed, but a stop-before-remove in places with more concurrency is effectively an assertion that no more log messages are being emitted.
@elasticmachine please run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. I've left some comments and questions for discussion.
@@ -99,6 +99,11 @@ | |||
Setting.timeSetting("cluster.publish.timeout", | |||
TimeValue.timeValueMillis(30000), TimeValue.timeValueMillis(1), Setting.Property.NodeScope); | |||
|
|||
// the interval between reports of slow publication | |||
public static final Setting<TimeValue> PUBLISH_REPORT_INTERVAL_SETTING = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the setting is not registered. Is that by design or omission?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omission :(
We really should have build-time checks for this.
@@ -1348,8 +1372,10 @@ public void onSuccess(String source) { | |||
} | |||
} | |||
lagDetector.startLagDetector(publishRequest.getAcceptedState().version()); | |||
logIncompleteNodes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).
server/src/main/java/org/elasticsearch/cluster/coordination/LagDetector.java
Show resolved
Hide resolved
@@ -569,7 +569,7 @@ public TimeValue ackTimeout() { | |||
|
|||
protected void warnAboutSlowTaskIfNeeded(TimeValue executionTime, String source) { | |||
if (executionTime.getMillis() > slowTaskLoggingThreshold.getMillis()) { | |||
logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of {}", source, executionTime, | |||
logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of [{}]", source, executionTime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting follow-up would be to change the slow logging in MasterService so that it only applies to calculateTaskOutputs
, and then reduce the timeout from 30s to 10s. Currently, the slow logging in MasterService measures cluster state update calculation, publication + application. The latter two are covered by your PR here now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ good idea
@elasticmachine please run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Today the lag detector may remove nodes from the cluster if they fail to apply a cluster state within a reasonable timeframe, but it is rather unclear from the default logging that this has occurred and there is very little extra information beyond the fact that the removed node was lagging. Moreover the only forewarning that the lag detector might be invoked is a message indicating that cluster state publication took unreasonably long, which does not contain enough information to investigate the problem further. This commit adds a good deal more detail to make the issues of slow nodes more prominent: - after 10 seconds (by default) we log an INFO message indicating that a publication is still waiting for responses from some nodes, including the identities of the problematic nodes. - when the publication times out after 30 seconds (by default) we log a WARN message identifying the nodes that are still pending. - the lag detector logs a more detailed warning when a fatally-lagging node is detected. - if applying a cluster state takes too long then the cluster applier service logs a breakdown of all the tasks it ran as part of that process.
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates elastic#45007
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates elastic#45007
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007 Backport of #45086
Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.
This commit adds a good deal more detail to make the issues of slow nodes more
prominent:
every 10 seconds (by default) we log a warning indicating that a publication
is still waiting for responses from some nodes, including the identities of
the problematic nodes.
the lag detector logs a more detailed warning when a fatally-lagging node is
detected.
if applying a cluster state takes too long then the cluster applier service
logs a breakdown of all the tasks it ran as part of that process.