Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More logging for slow cluster state application #45007

Merged
merged 5 commits into from
Aug 1, 2019

Conversation

DaveCTurner
Copy link
Contributor

Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.

This commit adds a good deal more detail to make the issues of slow nodes more
prominent:

  • every 10 seconds (by default) we log a warning indicating that a publication
    is still waiting for responses from some nodes, including the identities of
    the problematic nodes.

  • the lag detector logs a more detailed warning when a fatally-lagging node is
    detected.

  • if applying a cluster state takes too long then the cluster applier service
    logs a breakdown of all the tasks it ran as part of that process.

Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.

This commit adds a good deal more detail to make the issues of slow nodes more
prominent:

- every 10 seconds (by default) we log a warning indicating that a publication
  is still waiting for responses from some nodes, including the identities of
  the problematic nodes.

- the lag detector logs a more detailed warning when a fatally-lagging node is
  detected.

- if applying a cluster state takes too long then the cluster applier service
  logs a breakdown of all the tasks it ran as part of that process.
@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.4.0 labels Jul 30, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Jul 30, 2019

The one-line log message from the ClusterApplierService is informative but not especially pretty, because essentially no appliers or listeners implement a sensible toString() method:

[2019-07-30T14:42:47,259][WARN ][o.e.c.s.ClusterApplierService] [node-1] cluster state applier task [ApplyCommitRequest{term=1, version=2, sourceNode={node-0}{Ly-ZNAXgScKxPEjMyIsDmw}{sDmj2lBEQxaKD-iBA5bHEw}{127.0.0.1}{127.0.0.1:9300}{dim}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}}] took [4.8s] which is above the warn threshold of [3s]: [running task [ApplyCommitRequest{term=1, version=2, sourceNode={node-0}{Ly-ZNAXgScKxPEjMyIsDmw}{sDmj2lBEQxaKD-iBA5bHEw}{127.0.0.1}{127.0.0.1:9300}{dim}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}}]] took [0ms], [connecting to new nodes] took [3ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@7b93039f]] took [4779ms], [running applier [org.elasticsearch.script.ScriptService@6a7b0bcb]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@437aa918]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@34c89ab9]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@57a4e844]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@65338428]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@7317d3ff]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@37a9276]] took [0ms], [running applier [org.elasticsearch.snapshots.SnapshotsService@736cbc95]] took [0ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@6e29ba2]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@37f478be]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@66534e86]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$1901/0x000000080182a840@16767295]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$1970/0x000000080185c840@4dbd472]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@10fe6ddf]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@6c253638]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@49e213f3]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@134036f9]] took [3ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@67acb2d6]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@181c15e2]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@7165f508]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@437aa918]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@5dfb4a39]] took [0ms], [notifying listener [org.elasticsearch.xpack.slm.SnapshotLifecycleService@5f437952]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.ShardFollowTaskCleaner@2708d45d]] took [0ms], [notifying listener [org.elasticsearch.cluster.metadata.TemplateUpgradeService@799ba801]] took [0ms], [notifying listener [org.elasticsearch.node.ResponseCollectorService@7eb68f89]] took [0ms], [notifying listener [org.elasticsearch.snapshots.SnapshotShardsService@3afbf0e4]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportOpenJobAction$OpenJobPersistentTasksExecutor$$Lambda$2943/0x0000000801a9c840@5b3061da]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksClusterService@7f4f111d]] took [0ms], [notifying listener [org.elasticsearch.cluster.routing.DelayedAllocationService@35cf03ba]] took [0ms], [notifying listener [org.elasticsearch.indices.store.IndicesStore@5056b937]] took [0ms], [notifying listener [org.elasticsearch.gateway.DanglingIndicesState@8b03c56]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksNodeService@ff013a4]] took [0ms], [notifying listener [org.elasticsearch.license.LicenseService@42ef1333]] took [0ms], [notifying listener [org.elasticsearch.xpack.monitoring.exporter.local.LocalExporter@27959558]] took [17ms], [notifying listener [org.elasticsearch.xpack.ccr.action.AutoFollowCoordinator@164e5b2]] took [0ms], [notifying listener [org.elasticsearch.gateway.GatewayService@516305d]] took [0ms], [notifying listener [org.elasticsearch.cluster.service.ClusterApplierService$LocalNodeMasterListeners@4b9fa6a8]] took [0ms], [notifying listener [org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener@355a1a6b]] took [0ms], [notifying listener [org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener@68b69ac7]] took [0ms]

@@ -1348,8 +1372,10 @@ public void onSuccess(String source) {
}
}
lagDetector.startLagDetector(publishRequest.getAcceptedState().version());
logIncompleteNodes();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because 10s divides 30s exactly, there is a chance that by default sometimes we'll log the third periodic warning and then the end-of-publication warning in quick succession. We could just not log the end-of-publication warning here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex who knows :D)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an earlier debug log is of limited value, because people don't often run with debug logging on the offchance that it catches something interesting like this. I do agree that it seems sufficient to emit a single info message after 10 seconds, and then a warning after 30 seconds, so I've implemented that in 16fae08.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe fix the appender stop order but it shouldn't matter with the DeterministicTaskQueue I think.

@@ -1348,8 +1372,10 @@ public void onSuccess(String source) {
}
}
lagDetector.startLagDetector(publishRequest.getAcceptedState().version());
logIncompleteNodes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex who knows :D)

+ defaultMillis(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING), "waiting for warning to be emitted");
mockLogAppenderForLagDetection.assertAllExpectationsMatched();
} finally {
mockLogAppenderForLagDetection.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't matter here, but in general you should always remove appenders and then stop them to not get an error for trying to use a stopped appender? (same comment in the other log appender spots)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks, adjusted in c9df59c. It doesn't matter here indeed, but a stop-before-remove in places with more concurrency is effectively an assertion that no more log messages are being emitted.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/2

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. I've left some comments and questions for discussion.

@@ -99,6 +99,11 @@
Setting.timeSetting("cluster.publish.timeout",
TimeValue.timeValueMillis(30000), TimeValue.timeValueMillis(1), Setting.Property.NodeScope);

// the interval between reports of slow publication
public static final Setting<TimeValue> PUBLISH_REPORT_INTERVAL_SETTING =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the setting is not registered. Is that by design or omission?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omission :(

We really should have build-time checks for this.

@@ -1348,8 +1372,10 @@ public void onSuccess(String source) {
}
}
lagDetector.startLagDetector(publishRequest.getAcceptedState().version());
logIncompleteNodes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).

@@ -569,7 +569,7 @@ public TimeValue ackTimeout() {

protected void warnAboutSlowTaskIfNeeded(TimeValue executionTime, String source) {
if (executionTime.getMillis() > slowTaskLoggingThreshold.getMillis()) {
logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of {}", source, executionTime,
logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of [{}]", source, executionTime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting follow-up would be to change the slow logging in MasterService so that it only applies to calculateTaskOutputs, and then reduce the timeout from 30s to 10s. Currently, the slow logging in MasterService measures cluster state update calculation, publication + application. The latter two are covered by your PR here now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ good idea

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/2

@DaveCTurner DaveCTurner requested a review from ywelsch July 31, 2019 13:58
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 7776f75 into elastic:master Aug 1, 2019
@DaveCTurner DaveCTurner deleted the 2019-07-30-lagging-logging branch August 1, 2019 07:21
DaveCTurner added a commit that referenced this pull request Aug 1, 2019
Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.

This commit adds a good deal more detail to make the issues of slow nodes more
prominent:

- after 10 seconds (by default) we log an INFO message indicating that a
  publication is still waiting for responses from some nodes, including the
  identities of the problematic nodes.

- when the publication times out after 30 seconds (by default) we log a WARN
  message identifying the nodes that are still pending.

- the lag detector logs a more detailed warning when a fatally-lagging node is
  detected.

- if applying a cluster state takes too long then the cluster applier service
  logs a breakdown of all the tasks it ran as part of that process.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 1, 2019
Adds a tighter threshold for logging a warning about slowness in the
`MasterService` instead of relying on the cluster service's 30-second warning
threshold. This new threshold applies to the computation of the cluster state
update in isolation, so we get a warning if computing a new cluster state
update takes longer than 10 seconds even if it is subsequently applied quickly.
It also applies independently to the length of time it takes to notify the
cluster state tasks on completion of publication, in case any of these
notifications holds up the master thread for too long.

Relates elastic#45007
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 6, 2019
Adds a tighter threshold for logging a warning about slowness in the
`MasterService` instead of relying on the cluster service's 30-second warning
threshold. This new threshold applies to the computation of the cluster state
update in isolation, so we get a warning if computing a new cluster state
update takes longer than 10 seconds even if it is subsequently applied quickly.
It also applies independently to the length of time it takes to notify the
cluster state tasks on completion of publication, in case any of these
notifications holds up the master thread for too long.

Relates elastic#45007
DaveCTurner added a commit that referenced this pull request Aug 6, 2019
Adds a tighter threshold for logging a warning about slowness in the
`MasterService` instead of relying on the cluster service's 30-second warning
threshold. This new threshold applies to the computation of the cluster state
update in isolation, so we get a warning if computing a new cluster state
update takes longer than 10 seconds even if it is subsequently applied quickly.
It also applies independently to the length of time it takes to notify the
cluster state tasks on completion of publication, in case any of these
notifications holds up the master thread for too long.

Relates #45007
DaveCTurner added a commit that referenced this pull request Aug 6, 2019
Adds a tighter threshold for logging a warning about slowness in the
`MasterService` instead of relying on the cluster service's 30-second warning
threshold. This new threshold applies to the computation of the cluster state
update in isolation, so we get a warning if computing a new cluster state
update takes longer than 10 seconds even if it is subsequently applied quickly.
It also applies independently to the length of time it takes to notify the
cluster state tasks on completion of publication, in case any of these
notifications holds up the master thread for too long.

Relates #45007
Backport of #45086
@mfussenegger mfussenegger mentioned this pull request Mar 26, 2020
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants