Warn on slow metadata persistence #47005

DaveCTurner · 2019-09-24T11:30:23Z

Today if metadata persistence is excessively slow on a master-ineligible node
then the ClusterApplierService emits a warning indicating that the
GatewayMetaState applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to 10s. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.

Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved.

elasticmachine · 2019-09-24T11:30:26Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-09-24T12:36:46Z

Failure looks unrelated, I opened #47012.

@elasticmachine please run elasticsearch-ci/1.

ywelsch

Great change. Left one optional comment. Looks good o.w.

ywelsch · 2019-09-25T08:53:35Z

server/src/main/java/org/elasticsearch/gateway/IncrementalClusterStateWriter.java

@@ -320,6 +358,22 @@ void rollback() {
            rollbackCleanupActions.forEach(Runnable::run);
            finished = true;
        }
+
+        void incrementIndicesWritten() {


perhaps we can just make the fields package-visible and operate directly on them, removing getter and setters here, given that their are only used within IncrementalClusterStateWriter. Reduces the amount of clutter and avoids the extra test provisions for the interaction mocking.

Asserting on the interactions with the increment* methods is deliberate; it would add a good deal more noise to check that they're being called in a different way.

Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved.

Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved. Backport of #47005.

DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.5.0 labels Sep 24, 2019

DaveCTurner requested review from andrershov and ywelsch September 24, 2019 11:30

Mockery

fc5b83d

ywelsch approved these changes Sep 25, 2019

View reviewed changes

DaveCTurner merged commit db78d30 into elastic:master Sep 25, 2019

DaveCTurner deleted the 2019-09-24-warn-on-slow-metadata-persistence branch September 25, 2019 15:46

DaveCTurner mentioned this pull request Sep 25, 2019

Warn on slow metadata persistence #47130

Merged

DaveCTurner added the backport pending label Sep 25, 2019

DaveCTurner removed the backport pending label Sep 26, 2019

mfussenegger mentioned this pull request Mar 26, 2020

ES Backports crate/crate#9796

Closed

37 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn on slow metadata persistence #47005

Warn on slow metadata persistence #47005

DaveCTurner commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

DaveCTurner commented Sep 24, 2019

ywelsch left a comment

ywelsch Sep 25, 2019

DaveCTurner Sep 25, 2019

Warn on slow metadata persistence #47005

Warn on slow metadata persistence #47005

Conversation

DaveCTurner commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

DaveCTurner commented Sep 24, 2019

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Sep 25, 2019

Choose a reason for hiding this comment

DaveCTurner Sep 25, 2019

Choose a reason for hiding this comment