Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on slow metadata persistence #47005

Conversation

DaveCTurner
Copy link
Contributor

Today if metadata persistence is excessively slow on a master-ineligible node
then the ClusterApplierService emits a warning indicating that the
GatewayMetaState applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to 10s. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.

Today if metadata persistence is excessively slow on a master-ineligible node
then the `ClusterApplierService` emits a warning indicating that the
`GatewayMetaState` applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to `10s`. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.
@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.5.0 labels Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

Failure looks unrelated, I opened #47012.

@elasticmachine please run elasticsearch-ci/1.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change. Left one optional comment. Looks good o.w.

@@ -320,6 +358,22 @@ void rollback() {
rollbackCleanupActions.forEach(Runnable::run);
finished = true;
}

void incrementIndicesWritten() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can just make the fields package-visible and operate directly on them, removing getter and setters here, given that their are only used within IncrementalClusterStateWriter. Reduces the amount of clutter and avoids the extra test provisions for the interaction mocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserting on the interactions with the increment* methods is deliberate; it would add a good deal more noise to check that they're being called in a different way.

@DaveCTurner DaveCTurner merged commit db78d30 into elastic:master Sep 25, 2019
@DaveCTurner DaveCTurner deleted the 2019-09-24-warn-on-slow-metadata-persistence branch September 25, 2019 15:46
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Sep 25, 2019
Today if metadata persistence is excessively slow on a master-ineligible node
then the `ClusterApplierService` emits a warning indicating that the
`GatewayMetaState` applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to `10s`. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.
DaveCTurner added a commit that referenced this pull request Sep 26, 2019
Today if metadata persistence is excessively slow on a master-ineligible node
then the `ClusterApplierService` emits a warning indicating that the
`GatewayMetaState` applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to `10s`. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.

Backport of #47005.
@mfussenegger mfussenegger mentioned this pull request Mar 26, 2020
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants