Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MasterService does not complete all tasks on shutdown #94930

Open
DaveCTurner opened this issue Mar 31, 2023 · 1 comment
Open

MasterService does not complete all tasks on shutdown #94930

DaveCTurner opened this issue Mar 31, 2023 · 1 comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

Today when the MasterService shuts down, it fails waiting tasks but does not necessarily fail the ongoing batch of tasks. For instance, we just drop the batch on the floor here:

if (lifecycle.started() == false) {
logger.debug("processing [{}]: ignoring, master service not started", summary);
listener.onResponse(null);
return;
}

and we swallow rejections here:

assert publicationMayFail() || (exception instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown())
: exception;
clusterStateUpdateStatsTracker.onPublicationFailure(
threadPool.rawRelativeTimeInMillis(),
clusterStatePublicationEvent,
0L
);
handleException(summary, publicationStartTime, newClusterState, exception);

This behaviour has existed for a long time (i.e. it was not introduced by recent changes in the area such as #92021 and #94325) but I still think we should improve it. Note however that it does not work simply to fail the ongoing tasks on rejection: today with acked tasks we call (at most) one of onAllNodesAcked(), onAckFailure(), onAckTimeout(), or ClusterStateTaskListener#onFailure(), and implementations rely on this fact, but we may experience a rejection exception after acking has completed. I think that means we have to delay the acking until the end of the publication, because the alternative would be to suppress onFailure() calls for acked tasks which seems like a confusing API choice that will lead to bugs.

@DaveCTurner DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 31, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Mar 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

2 participants