-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scheduling of ClusterInfoService#refresh #59880
Fix scheduling of ClusterInfoService#refresh #59880
Conversation
Today the `InternalClusterInfoService` uses the `LocalNodeMasterListener` interface to start/stop its operations. Since the `onMaster` and `offMaster` methods are called on the `MANAGEMENT` threadpool, there's no guarantee that they run in the correct sequence, which could result in an elected master failing to regularly update the cluster info. Since this service is also a `ClusterStateListener` we may as well drop the usage of the `LocalNodeMasterListener` interface and simply update the status of the local node on the applier thread in `clusterChanged` to ensure consistency. Additionally, today the `InternalClusterInfoService` uses a simple flag to track whether the local node is the elected master or not. If the node stops being the master and then starts again within a few seconds then the scheduled updates from the old mastership might carry on running in addition to the ones for the new mastership. This commit addresses that by tracking the identity of the scheduled update job and creating a new job for each mastership.
Pinging @elastic/es-distributed (:Distributed/Allocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that this service is blocking a management thread (and there is no reason for it to do so). Anyway, removing that would be a larger change. Getting the data race fixed is a good step.
public void clusterChanged(ClusterChangedEvent event) { | ||
if (event.localNodeMaster() && refreshAndRescheduleRunnable.get() == null) { | ||
logger.trace("elected as master, scheduling cluster info update tasks"); | ||
executeRefresh(clusterService.state(), "became master"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just event.state()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ 6209182
Yeah the blocking nature of |
Today the `InternalClusterInfoService` uses the `LocalNodeMasterListener` interface to start/stop its operations. Since the `onMaster` and `offMaster` methods are called on the `MANAGEMENT` threadpool, there's no guarantee that they run in the correct sequence, which could result in an elected master failing to regularly update the cluster info. Since this service is also a `ClusterStateListener` we may as well drop the usage of the `LocalNodeMasterListener` interface and simply update the status of the local node on the applier thread in `clusterChanged` to ensure consistency. Additionally, today the `InternalClusterInfoService` uses a simple flag to track whether the local node is the elected master or not. If the node stops being the master and then starts again within a few seconds then the scheduled updates from the old mastership might carry on running in addition to the ones for the new mastership. This commit addresses that by tracking the identity of the scheduled update job and creating a new job for each mastership.
Today the
InternalClusterInfoService
uses theLocalNodeMasterListener
interface to start/stop its operations. Sincethe
onMaster
andoffMaster
methods are called on theMANAGEMENT
threadpool, there's no guarantee that they run in the correct sequence,
which could result in an elected master failing to regularly update the
cluster info.
Since this service is also a
ClusterStateListener
we may as well dropthe usage of the
LocalNodeMasterListener
interface and simply updatethe status of the local node on the applier thread in
clusterChanged
to ensure consistency.
Additionally, today the
InternalClusterInfoService
uses a simple flagto track whether the local node is the elected master or not. If the
node stops being the master and then starts again within a few seconds
then the scheduled updates from the old mastership might carry on
running in addition to the ones for the new mastership.
This commit addresses that by tracking the identity of the scheduled
update job and creating a new job for each mastership.