Skip to content

Conversation

DaveCTurner
Copy link
Contributor

Adds another layer of detection of slow activity on the cluster applier
thread. In particular this can detect activity that isn't included in an
UpdateTask, which particularly may include completing an expensive
listener attached to a ClusterStateObserver. Moreover it captures a
thread dump if slow activity is detected.

Adds another layer of detection of slow activity on the cluster applier
thread. In particular this can detect activity that isn't included in an
`UpdateTask`, which particularly may include completing an expensive
listener attached to a `ClusterStateObserver`. Moreover it captures a
thread dump if slow activity is detected.
@DaveCTurner DaveCTurner requested a review from ywangd September 9, 2025 11:57
@DaveCTurner DaveCTurner requested a review from a team as a code owner September 9, 2025 11:57
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v9.2.0 labels Sep 9, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner
Copy link
Contributor Author

This has required surprisingly many changes to the test suite, although they're all very simple and fall into two categories:

  1. some tests register only those settings that are actually used by the test, and these registrations needed extending because we used the new settings when constructing a ClusterApplierService.

  2. some tests use a ClusterApplierService and call DeterministicTaskQueue#runAllTasks which is incompatible with the ThreadWatchdog since the watchdog creates an infinite sequence of tasks, so must be disabled by setting its interval to zero.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Only had minor comments.

this.interval = interval;
this.quietTime = quietTime.compareTo(interval) <= 0 ? interval : quietTime;
this.lifecycle = lifecycle;
this.logger = logger;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hot-threads dumper still links the doc for ReferenceDocs.NETWORK_THREADING_MODEL which is not applicable in the new case. Do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes you're right. I can't think of a great place to document this. I mean it's kind of the same principle, this thread should be frequently idle just like the transport_worker threads. I think I'm going to say we don't care.


final AtomicBoolean completedTask = new AtomicBoolean();

clusterApplierService.runOnApplierThread("blocking task", randomFrom(Priority.values()), ignored -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test for immediate listener firing in ClusterApplierService#addTimeoutListener as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, see ea0fca3.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 12, 2025
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Oct 4, 2025
@elasticsearchmachine elasticsearchmachine merged commit 66e9e81 into elastic:main Oct 4, 2025
35 checks passed
@DaveCTurner DaveCTurner deleted the 2025/09/09/cluster-applier-thread-watchdog branch October 4, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement serverless-linked Added by automation, don't add manually Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants