Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of CCR threadpool rejections #92449

Open
DaveCTurner opened this issue Dec 19, 2022 · 1 comment
Open

Improve handling of CCR threadpool rejections #92449

DaveCTurner opened this issue Dec 19, 2022 · 1 comment
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

The CCR threadpool uses a fixed executor with default size 32 and default queue length 100, which means it rejects work if overloaded. However, it does not look like we handle these rejections very gracefully in several spots, even though the overload might be a transient situation:

  • ShardChangesAction.TransportAction#asyncShardOperation adds a GCP listener to run on the ccr pool, which if rejected looks like it might suppress some other notifications and propagate up into the ReplicationTracker.
  • ShardFollowTasksExecutor is a PersistentTasksExecutor which executes the task on the ccr pool, and on rejection the task is just marked as failed.
  • ShardFollowTasksExecutor#nodeOperation also just fails the task.
  • ShardFollowNodeTask#scheduleBackgroundRetentionLeaseRenewal uses scheduleWithFixedDelay which just stops running the scheduled task on rejection.
  • AutoFollower#finalise looks like it might call itself ad infinitum on rejection?
  • CcrRepository#restoreShard uses scheduleWithFixedDelay which just stops running the scheduled task on rejection. Possibly this is ok? If the restore fails I expect we will retry.
@DaveCTurner DaveCTurner added >bug :Distributed/CCR Issues around the Cross Cluster State Replication features labels Dec 19, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

2 participants