Skip to content

Use async OverlordClient to submit auto-compaction tasks#14537

Closed
kfaraz wants to merge 3 commits intoapache:masterfrom
kfaraz:submit_compact_async
Closed

Use async OverlordClient to submit auto-compaction tasks#14537
kfaraz wants to merge 3 commits intoapache:masterfrom
kfaraz:submit_compact_async

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jul 6, 2023

All duties on the Coordinator run on the same single-threaded executor including historical management duties (segment loading and balancing) and auto-compaction.

If auto-compaction gets stuck for some reason (such as slowness while submitting tasks), it can block up the whole coordination flow causing delayed segment handoffs and ingestion failures.

This PR attempts to remedy this situation by not blocking coordinator while submitting compaction tasks by using the async OverlordClient instead of IndexingServiceClient.

Changes

  • Add new methods to OverlordClient
  • Use OverlordClient in CompactSegments
  • All methods except submission of task still happen synchronously by using future.get().
  • Fix up tests

Pending validation

  • Cluster testing
  • Case: A compaction task is submitted to the overlord but the request hasn't finished yet and the next duty cycle has already started. In such a case, the coordinator might try to submit a duplicate compaction task for that interval. Ideally, the compaction task submitted later should just be cancelled if the first one is still running. Alternatively, we could simply maintain a state of recently submitted (but not started) compaction tasks in the CompactSegments duty to ensure that we do not submit duplicate tasks.

Alternate approach/Further improvement

  • The coordinator duties can be moved to a completely separate executor but that would require some validation to ensure that there are no race conditions between historical management duties and coordinator duties (most likely, there shouldn't be)

Release note


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz removed the WIP label Jul 6, 2023
@kfaraz kfaraz closed this Jul 6, 2023
@kfaraz kfaraz reopened this Jul 6, 2023
@gianm
Copy link
Contributor

gianm commented Jul 14, 2023

Hmm, something seems odd about firing a future and never doing anything with the return (other than logging it). What happens if another CompactSegments runs before the first one's runTask completes? Would it try to run the task again because it doesn't see it running? Would it properly track the number of concurrently running compaction tasks?

What do you think about an alternate approach using different threads for different duty cycles?

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 15, 2023

Thanks for the feedback, @gianm .

I had similar thoughts which is why I held off on this PR (forgot to mark this WIP).
My intention was to:

  • track the tasks being submitted until the future completes so that duplicate tasks are not submitted if another CompactSegments is invoked before all the futures have completed.
  • emit metrics and clean up this state on future completion
  • probably also have some kind of submission timeout so that futures don't wait indefinitely

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 15, 2023

What do you think about an alternate approach using different threads for different duty cycles?

Yes, that is the other approach I had in mind. I was a little concerned about race conditions between different duty groups. But the right design would be to break up the duty groups in such a way that they are completely independent and parallelizable (which is most likely already the case).

I went ahead with the current approach instead as it retained the same flow and only made task submission async. All the other overlord operations done by the CompactSegments duty such as fetching running compaction tasks are still being done synchronously. So there is still a possibility of the coordinator run being blocked.

The current approach seemed simpler originally but now it seems like a half measure and probably brings in more complications than necessary.

We already have a PR for using the OverlordClient. So I will go ahead and update this PR to simply use different threads for different duty groups (and do some validations to make sure that there cannot be any race conditions).

@kfaraz kfaraz added the WIP label Jul 15, 2023
@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 26, 2023

Closing this PR as #14581 is already merged.
I will create a separate PR for using separate executor threads for each duty group.

@kfaraz kfaraz closed this Jul 26, 2023
@kfaraz kfaraz deleted the submit_compact_async branch November 9, 2023 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants