Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fetch of task location in SpecificTaskServiceLocator #16462

Merged
merged 3 commits into from
May 20, 2024

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented May 16, 2024

Description

The task status API was improved in #15724 to serve task statuses from the Overlord memory.
But on older Overlord versions, this API would return an unknown task status location causing failures in inter-task communications.

This was later fixed in #16227 but that introduced another bug described below. This bug is typically reproducible in MSQ controller tasks but may occur in native batch ingestion as well.

  • The SpecificTaskServiceLocator first calls the multi-task status API /druid/indexer/v1/taskStatus to determine the location of a task.
  • This API returns an unknown task location on old Overlord versions
  • The SpecificTaskServiceLocator then falls back to calling the single task status API /druid/indexer/v1/task/{taskId}/status
  • The problem is that the second API is invoked in a synchronous manner in the callback of the first future while holding a lock
  • If there are enough MSQ workers, this could cause all the ServiceClientFactory threads to be stuck waiting to get back a task location. But to fetch the task location, we need one of the ServiceClientFactory threads.

Fix

  • Invoke the single task status API in an async manner
  • Ensure that success/failure callbacks are quick and do not keep threads blocked

Testing

Local setup to reproduce issue:

  • Build Druid on commit f643abdad9 and run overlord, coordinator, broker
  • Build Druid on commit f82cc34e5b and run middle manager with 10 task slots
  • Submit a large MSQ query with 6 workers + 1 controller
  • Verify that controller remains stuck and doesn't make progress since all ServiceClientFactory threads are busy fetching task location

Local setup to verify fix:

  • Build Druid on commit f643abdad9 and run overlord, coordinator, broker
  • Checkout Druid on commit f82cc34e5b
  • Apply the changes in this PR
  • Build Druid and run middle manager with 10 task slots
  • Submit a large MSQ query with 6 workers + 1 controller
  • Verify that the controller does not get stuck and live ingestion stats are reported as expected.

@kfaraz kfaraz requested review from gianm and cryptoe May 16, 2024 05:56
@kfaraz kfaraz added this to the 30.0.0 milestone May 17, 2024
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM other than the one exception handling thing.

taskStatusFuture = overlordClient.taskStatus(taskId);
}
catch (Exception e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should resolve pendingFuture in this case, I think. It would end up abandoned if an exception is actually thrown here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Fixed.

@kfaraz kfaraz requested a review from gianm May 18, 2024 07:03
@cryptoe cryptoe merged commit 15d27f3 into apache:master May 20, 2024
86 of 87 checks passed
@kfaraz kfaraz deleted the fix_task_location branch May 20, 2024 07:32
adarshsanjeev pushed a commit to adarshsanjeev/druid that referenced this pull request May 20, 2024
* Fix fetch of task location in SpecificTaskServiceLocator

* Resolve future if exception occurs while invoking API

* Remove unused import
adarshsanjeev added a commit that referenced this pull request May 21, 2024
…6473)

* Fix fetch of task location in SpecificTaskServiceLocator

* Resolve future if exception occurs while invoking API

* Remove unused import

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants