-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fetch of task location in SpecificTaskServiceLocator #16462
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM other than the one exception handling thing.
taskStatusFuture = overlordClient.taskStatus(taskId); | ||
} | ||
catch (Exception e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should resolve pendingFuture
in this case, I think. It would end up abandoned if an exception is actually thrown here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Fixed.
* Fix fetch of task location in SpecificTaskServiceLocator * Resolve future if exception occurs while invoking API * Remove unused import
Description
The task status API was improved in #15724 to serve task statuses from the Overlord memory.
But on older Overlord versions, this API would return an
unknown
task status location causing failures in inter-task communications.This was later fixed in #16227 but that introduced another bug described below. This bug is typically reproducible in MSQ controller tasks but may occur in native batch ingestion as well.
SpecificTaskServiceLocator
first calls the multi-task status API/druid/indexer/v1/taskStatus
to determine the location of a task.SpecificTaskServiceLocator
then falls back to calling the single task status API/druid/indexer/v1/task/{taskId}/status
ServiceClientFactory
threads to be stuck waiting to get back a task location. But to fetch the task location, we need one of theServiceClientFactory
threads.Fix
Testing
Local setup to reproduce issue:
f643abdad9
and run overlord, coordinator, brokerf82cc34e5b
and run middle manager with 10 task slotsServiceClientFactory
threads are busy fetching task locationLocal setup to verify fix:
f643abdad9
and run overlord, coordinator, brokerf82cc34e5b