-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying with a backward compatible task type on unknown task type error in parallel indexing #8905
Conversation
…ror in parallel indexing
@Override | ||
public String getType() | ||
{ | ||
return SinglePhaseSubTask.OLD_TYPE_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this really affect how the task is serialized? I thought the type
field would end up based solely on whatever Jackson thinks the type code for that class is, based on @JsonTypeName
or @JsonSubTypes
annotations.
Could you include a test that makes sure it serializes correctly? (Maybe serialize it, then deserialize as a Map and check the type
field.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a different type
works, but the class should be registered on Jackson properly. I fixed it and added a unit test.
{ | ||
T task = spec.newSubTask(numAttempts); | ||
try { | ||
indexingServiceClient.runTask(task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this approach be able to retry immediately, or will it have to exhaust retries in the indexingServiceClient
first? (The loop in DruidLeaderClient)
Ideally, this detects a problem on the first submission, and doesn't need to exhaust retries before it moves on to trying the backwards-compatible type.
Could you check it, if you haven't already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't retry immediately but will exhaust retries in DruidLeaderClient. Yes, ideally it should be able to detect the problem before it retries, but I'm not sure whether that refactoring is worth to do because 1) it's not easy to teach the logic of the caller to DruidLeaderClient, 2) it will happen only during a particular type of rolling update, 3) and retries won't take much time compared to the total indexing time.
@gianm thanks for the review. I also tested this patch with a cluster of an overlord of 0.15.0 and middleManagers of this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 with the latest changes, thanks @jihoonson
…ror in parallel indexing (apache#8905) * Retrying with a backward compatible task type on unknown task type error in parallel indexing * Register legacy class; add a serde test
…ror in parallel indexing (apache#8905) * Retrying with a backward compatible task type on unknown task type error in parallel indexing * Register legacy class; add a serde test
…ask type error in parallel indexing (#8905) (#8949) * Retrying with a backward compatible task type on unknown task type error in parallel indexing (#8905) * Retrying with a backward compatible task type on unknown task type error in parallel indexing * Register legacy class; add a serde test * Backport fix, use firehoses
Fixes #8836.
Description
I think I don't like this patch much, but don't see a better solution. Please leave comments if anyone has a better idea.
TaskMonitor.submit()
creates a sub task for a given spec submits it to the overlord. Here, if the task type was unknown to the overlord (can happen during a rolling update), the overlord would return an HTTP error toTaskMonitor
.HttpIndexingServiceClient
would throw anIllegalStateException
if the HTTP response was not 200.To handle this, I added a new method
SubTaskSpec.newSubTaskWithBackwardCompatibleType()
which will be called ifSubTaskSpec.newSubTask()
fails with anIllegalStateExceptoin
with a message starting with "Could not resolve type id".This PR has:
This change is