Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle task location fetch from overlord during rolling upgrades #16227

Merged
merged 7 commits into from Apr 16, 2024

Conversation

AmatyaAvadhanula
Copy link
Contributor

Bug:

#15724 - introduced a bug where a rolling upgrade would cause all task locations returned by the Overlord on an older version to be unknown.

Prior to #15724,

  1. OverlordResource#getTaskStatus for individual tasks fetched a TaskStatusResponse containing the location.
  2. OverlordResource#getMultipleTaskStatuses fetched task statuses in a batch from the metadata store. The metadata store doesn't contain the current location of an active task. Complete tasks do contain them

After the changes,

  1. OverlordResource#getTaskStatus remains unchanged.
  2. OverlordResource#getMultipleTaskStatuses fetches task statuses for in-memory tasks from the TaskQueue and enhances them with the location from the task runner. The method fetches task statuses for completed tasks from the db.

The Overlord client was also changed to rely on the 2nd API to fetch the task status and location from memory.

During a rolling upgrade, the task is on a version with the PR's changes and queries the 2nd API. The overlord is still on the older version and fails to return the correct location for active tasks. This can lead to task failures during rolling upgrades.

Fix

The overlord client now falls back to the orignal API that always returns the task location if the 2nd API fails to return it during the rolling upgrade. After the rolling upgrade, the active tasks' statuses will be fetched from memory as expected.

Testing

The new overlord client was used on an upgraded Indexer / MM while the Overlord was on a version prior to #15724. The tasks succeeded as expected. (They would fail with a newer Indexer without this patch).

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz
Copy link
Contributor

kfaraz commented Apr 2, 2024

I don't think it is desirable to fall back to another API within the OverlordClientImpl.
The OverlordClient methods are meant to have 1-to-1 correspondence with the Overlord APIs. Also, it is used in several places and some callers might actually want the result of the first API itself even if the location turns out to be empty.

It would make more sense to have the fallback logic in the specific task service locator class.

@AmatyaAvadhanula
Copy link
Contributor Author

It would make more sense to have the fallback logic in the specific task service locator class.

Would it be ok to add a parameter within the Overlord client to fallback to the older API? The locations are needed not only in SpecificTaskServiceLocator but also IndexerWorkerManagerClient. It seems a bit involved with added duplication to do it outside the client.

@kfaraz
Copy link
Contributor

kfaraz commented Apr 3, 2024

Would it be ok to add a parameter within the Overlord client to fallback to the older API? The locations are needed not only in SpecificTaskServiceLocator but also IndexerWorkerManagerClient. It seems a bit involved with added duplication to do it outside the client.

I feel it is okay to have duplication in this case for the time being.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 8, 2024
).get(workerId);

if (taskStatus != null
&& TaskLocation.unknown().equals(taskStatus.getLocation())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
&& TaskLocation.unknown().equals(taskStatus.getLocation())) {
&& !TaskLocation.unknown().equals(taskStatus.getLocation())) {

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. @AmatyaAvadhanula , could you please check the CI failures?

@kfaraz kfaraz merged commit ad6bd62 into apache:master Apr 16, 2024
85 checks passed
@kfaraz kfaraz deleted the task_status_from_memory_fix branch April 16, 2024 15:32
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants