Fix needless task shutdown on leader switch by AmatyaAvadhanula · Pull Request #13411 · apache/druid

AmatyaAvadhanula · 2022-11-22T07:36:24Z

Description

The first run of a supervisor tries to resume all actively reading tasks and shuts down any tasks which fail to resume successfully in #13223.

However it included pending / waiting tasks in the set of active tasks and killed them as they failed to resume.

The fix is to check if the task is also running using the TaskRunner before attempting to resume and fail tasks.

Release note

Key changed/added classes in this PR

SeekableStreamSupervisor

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

gianm · 2022-11-22T08:43:26Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+    if (activelyReadingTaskGroups.isEmpty()) {
+      return;
+    }
+    // Resume only running tasks and not pending / waiting ones.


Are we guaranteed at this point that the task runner has initialized its state? (For RTR, has it synced with ZK yet? For HRTR, has it heard from each worker yet?)

leaderLifecycle.addManagedInstance(taskRunner); leaderLifecycle.addManagedInstance(taskQueue); leaderLifecycle.addManagedInstance(supervisorManager); leaderLifecycle.addManagedInstance(overlordHelperManager);

Handlers are added while trying to become the leader and are also processed in this order.
TaskRunner#start seems to sync its state for both RTR and HRTR

kfaraz · 2022-11-22T09:14:57Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+    }
+    Set<String> runningTaskIds = taskMaster.getTaskRunner()
+                                           .get()
+                                           .getRunningTasks()


Does this also return paused tasks? According to the javadoc of this method, this method is meant to resume paused tasks (and try to resume running tasks, which should be a no-op).

Yes. Paused is an unrelated internal task state and the task runner considers such tasks to be running as well

Just catching up here, isn't the issue that we kill off RUNNING tasks which are actively processing data/publishing segments? Or were we killing off WAITING tasks?

IIUC, the existing flow wouldn't affect PENDING or WAITING tasks anyway, as they haven't started execution yet (unless they just started execution and we don't have the synced state yet).

The code introduced in #13223 tried to resume any task in the set of activelyReadingTaskGroups irrespective of the task runner worker status and killed the task if it failed to respond with a successful status.

The issue was that there were PENDING / WAITING tasks in the set which couldn't respond to this request as they hadn't even begun RUNNING, and were killed

Makes sense.

Have we seen a specific drawback of killing off PENDING/WAITING tasks or a race condition that may have adverse effects, or is this PR just a safety measure to avoid unnecessary operations?

Killing pending tasks of a supervisor could lead to increased lag as it must re-create these tasks in the next run.
The failure could also be misleading as one cannot resume a task that hasn't begun running

gianm

LGTM

gianm · 2022-11-30T20:45:06Z

The CI failures are a flaky test and code coverage checks. @AmatyaAvadhanula is this patch unit-testable? If not feasible to do a UT fo rit then I think it's OK to bypass the coverage checks.

…_shutdown

AmatyaAvadhanula · 2022-12-01T02:25:04Z

@gianm @kfaraz thanks for the review.
I've modified the existing UT to test the case where the task hasn't begun running.
However KafkaSupervisorTest / KinesisSupervisorTest don't help with coverage of SeekableStreamSupervisor

gianm · 2022-12-01T02:49:42Z

However KafkaSupervisorTest / KinesisSupervisorTest don't help with coverage of SeekableStreamSupervisor

OK, in that case I suggest we should bypass the coverage check, since it's getting coverage through test cases that are "invisible" to the coverage checker.

kfaraz

Thanks for the fix, @AmatyaAvadhanula !

Fix needless task shutdown on leader switch

8fe450f

abhishekagarwal87 added this to the 25.0 milestone Nov 22, 2022

gianm reviewed Nov 22, 2022

View reviewed changes

kfaraz added the Area - Ingestion label Nov 22, 2022

kfaraz reviewed Nov 22, 2022

View reviewed changes

gianm approved these changes Nov 30, 2022

View reviewed changes

AmatyaAvadhanula added 2 commits December 1, 2022 07:21

Merge remote-tracking branch 'upstream/master' into fix_needless_task…

a928a03

…_shutdown

Add unit test

0f5993e

Fix style

0e7c900

kfaraz approved these changes Dec 1, 2022

View reviewed changes

Fix UTs

0e129e6

AmatyaAvadhanula merged commit cc307e4 into apache:master Dec 1, 2022

Conversation

AmatyaAvadhanula commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Key changed/added classes in this PR

Uh oh!

gianm Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Nov 24, 2022

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula Nov 24, 2022

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

gianm commented Nov 30, 2022

Uh oh!

AmatyaAvadhanula commented Dec 1, 2022

Uh oh!

gianm commented Dec 1, 2022

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AmatyaAvadhanula commented Nov 22, 2022 •

edited

Loading

kfaraz Nov 22, 2022 •

edited

Loading