Skip to content

Fix needless task shutdown on leader switch#13411

Merged
AmatyaAvadhanula merged 5 commits intoapache:masterfrom
AmatyaAvadhanula:fix_needless_task_shutdown
Dec 1, 2022
Merged

Fix needless task shutdown on leader switch#13411
AmatyaAvadhanula merged 5 commits intoapache:masterfrom
AmatyaAvadhanula:fix_needless_task_shutdown

Conversation

@AmatyaAvadhanula
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Nov 22, 2022

Description

The first run of a supervisor tries to resume all actively reading tasks and shuts down any tasks which fail to resume successfully in #13223.

However it included pending / waiting tasks in the set of active tasks and killed them as they failed to resume.

The fix is to check if the task is also running using the TaskRunner before attempting to resume and fail tasks.

Release note


Key changed/added classes in this PR
  • SeekableStreamSupervisor

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87 abhishekagarwal87 added this to the 25.0 milestone Nov 22, 2022
if (activelyReadingTaskGroups.isEmpty()) {
return;
}
// Resume only running tasks and not pending / waiting ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed at this point that the task runner has initialized its state? (For RTR, has it synced with ZK yet? For HRTR, has it heard from each worker yet?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          leaderLifecycle.addManagedInstance(taskRunner);
          leaderLifecycle.addManagedInstance(taskQueue);
          leaderLifecycle.addManagedInstance(supervisorManager);
          leaderLifecycle.addManagedInstance(overlordHelperManager);

Handlers are added while trying to become the leader and are also processed in this order.
TaskRunner#start seems to sync its state for both RTR and HRTR

}
Set<String> runningTaskIds = taskMaster.getTaskRunner()
.get()
.getRunningTasks()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also return paused tasks? According to the javadoc of this method, this method is meant to resume paused tasks (and try to resume running tasks, which should be a no-op).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Paused is an unrelated internal task state and the task runner considers such tasks to be running as well

Copy link
Contributor

@kfaraz kfaraz Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just catching up here, isn't the issue that we kill off RUNNING tasks which are actively processing data/publishing segments? Or were we killing off WAITING tasks?

IIUC, the existing flow wouldn't affect PENDING or WAITING tasks anyway, as they haven't started execution yet (unless they just started execution and we don't have the synced state yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code introduced in #13223 tried to resume any task in the set of activelyReadingTaskGroups irrespective of the task runner worker status and killed the task if it failed to respond with a successful status.

The issue was that there were PENDING / WAITING tasks in the set which couldn't respond to this request as they hadn't even begun RUNNING, and were killed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Have we seen a specific drawback of killing off PENDING/WAITING tasks or a race condition that may have adverse effects, or is this PR just a safety measure to avoid unnecessary operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killing pending tasks of a supervisor could lead to increased lag as it must re-create these tasks in the next run.
The failure could also be misleading as one cannot resume a task that hasn't begun running

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gianm
Copy link
Contributor

gianm commented Nov 30, 2022

The CI failures are a flaky test and code coverage checks. @AmatyaAvadhanula is this patch unit-testable? If not feasible to do a UT fo rit then I think it's OK to bypass the coverage checks.

@AmatyaAvadhanula
Copy link
Contributor Author

@gianm @kfaraz thanks for the review.
I've modified the existing UT to test the case where the task hasn't begun running.
However KafkaSupervisorTest / KinesisSupervisorTest don't help with coverage of SeekableStreamSupervisor

@gianm
Copy link
Contributor

gianm commented Dec 1, 2022

However KafkaSupervisorTest / KinesisSupervisorTest don't help with coverage of SeekableStreamSupervisor

OK, in that case I suggest we should bypass the coverage check, since it's getting coverage through test cases that are "invisible" to the coverage checker.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @AmatyaAvadhanula !

@AmatyaAvadhanula AmatyaAvadhanula merged commit cc307e4 into apache:master Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants