Fix needless task shutdown on leader switch#13411
Fix needless task shutdown on leader switch#13411AmatyaAvadhanula merged 5 commits intoapache:masterfrom
Conversation
| if (activelyReadingTaskGroups.isEmpty()) { | ||
| return; | ||
| } | ||
| // Resume only running tasks and not pending / waiting ones. |
There was a problem hiding this comment.
Are we guaranteed at this point that the task runner has initialized its state? (For RTR, has it synced with ZK yet? For HRTR, has it heard from each worker yet?)
There was a problem hiding this comment.
leaderLifecycle.addManagedInstance(taskRunner);
leaderLifecycle.addManagedInstance(taskQueue);
leaderLifecycle.addManagedInstance(supervisorManager);
leaderLifecycle.addManagedInstance(overlordHelperManager);
Handlers are added while trying to become the leader and are also processed in this order.
TaskRunner#start seems to sync its state for both RTR and HRTR
| } | ||
| Set<String> runningTaskIds = taskMaster.getTaskRunner() | ||
| .get() | ||
| .getRunningTasks() |
There was a problem hiding this comment.
Does this also return paused tasks? According to the javadoc of this method, this method is meant to resume paused tasks (and try to resume running tasks, which should be a no-op).
There was a problem hiding this comment.
Yes. Paused is an unrelated internal task state and the task runner considers such tasks to be running as well
There was a problem hiding this comment.
Just catching up here, isn't the issue that we kill off RUNNING tasks which are actively processing data/publishing segments? Or were we killing off WAITING tasks?
IIUC, the existing flow wouldn't affect PENDING or WAITING tasks anyway, as they haven't started execution yet (unless they just started execution and we don't have the synced state yet).
There was a problem hiding this comment.
The code introduced in #13223 tried to resume any task in the set of activelyReadingTaskGroups irrespective of the task runner worker status and killed the task if it failed to respond with a successful status.
The issue was that there were PENDING / WAITING tasks in the set which couldn't respond to this request as they hadn't even begun RUNNING, and were killed
There was a problem hiding this comment.
Makes sense.
Have we seen a specific drawback of killing off PENDING/WAITING tasks or a race condition that may have adverse effects, or is this PR just a safety measure to avoid unnecessary operations?
There was a problem hiding this comment.
Killing pending tasks of a supervisor could lead to increased lag as it must re-create these tasks in the next run.
The failure could also be misleading as one cannot resume a task that hasn't begun running
|
The CI failures are a flaky test and code coverage checks. @AmatyaAvadhanula is this patch unit-testable? If not feasible to do a UT fo rit then I think it's OK to bypass the coverage checks. |
OK, in that case I suggest we should bypass the coverage check, since it's getting coverage through test cases that are "invisible" to the coverage checker. |
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the fix, @AmatyaAvadhanula !
Description
The first run of a supervisor tries to resume all actively reading tasks and shuts down any tasks which fail to resume successfully in #13223.
However it included pending / waiting tasks in the set of active tasks and killed them as they failed to resume.
The fix is to check if the task is also running using the TaskRunner before attempting to resume and fail tasks.
Release note
Key changed/added classes in this PR
SeekableStreamSupervisorThis PR has: