Skip to content

Fix leak in activeTasks map in TaskQueue#18031

Merged
kfaraz merged 3 commits intoapache:masterfrom
jtuglu1:fix-leak-in-task-queue
May 24, 2025
Merged

Fix leak in activeTasks map in TaskQueue#18031
kfaraz merged 3 commits intoapache:masterfrom
jtuglu1:fix-leak-in-task-queue

Conversation

@jtuglu1
Copy link
Contributor

@jtuglu1 jtuglu1 commented May 23, 2025

Currently we check only whether the timestamp on the task comes strictly before the current clock value during a syncFromStorage(). This can cause races between startPendingTasksOnRunner() and syncFromStorage() where the latter spuriously updates the clock values for all tasks in activeTasks, preventing syncFromStorage() from making any deletion progress. This can cause waiting tasks to continuously increase without GC.

Screenshot 2025-05-23 at 1 49 20 PM

Description

  • I elected to update removeTaskInternal instead of updateTaskEntry because of this being opaque to the caller. I think the invariant that if the updateOperation is called on an entry ===> its lastUpdated field should also be updated makes sense here.
  • It's my understanding that the definition of "waiting" is a task which is yet to be placed in the taskRunner queue. Therefore, I've adjusted getWaitingTaskCount to filter out the completed tasks. This should ensure that the only unknown tasks to the taskRunner are those which have not been placed in its queue (in line with the definition).
  • This change also enhances the log to print how many tasks were actually removed from activeTasks for better debugging.

Release note

Fix task leak in activeTasks map in TaskQueue and fix waiting task metric count


Key changed/added classes in this PR
  • indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1 jtuglu1 force-pushed the fix-leak-in-task-queue branch 2 times, most recently from 8aae114 to c36b986 Compare May 23, 2025 21:01
Currently we check only whether the timestamp on the task comes strictly before the current clock value during a syncFromStorage(). This can causes races between startPendingTasksOnRunner() and syncFromStorage() where the latter spuriously updates the clock values for all tasks in activeTasks, preventing syncFromStorage() from making any deletion progress.
@jtuglu1 jtuglu1 force-pushed the fix-leak-in-task-queue branch from c36b986 to 6a93984 Compare May 24, 2025 00:27
@jtuglu1 jtuglu1 marked this pull request as ready for review May 24, 2025 00:29
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀
Thanks for the fix and the tests, @jtuglu-netflix !

@kfaraz
Copy link
Contributor

kfaraz commented May 24, 2025

@jtuglu-netflix , please let me know if this is good to merge or if you intend to include some more change here.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented May 24, 2025

@jtuglu-netflix , please let me know if this is good to merge or if you intend to include some more change here.

Nothing more to this PR

@jtuglu1 jtuglu1 requested a review from kfaraz May 24, 2025 05:24
@jtuglu1
Copy link
Contributor Author

jtuglu1 commented May 24, 2025

@kfaraz all good here?

@kfaraz kfaraz merged commit 2b1f1fc into apache:master May 24, 2025
74 checks passed
@kfaraz
Copy link
Contributor

kfaraz commented May 24, 2025

@kfaraz all good here?

Yes, merged.

@capistrant capistrant added this to the 34.0.0 milestone Jul 22, 2025
@jtuglu1 jtuglu1 deleted the fix-leak-in-task-queue branch August 22, 2025 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants