KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest #14869

lucasbru · 2023-11-29T16:02:11Z

The following race can happen in the state updater code path

Task is restoring, owned by state updater
We fall out of the consumer group, lose all partitions
We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
We also register a "StateUpdater.taskAndAction" to remove the task
We get the same task reassigned. Since it's still owned by the state updater, we don't do much
The task completes restoration
The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
We now have the task assigned, but it's closed.

To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

cadonna

Thanks for the fix, @lucasbru !

LGTM!

…ionTest (apache#14869) The following race can happen in the state updater code path Task is restoring, owned by state updater We fall out of the consumer group, lose all partitions We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY We also register a "StateUpdater.taskAndAction" to remove the task We get the same task reassigned. Since it's still owned by the state updater, we don't do much The task completes restoration The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action We now have the task assigned, but it's closed. To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532. This should fix the flaky EOSIntegrationTest. Reviewers: Bruno Cadonna <cadonna@apache.org>

lucasbru requested a review from cadonna November 29, 2023 16:02

add back

0a7cf02

lucasbru force-pushed the fix_flaky_eos branch from afc65d0 to 0a7cf02 Compare November 30, 2023 09:55

lucasbru changed the title ~~KAFKA-15690: Fix restoring task handling on partition loss, flaky Eos…~~ KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest Nov 30, 2023

lucasbru marked this pull request as ready for review November 30, 2023 09:58

cadonna approved these changes Nov 30, 2023

View reviewed changes

cadonna added the streams label Nov 30, 2023

lucasbru merged commit bfee3b3 into apache:trunk Dec 1, 2023
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest #14869

KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest #14869

lucasbru commented Nov 29, 2023 •

edited

cadonna left a comment

KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest #14869

KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest #14869

Conversation

lucasbru commented Nov 29, 2023 • edited

Committer Checklist (excluded from commit message)

cadonna left a comment

Choose a reason for hiding this comment

lucasbru commented Nov 29, 2023 •

edited