Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest #14869

Merged
merged 1 commit into from Dec 1, 2023

Conversation

lucasbru
Copy link
Member

@lucasbru lucasbru commented Nov 29, 2023

The following race can happen in the state updater code path

  • Task is restoring, owned by state updater
  • We fall out of the consumer group, lose all partitions
  • We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
  • We also register a "StateUpdater.taskAndAction" to remove the task
  • We get the same task reassigned. Since it's still owned by the state updater, we don't do much
  • The task completes restoration
  • The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
  • Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
  • We now have the task assigned, but it's closed.

To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@lucasbru lucasbru changed the title KAFKA-15690: Fix restoring task handling on partition loss, flaky Eos… KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest Nov 30, 2023
@lucasbru lucasbru marked this pull request as ready for review November 30, 2023 09:58
Copy link
Contributor

@cadonna cadonna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @lucasbru !

LGTM!

@lucasbru lucasbru merged commit bfee3b3 into apache:trunk Dec 1, 2023
1 check failed
ex172000 pushed a commit to ex172000/kafka that referenced this pull request Dec 15, 2023
…ionTest (apache#14869)

The following race can happen in the state updater code path

Task is restoring, owned by state updater
We fall out of the consumer group, lose all partitions
We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
We also register a "StateUpdater.taskAndAction" to remove the task
We get the same task reassigned. Since it's still owned by the state updater, we don't do much
The task completes restoration
The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
We now have the task assigned, but it's closed.
To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Reviewers: Bruno Cadonna <cadonna@apache.org>
yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
…ionTest (apache#14869)

The following race can happen in the state updater code path

Task is restoring, owned by state updater
We fall out of the consumer group, lose all partitions
We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
We also register a "StateUpdater.taskAndAction" to remove the task
We get the same task reassigned. Since it's still owned by the state updater, we don't do much
The task completes restoration
The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
We now have the task assigned, but it's closed.
To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Reviewers: Bruno Cadonna <cadonna@apache.org>
AnatolyPopov pushed a commit to aiven/kafka that referenced this pull request Feb 16, 2024
…ionTest (apache#14869)

The following race can happen in the state updater code path

Task is restoring, owned by state updater
We fall out of the consumer group, lose all partitions
We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
We also register a "StateUpdater.taskAndAction" to remove the task
We get the same task reassigned. Since it's still owned by the state updater, we don't do much
The task completes restoration
The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
We now have the task assigned, but it's closed.
To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Reviewers: Bruno Cadonna <cadonna@apache.org>
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
…ionTest (apache#14869)

The following race can happen in the state updater code path

Task is restoring, owned by state updater
We fall out of the consumer group, lose all partitions
We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
We also register a "StateUpdater.taskAndAction" to remove the task
We get the same task reassigned. Since it's still owned by the state updater, we don't do much
The task completes restoration
The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
We now have the task assigned, but it's closed.
To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Reviewers: Bruno Cadonna <cadonna@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants