Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-14154; Ensure AlterPartition not sent to stale controller #12499

Closed
wants to merge 4 commits into from

Conversation

hachikuji
Copy link

@hachikuji hachikuji commented Aug 10, 2022

It is possible currently for a leader to send an AlterPartition request to a stale controller which does not have the latest leader epoch discovered through a LeaderAndIsr request. In this case, the stale controller returns FENCED_LEADER_EPOCH, which causes the partition leader to get stuck. This is a change in behavior following #12032. Prior to that patch, the request would either be accepted (potentially incorrectly) if the LeaderAndIsr state matched that on the controller, or it would have returned NOT_CONTROLLER after the stale controller failed to apply the update to Zookeeper.

This patch fixes the problem by ensuring that AlterPartition is sent to a controller with an epoch which is at least as large as that of the controller which sent the LeaderAndIsr request. The way this is achieved is by tracking the controller epoch in BrokerToControllerChannelManager and ensuring that it is only updated monotonically regardless of the source. If we find a controller epoch through LeaderAndIsr which is larger than what we have in the MetadataCache, then the controller node is reset and we wait until we have discovered the controller node with a higher epoch. This ensures that the FENCED_LEADER_EPOCH error from the controller can be trusted.

A more elegant solution to this problem would probably be to include the controller epoch in the AlterPartition request, but this would require a version bump. Alternatively, we considered letting the controller return UNKNOWN_LEADER_EPOCH instead of FENCED_LEADER_EPOCH when the epoch is larger than what it has in its context. This too likely would require a version bump. Finally, we considered reverting #12032, which would restore the looser validation logic which allows the controller to accept AlterPartition requests with larger leader epochs. We rejected this option because we feel it can lead to correctness violations.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

handleResponse(request)
))
controllerOpt.foreach { activeController =>
if (activeController.epoch >= request.minControllerEpoch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, this check is done on the broker side right? I guess you sort of allude to this in the PR description that potentially a more ideal solution would be for the controller to do the check server side, but that would require a version bump.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.

info(s"Recorded new controller, from now on will use broker $controllerNode")
updateControllerAddress(controllerNode)
metadataUpdater.setNodes(Seq(controllerNode).asJava)
case Some(controllerNodeAndEpoch) =>
Copy link
Contributor

@andymg3 andymg3 Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where/how eventually the LeaderAndIsr from the new controller gets applied?

@jolshan
Copy link
Contributor

jolshan commented Aug 10, 2022

This patch fixes the problem by ensuring that AlterPartition is sent to a controller with an epoch which is at least as large as that of the controller which sent the LeaderAndIsr request. This ensures that the FENCED_LEADER_EPOCH error from the controller can be trusted.

So we prevent sending the request if the epoch is lower? And is it the case, that there is always a controller with an epoch at least as large? Or in some cases would we need to wait/retry until such a controller exists?

@jsancio jsancio added core Kafka Broker controller Kafka Controller 3.3 labels Aug 10, 2022
@hachikuji
Copy link
Author

So we prevent sending the request if the epoch is lower? And is it the case, that there is always a controller with an epoch at least as large? Or in some cases would we need to wait/retry until such a controller exists?

@jolshan Yes, that is right. Ensuring some level of monotonicity seems like a good general change even outside the original bug. It is weird to allow the broker to send requests to a controller that it knows for sure is stale, and it makes the system harder to reason about.

One thing I have been trying to think through is how this bug affects kraft. The kraft controller will also return FENCED_LEADER_EPOCH if the leader epoch is higher than what it has in its cache. But does kraft give us a stronger guarantee to work with? I think it could, but at the moment, we do not proactively reset the controller in BrokerToControllerChannelManager after we discover a new controller. We only reset it after we receive a NOT_CONTROLLER error in a request. So it seems to me that we could still hit the same problem with kraft. As a matter of fact, I think this patch does not fix the kraft problem because we do not propagate the controller epoch down to Partition so that it can be used in AlterPartition requests. cc @jsancio

@hachikuji
Copy link
Author

I am going to close this PR. On the one hand, it does not address the problem for KRaft; on the other, we have thought of a simpler fix for the zk controller, which I will open shortly.

@hachikuji hachikuji closed this Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
controller Kafka Controller core Kafka Broker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants