New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15021; Skip leader epoch bump on ISR shrink #13765
KAFKA-15021; Skip leader epoch bump on ISR shrink #13765
Conversation
@@ -1087,12 +1087,14 @@ class Partition(val topicPartition: TopicPartition, | |||
// avoid unnecessary collection generation | |||
val leaderLogEndOffset = leaderLog.logEndOffsetMetadata | |||
var newHighWatermark = leaderLogEndOffset | |||
remoteReplicasMap.values.foreach { replica => | |||
remoteReplicasMap.foreachEntry { (replicaId, replica) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use remoteReplicasMap.values
here and use the replica.brokerId similar to the maximalIsr.contains call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a test in PartitionTest
to assert that the HWM is incremented when there is a replica that is fenced but caught up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and Done. I confirmed that the test I added fails for the "fenced" and "shutdown" variant against trunk.
Hey @jsancio To achieve the objective that you desired, there is another way without changing the definition, i.e. change how the components reacts when the version/epoch is changes. We can choose to not restart the fetcher threads on each replica when an shrink ISR with leadership epoch change arrives to it for processing. Thoughts? |
@divijvaidya This does not help. Restarting the fetcher threads is just a mean to provide them the new leader epoch that they have to use. Until they get it, they can't replicate. This is the annoying part. If you don't restart the fetchers and update the leader epoch "live", you still have that period of time during which the followers don't have the correct leader epoch. Note that the bump also have an impact on producers/consumers as they have to refresh their metadata. Overall, I think that the goal is to only bump the leader epoch on leadership changes to avoid all those disturbances. |
Yes, that is fair too. The definition of leadership epoch in which case changes to - it represents the version of the leader after a re-election. In this case, we should also remove the epoch change during ISR expansion as well. My point is, let's keep the definition as state of ISR (current) or state of leader (in which case we remove epoch change for both expansion and shrink). Aside for it, out of curiosity, is there any other version which represents the state of ISR in Kafka? Does replica epoch changes on every change to ISR? |
@jsancio I think that the real issue is in
I wonder if we really need this if we change |
Yeah, I agree that we need to do both in order to remain consistent.
There is the partition epoch which is incremented whenever the partition is updated. This includes ISR changes. |
Thanks for your feedback @divijvaidya @dajac. I am replying to both your comments in this message.
@divijvaidya, the old code was increasing the leader epoch when with the ISR shrinks but not when the ISR expands. My understanding that we were doing this because the old replica manager used leader epoch bump to invalidate old fetchers. During shutdown the fetchers needed to be invalidated to avoid having them rejoin the ISR. With KIP-841, this is no longer necessary as we can reject brokers that are shutting down from joining the ISR and modifying the HWM. Part of the code for doing this already exists, what we missing and what part of this PR fixes, is considering this state when advancing the HWM. The partition leader should not include shutting down replicas that are not in the ISR when determining the HWM.
@divijvaidya, for correctness, the main requirement is that the leader epoch is increase whenever the leader changes. This is needed for log truncation and reconciliation. For log consistency, log truncation and reconciliation assumes that the tuples (offset, epoch) are unique per topic partition and that if the tuple (offset, epoch) match in two replicas then their log up to that offset also match. In my opinion, for correctness Kafka doesn't require that the leader epoch is increased when the ISR changes.
@dajac, yes, I thought about this when I was implemented this feature. I decided against it because the follower (shutting down replica) is technically "caught up" to the leader we simply don't want the leader to wait for the replica when computing the HWM since we know it will soon be shutting down its fetchers.
@dajac, we need the MV check in the controller even with your suggestion. The question is "When is it beneficial for the controller to not increase the leader epoch if a replica is removed from the ISR because of shutdown?" This is only the case when the controller knows that the brokers have the replica manager fixes in this PR. That is guarantee to be the case if the MV is greater than the MV introduced in this PR. If the brokers don't contain the fixes in this PR and the controller doesn't bump the leader epoch, PRODUCE requests will timeout because the HWM increase will be delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one thing I was wondering about this change since I am not that familiar with what the metadata version gates - does the change in the PR allow leader epochs to go backwards?
Consider a sequence like the following:
- Initial partition state, leader epoch 0, ISR [0, 1, 2] and metadata version 3.5 on broker and controller. I suppose this is a PartitionRecord.
- Shrink the ISR to [0, 1], leader epoch is bumped to 1. This results in a PartitionChangeRecord
- Shrink the ISR to [0], leader epoch is bumped to 2. This results in a PartitionChangeRecord.
- Publish a message to the leader [0], leader assigns (epoch 2, offset 0).
- Update metadata version to 3.6 and restart [0]. When 0 replays the PartitionChangeRecords from steps 2, 3, the controller will end up with a leader epoch of 0 unless a PartitionRecord snapshot is generated before the restart.
- Publish a message to leader [0], leader assigns (epoch 0, offset 1), we get a backwards epoch.
Same thing applies for controller restarts/etc after MV bump.
If what I described is an issue then the PartitionChangeRecord version may need to be updated so that the controller quorum (or broker metadata log replayer) knows whether a PartitionChangeRecord was persisted with implicit leader epoch bumps on ISR shrink or not so that on record replay the controller can rebuild the correct leader epoch.
Disclaimer: I'm not familiar with KRaft internals, so this is a sort of handwavey guess of how things may go wrong.
@splett2, the important observation is that this PR doesn't change the semantic of replaying I should also point out that the MV check is not require for correctness. It is there for performance so that the PRODUCE request doesn't timeout and so that the Kafka producer doesn't have to retry the PRODUCE message. |
@jsancio |
Thank you, I didn't realise that. Next,
|
Yes. That is what the change to
We need to keep the old behavior in ZK deployments because the ZK controller doesn't implement KIP-841 which is required for this fix to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsancio Thanks for the clarification. That makes sense to me. I left a few minor comments for consideration.
((replicaState.isCaughtUp(leaderLogEndOffset.messageOffset, currentTimeMs, replicaLagTimeMaxMs) && | ||
isReplicaIsrEligible(replica.brokerId)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth extracting this condition into an helper method (e.g. isIsrEligibleAndCaughtUp)? That would simplify the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I called it shouldWaitForReplicaToJoinIsr
since I think this is what the leader is trying to do.
@@ -357,6 +363,51 @@ abstract class BaseProducerSendTest extends KafkaServerTestHarness { | |||
} | |||
} | |||
|
|||
@ParameterizedTest | |||
@ValueSource(strings = Array("zk", "kraft")) | |||
def testSendToPartitionWithFollowerShutdown(quorum: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: *ShouldNotTimeout
? it would be great to capture the issue in the test name or to add a comment about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done.
@@ -1456,6 +1456,105 @@ class PartitionTest extends AbstractPartitionTest { | |||
assertEquals(alterPartitionListener.failures.get, 1) | |||
} | |||
|
|||
@ParameterizedTest | |||
@ValueSource(strings = Array("fenced", "shutdown", "unfenced")) | |||
def testHWMIncreasesWithFencedOrShutdownFollower(brokerState: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/HWM/HighWatermark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and I added a comment to the last check in the test.
* | ||
* In MV before 3.6 there was a bug (KAFKA-15021) in the brokers' replica manager | ||
* that required that the leader epoch be bump whenever the ISR shrank. In MV 3.6 this leader | ||
* bump is not required when the ISR shrinks. | ||
*/ | ||
void triggerLeaderEpochBumpIfNeeded(PartitionChangeRecord record) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, do we bump the leader epoch when the ISR is expanded? My understanding is that we don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The Replica.contains
check is subtle but it returns true
if the second list is a subset of the first list. I added a comment about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
…kip-leader-epoch-bump
…kip-leader-epoch-bump
When the KRaft controller removes a replica from the ISR because of the controlled shutdown there is no need for the leader epoch to be increased by the KRaft controller. This is accurate as long as the topic partition leader doesn't add the removed replica back to the ISR.
This change also fixes a bug when computing the HWM. When computing the HWM, replicas that are not eligible to join the ISR but are caught up should not be included in the computation. Otherwise, the HWM will never increase for
replica.lag.time.max.ms
because the shutting down replica is not sending FETCH request. Without this additional fix PRODUCE requests would timeout if the request timeout is greater thanreplica.lag.time.max.ms
.Because of the bug above the KRaft controller needs to check the MV to guarantee that all brokers support this bug fix before skipping the leader epoch bump.
Committer Checklist (excluded from commit message)