KAFKA-15021; Skip leader epoch bump on ISR shrink #13765

jsancio · 2023-05-25T17:28:54Z

When the KRaft controller removes a replica from the ISR because of the controlled shutdown there is no need for the leader epoch to be increased by the KRaft controller. This is accurate as long as the topic partition leader doesn't add the removed replica back to the ISR.

This change also fixes a bug when computing the HWM. When computing the HWM, replicas that are not eligible to join the ISR but are caught up should not be included in the computation. Otherwise, the HWM will never increase for replica.lag.time.max.ms because the shutting down replica is not sending FETCH request. Without this additional fix PRODUCE requests would timeout if the request timeout is greater than replica.lag.time.max.ms.

Because of the bug above the KRaft controller needs to check the MV to guarantee that all brokers support this bug fix before skipping the leader epoch bump.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

splett2 · 2023-05-25T18:13:38Z

core/src/main/scala/kafka/cluster/Partition.scala

@@ -1087,12 +1087,14 @@ class Partition(val topicPartition: TopicPartition,
    // avoid unnecessary collection generation
    val leaderLogEndOffset = leaderLog.logEndOffsetMetadata
    var newHighWatermark = leaderLogEndOffset
-    remoteReplicasMap.values.foreach { replica =>
+    remoteReplicasMap.foreachEntry { (replicaId, replica) =>


can we use remoteReplicasMap.values here and use the replica.brokerId similar to the maximalIsr.contains call?

Should we have a test in PartitionTest to assert that the HWM is incremented when there is a replica that is fenced but caught up?

Done and Done. I confirmed that the test I added fails for the "fenced" and "shutdown" variant against trunk.

divijvaidya · 2023-05-30T08:19:09Z

Hey @jsancio
With this change, we are changing the semantics of what a leadership epoch means. Prior to this change, leadership epoch is a version number representing membership of an ISR. As soon as membership changes, this version changes. After this change, the definition has changed to - leadership epoch is a version number that represents member of an ISR "in some cases". As you can see, the new definition has added ifs and buts to the simple definition above. Hence, I am not in favour of changing this.

To achieve the objective that you desired, there is another way without changing the definition, i.e. change how the components reacts when the version/epoch is changes. We can choose to not restart the fetcher threads on each replica when an shrink ISR with leadership epoch change arrives to it for processing.

Thoughts?

dajac · 2023-05-30T08:32:55Z

To achieve the objective that you desired, there is another way without changing the definition, i.e. change how the components reacts when the version/epoch is changes. We can choose to not restart the fetcher threads on each replica when an shrink ISR with leadership epoch change arrives to it for processing.

@divijvaidya This does not help. Restarting the fetcher threads is just a mean to provide them the new leader epoch that they have to use. Until they get it, they can't replicate. This is the annoying part. If you don't restart the fetchers and update the leader epoch "live", you still have that period of time during which the followers don't have the correct leader epoch. Note that the bump also have an impact on producers/consumers as they have to refresh their metadata. Overall, I think that the goal is to only bump the leader epoch on leadership changes to avoid all those disturbances.

divijvaidya · 2023-05-30T08:38:19Z

Overall, I think that the goal is to only bump the leader epoch on leadership changes to avoid all those disturbances.

Yes, that is fair too. The definition of leadership epoch in which case changes to - it represents the version of the leader after a re-election. In this case, we should also remove the epoch change during ISR expansion as well. My point is, let's keep the definition as state of ISR (current) or state of leader (in which case we remove epoch change for both expansion and shrink).

Aside for it, out of curiosity, is there any other version which represents the state of ISR in Kafka? Does replica epoch changes on every change to ISR?

dajac · 2023-05-30T08:38:56Z

This change also fixes a bug when computing the HWM. When computing the HWM, replicas that are not eligible to join the ISR but are caught up should not be included in the computation. Otherwise, the HWM will never increase for replica.lag.time.max.ms because the shutting down replica is not sending FETCH request. Without this additional fix PRODUCE requests would timeout if the request timeout is greater than replica.lag.time.max.ms.

@jsancio I think that the real issue is in Partition.makeLeader. As you can see there, we only reset the followers' states when the leader epoch is bumped. I suppose that this is why you stumbled upon this issue with having shutting down replicas holding back advancing the HWM. The issue is that the shutting down replica's state is not reset so it remains caught-up for replica.lag.time.max.ms. I think that we need to update Partition.makeLeader to always update the followers' states. Obviously, we also need your changes to not consider fenced and shutting down replicas in the HWM computation.

Because of the bug above the KRaft controller needs to check the MV to guarantee that all brokers support this bug fix before skipping the leader epoch bump.

I wonder if we really need this if we change Partition.makeLeader as explained. It seems to me that the change in Partition.makeLeader and Partition.maybeIncrementLeaderHW together should be backward compatible. What do you think?

dajac · 2023-05-30T08:43:36Z

Yes, that is fair too. The definition of leadership epoch in which case changes to - it represents the version of the leader after a re-election. In this case, we should also remove the epoch change during ISR expansion as well. My point is, let's keep the definition as state of ISR (current) or state of leader (in which case we remove epoch change for both expansion and shrink).

Yeah, I agree that we need to do both in order to remain consistent.

Aside for it, out of curiosity, is there any other version which represents the state of ISR in Kafka? Does replica epoch changes on every change to ISR?

There is the partition epoch which is incremented whenever the partition is updated. This includes ISR changes.

jsancio · 2023-06-01T15:29:45Z

Thanks for your feedback @divijvaidya @dajac. I am replying to both your comments in this message.

With this change, we are changing the semantics of what a leadership epoch means. Prior to this change, leadership epoch is a version number representing membership of an ISR. As soon as membership changes, this version changes.

@divijvaidya, the old code was increasing the leader epoch when with the ISR shrinks but not when the ISR expands. My understanding that we were doing this because the old replica manager used leader epoch bump to invalidate old fetchers. During shutdown the fetchers needed to be invalidated to avoid having them rejoin the ISR. With KIP-841, this is no longer necessary as we can reject brokers that are shutting down from joining the ISR and modifying the HWM.

Part of the code for doing this already exists, what we missing and what part of this PR fixes, is considering this state when advancing the HWM. The partition leader should not include shutting down replicas that are not in the ISR when determining the HWM.

After this change, the definition has changed to - leadership epoch is a version number that represents member of an ISR "in some cases". As you can see, the new definition has added ifs and buts to the simple definition above. Hence, I am not in favour of changing this.

@divijvaidya, for correctness, the main requirement is that the leader epoch is increase whenever the leader changes. This is needed for log truncation and reconciliation. For log consistency, log truncation and reconciliation assumes that the tuples (offset, epoch) are unique per topic partition and that if the tuple (offset, epoch) match in two replicas then their log up to that offset also match. In my opinion, for correctness Kafka doesn't require that the leader epoch is increased when the ISR changes.

As you can see there, we only reset the followers' states when the leader epoch is bumped. I suppose that this is why you stumbled upon this issue with having shutting down replicas holding back advancing the HWM. The issue is that the shutting down replica's state is not reset so it remains caught-up for replica.lag.time.max.ms. I think that we need to update Partition.makeLeader to always update the followers' states. Obviously, we also need your changes to not consider fenced and shutting down replicas in the HWM computation.

@dajac, yes, I thought about this when I was implemented this feature. I decided against it because the follower (shutting down replica) is technically "caught up" to the leader we simply don't want the leader to wait for the replica when computing the HWM since we know it will soon be shutting down its fetchers.

I wonder if we really need this if we change Partition.makeLeader as explained. It seems to me that the change in Partition.makeLeader and Partition.maybeIncrementLeaderHW together should be backward compatible. What do you think?

@dajac, we need the MV check in the controller even with your suggestion. The question is "When is it beneficial for the controller to not increase the leader epoch if a replica is removed from the ISR because of shutdown?" This is only the case when the controller knows that the brokers have the replica manager fixes in this PR. That is guarantee to be the case if the MV is greater than the MV introduced in this PR.

If the brokers don't contain the fixes in this PR and the controller doesn't bump the leader epoch, PRODUCE requests will timeout because the HWM increase will be delayed.

splett2

Actually, one thing I was wondering about this change since I am not that familiar with what the metadata version gates - does the change in the PR allow leader epochs to go backwards?

Consider a sequence like the following:

Initial partition state, leader epoch 0, ISR [0, 1, 2] and metadata version 3.5 on broker and controller. I suppose this is a PartitionRecord.
Shrink the ISR to [0, 1], leader epoch is bumped to 1. This results in a PartitionChangeRecord
Shrink the ISR to [0], leader epoch is bumped to 2. This results in a PartitionChangeRecord.
Publish a message to the leader [0], leader assigns (epoch 2, offset 0).
Update metadata version to 3.6 and restart [0]. When 0 replays the PartitionChangeRecords from steps 2, 3, the controller will end up with a leader epoch of 0 unless a PartitionRecord snapshot is generated before the restart.
Publish a message to leader [0], leader assigns (epoch 0, offset 1), we get a backwards epoch.

Same thing applies for controller restarts/etc after MV bump.

If what I described is an issue then the PartitionChangeRecord version may need to be updated so that the controller quorum (or broker metadata log replayer) knows whether a PartitionChangeRecord was persisted with implicit leader epoch bumps on ISR shrink or not so that on record replay the controller can rebuild the correct leader epoch.

Disclaimer: I'm not familiar with KRaft internals, so this is a sort of handwavey guess of how things may go wrong.

jsancio · 2023-06-02T13:38:48Z

Actually, one thing I was wondering about this change since I am not that familiar with what the metadata version gates - does the change in the PR allow leader epochs to go backwards?

@splett2, the important observation is that this PR doesn't change the semantic of replaying PartitionRecord and PartitionChangeRecord with respect to leader epoch bump. When replaying PartitionChangeRecord the state machines (controller and broker) will increase the leader epoch if the field Leader is set. This holds true before and after this PR. What this PR changes is when the controller sets the Leader field in PartitionChangeRecord.

I should also point out that the MV check is not require for correctness. It is there for performance so that the PRODUCE request doesn't timeout and so that the Kafka producer doesn't have to retry the PRODUCE message.

splett2 · 2023-06-02T15:15:18Z

@jsancio
That makes sense. Sounds good to me in that case.

divijvaidya · 2023-06-02T15:52:23Z

the old code was increasing the leader epoch when with the ISR shrinks but not when the ISR expands

Thank you, I didn't realise that.

Next,

We change leader epoch when a replica is removed and shrinks the ISR (without a leader-re-election). Is that correct? Is yes, then should we also removing the logic to increment the epoch in such situations to keep the definition of leader epoch consistent?
Is a similar change required for Zk code path?

jsancio · 2023-06-02T16:09:51Z

We change leader epoch when a replica is removed and shrinks the ISR (without a leader-re-election). Is that correct? Is yes, then should we also removing the logic to increment the epoch in such situations to keep the definition of leader epoch consistent?

Yes. That is what the change to PartittionChangeBuilder does. Can you point me to what code exactly are you referring to?

Is a similar change required for Zk code path?

We need to keep the old behavior in ZK deployments because the ZK controller doesn't implement KIP-841 which is required for this fix to work.

dajac

@jsancio Thanks for the clarification. That makes sense to me. I left a few minor comments for consideration.

dajac · 2023-06-02T19:32:34Z

core/src/main/scala/kafka/cluster/Partition.scala

+          ((replicaState.isCaughtUp(leaderLogEndOffset.messageOffset, currentTimeMs, replicaLagTimeMaxMs) &&
+            isReplicaIsrEligible(replica.brokerId)) ||


Is it worth extracting this condition into an helper method (e.g. isIsrEligibleAndCaughtUp)? That would simplify the condition.

I agree. I called it shouldWaitForReplicaToJoinIsr since I think this is what the leader is trying to do.

dajac · 2023-06-02T19:33:40Z

core/src/test/scala/integration/kafka/api/BaseProducerSendTest.scala

@@ -357,6 +363,51 @@ abstract class BaseProducerSendTest extends KafkaServerTestHarness {
    }
  }

+  @ParameterizedTest
+  @ValueSource(strings = Array("zk", "kraft"))
+  def testSendToPartitionWithFollowerShutdown(quorum: String): Unit = {


nit: *ShouldNotTimeout? it would be great to capture the issue in the test name or to add a comment about it.

Done and done.

dajac · 2023-06-02T19:35:49Z

core/src/test/scala/unit/kafka/cluster/PartitionTest.scala

@@ -1456,6 +1456,105 @@ class PartitionTest extends AbstractPartitionTest {
    assertEquals(alterPartitionListener.failures.get, 1)
  }

+  @ParameterizedTest
+  @ValueSource(strings = Array("fenced", "shutdown", "unfenced"))
+  def testHWMIncreasesWithFencedOrShutdownFollower(brokerState: String): Unit = {


nit: s/HWM/HighWatermark?

Done and I added a comment to the last check in the test.

dajac · 2023-06-02T19:47:44Z

metadata/src/main/java/org/apache/kafka/controller/PartitionChangeBuilder.java

+     *
+     * In MV before 3.6 there was a bug (KAFKA-15021) in the brokers' replica manager
+     * that required that the leader epoch be bump whenever the ISR shrank. In MV 3.6 this leader
+     * bump is not required when the ISR shrinks.
     */
    void triggerLeaderEpochBumpIfNeeded(PartitionChangeRecord record) {


For my understanding, do we bump the leader epoch when the ISR is expanded? My understanding is that we don't.

Correct. The Replica.contains check is subtle but it returns true if the second list is a subset of the first list. I added a comment about this.

dajac

LGTM, thanks.

…kip-leader-epoch-bump

KAFKA-15021; Skip leader epoch bump

53184b7

jsancio requested review from dajac and hachikuji May 25, 2023 17:45

jsancio added the kraft label May 25, 2023

jsancio changed the title ~~KAFKA-15021; Skip leader epoch bump~~ KAFKA-15021; Skip leader epoch bump on ISR shrink May 25, 2023

splett2 reviewed May 25, 2023

View reviewed changes

jsancio assigned dajac May 30, 2023

KAFKA-15021; Add a Partition test for HWM update

704de6d

splett2 approved these changes Jun 2, 2023

View reviewed changes

dajac reviewed Jun 2, 2023

View reviewed changes

KAFKA-15021; Only check isCaughtUp if the replica is not in the ISR

5dc3d25

dajac approved these changes Jun 5, 2023

View reviewed changes

jsancio added 5 commits June 6, 2023 07:53

Merge remote-tracking branch 'apache-kafka/trunk' into kafka--15021-s…

f283b49

…kip-leader-epoch-bump

Merge remote-tracking branch 'apache-kafka/trunk' into kafka--15021-s…

fd312fd

…kip-leader-epoch-bump

KAFKA-15021; Fix replica manager tests

40b6590

KAFKA-15021; Fix quorum controller tests

71296e7

KAFKA-15021; More MV related changes

21fe45c

jsancio merged commit 8ad0ed3 into apache:trunk Jun 7, 2023
1 check failed

jsancio deleted the kafka--15021-skip-leader-epoch-bump branch June 7, 2023 14:20

mumrah mentioned this pull request Jun 20, 2023

KAFKA-15109 Don't skip leader epoch bump while in migration mode #13890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15021; Skip leader epoch bump on ISR shrink #13765

KAFKA-15021; Skip leader epoch bump on ISR shrink #13765

jsancio commented May 25, 2023 •

edited

splett2 May 25, 2023

splett2 May 26, 2023

jsancio Jun 1, 2023

divijvaidya commented May 30, 2023

dajac commented May 30, 2023

divijvaidya commented May 30, 2023

dajac commented May 30, 2023 •

edited

dajac commented May 30, 2023

jsancio commented Jun 1, 2023 •

edited

splett2 left a comment •

edited

jsancio commented Jun 2, 2023 •

edited

splett2 commented Jun 2, 2023

divijvaidya commented Jun 2, 2023

jsancio commented Jun 2, 2023 •

edited

dajac left a comment

dajac Jun 2, 2023

jsancio Jun 5, 2023 •

edited

dajac Jun 2, 2023

jsancio Jun 5, 2023

dajac Jun 2, 2023

jsancio Jun 5, 2023

dajac Jun 2, 2023

jsancio Jun 5, 2023

dajac left a comment

		((replicaState.isCaughtUp(leaderLogEndOffset.messageOffset, currentTimeMs, replicaLagTimeMaxMs) &&
		isReplicaIsrEligible(replica.brokerId)) \|\|

KAFKA-15021; Skip leader epoch bump on ISR shrink #13765

KAFKA-15021; Skip leader epoch bump on ISR shrink #13765

Conversation

jsancio commented May 25, 2023 • edited

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

divijvaidya commented May 30, 2023

dajac commented May 30, 2023

divijvaidya commented May 30, 2023

dajac commented May 30, 2023 • edited

dajac commented May 30, 2023

jsancio commented Jun 1, 2023 • edited

splett2 left a comment • edited

Choose a reason for hiding this comment

jsancio commented Jun 2, 2023 • edited

splett2 commented Jun 2, 2023

divijvaidya commented Jun 2, 2023

jsancio commented Jun 2, 2023 • edited

dajac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio Jun 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dajac left a comment

Choose a reason for hiding this comment

jsancio commented May 25, 2023 •

edited

dajac commented May 30, 2023 •

edited

jsancio commented Jun 1, 2023 •

edited

splett2 left a comment •

edited

jsancio commented Jun 2, 2023 •

edited

jsancio commented Jun 2, 2023 •

edited

jsancio Jun 5, 2023 •

edited