KAFKA-9539; Add leader epoch in StopReplicaRequest (KIP-570) #8257

dajac · 2020-03-09T14:20:17Z

This PR implements KIP-570.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

clients/src/main/resources/common/message/StopReplicaRequest.json

core/src/main/scala/kafka/api/ApiVersion.scala

core/src/main/scala/kafka/controller/ControllerChannelManager.scala

hachikuji · 2020-03-16T01:29:38Z

ok to test

dajac · 2020-03-16T08:02:11Z

build failures are due to #8301.

hachikuji · 2020-03-16T15:57:52Z

retest this please

hachikuji · 2020-03-16T15:58:05Z

ok to test

hachikuji

Thanks, left a few comments

hachikuji · 2020-03-16T23:53:10Z

clients/src/test/java/org/apache/kafka/common/requests/StopReplicaRequestTest.java

+                }
+                assertEquals(expectedPartitions, partitions);
+            } else {
+                Map<TopicPartition, StopReplicaPartitionState> partitionStates = new HashMap<>();


Wonder if it makes sense to add this method to StopReplicaRequest. Often the code expects to work with TopicPartition.

I have considered this as well but I haven't done it because it is only used in tests so far. I think that the downside is that using this in core is not optimal thus I am a bit reluctant to provide it. I mean, allocating and populating the Map is not necessary, especially when the controller and the brokers use the latest version of the API.

hachikuji · 2020-03-16T23:54:50Z

clients/src/test/java/org/apache/kafka/common/requests/StopReplicaRequestTest.java

+        topic1.partitionStates().add(new StopReplicaPartitionState()
+            .setPartitionIndex(2)
+            .setLeaderEpoch(2));
+        topic1.partitionStates().add(new StopReplicaPartitionState()


Might be worth adding one case where the epoch is -2.

Sure, done.

clients/src/main/resources/common/message/StopReplicaRequest.json

core/src/main/scala/kafka/log/LogManager.scala

hachikuji · 2020-03-17T00:50:11Z

core/src/main/scala/kafka/server/ReplicaManager.scala

+                    s"epoch $controllerEpoch for partition $topicPartition since its associated " +
+                    s"leader epoch $requestLeaderEpoch is smaller than the current " +
+                    s"leader epoch $currentLeaderEpoch")
+                  responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)


Hmm.. This error seems a little inaccurate. Could we use FENCED_LEADER_EPOCH?

Good point. I completely forgot to raise it...

I use STALE_CONTROLLER_EPOCH here to stay inline with the LeaderAndIsr API which uses is as well when the leader epoch is stale. See here: https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L1227

It was introduced in an old refactoring: a9ff3f2#diff-4f99f5a41c14e2a8523c03ce4ae23987L630. It seems that back in the days, we had StaleLeaderEpochCode but it got replaced by STALE_CONTROLLER_EPOCH .

I was actually wondering if we should stay inline with the current behavior of the LeaderAndIsr or just use FENCED_LEADER_EPOCH. If there a not other reasons besides the historical one to use STALE_CONTROLLER_EPOCH, FENCED_LEADER_EPOCH seems indeed more appropriate.

I have changed it to use FENCED_LEADER_EPOCH.

hachikuji · 2020-03-17T00:54:18Z

core/src/main/scala/kafka/server/ReplicaManager.scala

+                try {
+                  // Delete log and corresponding folders in case replica manager doesn't hold them anymore.
+                  // This could happen when topic is being deleted while broker is down and recovers.
+                  maybeCleanReplica(topicPartition, deletePartition)


This changes the order of the operations. Previously we would have stopped fetchers before attempting to delete the log directory. Are we sure this is safe?

I need to take another look at it.

I had a second look at this one and I think that the change is safe. Let me explain.

We can end up in this situation where the replica is not know any more by broker in two ways:

The broker receives a StopReplica request without having received a LeaderAndIsr request prior to it. In this case, the partition is not created and the fetchers haven't been started so we don't need to stop them for the concerned partitions. We have one test which simulate such scenario: https://github.com/apache/kafka/blob/trunk/core/src/test/scala/unit/kafka/admin/DeleteTopicTest.scala#L432. It fails if I comment the cleaning logic.

The handling of the StopReplica request fails with a storage exception when it deletes the log directory. The delete happens after the partition is effectively removed from the allPartitions map in the ReplicaManager. Note that the fetchers for the concerned partitions are already stopped at this point as they are stopped before removing the partition from the Map. If the request is retried somehow, the partition won't be there so the cleaning would take place.

All together, fetchers are always started after the partition is added in the allPartitions Map and always stopped before removing the partition from the Map. If it is not in the Map, fetchers can't be started. Thus, this seems safe to me based on my current knowledge.

The only benefit of putting it there is that the logging is better in my opinion. When the replica does not exist, we don't get the handling StopReplica... and completed StopReplica... but only Ignoring StopReplica... cause replica does not exist.

I would be fine reverting back to the prior behavior as it is only a cosmetic change at the end. It may be safer to do so.

hachikuji · 2020-03-19T23:58:03Z

retest this please

dajac · 2020-03-20T09:38:41Z

@hachikuji I think that I have answered all your comments/questions. Could you have a second look at it please?

dajac · 2020-03-27T12:54:19Z

core/src/main/scala/kafka/server/KafkaApis.scala

+
+        case (topicPartition, Left(true)) if topicPartition.topic == TRANSACTION_STATE_TOPIC_NAME =>
+          // The StopReplica API does not pass through the leader epoch
+          txnCoordinator.onResignation(topicPartition.partition, coordinatorEpoch = None)


@hachikuji It looks like that we could improve this now that we do have the leader epoch. I am not familiar at all with transactions. Can I just pass the epoch when provided here?

It seems safe to pass through when defined.

hachikuji

Thanks, left a few more comments.

hachikuji · 2020-03-30T19:04:21Z

clients/src/main/java/org/apache/kafka/common/requests/StopReplicaRequest.java

            }

            return new StopReplicaRequest(data, version);
        }

+        private boolean deletePartitions() {


Might not be too big of a problem, but it would be nice to avoid this pass through all the partitions. As it is we have 1) first pass to split partitions, 2) second pass to validate split, 3) third pass to convert to the needed type. Seems like we should be able to save some work here, like perhaps moving the conversion to the caller (even though it's annoying).

You're right. I did not realise this. I think that the best way is to move everything to the caller. I will do this.

I have refactored this and pushed all the conversion to the caller.

hachikuji · 2020-03-30T19:06:08Z

core/src/main/scala/kafka/server/KafkaApis.scala

+
+        case (topicPartition, Left(true)) if topicPartition.topic == TRANSACTION_STATE_TOPIC_NAME =>
+          // The StopReplica API does not pass through the leader epoch
+          txnCoordinator.onResignation(topicPartition.partition, coordinatorEpoch = None)


It seems safe to pass through when defined.

hachikuji · 2020-03-30T19:15:26Z

core/src/main/scala/kafka/server/KafkaApis.scala

        s"${stopReplicaRequest.brokerEpoch} smaller than the current broker epoch ${controller.brokerEpoch}")
      sendResponseExemptThrottle(request, new StopReplicaResponse(new StopReplicaResponseData().setErrorCode(Errors.STALE_BROKER_EPOCH.code)))
    } else {
-      val (result, error) = replicaManager.stopReplicas(stopReplicaRequest)
+      val (result, error) = replicaManager.stopReplicas(request.context.correlationId, stopReplicaRequest)


The return type is a bit awkward. As far as I can tell, the left side is just returning the deletion status, which is taken from the request. Was this an optimization in order to avoid another traversal of the request?

That's correct. Now, we also need the leaderEpoch to pass it to the txnCoordinator so I will refactor this.

I have refactored this as well. It is much better now.

hachikuji · 2020-03-30T19:34:02Z

core/src/main/scala/kafka/server/ReplicaManager.scala

+              try {
+                // Delete log and corresponding folders in case replica manager doesn't hold them anymore.
+                // This could happen when topic is being deleted while broker is down and recovers.
+                maybeCleanReplica(topicPartition, deletePartition)


I am not sure I followed your response to my previous question. My concern was actually the happy path when the partition exists locally. If we delete first before stopping replica fetchers, then would the fetcher thread handle that gracefully? By removing the fetchers first, we are guaranteed that we couldn't have a write in progress at the time of deletion.

Sorry if my comment was not clear. I was trying to argue that the fetcher thread can't be running for a given partition if the partition is not known by the ReplicaManager (case HostedPartition.None) because the fetcher thread is started after the partition is added to allPartitions Map in the ReplicaManager by the LeaderAndIsrRequest and stopped before the partition is removed from allPartitions by the StopReplicaRequest. This is based on my current understanding of the ReplicaManager but, as it is fairly new, I may have missed something though. Did I?

It is probably better to keep the previous behavior to be 100% safe.

I have reverted to the previous behavior to be 100% safe.

I think I get what you were saying now. You were probably right.

dajac · 2020-04-07T15:09:36Z

@hachikuji I have updated the PR based on your inputs. Could you have another look at it please?

core/src/main/scala/kafka/server/KafkaApis.scala

hachikuji · 2020-04-14T01:02:15Z

ok to test

hachikuji

Just a few minor questions. This looks nearly ready to merge.

hachikuji · 2020-04-14T01:20:08Z

core/src/main/scala/kafka/controller/ControllerChannelManager.scala

@@ -396,9 +395,23 @@ abstract class AbstractControllerBrokerRequestBatch(config: KafkaConfig,
  def addStopReplicaRequestForBrokers(brokerIds: Seq[Int],
                                      topicPartition: TopicPartition,
                                      deletePartition: Boolean): Unit = {
+    // A sentinel (-2) is used as an epoch if the topic is queued for deletion or
+    // does not have a leader yet. This sentinel overrides any existing epoch.


The first part of this definitely makes sense, but what is the motivation for the second part? Why not use LeaderAndIsr.NoEpoch? Though I can't really think of what would cause this case to be hit.

Indeed, LeaderAndIsr.NoEpoch makes more sense as a default value here. I did a mistake here. You're right. I don't think this should ever happen but I went on the defensive path with a default value in case.

I have updated the PR.

hachikuji · 2020-04-14T01:22:29Z

core/src/main/scala/kafka/controller/ControllerChannelManager.scala

+        .map(_.leaderAndIsr.leaderEpoch)
+        .getOrElse(LeaderAndIsr.EpochDuringDelete)
+    }
+
    brokerIds.filter(_ >= 0).foreach { brokerId =>


No need to fix here, but do you know why we do this filtering?

I've asked myself the same question but I couldn't find a reason. I believe that brokerId is always >= 0 in the controller.

hachikuji · 2020-04-14T01:38:14Z

core/src/main/scala/kafka/server/ReplicaManager.scala

-  def stopReplica(topicPartition: TopicPartition, deletePartition: Boolean)  = {
-    stateChangeLogger.trace(s"Handling stop replica (delete=$deletePartition) for partition $topicPartition")
-
+  def stopReplica(topicPartition: TopicPartition, deletePartition: Boolean): Unit  = {
    if (deletePartition) {
      getPartition(topicPartition) match {


Could potentially use nonOfflinePartition(topicPartition).foreach

Actually, it does not work because we need both the reference to the hosted partition and the partition bellow: hostedPartition and removedPartition. nonOfflinePartition only provides the latter.

hachikuji · 2020-04-14T01:40:30Z

core/src/main/scala/kafka/server/ReplicaManager.scala

        (responseMap, Errors.STALE_CONTROLLER_EPOCH)
      } else {
-        val partitions = stopReplicaRequest.partitions.asScala.toSet
-        controllerEpoch = stopReplicaRequest.controllerEpoch
+        this.controllerEpoch = controllerEpoch


Good call updating this.

hachikuji · 2020-04-14T14:40:16Z

retest this please

hachikuji · 2020-04-14T14:40:28Z

ok to test

hachikuji · 2020-04-14T14:41:51Z

retest this please

hachikuji

LGTM. Thanks for the patch!

dajac commented Mar 9, 2020

View reviewed changes

clients/src/main/resources/common/message/StopReplicaRequest.json Show resolved Hide resolved

dajac commented Mar 9, 2020

View reviewed changes

core/src/main/scala/kafka/api/ApiVersion.scala Show resolved Hide resolved

dajac commented Mar 9, 2020

View reviewed changes

core/src/main/scala/kafka/controller/ControllerChannelManager.scala Outdated Show resolved Hide resolved

hachikuji reviewed Mar 17, 2020

View reviewed changes

dajac force-pushed the KIP-570 branch from 3eb58ef to 6a3b4cb Compare March 27, 2020 09:27

dajac commented Mar 27, 2020

View reviewed changes

hachikuji reviewed Mar 30, 2020

View reviewed changes

dajac force-pushed the KIP-570 branch from 6a3b4cb to c610c51 Compare April 7, 2020 14:27

KAFKA-9539; Add leader epoch in StopReplicaRequest (KIP-570)

cdff56b

dajac force-pushed the KIP-570 branch from c610c51 to cdff56b Compare April 7, 2020 14:36

dajac commented Apr 10, 2020

View reviewed changes

core/src/main/scala/kafka/server/KafkaApis.scala Outdated Show resolved Hide resolved

fixup

0edeeeb

hachikuji reviewed Apr 14, 2020

View reviewed changes

Address review

49360d6

hachikuji approved these changes Apr 14, 2020

View reviewed changes

hachikuji merged commit 7c7d55d into apache:trunk Apr 14, 2020

dajac deleted the KIP-570 branch August 11, 2020 06:50

KAFKA-9539; Add leader epoch in StopReplicaRequest (KIP-570) #8257

KAFKA-9539; Add leader epoch in StopReplicaRequest (KIP-570) #8257

Conversation

dajac commented Mar 9, 2020

Committer Checklist (excluded from commit message)

hachikuji commented Mar 16, 2020

dajac commented Mar 16, 2020

hachikuji commented Mar 16, 2020

hachikuji commented Mar 16, 2020

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dajac Mar 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji commented Mar 19, 2020

dajac commented Mar 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dajac commented Apr 7, 2020

hachikuji commented Apr 14, 2020

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji commented Apr 14, 2020

hachikuji commented Apr 14, 2020

hachikuji commented Apr 14, 2020

hachikuji left a comment

Choose a reason for hiding this comment

dajac Mar 17, 2020 •

edited

Loading