KAFKA-5202: Handle topic deletion while trying to send txn markers #3130

guozhangwang · 2017-05-23T23:32:16Z

Here is the sketch of this proposal:

When it is time to send the txn markers, only look for the leader node of the partition once instead of retrying, and if that information is not available, it means the partition is highly likely been removed since it was in the cache before. In this case, we just remove the partition from the metadata object and skip putting into the corresponding queue, and if all partitions' leader broker are non-available, complete this delayed operation to proceed to write the complete txn log entry.
If the leader id is unknown from the cache but the corresponding node object with the listener name is not available, it means that the leader is likely unavailable right now. Put it into a separate queue and let sender thread retry fetching its metadata again each time upon draining the queue.

One caveat of this approach is the delete-and-recreate case, and the argument is that since all the messages are deleted anyways when deleting the topic-partition, it does not matter whether the markers are on the log partitions or not.

guozhangwang · 2017-05-23T23:32:29Z

ping @hachikuji @apurvam @dguy

apurvam · 2017-05-24T00:39:38Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

@@ -208,6 +209,10 @@ class TransactionCoordinator(brokerId: Int,
    if (transactionalId == null || transactionalId.isEmpty) {
      responseCallback(Errors.INVALID_REQUEST)
    } else {
+      // if there is any partitions unknown in the metadata cache, return immediately to client
+      if (partitions.exists(tp => !metadataCache.contains(tp.topic, tp.partition)))
+        responseCallback(Errors.UNKNOWN_TOPIC_OR_PARTITION)


Mentioned offline, but leaving it here as well: We should send this as a per partition error only for the partitions which are missing.

Thinking about this a bit more, the behavior of the client will be tricky. Essentially, the send will block until it's expired, and yet the AddPartitions will keep retrying indefinitely. And because of https://issues.apache.org/jira/browse/KAFKA-5251, the client will keep retrying the AddPartitions even if the user aborts. If we want to keep this behavior, I think we may want to fix KAFKA-5251 as well. This way the user can at least abort the transaction properly after realizing that they are trying to send to deleted partitions.

If we want to keep this behavior, I think we may want to fix KAFKA-5251 as well.

I think KAFKA-5251 is an optimization, while we want to fix the client behavior that if the topic partition gets deleted while the producer is having an ongoing transaction, it should be able to detect this and proceed instead of retrying, right?

You are right. In other words, the problem is that the UNKNOWN_TOPIC_PARTITION is a retriable error right now, so it will keep retrying. We need to add logic distinguish between cases where a broker is bounced and doesn't have its metadata yet, and from cases where the topic is truly deleted. There is no such logic in the client today, so it will retry indefinitely when a topic is deleted.

asfbot · 2017-05-24T00:45:47Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4327/
Test FAILed (JDK 7 and Scala 2.11).

asfbot · 2017-05-24T00:55:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4314/
Test FAILed (JDK 8 and Scala 2.12).

apurvam · 2017-05-24T01:03:10Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

-      // TODO: instead of retry until succeed, we can first put it into an unknown broker queue and let the sender thread to look for its broker and migrate them
-      while (brokerNode.isEmpty) {
-        brokerNode = metadataCache.getPartitionLeaderEndpoint(topicPartition.topic, topicPartition.partition, interBrokerListenerName)
+      brokerNode = metadataCache.getPartitionLeaderEndpoint(topicPartition.topic, topicPartition.partition, interBrokerListenerName)


Why can't we simply do metadataCache.contains(topicPartition.topic) to check if the topic exists?

Not sure I follow.. we need to get the broker node in order to put into the corresponding queue right?

Hmm. I guess I need to sync face to face. But from the code it seems like if there is no leader for any single topic partition, then the operation for the entire transactionalId in the purgatory will be marked as completed? If this is true, then I have two questions:

Would we enter this case if there are no live replicas for a particular partition? Or will metadataCache.getPartitionLeaderEndpoint only return no brokers if the topic is deleted?

Assuming that no leader means the partition is deleted, would the current logic mark the operation in purgatory as successful if even one partition in the transaction was deleted?

As I said, we can discuss this face to face: it may be more efficient as I am not most familiar with a bunch of this stuff.

good question about getPartitionLeaderEndpoint: I will use a separate queue for brokers who are not known but not available yet.

that is intentional: if one of the partitions are deleted but others have successfully written the markers, we should still treat this transaction ad completed, since the append of prepareXX txn log means that the txn has to be completedXX, even if some data partitions are deleted.

asfbot · 2017-05-24T22:51:45Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4349/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-05-24T23:01:22Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4363/
Test PASSed (JDK 7 and Scala 2.11).

asfbot · 2017-05-25T00:54:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4356/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-25T00:55:08Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4370/
Test FAILed (JDK 7 and Scala 2.11).

asfbot · 2017-05-25T02:11:29Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4371/
Test PASSed (JDK 7 and Scala 2.11).

asfbot · 2017-05-25T02:39:33Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4357/
Test PASSed (JDK 8 and Scala 2.12).

dguy · 2017-05-25T16:46:41Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

-      brokerNode.get
-    }
+        case None =>
+          // if the leader of the partition is unknown, skip sending the txn marker since


@guozhangwang i guess we are assuming deletion as we would never have made it this far if the partition wasn't previously in the cache?

The bottom line is that metadataCache's cache memory structure is "incremental" that it will not remove entries once added them unless the tp is marked as deleted in the metadata update request.

In KafkaApis we have already checked the cache and filtered any partitions whose info is not in the cache, so if later in the path here, that it has been gone, it means it was there and has been deleted.

I guess it is possible that the topic is deleted and recreated before we notice that it is gone. In that case, we might write the marker to the new topic. That seems fine since there would be no transactional data from the producer.

By the way, is it assumed that metadata must be updated before we can become the leader of any partition? Is there no scenario where we could see an uninitialized or stale metadata cache?

When there is a leader change, controller will always send the "update metadata request", wait for its response, and then send "leader and isr request" in a second round trip, so I think the answer should be yes.

That seems fine since there would be no transactional data from the producer.

Yes, as I mentioned in the description of the PR above.

Yeah, saw your comment after I posted the review. Glad we came to the same conclusion!

asfbot · 2017-05-25T18:10:48Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4386/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-05-25T18:19:12Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4400/
Test PASSed (JDK 7 and Scala 2.11).

hachikuji

Thanks for the patch. Left mostly minor comments. The only thing I wasn't too sure about is how safe it is to depend on the contents of the metadata cache. Also, if checking the cache is reliable, we could also check it when loading the transaction metadata on partition immigration before sending any markers.

hachikuji · 2017-05-26T04:48:08Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

@@ -156,6 +155,9 @@ class TransactionMarkerChannelManager(config: KafkaConfig,
  }

  // visible for testing
+  private[transaction] def queueForUnknownBroker = markersQueueForUnknownBroker


nit: maybe you could just give markersQueueForUnknownBroker private scope in the transaction package? Also, this name is a bit easier on the eyes than markersQueueForUnknownBroker, so maybe we could remove this and change the name of the variable?

Hmm, I felt the variable names the queue markersQueueForUnknownBroker and markersQueuePerBroker are okay, and the reason of these two functions queueForUnknownBroker and queueForBroker() is exactly to separate the actual private variables from the test-only functionalities.

Hiding the field doesn't seem to have much value if we expose it directly through an accessor anyway and it's annoying to need two names for the same thing. If it only needs to be accessed from a test case, I wouldn't worry too much about violating encapsulation because we can just change the test case if the implementation changes. That said, it's just a nit, so feel free to ignore.

hachikuji · 2017-05-26T04:52:05Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

@@ -128,10 +128,9 @@ class TransactionMarkerChannelManager(config: KafkaConfig,

  private val markersQueuePerBroker: concurrent.Map[Int, TxnMarkerQueue] = concurrent.TrieMap.empty[Int, TxnMarkerQueue]

-  private val interBrokerListenerName: ListenerName = config.interBrokerListenerName
+  private val markersQueueForUnknownBroker: TxnMarkerQueue = new TxnMarkerQueue(Node.noNode)


nit: type parameter on the lhs seems redundant.

Intellij IDE states the opposite :) Anyways, I know Ismael is a strong advocator against those redundants so I will change it back.

hachikuji · 2017-05-26T04:53:24Z

core/src/test/scala/unit/kafka/coordinator/transaction/TransactionCoordinatorTest.scala

@@ -16,6 +16,7 @@
 */
 package kafka.coordinator.transaction

+import kafka.server.MetadataCache


Seems we don't use this anywhere?

hachikuji · 2017-05-26T04:56:25Z

core/src/main/scala/kafka/server/MetadataCache.scala

@@ -160,16 +160,20 @@ class MetadataCache(brokerId: Int) extends Logging {
    }
  }

+  // if the leader is not known, return None;
+  // if the leader is known and corresponding node is available, return Some(node)
+  // if the leader is known but corresponding node with the listener name is not available, return Some(NO_NODE)
  def getPartitionLeaderEndpoint(topic: String, partitionId: Int, listenerName: ListenerName): Option[Node] = {
    inReadLock(partitionMetadataLock) {
      cache.get(topic).flatMap(_.get(partitionId)) match {


nit: seems this could be a map instead of match?

hachikuji · 2017-05-26T04:59:16Z

core/src/main/scala/kafka/server/MetadataCache.scala

@@ -235,6 +239,8 @@ class MetadataCache(brokerId: Int) extends Logging {
    }
  }

+  def contains(topic: String, partitionId: Int): Boolean = getPartitionInfo(topic, partitionId).isDefined


nit: could this accept TopicPartition instead? Seems like that's what we usually have in KafkaApis.

hachikuji · 2017-05-26T05:07:26Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

@@ -254,31 +272,65 @@ class TransactionMarkerChannelManager(config: KafkaConfig,
                                 result: TransactionResult, coordinatorEpoch: Int,
                                 topicPartitions: immutable.Set[TopicPartition]): Unit = {
    val txnTopicPartition = txnStateManager.partitionFor(transactionalId)
-    val partitionsByDestination: immutable.Map[Node, immutable.Set[TopicPartition]] = topicPartitions.groupBy { topicPartition: TopicPartition =>
-      var brokerNode: Option[Node] = None
+    val partitionsByDestination: immutable.Map[Option[Node], immutable.Set[TopicPartition]] = topicPartitions.groupBy { topicPartition: TopicPartition =>


nit: keying with an Option is a little odd. I wonder if it would be clearer to partition this into two separate collections, partitionsByDestination and partitionsWithUnknownDestinations?

I agree it is a bit odd, though with the above path we need to traverse the map twice with metadataCache.getPartitionLeaderEndpoint, one for partition and one for groupBy. Since this code is part of a sorta critical path I'm not sure which one is more efficient?

Yeah, not too sure how much difference it would make, but we can leave it as is.

hachikuji · 2017-05-26T05:18:32Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

-      brokerNode.get
-    }
+        case None =>
+          // if the leader of the partition is unknown, skip sending the txn marker since


I guess it is possible that the topic is deleted and recreated before we notice that it is gone. In that case, we might write the marker to the new topic. That seems fine since there would be no transactional data from the producer.

By the way, is it assumed that metadata must be updated before we can become the leader of any partition? Is there no scenario where we could see an uninitialized or stale metadata cache?

hachikuji · 2017-05-26T05:19:59Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

+          val marker = new TxnMarkerEntry(producerId, producerEpoch, coordinatorEpoch, result, topicPartitions.toList.asJava)
+          val txnIdAndMarker = TxnIdAndMarkerEntry(transactionalId, marker)
+
+          if (brokerNode.eq(Node.noNode)) {


nit: I understand we can use reference equality, but is it necessary? Seems a bit brittle to depend on noNode returning the same object.

again this is for efficiency, I can change it to equals if you feel it is not necessary.

No strong opinion either way, but I'd favor the normal == check because the performance difference is almost certainly negligible.

hachikuji · 2017-05-26T05:22:23Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

+        case None =>
+          // if the leader of the partition is unknown, skip sending the txn marker since
+          // the partition is likely to be deleted already
+          info(s"Couldn't find leader endpoint for partitions $topicPartitions while trying to send transaction markers for " +


nit: Maybe we could move this to after coordinator epoch check? Seems like it might be misleading if we end up cancelling the operation anyway.

hachikuji · 2017-05-26T05:41:43Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

+
+            case Right(Some(epochAndMetadata)) =>
+              if (epochAndMetadata.coordinatorEpoch != coordinatorEpoch) {
+                info(s"The cached metadata have been changed to $epochAndMetadata since preparing to send markers; cancel sending markers to its partition leaders")


Maybe worth mentioning the old and new coordinator epoch values?

…into K5202-handle-topic-deletion

hachikuji

LGTM

hachikuji · 2017-05-30T21:55:56Z

Tests passing locally. Merging to trunk and 0.11.0.

Here is the sketch of this proposal: 1. When it is time to send the txn markers, only look for the leader node of the partition once instead of retrying, and if that information is not available, it means the partition is highly likely been removed since it was in the cache before. In this case, we just remove the partition from the metadata object and skip putting into the corresponding queue, and if all partitions' leader broker are non-available, complete this delayed operation to proceed to write the complete txn log entry. 2. If the leader id is unknown from the cache but the corresponding node object with the listener name is not available, it means that the leader is likely unavailable right now. Put it into a separate queue and let sender thread retry fetching its metadata again each time upon draining the queue. One caveat of this approach is the delete-and-recreate case, and the argument is that since all the messages are deleted anyways when deleting the topic-partition, it does not matter whether the markers are on the log partitions or not. Author: Guozhang Wang <wangguoz@gmail.com> Reviewers: Apurva Mehta <apurva@confluent.io>, Damian Guy <damian.guy@gmail.com>, Jason Gustafson <jason@confluent.io> Closes #3130 from guozhangwang/K5202-handle-topic-deletion (cherry picked from commit 80223b1) Signed-off-by: Jason Gustafson <jason@confluent.io>

asfbot · 2017-05-30T22:12:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4575/
Test FAILed (JDK 7 and Scala 2.11).

asfbot · 2017-05-30T22:44:20Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4561/
Test PASSed (JDK 8 and Scala 2.12).

handle topic deletion

733f352

apurvam reviewed May 24, 2017

View reviewed changes

Apurva's comments

4859d2f

guozhangwang added 2 commits May 24, 2017 17:18

address github comments

c9c4e60

unnecessary parameter

2ed37f3

unit tests

b3b30f5

dguy reviewed May 25, 2017

View reviewed changes

guozhangwang added 2 commits May 25, 2017 10:05

remove debugging line

2bea317

rebase from trunk

a53ed6b

hachikuji reviewed May 26, 2017

View reviewed changes

guozhangwang added 4 commits May 26, 2017 16:53

Merge branch 'trunk' of https://git-wip-us.apache.org/repos/asf/kafka …

60c4978

…into K5202-handle-topic-deletion

Jason's comments

12440c9

Merge branch 'trunk' of https://git-wip-us.apache.org/repos/asf/kafka …

cdc9b77

…into K5202-handle-topic-deletion

Jason's comment

9629f33

hachikuji approved these changes May 30, 2017

View reviewed changes

asfgit closed this in 80223b1 May 30, 2017

guozhangwang deleted the K5202-handle-topic-deletion branch November 6, 2017 22:44

KAFKA-5202: Handle topic deletion while trying to send txn markers #3130

KAFKA-5202: Handle topic deletion while trying to send txn markers #3130

Conversation

guozhangwang commented May 23, 2017 • edited Loading

guozhangwang commented May 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfbot commented May 24, 2017

asfbot commented May 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfbot commented May 24, 2017

asfbot commented May 24, 2017

asfbot commented May 25, 2017

asfbot commented May 25, 2017

asfbot commented May 25, 2017

asfbot commented May 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji May 27, 2017 • edited Loading

Choose a reason for hiding this comment

asfbot commented May 25, 2017

asfbot commented May 25, 2017

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji May 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

hachikuji commented May 30, 2017

asfbot commented May 30, 2017

asfbot commented May 30, 2017

guozhangwang commented May 23, 2017 •

edited

Loading

hachikuji May 27, 2017 •

edited

Loading

hachikuji May 26, 2017 •

edited

Loading