KAFKA-5059: Implement Transactional Coordinator #2849

dguy · 2017-04-13T07:26:41Z

No description provided.

* add transaction log message format * add transaction timeout to initPid request * collapse to one message type

* sub-package transaction and group classes within coordinator * add loading and cleaning up logic * add transaction configs

* add all broker-side configs * check for transaction timeout value * added one more exception type

* handling add offsets to txn * add a pending state with prepareTransition / completeTransaction / abortTransition of state * refactor handling logic for multiple in-flight requests

… stubs and client changes.

1. Notable conflicts are with the small API changes to DelayedOperation and the newly introduced purgeDataBefore PR. 2. Jason's update to support streaming decompression required a bit of an overhaul to the way we handle aborted transactions on the consumer.

…than inheritance

Add tests for TransactionMarkerRequestCompletionHandler

Exactly once end txn

dguy · 2017-04-13T07:27:46Z

@ijuma @junrao @guozhangwang @apurvam @mjsax for reviews please.
Note: this is not the complete TransactionCoordinator, i.e., transaction expiration, TransactionalId -> PID mapping expiration, and recovery in handling initPidRequest all haven't been done yet

asfbot · 2017-04-13T07:54:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/2928/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-04-13T07:54:49Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/2923/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-04-13T07:56:02Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/2924/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-04-13T09:04:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/2924/
Test PASSed (JDK 7 and Scala 2.10).

junrao

@dguy : Thanks for the updated patch. Made another pass of the non-test files and added some more comments. Some of the issues can potentially be addressed in a followup patch, as long as we mark that clearly. Also, it seems that we don't have the code to (1) abort a long transaction; (2) expire a transactional id not being actively used some time?

junrao · 2017-04-24T19:59:12Z

core/src/main/scala/kafka/server/KafkaApis.scala

-  }
+    def sendResponseCallback(error: Errors) {
+      val responseBody = new EndTxnResponse(error)
+      trace(s"Completed ${endTxnRequest.transactionalId()}'s EndTxnRequest with command: ${endTxnRequest.command()}, errors: $error from client ${request.header.clientId()}.")


Be consistent on whether to use () when calling clientId() ?

junrao · 2017-04-24T20:01:29Z

core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala

+  }
+
+  private def loadTransactionMetadata(topicPartition: TopicPartition) {
+    def highWaterMark = replicaManager.getHighWatermark(topicPartition).getOrElse(-1L)


This still needs to be addressed. The loading in GroupCoordinator has the same issue.

junrao · 2017-04-25T00:00:43Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+    } else if (!txnManager.validateTransactionTimeoutMs(transactionTimeoutMs)) {
+      // check transactionTimeoutMs is not larger than the broker configured maximum allowed value
+      responseCallback(initTransactionError(Errors.INVALID_TRANSACTION_TIMEOUT))
+    } else {


I am a bit worried about all those independent checks on transactional state w/o any coordinator level locking. For example, in theory, a coordinator emigration and immigration could have happened after the check in line 104. Then, the appendMetadataToLog()/initPidWithExistingMetadata() call could mess up some state.

I was thinking that another way of doing this is to maintain a read/write lock for the coordinator partition. Immigration/emigration will hold the write lock while setting the state. Other calls like initPid will hold the read lock, do the proper coordinator state check, initiate the process like appending to the log and then release the read lock (we already have such a partition level lock in Partition, not sure if it's easily reusable). This will potentially give us better protection and make things easier to reason about.

@junrao thanks - yeah that makes sense. I was thinking about locking, too, but wasn't sure of the correct level to do it at, but the partition level seems ok. Will look into it. Thanks for the suggestion

@junrao the TC maintains multiple partitions, so we'd need to have a lock per partition. You mentioned that there is a read/write lock on partition - i believe you are referring to leaderIsrUpdateLock... I can't see any other locks in Partition. Anyway, do we want to expose this for other classes to use? I'd probably think not.

If we maintain a lock per partition then perhaps it should be done by the TransactionStateManager and then we'd need to add/remove locks in the immigration/emigration. I think we'd also need to add another method on TransactionStateManager, say partitionLock(partitionId) that returns an Option[ReentrantReadWriteLock]. The calls in TransactionCoordinator to isCoordinatorFor could then be replaced with calls to partitionLock(partitionId) - if the lock exists they take a read lock. If it doesn't exist then respond with Errors.NOT_COORDINATOR

Does this seem sensible?

@guozhangwang

Agree, maybe we can have a read-write lock on the txn metadata cache and only release the read lock after the txn log has been appended locall?

@guozhangwang per partition? or a global lock?

On second thoughts, i'll add a single read/write lock in the coordinator as it is much simpler than having to maintain multiple. If that is not ok, we can revisit.

Most operations just need to hold a read lock. Only emigration/immigration need to hold a write lock. So, perhaps having a single lock per broker is also fine as long as we don't hold the lock for too long (i.e., we should mostly be just setting critical states while holding the lock. Any expensive stuff should be done outside the lock).

junrao · 2017-04-25T00:00:50Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+
+          // there might be a concurrent thread that has just updated the mapping
+          // with the transactional id at the same time; in this case we will
+          // treat it as the metadata has existed and update it accordingly


Hmm, appendMetadataToLog() is not doing exactly the same as if the metadata has existed. It seems that we will be missing all those checks on metadata state in initPidWithExistingMetadata()?

I'm not sure what was meant by the comment, but i think you are correct in that we should do initPidWithExistingMetadata() in the case that they aren't the same. @guozhangwang any thoughts?

I think @junrao 's comment is that we did some checking on the txn metadata's state in initPidWithExistingMetadata whereas we did not do such checking before calling appendMetadataToLog. Have explained to him that it is because at line 129 we are assured that the metadata is just newly created and hence it's always Ongoing. Maybe the comment itself has been outdated after the addition of the initPidWithExistingMetadata logic.

junrao · 2017-04-25T00:01:00Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+          // with the transactional id at the same time; in this case we will
+          // treat it as the metadata has existed and update it accordingly
+          metadata synchronized {
+            if (!metadata.equals(newMetadata))


Should we just do the eq check on reference instead?

junrao · 2017-04-25T01:01:10Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannel.scala

+        case Some(partitionInfo) =>
+          val brokerId = partitionInfo.leaderIsrAndControllerEpoch.leaderAndIsr.leader
+          if (currentBrokers.add(brokerId)) {
+            // TODO: What should we do if we get BrokerEndPointNotAvailableException?


We should just wait until the target broker is available.

junrao · 2017-04-25T01:01:42Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannel.scala

+          brokerId
+        case None =>
+          // TODO: there is a rare case that the producer gets the partition info from another broker who has the newer information of the
+          // partition, while the TC itself has not received the propagated metadata update; do we need to block when this happens?


Hmm, instead of throwing IllegalStateException, it seems that we should just keep retrying until successful?

junrao · 2017-04-25T01:04:11Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

+      completionCallback(Errors.INVALID_TXN_STATE)
+    else {
+      val delayedTxnMarker = new DelayedTxnMarker(metadataToWrite, completionCallback)
+      txnMarkerPurgatory.tryCompleteElseWatch(delayedTxnMarker, Seq(metadata.pid))


Since the timeout in DelayedTxnMarker is infinite. I am wondering if we really need a txnMarkerPurgatory. In TransactionMarkerRequestCompletionHandler, we are already updating the pending partitions as the client response comes back. The response that removes the last pending partition can just trigger the calling of completionCallback.

In initPidWithExistingMetadata we also need to wait on the transaction to complete if there is an inflight transaction in the PrepareAbort or PrepareCommit phase.

junrao · 2017-04-25T01:26:07Z

core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala

+
+  def appendTransactionToLog(transactionalId: String,
+                             txnMetadata: TransactionMetadata,
+                             responseCallback: Errors => Unit) {


If the broker's message format is < V2, currently, when appending to the log, we simply convert it to an old format. In this case, we want to error out and respond to the client with a TransactionNotSupported error.

@junrao i'm not really sure what i'm supposed to be checking here?

This is about the inter-broker protocol version. Details are here:

http://kafka.apache.org/documentation/#upgrade_10_1

Maybe just leave a TODO marker and I can address it in a follow-up PR, so we would not drag too long for this one?

junrao · 2017-04-25T01:28:07Z

core/src/main/scala/kafka/common/InterBrokerSendThread.scala

+        request.request,
+        now,
+        true,
+        completionHandler)


Since we are sending new type of requests across the brokers, we need to check inter broker protocol and error out if the new request is not supported.

How would i do that?

Responded in another comment. Let's do this incrementally in another PR and just leave a TODO in this PR. Otherwise we would be looking at a 10K diff

Ok i've filed: https://issues.apache.org/jira/browse/KAFKA-5128

asfbot · 2017-04-25T06:15:58Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3157/
Test PASSed (JDK 7 and Scala 2.10).

dguy · 2017-04-25T07:27:43Z

@junrao thanks for taking the time to review again.
Regarding:

Also, it seems that we don't have the code to (1) abort a long transaction; (2) expire a transactional id not being actively used some time?

Correct they have not been done yet.

asfbot · 2017-04-25T08:15:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3163/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-25T09:15:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3157/
Test PASSed (JDK 8 and Scala 2.12).

guozhangwang

Made a pass over the added non-test code beyond my commits, also places that I got pinged.

guozhangwang · 2017-04-25T22:48:31Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java

@@ -285,7 +285,7 @@ public void forceClose() {

    private ClientResponse sendAndAwaitInitPidRequest(Node node) throws IOException {
        String nodeId = node.idString();
-        InitPidRequest.Builder builder = new InitPidRequest.Builder(null);
+        InitPidRequest.Builder builder = new InitPidRequest.Builder(null, Integer.MAX_VALUE);


In that case do we really need this change in this PR? Maybe we can just remove this change as it is actually doing the same still.

guozhangwang · 2017-04-25T22:49:15Z

clients/src/main/java/org/apache/kafka/common/protocol/Errors.java

@@ -38,11 +38,12 @@
 import org.apache.kafka.common.errors.InvalidReplicationFactorException;
 import org.apache.kafka.common.errors.InvalidRequestException;
 import org.apache.kafka.common.errors.InvalidRequiredAcksException;
+import org.apache.kafka.common.errors.InvalidTxnTimeoutException;


Are these changes intentional? The original ordering seems OK to me.

guozhangwang · 2017-04-25T22:55:50Z

clients/src/main/java/org/apache/kafka/common/requests/WriteTxnMarkersRequest.java

+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        final WriteTxnMarkersRequest that = (WriteTxnMarkersRequest) o;
+        return coordinatorEpoch == that.coordinatorEpoch &&


Thinking about this a bit more, I wonder if the coordinatorEpoch should also be in the internal entry as well, since different txn log partition leader's epoch hence the coordinator epoch would be different?

EDIT: in the existing branch I have already made those changes a while back: https://github.com/guozhangwang/kafka/blob/KEOS-transactions-coordinator-network-thread/clients/src/main/java/org/apache/kafka/common/requests/WriteTxnMarkerRequest.java

The design doc however is not updated.

I saw you did a groupBy on the coordinatorEpoch instead, so that each write marker request will only contain one coordinatorEpoch, but since on the broker side, this coordinator epoch is checked inside the Log layer anyways I felt it is better to change this field as a per-marker-entry field in the protocol.

So this should change back to what you previously had? We originally had your code, but during the merge with other changes it was probably removed. That is why i did the groupBy on coordinatorEpoch.

guozhangwang · 2017-04-25T23:03:23Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

+        metrics,
+        time,
+        "txn-marker-channel",
+        Map("broker-id" -> config.brokerId.toString).asJava,


I did this in the original commit but: since this thread is owned by the broker only, we do not need this tag. Instead we can just pass an empty tag map.

guozhangwang · 2017-04-26T01:15:54Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

+        new ManualMetadataUpdater(),
+        threadName,
+        1,
+        50,


Nice catch. Currently we do not have a broker-side reconnect.backoff config yet so different modules just hand-code different values. But moving forward I felt we may want to introduce a new config for inter-broker reconnect backoff.

guozhangwang · 2017-04-26T02:23:52Z

core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala

+
+      if (responseError != Errors.NONE) {
+        debug(s"Updating $transactionalId's transaction state to $txnMetadata for $transactionalId failed after the transaction message " +
+          s"has been appended to the log since the metadata does not match anymore.")


The log entry was wrong, maybe just "since the the appended log did not successfully replicate to all replicas". Does that sound better?

guozhangwang · 2017-04-26T02:26:03Z

core/src/main/scala/kafka/common/InterBrokerSendThread.scala

+        completionHandler.onComplete(disConnectedResponse)
+      }
+    }
+    networkClient.poll(pollTimeout, now)


Since we now have one queue per broker, and 1) we drain all the elements in the queue whenever trying to send; 2) we wake up the client whenever we are adding new elements to the queue; I think it is not as critical to set lower values?

guozhangwang · 2017-04-26T02:26:52Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+    } else if (!txnManager.validateTransactionTimeoutMs(transactionTimeoutMs)) {
+      // check transactionTimeoutMs is not larger than the broker configured maximum allowed value
+      responseCallback(initTransactionError(Errors.INVALID_TRANSACTION_TIMEOUT))
+    } else {


Agree, maybe we can have a read-write lock on the txn metadata cache and only release the read lock after the txn log has been appended locall?

guozhangwang · 2017-04-26T02:29:31Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+
+          // there might be a concurrent thread that has just updated the mapping
+          // with the transactional id at the same time; in this case we will
+          // treat it as the metadata has existed and update it accordingly


I think @junrao 's comment is that we did some checking on the txn metadata's state in initPidWithExistingMetadata whereas we did not do such checking before calling appendMetadataToLog. Have explained to him that it is because at line 129 we are assured that the metadata is just newly created and hence it's always Ongoing. Maybe the comment itself has been outdated after the addition of the initPidWithExistingMetadata logic.

guozhangwang · 2017-04-26T02:33:57Z

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannel.scala

+  }
+
+  private[transaction]
+  def drainQueuedTransactionMarkers(txnMarkerPurgatory: DelayedOperationPurgatory[DelayedTxnMarker]): Iterable[RequestAndCompletionHandler] = {


@dguy @junrao What's the motivation of trying to drain all the queued elements? Since the max inflight request is only 1 in the network client, even if we construct multiple requests for a certain destination only the first request will succeed in sending right? In that case could just do the 1) peek-first 2) if-ready-send-and-pop pattern?

@guozhangwang this is largely a refactoring of your code from here: https://github.com/guozhangwang/kafka/blob/KEOS-transactions-coordinator-network-thread/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala#L157 :-P

guozhangwang · 2017-04-26T21:11:18Z

Collapsed all commits and merged to trunk.

ijuma · 2017-04-26T21:55:47Z

For large PRs like this, we should run the system tests before we merge. Can we post a link to a successful run for future record (assuming we've done that)?

hachikuji · 2017-04-26T21:58:55Z

@ijuma I've kicked off a build here: https://jenkins.confluent.io/view/All/job/system-test-kafka-branch-builder-2/275/. Let's cross our fingers since it's already merged!

asfbot · 2017-04-27T01:19:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3190/
Test PASSed (JDK 7 and Scala 2.10).

ijuma · 2017-04-27T08:40:52Z

Thanks @hachikuji, build passed. :)

asfbot · 2017-04-27T13:19:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3189/
Test PASSed (JDK 8 and Scala 2.12).

guozhangwang and others added 23 commits April 11, 2017 14:05

Transaction log message format (#134)

4d17b7c

* add transaction log message format * add transaction timeout to initPid request * collapse to one message type

Fix build and test errors due to reabse onto idempotent-producer branch

af92651

Transaction log partition Immigration and Emigration (#142)

fc3544b

* sub-package transaction and group classes within coordinator * add loading and cleaning up logic * add transaction configs

Add transactions broker configs (#146)

fc5fe92

* add all broker-side configs * check for transaction timeout value * added one more exception type

Handle addPartitions and addOffsets on TC (#147)

ef390df

* handling add offsets to txn * add a pending state with prepareTransition / completeTransaction / abortTransition of state * refactor handling logic for multiple in-flight requests

Fix build errors after rebase onto trunk and dropping out the request…

2a6526a

… stubs and client changes.

fix unit tests

f639b96

add sender thread

853c5e8

rename TC Send Thread to general inter-broker send thread

879c01c

add tc channel manager

239e7f7

missing files

b1561da

add the txn marker channel manager

62685c7

fix compilation errors

2987901

integrate EndTxnRequest

4f5c23d

add test fo InterBrokerSendThread. Refactor to use delegation rather …

e5f25f3

…than inheritance

refactor TransactionMarkerChannelManager. Add some test

8bbd7a0

more tests

195bccf

remove some answered TODOs

c28eb5a

update to WriteTxnMarkersRequest/Response from Trunk

4346c4d

add missing @test annotation

46880d7

fixes after rebase

cbcd55e

Add tests for TransactionMarkerRequestCompletionHandler

Merge pull request #161 from confluentinc/exactly-once-end-txn

b307e5d

Exactly once end txn

change time configs to use TimeUnit

266aecc

fix complie error

bc79bdd

respond with COORDINATOR_LOAD_IN_PROGRESS in updateCacheCallback of TSM

5123875

junrao reviewed Apr 25, 2017

View reviewed changes

dguy added 4 commits April 25, 2017 14:01

addres some comments

a0f805f

retry completion of transaction commit/abort

965c35b

add some test for retries on commit failure

fe39c18

retry on broker not available in TransactionMarkerChannel

4d670e0

guozhangwang reviewed Apr 26, 2017

View reviewed changes

dguy added 10 commits April 26, 2017 08:25

address some feedback

cf3a1ec

use LEO rather than HW when loading the transaction log

2c87f6f

add new error for concurrent exceptions

8d2d744

wake up networkclient when items added to the request queue

f23f6ae

address further comments

f645d67

clear out purgatory when partitions have emigrated

1f4ec97

add coordinator read/write lock

6e0a51e

put the readlock around transaction.appendToLog

bd6d154

remove unused import

24a3760

todos for future PRs

726bc4a

asfgit closed this in f69d941 Apr 26, 2017

guozhangwang mentioned this pull request May 8, 2017

KAFKA-5130: Refactor TC In-memory Cache #2964

Closed

ijuma deleted the exactly-once-tc branch January 25, 2020 16:57

KAFKA-5059: Implement Transactional Coordinator #2849

KAFKA-5059: Implement Transactional Coordinator #2849

Conversation

dguy commented Apr 13, 2017

dguy commented Apr 13, 2017

asfbot commented Apr 13, 2017

asfbot commented Apr 13, 2017

asfbot commented Apr 13, 2017

asfbot commented Apr 13, 2017

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dguy Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfbot commented Apr 25, 2017

dguy commented Apr 25, 2017

asfbot commented Apr 25, 2017

asfbot commented Apr 25, 2017

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Apr 26, 2017 • edited Loading

ijuma commented Apr 26, 2017

hachikuji commented Apr 26, 2017

asfbot commented Apr 27, 2017

ijuma commented Apr 27, 2017

asfbot commented Apr 27, 2017

dguy Apr 25, 2017 •

edited

Loading

guozhangwang commented Apr 26, 2017 •

edited

Loading