KAFKA-5132: abort long running transactions #2957

dguy · 2017-05-02T15:12:13Z

Abort any ongoing transactions that haven't been touched for longer than the transaction timeout

dguy · 2017-05-02T15:14:44Z

@guozhangwang @apurvam @hachikuji @mjsax should the expiry time be based on the transaction start time or the last updated time?

Also, in the Google doc it says:

If its status is PREPARE_COMMIT, then complete the committing process of the transaction.
If its status is PREPARE_ABORT, then complete the aborting process of the transaction.

Though i'm not sure why we'd need to do this. If the transaction has made it either of those statuses then it is going to complete anyway.

apurvam · 2017-05-02T15:52:21Z

The expiry time is based on the start time. I think we need to add that field to the messages on the transaction log and set it on the first add partitions. From then on, we use that time to determine if the transaction needs to be timed out.

Regarding your second comment, you are correct. If the transaction is in PREPARE_XXX state, it should be rolled forward or rolled back. I think the point of that passage is that we should not force abort it if it is already rolling forward and it has hit the timeout.

asfbot · 2017-05-02T18:22:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3376/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-05-02T18:23:09Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3367/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-02T18:24:01Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3370/
Test FAILed (JDK 7 and Scala 2.10).

apurvam · 2017-05-02T18:44:45Z

Hmm. there might be a resource leak somewhere, because all the tests are failing with an OutOfMemoryException.

asfbot · 2017-05-03T09:57:12Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3405/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-03T10:05:15Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3408/
Test FAILed (JDK 7 and Scala 2.10).

dguy · 2017-05-03T11:47:57Z

I don't understand the build failure. I've ran the jenkins build locally and it is all good.

asfbot · 2017-05-03T12:27:10Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3414/
Test FAILed (JDK 8 and Scala 2.11).

dguy · 2017-05-03T15:47:47Z

It is leaking threads. Looking into it.

asfbot · 2017-05-03T17:34:17Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3422/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-05-03T18:15:41Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3419/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-05-03T18:16:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3428/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-05-03T20:30:11Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3443/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-05-03T20:37:36Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3437/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-05-03T20:45:13Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3434/
Test PASSed (JDK 8 and Scala 2.12).

apurvam

Looks good to me overall. My biggest comment is about the 'two phase' operation of bumping the epoch and then transitioning to PREPARE_ABORT when a transaction needs to be expired. I think we can just initiate PREPARE_ABORT with the higher epoch. Please correct me if I am wrong!

Also, out of curiosity, all the shutdown and close are in response to the resource leaks that were exposed on Jenkins. But a lot of these resources seemed to be leaking anyway, independently of these changes (there was no networkClient.close, no transactionCoordinator.shutdown(), etc.). This suggests that the tests were already at tipping point, and this new code pushed it over. Is my assessment correct?

apurvam · 2017-05-04T06:10:44Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+                idAndMetadata.metadata.producerEpoch,
+                TransactionResult.ABORT,
+                (errors: Errors) => {
+                  warn(s"rollback of transactionalId: ${idAndMetadata.transactionalId} failed during transaction expiry. errors: $errors")


Shoudn't this only be logged if there is an error?

apurvam · 2017-05-04T06:11:04Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+          idAndMetadata.metadata.prepareTransitionTo(Ongoing)
+          txnManager.appendTransactionToLog(idAndMetadata.transactionalId, idAndMetadata.metadata, (errors: Errors) => {
+            if (errors != Errors.NONE)
+            // TODO: Is this sufficient? It will be retried later if it failed


I think this is sufficient. What else can you do on a background scheduled thread anyway?

apurvam · 2017-05-04T06:13:25Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+        if (!txnManager.isCoordinatorLoadingInProgress(idAndMetadata.transactionalId)) {
+          idAndMetadata.metadata.producerEpoch = (idAndMetadata.metadata.producerEpoch + 1).toShort
+          idAndMetadata.metadata.prepareTransitionTo(Ongoing)
+          txnManager.appendTransactionToLog(idAndMetadata.transactionalId, idAndMetadata.metadata, (errors: Errors) => {


Why can't we initiate an ABORT directly with the bumped epoch? We have to write the PREPARE anyway. We might as well do it with the the higher epoch.

Thinking about this a bit more, I can see the motivation for two phase approach: you will categorically fence off any existing producer while maintaining the Ongoing state. You can then safely begin the abort.

However, even with the two phase approach, won't you still have a race condition where the existing producer could do a PREPARE_COMMIT which gets in before your epoch bump?

Yes the idea is to fence off any existing producer.
As for the race condition - yes i should check that there is no pending state transaction before bumping the epoch etc. Thanks for pointing out

apurvam · 2017-05-04T06:15:39Z

core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala

-    scheduler.startup()
-
+    if (!scheduler.isStarted)
+      scheduler.startup()
    // TODO: add transaction and pid expiration logic


Is this comment still valid?

half valid ;-)

dguy · 2017-05-04T08:13:42Z

@apurvam - yes the resources were leaking anyway, but this pushed it beyond the limit.

asfbot · 2017-05-04T10:25:20Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3481/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-05-04T10:29:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3475/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-05-04T10:43:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3472/
Test FAILed (JDK 8 and Scala 2.12).

dguy · 2017-05-04T16:29:53Z

retest this please

apurvam

Looks good. Left one comment about my understanding of the assumptions around synchronization. If those are correct, I think this is pretty solid.

apurvam · 2017-05-05T17:37:18Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+          && idAndMetadata.metadata.pendingState.isEmpty) {
+          idAndMetadata.metadata.producerEpoch = (idAndMetadata.metadata.producerEpoch + 1).toShort
+          idAndMetadata.metadata.prepareTransitionTo(Ongoing)
+          txnManager.appendTransactionToLog(idAndMetadata.transactionalId, idAndMetadata.metadata, (errors: Errors) => {


This looks good. For my edification: this txnManager.appendTransactionToLog will happen under the idAndMetadata.metadata lock, which is a shared object. so any futher request from the producer with this transactional id will be blocked until after the epoch is bumped, and hence will get a ProducerFencedException, correct?

@apurvam correct they will get a ProducerFencedException.

asfbot · 2017-05-06T12:42:49Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3578/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-06T12:42:56Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3582/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-05-06T12:43:09Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3588/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-05-10T16:15:16Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3709/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-05-10T16:15:18Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3699/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-10T16:15:19Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3703/
Test FAILed (JDK 7 and Scala 2.10).

hachikuji · 2017-05-11T20:20:49Z

core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala

+        if (!txnManager.isCoordinatorLoadingInProgress(idAndMetadata.transactionalId)
+          && idAndMetadata.metadata.pendingState.isEmpty) {
+          idAndMetadata.metadata.producerEpoch = (idAndMetadata.metadata.producerEpoch + 1).toShort
+          idAndMetadata.metadata.prepareTransitionTo(Ongoing)


Sorry if this was raised before, but why don't we bump the epoch and transition to PREPARE_ABORT in the same write? If there's an edge case we're protecting by doing the epoch bump first, it might be useful to add a comment to document it.

guozhangwang · 2017-05-12T05:38:39Z

retest this please.

asfbot · 2017-05-12T06:39:46Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3781/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-12T06:48:30Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/3793/
Test FAILed (JDK 7 and Scala 2.11).

guozhangwang · 2017-05-12T07:21:14Z

retest this please

ijuma · 2017-05-12T07:42:00Z

core/src/main/scala/kafka/server/KafkaServer.scala

@@ -611,6 +611,8 @@ class KafkaServer(val config: KafkaConfig, time: Time = Time.SYSTEM, threadNameP
          CoreUtils.swallow(zkUtils.close())
        if (metrics != null)
          CoreUtils.swallow(metrics.close())
+        if (transactionCoordinator != null)
+          CoreUtils.swallow(transactionCoordinator.shutdown())


This needs to be removed as we have already merged a PR that fixes this:

3085d4f

Note that the shutdown needs to happen before the LogManager is shutdown (we had an issue where the GroupCoordinator tried to write to an already closed LogManager in the past).

asfbot · 2017-05-12T08:13:03Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/3798/
Test PASSed (JDK 7 and Scala 2.11).

asfbot · 2017-05-12T08:42:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3786/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-05-12T09:38:21Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3794/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-12T10:01:14Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/3802/
Test PASSed (JDK 7 and Scala 2.11).

dguy · 2017-05-12T10:27:37Z

retest this please

asfbot · 2017-05-12T10:35:22Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3790/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-05-12T10:42:32Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3799/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-12T11:33:46Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/3811/
Test PASSed (JDK 7 and Scala 2.11).

guozhangwang

Merged to trunk.

abort Ongoing transactions that have expired

b680920

check the coordinator is not loading

f7c1923

merge trunk

a7de6e0

dguy added 2 commits May 3, 2017 17:03

shutdown TC during KafkaServer shutdown

9e972ef

wakeup and close network client during shutdown

4830f19

use transaction start time for expiry

584a843

apurvam reviewed May 4, 2017

View reviewed changes

address Apurva's comments

9c24519

apurvam approved these changes May 5, 2017

View reviewed changes

address most comments

9b1fe29

Merge branch 'trunk' into kafka-5132

e5ff2f4

hachikuji reviewed May 11, 2017

View reviewed changes

ijuma reviewed May 12, 2017

View reviewed changes

Merge branch 'trunk' into kafka-5132

6c4c9e4

dguy added 3 commits May 12, 2017 09:56

remove tc shutdown after merging trunk

877be67

add comment about bumping epoch during expiration

4b6a625

remove second purgatory shutdown in TC

daa877d

guozhangwang approved these changes May 12, 2017

View reviewed changes

asfgit closed this in 4951849 May 12, 2017

dguy deleted the kafka-5132 branch May 16, 2017 14:04

KAFKA-5132: abort long running transactions #2957

KAFKA-5132: abort long running transactions #2957

Conversation

dguy commented May 2, 2017

dguy commented May 2, 2017

apurvam commented May 2, 2017

asfbot commented May 2, 2017

asfbot commented May 2, 2017

asfbot commented May 2, 2017

apurvam commented May 2, 2017

asfbot commented May 3, 2017

asfbot commented May 3, 2017

dguy commented May 3, 2017

asfbot commented May 3, 2017

dguy commented May 3, 2017

asfbot commented May 3, 2017

asfbot commented May 3, 2017

asfbot commented May 3, 2017

asfbot commented May 3, 2017

asfbot commented May 3, 2017

asfbot commented May 3, 2017

apurvam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dguy commented May 4, 2017

asfbot commented May 4, 2017

asfbot commented May 4, 2017

asfbot commented May 4, 2017

dguy commented May 4, 2017

apurvam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfbot commented May 6, 2017

asfbot commented May 6, 2017

asfbot commented May 6, 2017

asfbot commented May 10, 2017

asfbot commented May 10, 2017

asfbot commented May 10, 2017

Choose a reason for hiding this comment

guozhangwang commented May 12, 2017

asfbot commented May 12, 2017

asfbot commented May 12, 2017

guozhangwang commented May 12, 2017

Choose a reason for hiding this comment

asfbot commented May 12, 2017

asfbot commented May 12, 2017

asfbot commented May 12, 2017

asfbot commented May 12, 2017

dguy commented May 12, 2017

asfbot commented May 12, 2017

asfbot commented May 12, 2017

asfbot commented May 12, 2017

guozhangwang left a comment

Choose a reason for hiding this comment