KAFKA-14053: Transactional producer should bump the epoch and skip ab… #12392

urbandan · 2022-07-08T10:53:18Z

…orting when a delivery timeout is encountered

When a transactional batch encounters delivery or request timeout, it can still be in-flight. In this situation, if the transaction is aborted, the abort marker might get appended to the log earlier than the in-flight batch. This can cause the LSO of a partition to be blocked infinitely, or can violate the processing guarantees.
To avoid this situation, on a client side timeout, the transactional producer should skip aborting (EndTxnRequest), and bump the epoch instead. Since this is a fencing bump, the producer cannot safely continue, resulting in a fatal error.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…orting when a client side timeout is encountered When a transactional batch encounters delivery or request timeout, it can still be in-flight. In this situation, if the transaction is aborted, the abort marker might get appended to the log earlier than the in-flight batch. This can cause the LSO of a partition to be blocked infinitely, or can violate the processing guarantees. To avoid this situation, on a client side timeout, the transactional producer should skip aborting (EndTxnRequest), and bump the epoch instead. Since this is a fencing bump, the producer cannot safely continue, resulting in a fatal error.

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

akatona84 · 2022-07-18T08:11:48Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

@@ -728,18 +779,11 @@ synchronized void maybeResolveSequences() {
                } else {
                    // We would enter this branch if all in flight batches were ultimately expired in the producer.
                    if (isTransactional()) {
-                        // For the transactional producer, we bump the epoch if possible, otherwise we transition to a fatal error
+                        // For the transactional producer, we bump the epoch if possible, then transition to a fatal error


It's a behaviour change here, when the transactional producer reaches this state, we'll do an epoch bump and then it'll be a fatal error.
Could you explain how it's changed actually? What's the difference between flipping the epochBumpRequired flag and go to abortable, and going to fatalbumpable?
Was the producer still usable after abortable transition (and the handled abort)?

The existing epochBumpRequired flag is used to bump the epoch after an abort. It is usually used to reset the sequence numbers for the producer, and keeps the producer in a usable state.
In the case I'm trying to fix, we have to skip the abort, and immediately go to the bump. This means that the producer will bump during a transaction, which is handles as a fence by the coordinator. Because of this, there is no way to safely get a new (bumped) epoch with this specific producer instance, and we need to handle this case as a fatal error.
After the InitProducerIDRequest was successful, we transition into the old FATAL_ERROR state.

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

…ed when in FATAL_BUMPABLE_ERROR state, fixed transaction state machine transition, fixed error message

…E_ERROR state

…RROR state in Sender

viktorsomogyi · 2022-07-20T10:59:15Z

@hachikuji would you please review this PR as well?

urbandan · 2022-07-20T11:02:10Z

@showuon @artemlivshits Can you please take a look at this PR? This is the issue we had a thread about on the dev list.

viktorsomogyi

So talked this over with Daniel on a call and I approve the changes in this form. The summary is that although there is another option to solve this problem without running into a fatal error state, that would require increasing the epoch in the producer (+working around when it reaches short.max_value) which to me seems like something we shouldn't do (and leave this functionality with the brokers). Therefore overall running into fatal state seems like a safer and easier option from clean code perspectives, although keeping the producer alive might be slightly better from the users' perspective (but they need to manage errors anyway so it doesn't seem to be a huge problem).

showuon · 2022-07-28T03:43:03Z

I'll review it this week. Sorry for the delay.

showuon · 2022-07-31T13:45:14Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

+        // If an epoch bump is possible, try to fence the current transaction by bumping
+        if (canBumpEpoch()) {
+            log.info("Invoking InitProducerId with current producer ID and epoch {} in order to bump the epoch to fence the current transaction", producerIdAndEpoch);
+            InitProducerIdRequestData requestData = new InitProducerIdRequestData()
+                    .setTransactionalId(transactionalId)
+                    .setTransactionTimeoutMs(transactionTimeoutMs)
+                    .setProducerId(producerIdAndEpoch.producerId)
+                    .setProducerEpoch(producerIdAndEpoch.epoch);
+            InitProducerIdHandler handler = new InitProducerIdHandler(new InitProducerIdRequest.Builder(requestData),
+                    false, true);
+            enqueueRequest(handler);
+        } else {
+            log.info("Cannot bump epoch, transitioning into fatal error");
+            transitionToFatalError(failure);


Let me make sure I understand your problem and solution. Are you saying the issue happens only when the "timed out transaction ID" is not re-used, and the abort marker arrived earlier than transaction records. Is my understanding correct?

And what we are trying to do is to force bump the epoch when encountering timeout exception, to let the fence mechanism help us abort previous in-flight transactions. And next, we enter fatal error state as before. Is that right?

If so, then I have a question: what if the initPid request failed (i.e. failed to bump the epoch), what will happen? The pending transactions will still occur?

Thank you.

Out of order messages can occur even when the transactional.id is reused. The issue I encountered was caused by a valid producer aborting the transaction "too soon" - where too soon means that all of the last batches were timed out due to the delivery.timeout.ms, but they were still in-flight. So the issue occurs with a single producer, without any fencing or transactional.id reuse.

Yes, that summary is right. Bump to fence the in-flight requests, then discard the producer.

If the initPid fails, there can be 2 scenarios:

Transaction times out due to transaction.timeout.ms - in this case, the coordinator bumps the epoch, practically achieving the same fencing I am trying to implement here.

Transactional.id is reused by a new producer instance - in this case, the usual fencing happens.
So I believe that the essential change here is that the producer must not abort when encountering a client side timeout.

As for the producer going into fatal state - I was thinking about a possible workaround for that, and I think the producer can be kept in a usable state, but it involves the epoch being increased on the client side. If this fatal state solution is not acceptable, I can work on another version of the change which involves this client-side bump. I was hesitant to do so because I wasn't sure if the protocol allows such things, but since the idempotent producer does the same, my guess is that it is safe.

hi @showuon, do you think this explanation and solution makes sense? or should I look into the other solution, in which the producer stays in a usable state?

showuon

@urbandan , thanks for the explanation. I think this is the best solution we can think of so far. But I'd like to hear @hachikuji @dajac @guozhangwang 's thoughts. Thanks.

showuon · 2022-08-10T12:53:54Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

+                    .setProducerEpoch(producerIdAndEpoch.epoch);
+            InitProducerIdHandler handler = new InitProducerIdHandler(new InitProducerIdRequest.Builder(requestData),
+                    false, true);
+            enqueueRequest(handler);


We enqueue request here, and when will we send out the request?

At this point we enter the FATAL_BUMPABLE_ERROR, which still allows the Sender to send requests - see the changes in the Sender class and in TransactionManager#maybeTerminateRequestWithError.
If the producer is closed gracefully, we will try to send this last InitProducerID request. After the InitProducerID was successful, we transition into FATAL_ERROR, and won't send anything else.

showuon · 2022-08-11T08:20:50Z

clients/src/test/java/org/apache/kafka/clients/producer/internals/TransactionManagerTest.java

-        assertTrue(transactionManager.hasAbortableError());
-        assertTrue(transactionManager.hasOngoingTransaction());
+        assertTrue(transactionManager.hasFatalBumpableError());
+        assertFalse(transactionManager.hasOngoingTransaction());
        assertFalse(transactionManager.isCompleting());
-        assertTrue(transactionManager.transactionContainsPartition(tp0));

-        TransactionalRequestResult abortResult = transactionManager.beginAbort();
-
-        prepareEndTxnResponse(Errors.NONE, TransactionResult.ABORT, producerId, epoch);
-        prepareInitPidResponse(Errors.NONE, false, producerId, (short) (epoch + 1));
-        runUntil(abortResult::isCompleted);
-        assertTrue(abortResult.isSuccessful());
-        assertFalse(transactionManager.hasOngoingTransaction());
-        assertFalse(transactionManager.transactionContainsPartition(tp0));
+        assertThrows(KafkaException.class, () -> transactionManager.beginAbort());


So, it looks like after this patch, when batch expiration or timeout error, the producer will enter fatal error state after bumping epoch. But before this patch, the we'll abort it and continue the transaction work. Is that right?

Sorry, I didn't realize this situation. This will impact current user behavior, so we need more discussion. I'll ping some experts in this PR, and hope they will help provide comments.
cc @artemlivshits @ijuma

Yes, that is correct. That abort is causing the issue. The producer just assumes that the batches failed, but it is possible that they are still in-flight. When that happens, the abort marker might get processed earlier than the batch. I've seen this in action, and it corrupts the affected partition permanently.

If it is better to keep the producer in a usable state, I can give it a shot. I had one experiment in which I tried keeping the producer usable by increasing the epoch on the client side once. I believe that it is safe to do as the fencing bump will increase the epoch, and the coordinator will never return that to any clients.

Please let me know what you think @ijuma @artemlivshits @showuon

I personally like the solution to make the producer entering fatal error state. But I'd like to hear others' opinion since it will affect producer's behavior.

I'm wondering, is there a way that we could mitigate this server side? Is it possible to prevent writing the late records after the abort marker? I might be missing something though, so let me know.

@jolshan would that mean that each record sent by the producer would have to include the id of a specific transaction (not just the transactional id of the producer?)

If the transactional producer sends a transaction id to the coordinator with each record rather than just the producer's id (in which case the coordinator determines whether there is a transaction going on by the order of Start Transaction, Send Record, End Transaction), then this could work. Otherwise, I don't think it's possible to mitigate on the server side.

@showuon I believe that I've seen this bug cause violation of EOS with a transactional producer in the case of a broker failure (90% sure). I'd much rather deal with the producer crashing than deal with incorrect behavior. However, if it's possible to fix this issue without causing a producer crash that would be really nice (:

I was thinking for some longer term work we could potentially distinguish transactions by having perhaps having a bit of extra state server-side and by bumping the epoch after each transaction. But maybe this is too large of a change for now.

I think you also came to the conclusion of an epoch bump but through a different path.

@jolshan not sure about the impact it would have on the overhead of transactions, but having a unique ID per transaction doesn't really seem necessary to me

I'm not suggesting a unique transactional ID, but simply bumping the epoch would give us a unique identifier for the transaction in combination with the producer and/or transaction ID. Again -- this is something I'm considering as a longer term change, and there could be flaws.

@jolshan sorry for the confusion, I understood that the uniqueness would be achieved through the epoch bumps - I just don't really see the added value of it.

Yeah. No worries. I think what I was thinking of would require a bit more effort -- but the idea is that if the server knew the difference between individual transactions, then it could make better decisions about new writes and markers. (Ie, potentially we could avoid appending the records of an old transaction after a marker for that transaction is appended.) But I also think this idea needs a bit more thought and could require more work than what you are proposing here.

urbandan · 2022-08-29T07:38:06Z

@artemlivshits @ijuma @hachikuji Can you please take a look at this PR? Trying to fix a bug in the transactional producer. Thanks in advance!

coltmcnealy-lh

This looks good to me. We'd probably need some committers (I'm not one) to look at it since it involves a slight change to txn producer behavior (fatal error when there's a timeout with in-flight messages as opposed to just continuing on).

Thank you for the PR, Daniel!

coltmcnealy-lh · 2022-09-13T17:19:22Z

clients/src/test/java/org/apache/kafka/clients/producer/internals/TransactionManagerTest.java

-        assertTrue(transactionManager.hasAbortableError());
-        assertTrue(transactionManager.hasOngoingTransaction());
+        assertTrue(transactionManager.hasFatalBumpableError());
+        assertFalse(transactionManager.hasOngoingTransaction());
        assertFalse(transactionManager.isCompleting());
-        assertTrue(transactionManager.transactionContainsPartition(tp0));

-        TransactionalRequestResult abortResult = transactionManager.beginAbort();
-
-        prepareEndTxnResponse(Errors.NONE, TransactionResult.ABORT, producerId, epoch);
-        prepareInitPidResponse(Errors.NONE, false, producerId, (short) (epoch + 1));
-        runUntil(abortResult::isCompleted);
-        assertTrue(abortResult.isSuccessful());
-        assertFalse(transactionManager.hasOngoingTransaction());
-        assertFalse(transactionManager.transactionContainsPartition(tp0));
+        assertThrows(KafkaException.class, () -> transactionManager.beginAbort());


@jolshan would that mean that each record sent by the producer would have to include the id of a specific transaction (not just the transactional id of the producer?)

If the transactional producer sends a transaction id to the coordinator with each record rather than just the producer's id (in which case the coordinator determines whether there is a transaction going on by the order of Start Transaction, Send Record, End Transaction), then this could work. Otherwise, I don't think it's possible to mitigate on the server side.

@showuon I believe that I've seen this bug cause violation of EOS with a transactional producer in the case of a broker failure (90% sure). I'd much rather deal with the producer crashing than deal with incorrect behavior. However, if it's possible to fix this issue without causing a producer crash that would be really nice (:

urbandan · 2022-09-27T11:46:30Z

@dajac @hachikuji Any chance you can take a look at this? This is a painful issue in transactional producers, with some serious consequences (partition corruption).

jolshan · 2022-09-28T17:10:35Z

@urbandan

If it is better to keep the producer in a usable state, I can give it a shot. I had one experiment in which I tried keeping the producer usable by increasing the epoch on the client side once. I believe that it is safe to do as the fencing bump will increase the epoch, and the coordinator will never return that to any clients

I think it would be possible to avoid a fatal state, but it would require a client-side epoch bump.
When an IniPid is sent during an ongoing transaction, the coordinator bumps the producer epoch to fence off the current producer. This bumped epoch is never returned to any producers as a valid epoch. This never-exposed epoch could be used by the producer to stay in a usable state.

In short:
epoch=0 -> delivery timeout occurs -> send fencing InitPid -> epoch=1 (on coordinator side) -> increase epoch on client side -> send another InitPid -> safely acquire epoch=2
Since epoch=1 will never be used by another producer, this is a safe operation, and an actual fencing operation (by another producer instance) can be detected.

Can you elaborate a bit more on this idea? Is this the implementation in the PR now, or was an idea to avoid the fatal error?

urbandan · 2022-09-29T08:41:18Z

@jolshan It is an idea, the first version of the PR was trying to implement that, but the current state of the PR is based on the fatal state.

The idea about keeping the producer in a reusable state is kind of tricky. The issue is that to fix the bug, we need to bump the epoch instead of aborting.
Normally, an epoch bump results in a successful response from the coordinator, which contains the increased epoch, which then can be safely used by the producer to keep working. But bumping an epoch during an ongoing transaction is handled differently, because the coordinator assumes that a producer fencing occurred (a new producer instance with the same transaction id started up). Because of this, the response to the bump does not contain an actual epoch - it kicks off the fencing operation, and tells the producer to keep retrying until the fencing operation is finished. When that is done, the coordinator will increase the epoch again, and return it to the new producer.
An important observation here is that there is an epoch which is never returned to any producers by the coordinator. We could rely on this fact by trying to use this "hidden" epoch, by increasing the epoch on the client side. Then we can try to bump the epoch again with this "hidden" epoch. If there were no other producer instances fencing off the current producer, this will succeed, and we will get an increased epoch from the broker, meaning that the producer can safely continue. If there was another producer instance fencing of the current instance, even this "hidden" epoch will be fenced anyway.

In short, as I wrote in the other thread:
epoch=0 -> delivery timeout occurs -> send fencing InitPid with epoch=0 -> epoch=1 (on coordinator side) -> increase epoch on client side epoch=1 -> send another InitPid with epoch=1 -> safely acquire epoch=2

hachikuji · 2022-10-28T19:46:52Z

Thanks for all the discussion here and sorry for the late arrival. I have seen this issue in practice as well, often in the context of hanging transactions. The late-arriving Produce request is not expected by the transaction coordinator. Unless the producer is lingering around to continue writing to the transaction, then it is considered hanging by the partition leader. It's also fair to point out that this can violate the transaction's atomicity.

I think the basic idea in the patch here is to bump the epoch when we abort a transaction in order to fence off writes that are in inflight. Do I have that right?

This is in the spirit of an idea that's been on my mind for a while. The only difference is that I was considering a server-side implementation. The basic thought is to have the coordinator bump the epoch after every EndTxn request. We would let the bumped epoch be returned in the response.

EndTxnResponse => ThrottleTimeMs ErrorCode ProducerId ProducerEpoch

The tuple of (producerId, epoch) effectively becomes a unique transaction ID. This would also simplify some of the sequence bookkeeping that we've had so much trouble with on the client. Each transaction would begin with sequence=0 on every partition and the client could "forget" about the inflight requests. Some of the logic we have struggled to get right is how to continue the sequence chain

There is still a hole, however, which I think @jolshan was describing above. We cannot assume clients will always add partitions correctly to the transaction before beginning to write to the partition. We need a server-side validation. Otherwise, hanging transactions will always be possible. We have seen this so many times by now.

My suggestion here is to let us get a KIP out in the couple weeks with a good server-side solution. We may still need a client-side approach for compatibility with older brokers though, so maybe we can leave the PR open.

urbandan · 2022-10-29T09:18:05Z

@hachikuji Thanks for the feedback.
Yes, the essence of the change is bumping the epoch - but only in case of timeouts - so delivery timeout or request timeout.

Overall I agree that a server-side solution might be safer, and I'm interested in the KIP.
At the same time, it sounds like a pretty big overhaul of the transaction coordination flow, and as you mentioned, will only help in new broker versions.
What I was aiming to achieve with this change was to provide a bugfix which could be even backported to older versions. Do you think it would make sense to move forward with this fix in the meantime?

artemlivshits · 2022-10-31T00:16:30Z

I'm not sure if the fix addresses the following scenario:

producer got a timeout (and there is a delayed produce that got stuck)
producer crashes before having an opportunity to bump epoch or anything
transaction coordinator auto-aborts transaction
delayed produce gets unstuck and delivered on top of abort

urbandan · 2022-11-07T08:29:21Z

@artemlivshits The scenario you mentioned is already covered, even without this change - when a transaction times out, the transaction coordinator bumps the epoch, so it already fences off the "stuck" produce request.

jolshan · 2022-11-18T21:32:39Z

@urbandan By the way, KIP-890 is now available to review 😄
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense

urbandan force-pushed the KAFKA-14053_epoch_bump_delivery_timeout branch from bcb201a to 99f6fad Compare July 14, 2022 12:09

akatona84 suggested changes Jul 18, 2022

View reviewed changes

pnagy-cldr reviewed Jul 18, 2022

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java Outdated Show resolved Hide resolved

urbandan added 2 commits July 19, 2022 14:53

KAFKA-14053: Cancelling batches and refusing new partitions to be add…

7c6bab2

…ed when in FATAL_BUMPABLE_ERROR state, fixed transaction state machine transition, fixed error message

KAFKA-14053: Allow FindCoordinatorRequest to be sent in FATAL_BUMPABL…

cc29862

…E_ERROR state

urbandan force-pushed the KAFKA-14053_epoch_bump_delivery_timeout branch from 3edb282 to cc29862 Compare July 19, 2022 12:54

KAFKA-14053: Allow sending transactional messages in FATAL_BUMPABLE_E…

a7c7c97

…RROR state in Sender

viktorsomogyi approved these changes Jul 27, 2022

View reviewed changes

showuon reviewed Jul 31, 2022

View reviewed changes

showuon reviewed Aug 10, 2022

View reviewed changes

showuon reviewed Aug 11, 2022

View reviewed changes

coltmcnealy-lh approved these changes Sep 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-14053: Transactional producer should bump the epoch and skip ab… #12392

KAFKA-14053: Transactional producer should bump the epoch and skip ab… #12392

urbandan commented Jul 8, 2022 •

edited

akatona84 Jul 18, 2022

urbandan Jul 18, 2022

viktorsomogyi commented Jul 20, 2022

urbandan commented Jul 20, 2022

viktorsomogyi left a comment

showuon commented Jul 28, 2022

showuon Jul 31, 2022 •

edited

urbandan Aug 1, 2022

urbandan Aug 8, 2022

showuon left a comment

showuon Aug 10, 2022

urbandan Aug 10, 2022

showuon Aug 11, 2022

urbandan Aug 17, 2022

showuon Aug 17, 2022

jolshan Sep 12, 2022

coltmcnealy-lh Sep 13, 2022

jolshan Sep 14, 2022

urbandan Sep 16, 2022

jolshan Sep 16, 2022 •

edited

urbandan Sep 21, 2022

jolshan Sep 21, 2022

urbandan commented Aug 29, 2022

coltmcnealy-lh left a comment

coltmcnealy-lh Sep 13, 2022

urbandan commented Sep 27, 2022

jolshan commented Sep 28, 2022

urbandan commented Sep 29, 2022

hachikuji commented Oct 28, 2022

urbandan commented Oct 29, 2022

artemlivshits commented Oct 31, 2022

urbandan commented Nov 7, 2022

jolshan commented Nov 18, 2022

KAFKA-14053: Transactional producer should bump the epoch and skip ab… #12392

Are you sure you want to change the base?

KAFKA-14053: Transactional producer should bump the epoch and skip ab… #12392

Conversation

urbandan commented Jul 8, 2022 • edited

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viktorsomogyi commented Jul 20, 2022

urbandan commented Jul 20, 2022

viktorsomogyi left a comment

Choose a reason for hiding this comment

showuon commented Jul 28, 2022

showuon Jul 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolshan Sep 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urbandan commented Aug 29, 2022

coltmcnealy-lh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urbandan commented Sep 27, 2022

jolshan commented Sep 28, 2022

urbandan commented Sep 29, 2022

hachikuji commented Oct 28, 2022

urbandan commented Oct 29, 2022

artemlivshits commented Oct 31, 2022

urbandan commented Nov 7, 2022

jolshan commented Nov 18, 2022

urbandan commented Jul 8, 2022 •

edited

showuon Jul 31, 2022 •

edited

jolshan Sep 16, 2022 •

edited