[KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBatch is created #12462

sudeshwasnik · 2022-08-01T12:38:03Z

Issue - partitioner.partition is called once. For this partition -> accumulator would try to find existing batch present to fill record in.
If not, it would abortAppend on record in this partition buffer, and flag abortForNewBatch.
With abortForNewBatch, partitioner.partition is called again! RoundRobin partition would increment the counter twice for the same record (when key, and record.partition is null). For even number of partitions, only half of them are filled up.

Since every second partition gets records filled in batch, accumulator will always find empty batch for current partition, and hence it will always get skipped.

Solution
here is what i had in mind with the approach on this PR

Partitions -> p0,p1,p2,p3
records -> r0,r1,r2 ... ,r11

Happy case 
first iteration -> 
p0 {}, p1 {r0}, p2 {}, p3 {r1}
second iteration -> 
p0 {r2}, p1 {r0, r3}, p2 {r4}, p3 {r1, r5}
third iteration -> 
p0 {r2, r6}, p1 {r0, r3, r7}, p2 {r4, r8}, p3 {r1, r5, r9}
fourth iteration -> 
p0 {r2, r6, r10}, p1 {r0, r3, r7, r11}, p2 {r4, r8}, p3 {r1, r5, r9}

What happens with "drain"(s) ? 

first iteration -> 
p0 {}, p1 {r0}, p2 {}, p3 {r1}
second iteration -> 
p0 {r2}, p1 {r0, r3}, p2 {r4}, p3 {r1, r5}

--- drain on partition p0
p0 {}, p1 {r0, r3}, p2 {r4}, p3 {r1, r5}

third iteration -> 
p0 {}, p1 {r0, r3, r6}, p2 {r4, r7}, p3 {r1, r5, r8}
fourth iteration -> 
p0 {r9}, p1 {r0, r3, r6, r10}, p2 {r4, r7, r11}, p3 {r1, r5, r8}

[Slow producer] What happens if drain happens AFTER partition is checked for DQ's presence?
first iteration -> 
p0 {}, p1 {r0}, p2 {}, p3 {r1}
second iteration -> 
p0 {r2}, p1 {r0, r3}, p2 {r4}, p3 {r1, r5}

--- drain on partition p0  -> but after "p0" was checked to have a DQ. Hence it is selected again unfortunately
p0 {}, p1 {r0, r3}, p2 {r4}, p3 {r1, r5}

third iteration ->  (p0 is filled up, even if it was just drained, which is fine imo)
p0 {r6}, p1 {r0, r3, r7}, p2 {r4, r8}, p3 {r1, r5, r9}
fourth iteration -> 
p0 {r6, r10}, p1 {r0, r3, r7, r11}, p2 {r4, r8}, p3 {r1, r5, r9}

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…ch is created

sudeshwasnik · 2022-08-01T16:13:04Z

@cmccabe @jonmcewen I see previous two PRs (#11326 and #8690) that fixes the RoundRobinPartitioner. I think making the change in RecordAccumulator will naturally solve the issue (for all such affected Partitioners). wdyt ?

Also tagging @ashwinpankaj @ijuma for reviews

ijuma · 2022-08-01T18:22:58Z

cc @artemlivshits @junrao since you both worked on recent partitioner changes.

ashwinpankaj

Thanks for picking this up @sudeshwasnik . LGTM.

ashwinpankaj · 2022-08-05T04:02:50Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

            } else {
                int appendedBytes = last.estimatedSizeInBytes() - initialBytes;
                return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false, appendedBytes);
            }
        }
-        return null;
+        return new RecordAppendResult(null, false, true, false, 0);


nit: add a comment that new batch needs to be created

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

artemlivshits · 2022-08-17T05:07:05Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

-                    // Return a result that will cause another call to append.
+                // either 1. current topicPartition producerBatch is full - return and prepare for another batch/partition.
+                // 2. no producerBatch existed for this topicPartition, create a new producerBatch.
+                if (appendResult == null && abortOnNewBatch) {


It looks like there are still cases when onNewBatch logic is invoked and partition could be switched for the second time, and it seems to change the behavior of DefaultPartitioner - if the batch is sent (removed from the queue) before a new record is produced, onNewBatch isn't going to get called and it'll get stuck with the current partition.

QQ - if there are no batches in current partition, are there cases we would still want to switch partitions? Technically, the partition is completely empty at this point, and we can create and store new batches. I'll revisit the code and get back asap if there exists some case where we wouldn't want to fill empty batches in partitions and instead want to switch.
Apart from that, here is some context on intended change from my limited knowledge on Partitioning logic ->

@artemlivshits IIUC onNewBatch should only be called when existing batch is full.

The difference this PR is trying to make with this condition here is -

if appendResult is null, that means existing batch for current partition is full -> in this case, return and call onNewBatch.

if appendResult isn't null, at this point will also imply appendResult.newBatchCreated is true. And producerBatch isn't completed for current partition, onNewbatch should not be called. Instead we should stick to this partition, create a producerBatch and fill it. We shouldn't switch partitions here.

wdyt ?

if there are no batches in current partition, are there cases we would still want to switch partitions?

It could happen that batch got sent out by the the time next record arrives. When linger.ms=0 (the default), the sender would try to send the batch out as soon as possible. So the sequence of events could look like this:

batch queue is empty.

append record: create a new batch, but don't switch partitions because queue is empty

sender drains the queue, so now queue is empty

append record: create a new batch, but don't switch partitions because queue is empty
...

Makes sense! with current approach we could get stuck to same partition, leading to starvation.
There are now two constraints - 1. Prevent even partitions getting skipped consistently when there are no producerBatches corresponding to this partition. 2. Do not always select the same partition when producerBatch is empty. ie, detect if recent drain was done for this partition and skip it in that case.

with these conditions, i have made some modifications to the original PR, will summarize the new approach.

@artemlivshits the approach now is little brittle tbh. We could have had a timeout after drain for a partition, within which partition should get skipped. But that approach seemed to make the solution complex. Wdyt?
Any other approach you think we should think about instead ?

@sudeshwasnik
I am following up with this fix to make kafka to have proper roundrobin. i had a thought why cant we put the partitionid in record value..so if the partitioner called twice the paritioner in record will be there and twice call of partitioner wont happen.

https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L1387

if (record.partition() != null)
return record.partition();
this always returns null.

…main null when either partition is encountered first time, batches are cleaned up after expiration OR when sender clears deque

sudeshwasnik · 2022-09-05T09:30:21Z

hey @artemlivshits , thanks for pointing out that if we just check if partition's producerBatches are empty -> we may get stuck selecting the same partition if producerBatches keep getting cleared up by sender during drain.

Here is a new approach -

Check if DQ was created in topicInfo.batches for this partition.
There are two cases when topicinfo.batches is empty
- this partition is encountered for the first time.
- this partition was recently cleared while either drain or batches-expiry cleanup.
In the first case, it is fine to skip this partition (but create an ArrayDeque for it in topicinfo.batches) as this is current
behaviour. Next time this partition is checked for, we find DQ for this partition, and the append is not aborted. ie partition will not get skipped here-on.
In the second case, the partition will be skipped since partition entry from topicinfo.batches was removed. (but create an ArrayDeque for it in topicinfo.batches). After this, this partition won't get skipped and records will be added into it.
Always clear DQ when its entries are cleared up while drain or expiry.

example -

When producer is new. First partition is skipped. But this will call append again (with abortOnNewBatch false) -> thus onNewBatch will be called, new partition will get assigned and next partition won't get skipped.
When drain happens for a partition - immediate next append will get aborted. And append will be called again with abortOnNewBatch false, onNewBatch will be called and new partition is assigned.

junrao

@sudeshwasnik : Thanks for finding the issue and providing a PR. Left a comment below.

junrao · 2023-01-06T22:55:57Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

+                // 2. DQ was removed due to - all batches were cleared due to expiration or sender cleared batches after draining.
+                // if so, abort and look to call partitioner -> onNewBatch and select other partition.
+                // This prevents a single partition getting re-selected after recent drain.
+                if (abortOnNewBatch && noDqForPartition) {


I am not sure if this completely addresses the issue. For example, if the producer adds records relatively slowly, it's possible that by the time the RoundRobinPartitioner adds the record to the same partition again, the batches for that partition have already been drained.

Here is another possibility. We could change the computation for the following code in KafkaProducer. If a partitioner is defined, but doesn't override the default implementation of onNewBatch(which is the case for RoundRobinPartitioner), we could also set abortOnNewBatch to false. We could use reflection to get the method and do the isDefault test.

// A custom partitioner may take advantage on the onNewBatch callback. boolean abortOnNewBatch = partitioner != null;

What do you think @jolshan ?

if a partitioner is defined, but doesn't override the default implementation of onNewBatch

I think this is the only correct way to fix this without breaking partitioners that depend on the onNewBatch being called to switch partitions. Given that the onNewBatch is deprecated now, we should avoid abortOnNewBatch logic unless it's defined.

Is this an issue for all partitioners or just RoundRobinPartitioner?

This is potentially an issue for any partitioner not overriding the default empty implementation for onNewBatch since the partitioner is unnecessarily called a second time for the same record on appending to a new batch.

Yeah -- it seems like abortOnNewBatch being hardcoded has always been an issue. Any partitioner that doesn't implement it (and essentially batch on partitions like the old sticky partitioner) doesn't really have a use for it as far as I can see.

The old sticky partitioner changed the behavior such that partition is called one more time when a batch is created (which is not a problem for the old sticky partitioner, because partition isn't changing state, the state is changed by onNewBatch) but changes behavior of pretty much all other partitioners that change state on partition. The RoundRobinPartitioner probably has the most visible effects, but all partitioners that change state on partition are affected.

It's kind of surprising that this change of behavior wasn't noticed for a while (which probably indicates how often custom partitioners are used).

Eliminating the second call of partition would change the partition switching pattern for the old sticky partitioner, and we didn't want to do that in case people want to switch back to the default logic. So doing the reflection lookup is probably the best way to separate when we need the onNewBatch & second partition call, vs. doing one partition call per tryAppend.

It actually was noticed back in 2020. :( I just missed the comments. 717c55b#r39200381
We also saw a ticket filed for this 2 days ago. https://issues.apache.org/jira/browse/KAFKA-14573

Using reflection is a creative option, but it does mean that if someone overrides the method to eg add a log, then suddenly things will break. Which would be quite surprising (it goes without saying).

Compatibility is tricky here. I think back when we implemented this I remember there were some concerns about the onNewBatch method since we wanted to avoid too many changes to the interface and allow for old partitioners to work as intended. I suppose if there was a way to avoid "abortOnNewBatch" and the subsequent call to partition, that would be ideal, but not sure it is something we can add now.

onNewBatch is now deprecated, so hopefully, people would take a hint not to override it :-)

But, yeah, agree with calling it creative (I wouldn't have come up with that given my C++ background -- these ideas are blocked before they appear in my mind).

ijuma · 2023-05-16T18:27:12Z

Where are we with this?

artemlivshits · 2023-05-16T19:49:41Z

I think it's still a choice between breaking old default sticky partitioner logic vs. not fixing even older round robin partitioner logic (which has been broken for a few years) vs. using reflection to sort of eat the cake and have it too. One thing that might've changed with time is that if people got confidence in the new built-in partitioner logic, maybe it's ok to break the old default sticky partitioner, but it's hard to tell.

blue-int · 2023-11-02T06:33:37Z

I want RoundRobinPartitioner to work again. When might this PR be merged and released?

sudeshwasnik · 2023-11-02T08:25:23Z

@artemlivshits unfortunately, I won't be able to contribute to this PR anymore. Thank you everyone for the reviews, I apologize for the discontinuation.
Do let me know if we should close this PR.

[KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBat…

a0e9020

…ch is created

sudeshwasnik changed the title ~~[KAFKA-9965] [WIP] Fix accumulator tryAppend, so that fresh new producerBat…~~ [KAFKA-9965] [WIP] Fix accumulator tryAppend, so that fresh new producerBatch is created Aug 1, 2022

sudeshwasnik marked this pull request as ready for review August 1, 2022 12:50

sudeshwasnik changed the title ~~[KAFKA-9965] [WIP] Fix accumulator tryAppend, so that fresh new producerBatch is created~~ [KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBatch is created Aug 1, 2022

ashwinpankaj approved these changes Aug 5, 2022

View reviewed changes

artemlivshits suggested changes Aug 5, 2022

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java Outdated Show resolved Hide resolved

artemlivshits reviewed Aug 17, 2022

View reviewed changes

Only create and add records when Deque was pre-created. Deque will re…

772ffd8

…main null when either partition is encountered first time, batches are cleaned up after expiration OR when sender clears deque

sudeshwasnik force-pushed the KAFKA-9965-Accumalator branch from efe62d2 to 772ffd8 Compare September 5, 2022 08:52

junrao reviewed Jan 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBatch is created #12462

[KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBatch is created #12462

sudeshwasnik commented Aug 1, 2022 •

edited

sudeshwasnik commented Aug 1, 2022

ijuma commented Aug 1, 2022

ashwinpankaj left a comment

ashwinpankaj Aug 5, 2022

artemlivshits Aug 17, 2022

sudeshwasnik Aug 22, 2022 •

edited

artemlivshits Aug 25, 2022

sudeshwasnik Sep 5, 2022

sudeshwasnik Sep 5, 2022

sudeshwasnik Sep 5, 2022

Naancy Jan 5, 2023

sudeshwasnik commented Sep 5, 2022 •

edited

junrao left a comment

junrao Jan 6, 2023

artemlivshits Jan 6, 2023

ijuma Jan 6, 2023

junrao Jan 6, 2023

jolshan Jan 6, 2023

artemlivshits Jan 6, 2023

jolshan Jan 6, 2023 •

edited

ijuma Jan 6, 2023

jolshan Jan 6, 2023

artemlivshits Jan 6, 2023

ijuma commented May 16, 2023

artemlivshits commented May 16, 2023

blue-int commented Nov 2, 2023

sudeshwasnik commented Nov 2, 2023

[KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBatch is created #12462

Are you sure you want to change the base?

[KAFKA-9965] Fix accumulator tryAppend, so that fresh new producerBatch is created #12462

Conversation

sudeshwasnik commented Aug 1, 2022 • edited

Committer Checklist (excluded from commit message)

sudeshwasnik commented Aug 1, 2022

ijuma commented Aug 1, 2022

ashwinpankaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudeshwasnik Aug 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudeshwasnik commented Sep 5, 2022 • edited

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolshan Jan 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma commented May 16, 2023

artemlivshits commented May 16, 2023

blue-int commented Nov 2, 2023

sudeshwasnik commented Nov 2, 2023

sudeshwasnik commented Aug 1, 2022 •

edited

sudeshwasnik Aug 22, 2022 •

edited

sudeshwasnik commented Sep 5, 2022 •

edited

jolshan Jan 6, 2023 •

edited