KAFKA-20332: Fix to ensure app thread not collecting records for partitions being revoked by lianetm · Pull Request #21897 · apache/kafka

lianetm · 2026-03-30T16:10:18Z

This addresses race conditions where the app thread could collect/return
records for revoked partitions.

Fix by ensuring that the app thread does not return buffered records if
it hasn't checked pending reconciliations. Once it checked pending
reconciliations, we know that partitions being revoked were marked as
non-fetchable (so it's when we can safely move onto fetching/collecting
in the app thread). Also ensure that background reconciliations do not
trigger revocations (the app thread could already have records in
memory, collected from the buffer, for those partitions, which would
lead to the consumer returning records for revoked partitions if the
background completes the revocation before the app thread returns).

With these fixes we are sure that the app thread only collects/returns
records after it has marked revoked partitions as non-fetchable.

This fix applies to the consumer only (share consumer remains unchanged
with this PR, can trigger full reconciliation & assignment update from
the background)

Reviewers: Andrew Schofield aschofield@confluent.io, nileshkumar3
nileshkumar3@gmail.com, PoAn Yang payang@apache.org, Chia-Ping Tsai
chia7712@gmail.com, Kirk True ktrue@confluent.io

AndrewJSchofield

Thanks for the PR. I'm not sure whether this is tight enough. I'm not quite the expert in this area and it's certainly better, but I'm thinking about the following:

maybeReconcile has asynchronous actions, including committing if auto-commit is enabled.
This auto-commit can complete asynchronously relative to maybeReconcile.
The revocation doesn't happen until the commit has completed, but it is inevitable because it will be performed in the completion lambda for the commit.
The new flag in the AsyncPollEvent will be set as soon as maybeReconcile returns, but that's not quite the same as it having completed.

The key seems to be that the application thread has a stable view of the SubscriptionState which it can check in order to discard records from revoked partitions. Mightn't the SubscriptionState change a little later once the commit completes?

AndrewJSchofield · 2026-03-31T12:11:18Z

+     * so the app thread can safely process to fetch/collect records.
+     */
+    public boolean isReconciliationCheckComplete() {
+        return pendingReconciliationChecked;


To follow the convention you used with isValidatePositionsComplete, this ought to be isReconciliationCheckComplete.

lianetm · 2026-03-31T17:01:44Z

maybeReconcile has asynchronous actions, including committing if auto-commit is enabled.

This auto-commit can complete asynchronously relative to maybeReconcile.

The revocation doesn't happen until the commit has completed, but it is inevitable because it will be performed in the completion lambda for the commit.

The new flag in the AsyncPollEvent will be set as soon as maybeReconcile returns, but that's not quite the same as it having completed

This case is already handled because we mark the partitions as non-fetchable right before issuing the commit (markPendingRevocationToPauseFetching in place before this PR), before allowing the app thread to collect/fetch from the fetchable partitions (fixed in this PR)

You're totally right that in if auto-commit enabled, the revocation will happen when that commit completes, outside the app thread scope, but that's fine (because the partitions were already marked as pending revocation, which is the key to ensure the app thread does not fetch/collect for them while the async operations complete).

Makes sense?

AndrewJSchofield

Thanks for the PR. Looks good to me.

kirktrue

Thanks for the PR @lianetm.

I'm really only "concerned" about the CPU hit. I'm assuming it will be negligible, but I'm curious if there's been any testing. Correctness trumps performance, of course.

kirktrue · 2026-04-01T23:57:58Z

    }

+    /**
+     *


Super nitpicky: extra line.

kirktrue · 2026-04-02T00:07:10Z

+        if (inflightPoll != null && !inflightPoll.isReconciliationCheckComplete()) {
+            return Fetch.empty();
+        }


My main concern is if this affect the CPU performance 🤔

When reconciliation hasn't occurred and this method returns an empty Fetch, the calling code will "fall through" to block on the FetchBuffer. Having the application thread block for the background thread to produce data really hurts the CPU load metrics.

But in the end, correctness wins. It would be ideal if we could see if this change increases CPU load, and if so, by how much.

Good call out about the empty leading to blocking on buffer data. Made me realize that even though we have to wait (for correctness), all we do is really to wait for the background to do the reconciliation check. Made the improvement, let me know your thoughts, thanks!

It's not necessarily the length of time we wait, it's that we enter a waiting state at all. Even if the waiting time was literally 0 ns, the overhead of the thread scheduling is what matters. So blocking on the reconciliation check is still going to incur the thread scheduler's wrath 😄

That said, I'm definitely in the correctness before performance camp, so let's go ahead and merge this to resolve the core bug. It won't tank CPU load, but it might be a small step backward in those efforts.

yeah, agree, the good thing is that here we only wait if !inflightPoll.isReconciliationCheckComplete() and that check should be done most of the time I expect (background only needs to check/set some vars to mark this done)

kirktrue · 2026-04-02T00:08:51Z

+    public PollResult poll(final long currentTimeMs) {
+        maybeReconcile(true);
+        return PollResult.EMPTY;
+    }


Is this directly related to this change? I assume that the changes made to maybeReconcile() trigger the need to add this.

Also, do we need a new unit test to validate this behavior? Or is it covered by an existing test?

This is to keep the existing behaviour in the shareconsumer, and it's already covered (the tests were the ones that catch this gap for me :) many share failed when I hadn't added this initially)

kirktrue · 2026-04-02T00:19:46Z

+        AtomicReference<AsyncPollEvent> capturedEvent = new AtomicReference<>();
+        doAnswer(invocation -> {
+            AsyncPollEvent event = invocation.getArgument(0);
+            capturedEvent.set(event);


In some cases AsyncKafkaConsumer.poll() can technically submit more than AsyncPollEvent. Perhaps I'm being too paranoid, but defensively, this line could ensure that doesn't happen by a CAS:

Suggested change

capturedEvent.set(event);

assertTrue(capturedEvent.compareAndSet(null, event));

kirktrue · 2026-04-02T19:42:40Z

+        if (inflightPoll != null && !inflightPoll.isReconciliationCheckComplete()) {
+            return Fetch.empty();
+        }


It's not necessarily the length of time we wait, it's that we enter a waiting state at all. Even if the waiting time was literally 0 ns, the overhead of the thread scheduling is what matters. So blocking on the reconciliation check is still going to incur the thread scheduler's wrath 😄

That said, I'm definitely in the correctness before performance camp, so let's go ahead and merge this to resolve the core bug. It won't tank CPU load, but it might be a small step backward in those efforts.

lianetm · 2026-04-02T20:26:11Z

Thanks for the reviews! Waiting for the build...

lianetm · 2026-04-02T22:07:49Z

@kirktrue added a minor update to ensure that we don't propagate errors via the reconciliation check future (we want to leave errors flowing in the same way they flow already). Will just open a separate jira because this test failure made me notice that we may be able to improve and avoid blocking on the buffer if there are metadata errors (unrelated to this PR, it's the behaviour in trunk atm). Thanks!

-- update (improvement to consider separately, unrelated to this PR, applies to trunk)
https://issues.apache.org/jira/browse/KAFKA-20397

nileshkumar3 · 2026-04-04T14:40:25Z

+        // safely collect records from the buffer.
+        if (inflightPoll != null && !inflightPoll.isReconciliationCheckComplete()) {
+            // If the background hasn't had the time to check for pending reconciliation,
+            // we need to wait for that check before moving on (instead of returning empty righ away,


nit: typo righ away

nileshkumar3 · 2026-04-04T15:25:45Z

    public void completeExceptionally(KafkaException e) {
+        // Complete reconciliation future to unblock any waiters - the error will be surfaced
+        // through the normal checkInflightPoll() mechanism via the error field
+        reconciliationCheckFuture.complete(null);


completeExceptionally completes reconciliationCheckFuture with a normal completion so collectFetch’s wait unblocks, but collectFetch doesn’t recheck inflightPoll.error() afterward. So we could still return buffered records from this poll() and only throw on the next poll() via checkInflightPoll. Is that ordering intentional for the public contract, or should we fail fast / skip collection once the wait returns and the inflight event already has an error?

we're intentionally not completing exceptionally the reconciliation check because it would start propagating poll core errors via the reconciliation check (we want to unblock the app thread from waiting on a reconciliation trigger, and let the app thread errors flowing through the same mechanism that we had in place, checkInflight errors). So just keeping the same behaviour we had before this PR (where agree with what you described, if let's say a metadata error is discovered after the checkInflight error check, the poll would return records, and throw the error on the next poll.

I had filed https://issues.apache.org/jira/browse/KAFKA-20397 to review that, but separately, as it's not directly related to this PR (and from the reconciliation pov, this case of failure on poll event here it's ok to return buffered records, because if the poll event failed before the successful completion of the reconciliation check, it's the case where the poll event wasn't even processed, so no reconciliation triggered, no risk of ongoing revocation (this is the failure before processing events I refer to

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkThread.java

Lines 264 to 267 in 1bf7d60

if (event instanceof MetadataErrorNotifiableEvent) {

if (maybeFailOnMetadataError(List.of(event)))

continue;

}

). Makes sense?

Thanks, that clears it up. I’m aligned on not completing reconciliationCheckFuture exceptionally so poll errors stay on the existing checkInflight path, and on treating the post-wait / next-poll error ordering as pre-existing. Your point about failure before the poll event is processed (no reconciliation / revocation in play) is convincing for this PR. I’ve noted the metadata-error vs. buffer-blocking improvement for KAFKA-20397 and will try to narrow it down with a small test there.

lianetm · 2026-04-06T12:38:25Z

Thanks for the review @nileshkumar3 !

chia7712

@lianetm nice fix. one small question is left. PTAL

chia7712 · 2026-04-06T13:33:28Z

+            long timeoutMs = inflightPoll.deadlineMs() - time.milliseconds();
+            if (timeoutMs > 0) {
+                try {
+                    ConsumerUtils.getResult(inflightPoll.reconciliationCheckFuture(), timeoutMs);


Should we calculate the pollTimeout after collectFetch, since collectFetch can now be a blocking operation?

…itions being revoked (#21897) This addresses race conditions where the app thread could collect/return records for revoked partitions. Fix by ensuring that the app thread does not return buffered records if it hasn't checked pending reconciliations. Once it checked pending reconciliations, we know that partitions being revoked were marked as non-fetchable (so it's when we can safely move onto fetching/collecting in the app thread). Also ensure that background reconciliations do not trigger revocations (the app thread could already have records in memory, collected from the buffer, for those partitions, which would lead to the consumer returning records for revoked partitions if the background completes the revocation before the app thread returns). With these fixes we are sure that the app thread only collects/returns records after it has marked revoked partitions as non-fetchable. This fix applies to the consumer only (share consumer remains unchanged with this PR, can trigger full reconciliation & assignment update from the background) Reviewers: Andrew Schofield <aschofield@confluent.io>, nileshkumar3 <nileshkumar3@gmail.com>, PoAn Yang <payang@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>, Kirk True <ktrue@confluent.io>

…itions being revoked (apache#21897) This addresses race conditions where the app thread could collect/return records for revoked partitions. Fix by ensuring that the app thread does not return buffered records if it hasn't checked pending reconciliations. Once it checked pending reconciliations, we know that partitions being revoked were marked as non-fetchable (so it's when we can safely move onto fetching/collecting in the app thread). Also ensure that background reconciliations do not trigger revocations (the app thread could already have records in memory, collected from the buffer, for those partitions, which would lead to the consumer returning records for revoked partitions if the background completes the revocation before the app thread returns). With these fixes we are sure that the app thread only collects/returns records after it has marked revoked partitions as non-fetchable. This fix applies to the consumer only (share consumer remains unchanged with this PR, can trigger full reconciliation & assignment update from the background) Reviewers: Andrew Schofield <aschofield@confluent.io>, nileshkumar3 <nileshkumar3@gmail.com>, PoAn Yang <payang@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>, Kirk True <ktrue@confluent.io>

lianetm added 2 commits March 30, 2026 11:55

fix

cdd9800

test

82d0ad5

github-actions Bot added consumer clients labels Mar 30, 2026

update share to maintain behaviour

ace043d

lianetm changed the title ~~KAKFA-20332: Fix to ensure app thread not collecting records for partitions being revoked~~ KAFKA-20332: Fix to ensure app thread not collecting records for partitions being revoked Mar 31, 2026

lianetm requested a review from kirktrue March 31, 2026 01:34

AndrewJSchofield reviewed Mar 31, 2026

View reviewed changes

rename var for consistency

2c8209f

AndrewJSchofield approved these changes Mar 31, 2026

View reviewed changes

kirktrue added the ci-approved label Apr 1, 2026

kirktrue approved these changes Apr 2, 2026

View reviewed changes

addressing comments

1a272ee

kirktrue approved these changes Apr 2, 2026

View reviewed changes

fix test

9a5bf69

nileshkumar3 reviewed Apr 4, 2026

View reviewed changes

typo

78080ee

FrankYang0529 approved these changes Apr 6, 2026

View reviewed changes

nileshkumar3 approved these changes Apr 6, 2026

View reviewed changes

chia7712 approved these changes Apr 6, 2026

View reviewed changes

lianetm added 2 commits April 7, 2026 06:35

pollTimeout calc

8e5ed06

Merge branch 'trunk' into lm-kafka-20332-fix

50fd803

lianetm merged commit b954b35 into apache:trunk Apr 8, 2026
22 checks passed

                   }
+                  /**
+                   *

	capturedEvent.set(event);
	assertTrue(capturedEvent.compareAndSet(null, event));

	if (event instanceof MetadataErrorNotifiableEvent) {
	if (maybeFailOnMetadataError(List.of(event)))
	continue;
	}

Conversation

lianetm commented Mar 30, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm commented Mar 31, 2026

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

kirktrue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm commented Apr 2, 2026

Uh oh!

lianetm commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm commented Apr 6, 2026

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lianetm commented Mar 30, 2026 •

edited by github-actions Bot

Loading

lianetm commented Apr 2, 2026 •

edited

Loading

lianetm Apr 5, 2026 •

edited

Loading