Skip to content

KAFKA-20237: Re-enqueue InitProducerId request after authentication failure#21715

Open
mvanhorn wants to merge 2 commits intoapache:trunkfrom
mvanhorn:osc/KAFKA-20237-fix-transactionmanager-initializing
Open

KAFKA-20237: Re-enqueue InitProducerId request after authentication failure#21715
mvanhorn wants to merge 2 commits intoapache:trunkfrom
mvanhorn:osc/KAFKA-20237-fix-transactionmanager-initializing

Conversation

@mvanhorn
Copy link

@mvanhorn mvanhorn commented Mar 12, 2026

JIRA:
KAFKA-20237

When an idempotent (non-transactional) KafkaProducer encounters an SSL
handshake failure during the initial connection, the InitProducerId
request is dequeued from the pending requests queue but never sent. The
AuthenticationException is caught in Sender.runOnce(), which calls
transactionManager.authenticationFailed(). However, since the request
was already dequeued, authenticationFailed() iterates over an empty
queue and does nothing.

The TransactionManager remains stuck in INITIALIZING state:

  • bumpIdempotentEpochAndResetIdIfNeeded() skips re-enqueueing because
    currentState == INITIALIZING
  • nextRequest() returns null because the queue is empty
  • The producer becomes permanently unable to send messages, even after
    the SSL configuration is corrected

Changes

Added a recovery path in bumpIdempotentEpochAndResetIdIfNeeded() that
detects when the state is INITIALIZING but the pending queue is empty
and no request is in-flight. In this case, the InitProducerId request
is re-enqueued, allowing the producer to recover without requiring a
restart.

Also extracted the InitProducerId request creation into a helper
method (enqueueInitProducerIdRequest()) to avoid duplication.

Testing

Added testIdempotentProducerRecoversFromLostInitProducerIdRequest()
that simulates:

  1. Transitioning to INITIALIZING and enqueuing the request
  2. Dequeuing the request (as
    Sender.maybeSendAndPollTransactionalRequest() would)
  3. Triggering authenticationFailed() (which does nothing since queue
    is empty)
  4. Verifying that the next call to
    bumpIdempotentEpochAndResetIdIfNeeded() re-enqueues the request

Impact

This enables self-recovery for idempotent producers in cloud-native
environments where certificate rotation or temporary auth server
unavailability can cause transient SSL failures. Previously, the only
workaround was to close and recreate the KafkaProducer.

This contribution was developed with AI assistance (Claude Code).

…ailure

When an idempotent (non-transactional) KafkaProducer encounters an SSL
handshake failure during the initial connection, the InitProducerId
request is dequeued from the pending requests queue but never sent. The
AuthenticationException is caught in Sender.runOnce(), which calls
transactionManager.authenticationFailed(). However, since the request
was already dequeued, authenticationFailed() iterates over an empty
queue and does nothing. The TransactionManager remains stuck in
INITIALIZING state with no pending request to complete initialization.

On subsequent Sender iterations, bumpIdempotentEpochAndResetIdIfNeeded()
skips re-enqueueing because currentState == INITIALIZING, and
nextRequest() returns null because the queue is empty. The producer
becomes permanently unable to send messages.

Fix: Add a recovery path in bumpIdempotentEpochAndResetIdIfNeeded()
that detects when the state is INITIALIZING but the pending queue is
empty and no request is in-flight. In this case, re-enqueue the
InitProducerId request to allow recovery after the connection is
restored.

Also extract the InitProducerId request creation into a helper method
to avoid duplication.
@github-actions github-actions bot added triage PRs from the community producer clients small Small PRs labels Mar 12, 2026
…roducer

isInitializing() returns isTransactional() && currentState == INITIALIZING,
which is always false for idempotent (non-transactional) producers. Replace
state-based assertions with behavioral assertions that verify the request
enqueue/dequeue lifecycle directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant