Kafka Connect: Prevent zombie coordinator during rebalance by kumarpritam863 · Pull Request #16156 · apache/iceberg

kumarpritam863 · 2026-04-29T17:23:31Z

Summary

Make stopCoordinator() synchronous by joining the CoordinatorThread before clearing the reference, ensuring the old coordinator is fully stopped (Kafka clients closed, executor drained) before a new coordinator can start on the same or different task.

Problem

During a Kafka Connect rebalance, CommitterImpl.close() calls stopCoordinator() which sets the termination flag and shuts down the executor, but returns before the CoordinatorThread has fully exited its run() loop and closed its Kafka producer/consumer/admin clients. If open() is called on the new leader task before the old thread finishes, two coordinators can briefly run simultaneously:

The old coordinator's executor thread may still be inside commitToTable() executing an Iceberg commit (e.g. an HTTP call to Glue/HMS) that doesn't respond to interrupts.
The old coordinator's Kafka consumer remains in the -coord consumer group, splitting control topic partitions with the new coordinator's consumer during the overlap window.
Both coordinators can send events to the control topic and attempt Iceberg commits concurrently.

While Iceberg's optimistic concurrency (CAS) and the SnapshotAncestryValidator prevent corrupt commits, the race window can cause spurious CommitFailedException errors, partial control topic partition visibility, and unnecessary commit retries.

Fix

Add coordinatorThread.join() after coordinatorThread.terminate() in stopCoordinator().
Since terminate() already calls coordinator.terminate() which blocks up to 1 minute for the executor to drain, the additional join() only waits for the thread's final cleanup (coordinator.stop() — closing Kafka clients), adding negligible overhead.
This guarantees the old coordinator is fully dead before stopCoordinator() returns, closing the race window entirely.

This reverts commit 67619ec.

This reverts commit c0a2665.

…before a new coordinator is elected

kumarpritam863 · 2026-04-29T17:24:47Z

@danielcweeks can you please review.

kumarpritam863 · 2026-05-01T04:17:30Z

@bryanck can you please review.

kumarpritam863 · 2026-05-16T14:54:15Z

@danielcweeks can you please take a look.

Pritam Kumar Mishra and others added 25 commits August 9, 2025 11:27

added metadat and data path in case of dynamic routing

c0a2665

spotless

67619ec

Revert "spotless"

6b15ae4

This reverts commit 67619ec.

Revert "added metadat and data path in case of dynamic routing"

8398e4c

This reverts commit c0a2665.

Merge branch 'apache:main' into main

fbf52a9

Merge branch 'apache:main' into main

c92ec66

Merge branch 'apache:main' into main

9392a6d

Merge branch 'apache:main' into main

ecd8b55

Merge branch 'apache:main' into main

5e76e04

Merge branch 'apache:main' into main

a1ec7e6

Merge branch 'apache:main' into main

4eaf70b

Merge branch 'apache:main' into main

1508513

Merge branch 'apache:main' into main

e5908c8

Merge branch 'apache:main' into main

cbefe9a

Merge branch 'apache:main' into main

57d4667

Merge branch 'apache:main' into main

ee658ea

Merge branch 'apache:main' into main

7c5976d

Merge branch 'apache:main' into main

8a9654f

Merge branch 'apache:main' into main

888e659

Merge branch 'apache:main' into main

1daa4dd

Merge branch 'apache:main' into main

8b7ec63

Merge branch 'apache:main' into main

92c7e89

Merge branch 'apache:main' into main

fe9e7f0

Merge branch 'apache:main' into main

aaa8f89

joining coordinator thread to make sure coordinator thread completes …

2c7a71e

…before a new coordinator is elected

github-actions Bot added the KAFKACONNECT label Apr 29, 2026

fixed spotless failures

60a4607

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka Connect: Prevent zombie coordinator during rebalance#16156

Kafka Connect: Prevent zombie coordinator during rebalance#16156
kumarpritam863 wants to merge 26 commits into
apache:mainfrom
kumarpritam863:fix/kafka-connect-zombie-coordinator

kumarpritam863 commented Apr 29, 2026

Uh oh!

kumarpritam863 commented Apr 29, 2026

Uh oh!

kumarpritam863 commented May 1, 2026

Uh oh!

kumarpritam863 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kumarpritam863 commented Apr 29, 2026

Summary

Problem

Fix

Uh oh!

kumarpritam863 commented Apr 29, 2026

Uh oh!

kumarpritam863 commented May 1, 2026

Uh oh!

kumarpritam863 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant