Skip to content

Kafka Connect: Fix flaky integration tests by isolating iceberg.control.topic per test#16438

Open
wombatu-kun wants to merge 3 commits into
apache:mainfrom
wombatu-kun:fix/flaky-kafka-connect-integration-test
Open

Kafka Connect: Fix flaky integration tests by isolating iceberg.control.topic per test#16438
wombatu-kun wants to merge 3 commits into
apache:mainfrom
wombatu-kun:fix/flaky-kafka-connect-integration-test

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented May 20, 2026

Summary

TestIntegrationDynamicTable.testIcebergSink has been flaky on kafka-connect-tests — 5 of the last 10 Kafka Connect CI runs on main itself failed on the same assertion (assertThat(table.snapshots()).hasSize(1) inside an Awaitility.untilAsserted(...) block), almost always against the partitioned table test.tbl1 and concentrated on the [2] test_branch parameterization. Bumping the Awaitility budget from 30s to 60s and then to 120s did not help — ruling out a "slow first commit" cause and pointing at cross-test state pollution.

Root cause

The Iceberg sink's control topic was shared. iceberg.control.topic defaulted to control-iceberg and was reused across every connector lifecycle in every integration test method. With iceberg.kafka.auto.offset.reset=earliest, every new Coordinator joined a fresh consumer group on that topic and replayed the entire control-topic history from prior tests. Historical DATA_COMPLETE events fed into Coordinator.receive (see Coordinator.java:140-145) can hit isCommitReady(totalPartitionCount) and trigger commit cycles before the current test's events are processed, which on a partitioned table can produce a snapshot whose offsets the legitimate commit then fails to validate against via Coordinator.offsetValidator (Coordinator.java:280). That explains both the concentration on the partitioned tbl1 (where the offset validator is strict) and the immunity to larger Awaitility budgets.

Fix

Generate a unique control topic name per test in IntegrationTestBase#baseBefore (control-iceberg-<uuid>), pass it through createCommonConfig as iceberg.control.topic, and best-effort-delete it in baseAfter alongside the test topic so cleanup is symmetric. The control topic auto-creates on first publish so no explicit pre-creation is needed. All three integration test classes (TestIntegration, TestIntegrationMultiTable, TestIntegrationDynamicTable) route through createCommonConfig, so the override propagates uniformly. No production source changes — only test scaffolding.

Diagnostic improvements (bundled in this PR)

Two follow-on commits add CI plumbing that made root-causing this flake possible, and that future Kafka Connect integration-test failures will benefit from automatically:

  • Capture docker container logs — attach a testcontainers withLogConsumer for the connect, kafka, and iceberg services in TestContext, writing each service's container output to ${rootDir}/build/testlogs/<service>-container.log. The Iceberg sink coordinator runs inside the Connect container, and its logs were never reaching the JVM test worker before this change. The Kafka Connect CI artifact upload already covers **/build/testlogs, so docker logs come along on failure automatically.

  • Capture per-test output and integration test reports — mirror the existing test block from the root build.gradle inside the integrationTest task (addTestOutputListener writing to ${rootDir}/build/testlogs/${project.name}-integration.log + testLogging { exceptionFormat "full" }), and extend the Kafka Connect CI artifact upload path to also include **/build/reports/tests/integrationTest. Previously a failing integration test surfaced as a bare FAILED line in CI with no stack trace, no AssertJ description, no sink-side logs — now the HTML reports with per-test stack traces are preserved.

Verification

Three consecutive Kafka Connect CI runs passed on this branch with the fix in place (one of them with ensureConnectorRemoved removed, confirming that control-topic isolation alone is sufficient and the cleanup wait was redundant). Baseline failure rate on main was 5/10; probability of three consecutive random greens is below 1/8, and the fix matches a hypothesis-driven root cause that explains the specific failure target.

Commits

  1. Kafka Connect: Isolate iceberg.control.topic per integration test — the actual fix
  2. Kafka Connect: Capture docker container logs for integration tests — diagnostic plumbing
  3. Kafka Connect: Capture per-test output and reports for integration tests in CI — diagnostic plumbing

@wombatu-kun wombatu-kun marked this pull request as draft May 20, 2026 05:39
@wombatu-kun wombatu-kun reopened this May 20, 2026
@wombatu-kun wombatu-kun changed the title Kafka Connect: Increase integration test timeout to reduce flakiness Kafka Connect: Wait for connector removal in stopConnector to fix integration test flakiness May 20, 2026
@wombatu-kun wombatu-kun marked this pull request as ready for review May 20, 2026 05:53
@wombatu-kun wombatu-kun requested a review from manuzhang May 20, 2026 10:47
@manuzhang manuzhang requested review from bryanck, nastra and pvary May 20, 2026 10:52
@wombatu-kun wombatu-kun marked this pull request as draft May 20, 2026 11:02
Vova Kolmakov and others added 3 commits May 20, 2026 19:26
`TestIntegrationDynamicTable#testIcebergSink` was flaky on the `kafka-connect-tests` job — 5 of the last 10 Kafka Connect CI runs on `main` failed on the same assertion (`IntegrationTestBase.java`, `assertThat(table.snapshots()).hasSize(1)` inside an `Awaitility.untilAsserted` block), almost always against the partitioned table `test.tbl1` and concentrated on the `[2] test_branch` parameterization. Larger Awaitility budgets (60s, then 120s) did not help, ruling out a "slow first commit" cause and pointing at cross-test state.

Root cause is the shared control topic. `iceberg.control.topic` defaulted to `control-iceberg` and was reused across every connector lifecycle in every integration test method. With `iceberg.kafka.auto.offset.reset=earliest`, every new Coordinator joined a fresh consumer group on that topic and replayed the entire control-topic history from prior tests. Historical `DATA_COMPLETE` events fed to `Coordinator.receive` (see `Coordinator.java:140-145`) can hit `isCommitReady(totalPartitionCount)` and trigger commit cycles before the current test's events are processed, which on the partitioned table can produce a snapshot whose offsets the legitimate commit then fails to validate against via `Coordinator.offsetValidator` (`Coordinator.java:280`). That fits both the failure target (the partitioned `tbl1`) and the immunity to larger timeouts.

Generate a unique control topic name in `IntegrationTestBase#baseBefore` (`control-iceberg-<uuid>`), pass it through `createCommonConfig` as `iceberg.control.topic`, and best-effort-delete it in `baseAfter` alongside the test topic so cleanup is symmetric. The control topic auto-creates on first publish so no explicit pre-creation is needed. All three integration test classes (`TestIntegration`, `TestIntegrationMultiTable`, `TestIntegrationDynamicTable`) route through `createCommonConfig`, so the override propagates uniformly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The integration tests drive their workloads through a testcontainers docker compose stack (kafka, connect, iceberg REST catalog, minio). The Kafka Connect coordinator does the actual snapshot commit work, and its logs live in the Connect container's stdout — never in the JVM test worker. So when an Awaitility timeout surfaced as a bare AssertionError in CI, there was no way to see why the commit did not happen.

Attach a withLogConsumer for the connect, kafka, and iceberg services in TestContext, writing each service's container output to ${rootDir}/build/testlogs/<service>-container.log. The location is passed in from the integrationTest Gradle task via a `dockerLogDir` system property and falls back to a no-op when unset (so the constructor still works under IDEs or ad-hoc runs). The Kafka Connect CI artifact upload already covers `**/build/testlogs`, so on failure the docker logs come along automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sts in CI

The `integrationTest` Gradle task carried no `addTestOutputListener` and no `testLogging` block, so test-process stdout/stderr was lost and the Gradle console output for CI showed only a bare `FAILED` line and the assertion source location, with no stack trace or AssertJ description. The Kafka Connect CI workflow uploaded only `**/build/testlogs`, which is populated by the unit test task in the root `build.gradle` but not by `integrationTest`.

Mirror the existing `test` block from the root `build.gradle` inside the `integrationTest` task: stream per-test output to `${rootDir}/build/testlogs/${project.name}-integration.log` (a separate file from the unit-test log), and emit verbose `testLogging` events with `exceptionFormat "full"` on CI. Extend the Kafka Connect CI artifact upload to also include `**/build/reports/tests/integrationTest` so the HTML reports with per-test stack traces are preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the fix/flaky-kafka-connect-integration-test branch from 2cf942e to 1fb4bd3 Compare May 20, 2026 12:28
@wombatu-kun wombatu-kun changed the title Kafka Connect: Wait for connector removal in stopConnector to fix integration test flakiness Kafka Connect: Fix flaky integration tests by isolating iceberg.control.topic per test May 20, 2026
@wombatu-kun wombatu-kun marked this pull request as ready for review May 20, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants