Skip to content

[Flaky Test] Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest setUp#18061

Merged
xiangfu0 merged 2 commits intoapache:masterfrom
xiangfu0:fix/flaky-ExactlyOnceKafkaIntegrationTest
Apr 7, 2026
Merged

[Flaky Test] Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest setUp#18061
xiangfu0 merged 2 commits intoapache:masterfrom
xiangfu0:fix/flaky-ExactlyOnceKafkaIntegrationTest

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 1, 2026

Summary

  • Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest#setUp which had 2 failures in the weekly CI digest (2026-03-22 to 2026-03-28)
  • Root cause: transaction coordinator not fully ready — canInitTransactions() only called initTransactions() without verifying a complete transaction cycle could succeed
  • Extend Kafka cluster readiness timeouts from 120s to 180s for 3-broker transactional setup
  • Verify full transaction cycle (begin + commit) in canInitTransactions()
  • Add read_committed isolation level metadata verification before proceeding
  • Extend topic metadata timeout from 30s to 60s

Test plan

  • Run ExactlyOnceKafkaRealtimeClusterIntegrationTest locally to verify setUp succeeds consistently
  • Monitor next week's flaky test digest for regression

🤖 Generated with Claude Code

ExactlyOnceKafkaRealtimeClusterIntegrationTest#setUp had 2 failures
in the weekly CI digest (2026-03-22 to 2026-03-28).

Root causes:
- Transaction coordinator not fully ready: canInitTransactions()
  only called initTransactions() without verifying the coordinator
  could handle a complete transaction cycle
- Insufficient Kafka cluster readiness timeout for 3-broker setup
- Topic metadata not verified with read_committed isolation level

Fixes in BaseClusterIntegrationTest:
- Extend Kafka readiness timeouts from 120s to 180s
- Increase transaction wait poll interval from 500ms to 1000ms
- Verify full transaction cycle (begin + commit) in
  canInitTransactions() instead of just initTransactions()

Fixes in ExactlyOnceKafkaRealtimeClusterIntegrationTest:
- Extend topic metadata timeout from 30s to 60s
- Add read_committed isolation level metadata verification
- Improve retry logging in addTableConfig()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.37%. Comparing base (fc95b79) to head (0858504).
⚠️ Report is 100 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18061      +/-   ##
============================================
+ Coverage     63.24%   63.37%   +0.13%     
- Complexity     1481     1540      +59     
============================================
  Files          3191     3200       +9     
  Lines        192593   194221    +1628     
  Branches      29537    29929     +392     
============================================
+ Hits         121797   123088    +1291     
- Misses        61239    61471     +232     
- Partials       9557     9662     +105     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.32% <ø> (+0.10%) ⬆️
java-21 63.32% <ø> (+0.17%) ⬆️
temurin 63.37% <ø> (+0.13%) ⬆️
unittests 63.37% <ø> (+0.13%) ⬆️
unittests1 55.52% <ø> (+0.04%) ⬆️
unittests2 34.28% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…erIntegrationTest

Wrap two lines that exceeded the 120-character limit:
- Split waitForCondition lambda arg to next line
- Extract groupId variable to shorten consumerProps.put call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@xiangfu0 xiangfu0 added the flaky-test Tracks a test that intermittently fails label Apr 7, 2026
@xiangfu0 xiangfu0 merged commit 6f654ca into apache:master Apr 7, 2026
16 checks passed
@xiangfu0 xiangfu0 deleted the fix/flaky-ExactlyOnceKafkaIntegrationTest branch April 7, 2026 22:24
xiangfu0 added a commit to xiangfu0/pinot that referenced this pull request Apr 26, 2026
…sterIntegrationTest

setUp in this test has been the dominant integration-test flake on master --
16 of the last 30 failed `Pinot Tests` runs hit the same setUp failure, in two
forms:

  (a) commitTransaction() returned but the partition LSO never advanced;
      read_committed consumers (test consumer + Pinot server) saw nothing
      within the 120 s window (read_committed=0, read_uncommitted=~150k).
  (b) Records reached Kafka but the server-side consuming segment stalled
      and COUNT(*) never converged to 115 545 within the 1 200 000 ms timeout.

Root cause: the test's getKafkaExtraProperties override set
`log.flush.interval.messages=1` on the embedded broker (introduced in apache#18061
under the assumption that forced fsync would help durability). That setting
forces an fsync per record on every partition log. Each setUp pushes
~115 545 records * 2 transactions ~= 231 k records, so the broker's I/O
thread on the CI runner's shared disk was buried under hundreds of thousands
of fsync calls.

The transaction COMMIT marker writes (WriteTxnMarkers from the coordinator
to data-partition leaders) ride the same per-partition I/O queue. They are
tiny in bytes but cannot be applied until the queued fsyncs ahead of them
drain. That gives mode (a) directly. Once the markers eventually drain, the
broker is still I/O-bound; fetch responses to Pinot's consumer are delayed
and the segment never converges -- mode (b).

Kafka's transactional protocol already provides durability via acks=all and
transaction.state.log.replication.factor=3 (which the embedded cluster sets
up), so the forced fsync was redundant. Removing the override eliminates
both observed stall modes.

Also strengthen the post-push verification from "any record visible" to
"all expected records visible, with overshoot detection". The previous check
returned as soon as `read_committed > 0`, which would have hidden a
partial-commit bug; the new check confirms the actual exactly-once contract
(commit marker propagated for every committed record, no leakage from the
aborted batch).

Diff is ~30 lines in a single test file: no retry, no fresh-topic logic, no
setUp override -- the original setUp inherited from
BaseRealtimeClusterIntegrationTest is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flaky-test Tracks a test that intermittently fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants