[Flaky Test] Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest setUp by xiangfu0 · Pull Request #18061 · apache/pinot

xiangfu0 · 2026-04-01T01:23:53Z

Summary

Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest#setUp which had 2 failures in the weekly CI digest (2026-03-22 to 2026-03-28)
Root cause: transaction coordinator not fully ready — canInitTransactions() only called initTransactions() without verifying a complete transaction cycle could succeed
Extend Kafka cluster readiness timeouts from 120s to 180s for 3-broker transactional setup
Verify full transaction cycle (begin + commit) in canInitTransactions()
Add read_committed isolation level metadata verification before proceeding
Extend topic metadata timeout from 30s to 60s

Test plan

Run ExactlyOnceKafkaRealtimeClusterIntegrationTest locally to verify setUp succeeds consistently
Monitor next week's flaky test digest for regression

🤖 Generated with Claude Code

ExactlyOnceKafkaRealtimeClusterIntegrationTest#setUp had 2 failures in the weekly CI digest (2026-03-22 to 2026-03-28). Root causes: - Transaction coordinator not fully ready: canInitTransactions() only called initTransactions() without verifying the coordinator could handle a complete transaction cycle - Insufficient Kafka cluster readiness timeout for 3-broker setup - Topic metadata not verified with read_committed isolation level Fixes in BaseClusterIntegrationTest: - Extend Kafka readiness timeouts from 120s to 180s - Increase transaction wait poll interval from 500ms to 1000ms - Verify full transaction cycle (begin + commit) in canInitTransactions() instead of just initTransactions() Fixes in ExactlyOnceKafkaRealtimeClusterIntegrationTest: - Extend topic metadata timeout from 30s to 60s - Add read_committed isolation level metadata verification - Improve retry logging in addTableConfig() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-04-01T02:11:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.37%. Comparing base (fc95b79) to head (0858504).
⚠️ Report is 100 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18061      +/-   ##
============================================
+ Coverage     63.24%   63.37%   +0.13%     
- Complexity     1481     1540      +59     
============================================
  Files          3191     3200       +9     
  Lines        192593   194221    +1628     
  Branches      29537    29929     +392     
============================================
+ Hits         121797   123088    +1291     
- Misses        61239    61471     +232     
- Partials       9557     9662     +105

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.32% <ø> (+0.10%)`	⬆️
java-21	`63.32% <ø> (+0.17%)`	⬆️
temurin	`63.37% <ø> (+0.13%)`	⬆️
unittests	`63.37% <ø> (+0.13%)`	⬆️
unittests1	`55.52% <ø> (+0.04%)`	⬆️
unittests2	`34.28% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…erIntegrationTest Wrap two lines that exceeded the 120-character limit: - Split waitForCondition lambda arg to next line - Extract groupId variable to shorten consumerProps.put call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sterIntegrationTest setUp in this test has been the dominant integration-test flake on master -- 16 of the last 30 failed `Pinot Tests` runs hit the same setUp failure, in two forms: (a) commitTransaction() returned but the partition LSO never advanced; read_committed consumers (test consumer + Pinot server) saw nothing within the 120 s window (read_committed=0, read_uncommitted=~150k). (b) Records reached Kafka but the server-side consuming segment stalled and COUNT(*) never converged to 115 545 within the 1 200 000 ms timeout. Root cause: the test's getKafkaExtraProperties override set `log.flush.interval.messages=1` on the embedded broker (introduced in apache#18061 under the assumption that forced fsync would help durability). That setting forces an fsync per record on every partition log. Each setUp pushes ~115 545 records * 2 transactions ~= 231 k records, so the broker's I/O thread on the CI runner's shared disk was buried under hundreds of thousands of fsync calls. The transaction COMMIT marker writes (WriteTxnMarkers from the coordinator to data-partition leaders) ride the same per-partition I/O queue. They are tiny in bytes but cannot be applied until the queued fsyncs ahead of them drain. That gives mode (a) directly. Once the markers eventually drain, the broker is still I/O-bound; fetch responses to Pinot's consumer are delayed and the segment never converges -- mode (b). Kafka's transactional protocol already provides durability via acks=all and transaction.state.log.replication.factor=3 (which the embedded cluster sets up), so the forced fsync was redundant. Removing the override eliminates both observed stall modes. Also strengthen the post-push verification from "any record visible" to "all expected records visible, with overshoot detection". The previous check returned as soon as `read_committed > 0`, which would have hidden a partial-commit bug; the new check confirms the actual exactly-once contract (commit marker propagated for every committed record, no leakage from the aborted batch). Diff is ~30 lines in a single test file: no retry, no fresh-topic logic, no setUp override -- the original setUp inherited from BaseRealtimeClusterIntegrationTest is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shauryachats approved these changes Apr 7, 2026

View reviewed changes

xiangfu0 added the flaky-test Tracks a test that intermittently fails label Apr 7, 2026

xiangfu0 merged commit 6f654ca into apache:master Apr 7, 2026
16 checks passed

xiangfu0 deleted the fix/flaky-ExactlyOnceKafkaIntegrationTest branch April 7, 2026 22:24

xiangfu0 mentioned this pull request Apr 26, 2026

[Flaky Test] ExactlyOnceKafka: fix mode (a), characterize mode (b) #18337

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test] Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest setUp#18061

[Flaky Test] Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest setUp#18061
xiangfu0 merged 2 commits intoapache:masterfrom
xiangfu0:fix/flaky-ExactlyOnceKafkaIntegrationTest

xiangfu0 commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiangfu0 commented Apr 1, 2026

Summary

Test plan

Uh oh!

codecov-commenter commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 1, 2026 •

edited

Loading