[Flaky Test] Fix ExactlyOnceKafkaRealtimeClusterIntegrationTest setUp#18061
Merged
xiangfu0 merged 2 commits intoapache:masterfrom Apr 7, 2026
Merged
Conversation
ExactlyOnceKafkaRealtimeClusterIntegrationTest#setUp had 2 failures in the weekly CI digest (2026-03-22 to 2026-03-28). Root causes: - Transaction coordinator not fully ready: canInitTransactions() only called initTransactions() without verifying the coordinator could handle a complete transaction cycle - Insufficient Kafka cluster readiness timeout for 3-broker setup - Topic metadata not verified with read_committed isolation level Fixes in BaseClusterIntegrationTest: - Extend Kafka readiness timeouts from 120s to 180s - Increase transaction wait poll interval from 500ms to 1000ms - Verify full transaction cycle (begin + commit) in canInitTransactions() instead of just initTransactions() Fixes in ExactlyOnceKafkaRealtimeClusterIntegrationTest: - Extend topic metadata timeout from 30s to 60s - Add read_committed isolation level metadata verification - Improve retry logging in addTableConfig() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18061 +/- ##
============================================
+ Coverage 63.24% 63.37% +0.13%
- Complexity 1481 1540 +59
============================================
Files 3191 3200 +9
Lines 192593 194221 +1628
Branches 29537 29929 +392
============================================
+ Hits 121797 123088 +1291
- Misses 61239 61471 +232
- Partials 9557 9662 +105
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…erIntegrationTest Wrap two lines that exceeded the 120-character limit: - Split waitForCondition lambda arg to next line - Extract groupId variable to shorten consumerProps.put call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shauryachats
approved these changes
Apr 7, 2026
4 tasks
xiangfu0
added a commit
to xiangfu0/pinot
that referenced
this pull request
Apr 26, 2026
…sterIntegrationTest
setUp in this test has been the dominant integration-test flake on master --
16 of the last 30 failed `Pinot Tests` runs hit the same setUp failure, in two
forms:
(a) commitTransaction() returned but the partition LSO never advanced;
read_committed consumers (test consumer + Pinot server) saw nothing
within the 120 s window (read_committed=0, read_uncommitted=~150k).
(b) Records reached Kafka but the server-side consuming segment stalled
and COUNT(*) never converged to 115 545 within the 1 200 000 ms timeout.
Root cause: the test's getKafkaExtraProperties override set
`log.flush.interval.messages=1` on the embedded broker (introduced in apache#18061
under the assumption that forced fsync would help durability). That setting
forces an fsync per record on every partition log. Each setUp pushes
~115 545 records * 2 transactions ~= 231 k records, so the broker's I/O
thread on the CI runner's shared disk was buried under hundreds of thousands
of fsync calls.
The transaction COMMIT marker writes (WriteTxnMarkers from the coordinator
to data-partition leaders) ride the same per-partition I/O queue. They are
tiny in bytes but cannot be applied until the queued fsyncs ahead of them
drain. That gives mode (a) directly. Once the markers eventually drain, the
broker is still I/O-bound; fetch responses to Pinot's consumer are delayed
and the segment never converges -- mode (b).
Kafka's transactional protocol already provides durability via acks=all and
transaction.state.log.replication.factor=3 (which the embedded cluster sets
up), so the forced fsync was redundant. Removing the override eliminates
both observed stall modes.
Also strengthen the post-push verification from "any record visible" to
"all expected records visible, with overshoot detection". The previous check
returned as soon as `read_committed > 0`, which would have hidden a
partial-commit bug; the new check confirms the actual exactly-once contract
(commit marker propagated for every committed record, no leakage from the
aborted batch).
Diff is ~30 lines in a single test file: no retry, no fresh-topic logic, no
setUp override -- the original setUp inherited from
BaseRealtimeClusterIntegrationTest is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ExactlyOnceKafkaRealtimeClusterIntegrationTest#setUpwhich had 2 failures in the weekly CI digest (2026-03-22 to 2026-03-28)canInitTransactions()only calledinitTransactions()without verifying a complete transaction cycle could succeedcanInitTransactions()read_committedisolation level metadata verification before proceedingTest plan
ExactlyOnceKafkaRealtimeClusterIntegrationTestlocally to verify setUp succeeds consistently🤖 Generated with Claude Code