Skip to content

PHOENIX-7797 Fixing flapper test HAGroupStoreClientIT.testHAGroupStoreClientWithMultiThreadedUpdates#2402

Merged
virajjasani merged 2 commits intoapache:masterfrom
ritegarg:FixFlapper
Apr 10, 2026
Merged

PHOENIX-7797 Fixing flapper test HAGroupStoreClientIT.testHAGroupStoreClientWithMultiThreadedUpdates#2402
virajjasani merged 2 commits intoapache:masterfrom
ritegarg:FixFlapper

Conversation

@ritegarg
Copy link
Copy Markdown
Contributor

@ritegarg ritegarg commented Apr 7, 2026

Jira: PHOENIX-7797

Fix flaky test HAGroupStoreClientIT.testHAGroupStoreClientWithMultiThreadedUpdates

Problem

testHAGroupStoreClientWithMultiThreadedUpdates fails intermittently (~21% failure rate over 100 runs) with:

java.lang.AssertionError: 
    at HAGroupStoreClientIT.testHAGroupStoreClientWithMultiThreadedUpdates(HAGroupStoreClientIT.java:450)

The test writes 5 versioned updates to the same ZK node from multiple threads and expects exactly 5 PathChildrenCacheListener events. However, all 5 writes complete within ~14ms, and Curator's PathChildrenCache coalesces rapid updates -- when a one-time ZK watch fires, getData() reads the latest value (skipping intermediate versions) before setting a new watch. This results in fewer events than writes, causing the fixed-count eventsLatch to time out.

Fix

  • Replaced eventsLatch(threadCount) with finalEventLatch(1): Instead of requiring exactly N events, wait for the event carrying the final version. This accommodates event coalescing while still ensuring all updates were processed.
  • Added inline ordering validation in the listener: Each received event version is checked against the previous using AtomicInteger. Any out-of-order delivery is recorded and asserted after the test completes.
  • Moved updateLatch.countDown() to a finally block: Previously, if createOrUpdateDataOnZookeeper threw an exception, countDown() was skipped and the exception was silently swallowed by the executor. Now the latch always decrements, and exceptions are captured and asserted separately.
  • Made shared collections thread-safe: crrEventVersions and orderingErrors use Collections.synchronizedList.
  • Added resource cleanup: executor.shutdown() and storeClient.close() to prevent leaks.

Test plan

  • Ran the test 100 times in a loop in IntelliJ -- 0 failures (previously 21/100 failures)

@ritegarg ritegarg changed the title Fixing flapper test HAGroupStoreClientIT.testHAGroupStoreClientWithMultiThreadedUpdates PHOENIX-7797 Fixing flapper test HAGroupStoreClientIT.testHAGroupStoreClientWithMultiThreadedUpdates Apr 10, 2026
@virajjasani virajjasani merged commit f088fde into apache:master Apr 10, 2026
1 check failed
virajjasani pushed a commit that referenced this pull request Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants