Skip to content

[fix][test] Reduce flakiness in testLoadBalancerServiceUnitTableViewSyncer#25638

Merged
lhotari merged 1 commit intoapache:masterfrom
merlimat:fix-flaky-table-view-syncer-test
May 1, 2026
Merged

[fix][test] Reduce flakiness in testLoadBalancerServiceUnitTableViewSyncer#25638
lhotari merged 1 commit intoapache:masterfrom
merlimat:fix-flaky-table-view-syncer-test

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Summary

Restructure testLoadBalancerServiceUnitTableViewSyncer to stop chasing timing bugs.

  • Activate/deactivate the syncer by calling primaryLoadManager.monitor() directly instead of forcing leader transitions via makeSecondaryAsLeader() + makePrimaryAsLeader(). The double transition serializes playLeader() behind a still-running playFollower() on the single-threaded loadManagerExecutor, which was the root cause of repeated 30s+ timeouts. The 60s Awaitility bumps in [fix][test] Fix flaky ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer #25596 / [fix][test] Fix flaky testLoadBalancerServiceUnitTableViewSyncer #25427 / [fix][test] Fix flaky ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer #25378 were treating that symptom; calling monitor() (the same hook the periodic scheduler uses) makes activation deterministic and synchronous.
  • Drop pulsar4. The original test added two extra brokers but only one of them ever exercised the cross-impl syncer path; the other was redundant in each parametrization. Always use pulsar3 with the OTHER table view impl so both parametrizations get equivalent coverage from half the cluster work — this also removes the producer-timeout hot spot on persistent://pulsar/system/loadbalancer-service-unit-state.
  • Restore 30s Awaitility timeouts. With monitor() driving syncer state synchronously, the longer 60s budgets are no longer needed.
  • Reorganize into explicit phases (activate → cross-impl lookup → disconnect → re-register → deactivate) with the SLA-monitor-topic durability check preserved.

Net: 113 insertions(+), 235 deletions(-).

Test plan

  • Six consecutive local runs pass cleanly across both serviceUnitStateTableViewClassName parametrizations
  • CI is green

…yncer

Restructure the test so it stops chasing timing bugs:

- Activate/deactivate the syncer by calling primaryLoadManager.monitor()
  directly instead of forcing leader transitions via
  makeSecondaryAsLeader() + makePrimaryAsLeader(). The double transition
  serializes playLeader() behind a still-running playFollower() on the
  single-threaded loadManagerExecutor, which was the root cause of
  repeated 30s+ timeouts (the 60s bumps in apache#25596 / apache#25427 / apache#25378
  were treating that symptom).

- Drop pulsar4. The original test added two extra brokers but only one
  of them (pulsar3 in the metadata-store parametrization, pulsar4 in the
  system-topic parametrization) ever exercised the cross-impl syncer
  path; the other was redundant. Always use pulsar3 with the OTHER
  table view impl so both parametrizations get equivalent coverage from
  half the cluster work.

- Restore the 30s Awaitility timeouts; with monitor() driving syncer
  state synchronously, the longer 60s budgets are no longer needed.

- Reorganize into explicit phases (activate, cross-impl lookup,
  disconnect, re-register, deactivate) with the durability check on
  the SLA monitor topic preserved.

Net: 113 insertions(+), 235 deletions(-). Six consecutive local runs
pass cleanly across both parametrizations.
Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari lhotari merged commit 2accf43 into apache:master May 1, 2026
79 of 82 checks passed
@lhotari lhotari added this to the 5.0.0-M1 milestone May 1, 2026
poorbarcode pushed a commit to poorbarcode/pulsar that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants