Skip to content

Conversation

@shauryachats
Copy link
Collaborator

@shauryachats shauryachats commented Nov 19, 2025

Problem

The TableRebalancePauselessIntegrationTest.testForceCommit integration test was transiently failing with a timeout error after 600 seconds:

TableRebalancePauselessIntegrationTest.testForceCommit:191->BaseClusterIntegrationTestSet.waitForRebalanceToComplete:850 Failed to meet condition in 600000ms, error message: Failed to complete rebalance

The test would hang indefinitely while waiting for a rebalance operation to complete during the initial setup phase where consuming segments needed to be moved between servers.

Root Cause

Missing minAvailableReplicas configuration caused the rebalancer to maintain availability constraints that created a deadlock during force-commit of consuming segments.

Fix

Added

rebalanceConfig.setMinAvailableReplicas(0); 

to allow aggressive rebalance without availability guarantees. This matches the pattern used in the non-pauseless version of the same test TableRebalanceIntegrationTest, which also uses minAvailableReplicas=0 for force commit operations.

@codecov-commenter
Copy link

codecov-commenter commented Nov 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.15%. Comparing base (a548532) to head (c2df85a).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17240      +/-   ##
============================================
- Coverage     63.21%   63.15%   -0.07%     
+ Complexity     1433     1432       -1     
============================================
  Files          3124     3124              
  Lines        185370   185375       +5     
  Branches      28335    28335              
============================================
- Hits         117182   117065     -117     
- Misses        59143    59257     +114     
- Partials       9045     9053       +8     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.13% <ø> (-0.04%) ⬇️
java-21 63.13% <ø> (-0.05%) ⬇️
temurin 63.15% <ø> (-0.07%) ⬇️
unittests 63.14% <ø> (-0.07%) ⬇️
unittests1 55.73% <ø> (-0.04%) ⬇️
unittests2 33.76% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@J-HowHuang J-HowHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM

@cbalci cbalci merged commit 82db600 into apache:master Nov 24, 2025
18 checks passed
@shauryachats shauryachats deleted the tablerebalance_integfix branch November 24, 2025 21:55
@Jackie-Jiang
Copy link
Contributor

@shauryachats @J-HowHuang Thanks for the fix. Could this deadlock happen in production if user force commit and rebalance at the same time? If so, should we handle it in production code instead of fixing the test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants