Fix rebalancer failure counter in async scenario#2318
Conversation
| doGlobalRebalance(clusterData, resourceMap, algorithm, currentStateOutput, !waitForGlobalRebalance, | ||
| clusterChanges); | ||
| } catch (HelixRebalanceException e) { | ||
| _rebalanceFailureCount.increment(1L); |
There was a problem hiding this comment.
For my own understanding, how was it working in the past? Did we simply skip the error and not counting?
There was a problem hiding this comment.
Yes, Global Rebalance failure was never accounted for in the failure metric. See #2062 .
qqu0127
left a comment
There was a problem hiding this comment.
LGTM, also shall we consider cherry-pick this to the master branch?
|
@qqu0127 This will be merged into master with feature branch, which happens now. :) |
|
If both baseline and partial rebalance fail, will we count 2 times instead
of 1?
Best Regards,
Jiajun
…On Mon, Dec 12, 2022 at 4:00 PM Neal Sun ***@***.***> wrote:
Merged #2318 <#2318> into
nealsun/waged-pipeline-redesign.
—
Reply to this email directly, view it on GitHub
<#2318 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANYM2CD4WR435E3WGXRQMTWM64CVANCNFSM6AAAAAAS4PF4OA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
@jiajunwang missed your message: yes we will count 2 times. Conceptually, given that these pipelines are separate async pipelines now, counting 2 times does make sense (rebalance failed on 2 occasions with possibly different settings). Operationally, since the metric is used to indicate errors, counting 2 times is mostly similar to counting 1 time (and definitely better than counting 0 times). |
|
Agreed. Not a concern. |
Co-authored-by: Neal Sun <nesun@nesun-mn2.linkedin.biz>
Issues
Fixes #2062
Description
Async processes cannot propagate exceptions upwards, so we need to explicitly increment the failure counters during exception handling in the submission block. Since partial rebalance is turning into async, this PR fixes both the old problem (global rebalance missing rebalance failure counter) and the new problem.
Tests
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
(Link the GitHub wiki you added)
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)