Skip to content

Comments

Fix rebalancer failure counter in async scenario#2318

Merged
NealSun96 merged 1 commit intoapache:nealsun/waged-pipeline-redesignfrom
NealSun96:nealsun/waged-pipeline-redesign
Dec 13, 2022
Merged

Fix rebalancer failure counter in async scenario#2318
NealSun96 merged 1 commit intoapache:nealsun/waged-pipeline-redesignfrom
NealSun96:nealsun/waged-pipeline-redesign

Conversation

@NealSun96
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Fixes #2062

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Async processes cannot propagate exceptions upwards, so we need to explicitly increment the failure counters during exception handling in the submission block. Since partial rebalance is turning into async, this PR fixes both the old problem (global rebalance missing rebalance failure counter) and the new problem.

Tests

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

doGlobalRebalance(clusterData, resourceMap, algorithm, currentStateOutput, !waitForGlobalRebalance,
clusterChanges);
} catch (HelixRebalanceException e) {
_rebalanceFailureCount.increment(1L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, how was it working in the past? Did we simply skip the error and not counting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Global Rebalance failure was never accounted for in the failure metric. See #2062 .

Copy link
Contributor

@qqu0127 qqu0127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, also shall we consider cherry-pick this to the master branch?

@NealSun96
Copy link
Contributor Author

@qqu0127 This will be merged into master with feature branch, which happens now. :)

@NealSun96 NealSun96 merged commit 47f6a3d into apache:nealsun/waged-pipeline-redesign Dec 13, 2022
@jiajunwang
Copy link
Contributor

jiajunwang commented Dec 13, 2022 via email

@NealSun96
Copy link
Contributor Author

@jiajunwang missed your message: yes we will count 2 times. Conceptually, given that these pipelines are separate async pipelines now, counting 2 times does make sense (rebalance failed on 2 occasions with possibly different settings). Operationally, since the metric is used to indicate errors, counting 2 times is mostly similar to counting 1 time (and definitely better than counting 0 times).

@jiajunwang
Copy link
Contributor

Agreed. Not a concern.

Copy link
Contributor

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NealSun96 NealSun96 mentioned this pull request Dec 13, 2022
3 tasks
NealSun96 added a commit that referenced this pull request Dec 15, 2022
Co-authored-by: Neal Sun <nesun@nesun-mn2.linkedin.biz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants