-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TestClusterAggregateMetrics #1842
Fix TestClusterAggregateMetrics #1842
Conversation
This PR is ready to be merged, approved by @junkaixue Fix TestClusterAggregateMetrics |
A better way, or IMHO the correct way, is to wait until the cluster state converged. The main reason we might get unexpected results is that the controller is still rebalancing the cluster while the assert check happens. Given that saying, the current change will reduce the error by reducing pipeline running. But it is not guaranteed since only one pipeline can still run slowly. Could you please change the test logic to wait until the cluster converges then start validation? Note the verifier should support validate instance state in the call. So you can specify the expected disabled nodes in the same check to ensure verifier does not return prematurely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented, I think the current fix does not completely fix the problem. Please take a look and let me know if you think it makes sense : )
@jiajunwang On line 178 we have The previous test causes the verifier to mistakenly verify an intermediate state instead of the final state; adding the maintenance mode should mitigate that, and let the verifier do its job. |
I had the impression that this verifier is not enough if not validating disabled nodes during the wait. But I guess you are right that what we've done here is the best possible way for NOW. Eventually, we need to fix #526. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Issues
Fixes #1806
Description
Fix flaky TestClusterAggregateMetrics by wrapping maintenance mode around batch enabling of instances, otherwise the events may be processed separately, resulting in an intermediate best possible state that does not represent the final state.
Tests
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
(Link the GitHub wiki you added)
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)