Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator fix balancer stuck #5987

Merged
merged 6 commits into from Jul 12, 2018

Conversation

@clintropolis
Copy link
Member

commented Jul 9, 2018

This appears to happen when the balancer runs before the metadata manager polls the metadata database, resulting in the if statement inside of the for loop to never get satisfied. Added max iterations to give up and log about it. I was able to duplicate in local debug cluster, which is where I should've caught this in the first place, my bad.

Fixes #5981

It also fixes issues with correctly counting 'moved' and 'unmoved' segments and optimizes cost calculation by removing servers which already have a replica of a segment from being considered as a target server to move a segment to. Previously, sometimes servers which already had the segment would be selected as the 'best' destination to move the segment to, but then the move function would bail out and not do anything, incorrectly counting a segment as 'moved' but without any corresponding log.

Finally ImmutableDruidServer.equals was broken, but doesn't appear to be called anywhere other than in this change, so maybe not a big deal (most things deal with ServerHolder whose .equals
picks some properties of ImmutableDruidServer instead of calling it's .equals directly.

@clintropolis

This comment has been minimized.

Copy link
Member Author

commented Jul 10, 2018

test failure appears related

@clintropolis

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2018

Current failures seem unrelated

@jon-wei jon-wei merged commit 31c2179 into apache:master Jul 12, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
clintropolis added a commit to implydata/druid that referenced this pull request Jul 18, 2018
Coordinator fix balancer stuck (apache#5987)
* this will fix it

* filter destinations to not consider servers already serving segment

* fix it

* cleanup

* fix opposite day in ImmutableDruidServer.equals

* simplify

@clintropolis clintropolis deleted the clintropolis:coordinator-move-stuck-fix branch Aug 6, 2018

@dclim dclim added this to the 0.13.0 milestone Oct 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.