Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator primary segment assignment fix #5532

Merged
merged 3 commits into from Apr 2, 2018

Conversation

@clintropolis
Copy link
Member

commented Mar 24, 2018

Fixes #5531 by taking segment loading into account when deciding how to place segments, avoiding a bug where primary assignment would assign the segment to all historicals until it is loaded.

Reverts some code removed in #4921 for tracking currently loading segments in SegmentReplicantLookup.

@jon-wei

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2018

Fix looks good to me, would be nice if we had tests but looks like that's pretty difficult/complicated as of now

@jihoonson
Copy link
Contributor

left a comment

LGTM overall. Can we add a unit test to verify the load rule correctly considers segment loading?

@@ -33,9 +33,12 @@
*/
public class SegmentReplicantLookup
{


This comment has been minimized.

Copy link
@jihoonson

jihoonson Mar 28, 2018

Contributor

Unnecessary lines.

@jon-wei
Copy link
Contributor

left a comment

LGTM after CI

@gianm

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2018

Saw a failure in this one:

testMoveSegment(io.druid.server.coordinator.CuratorDruidCoordinatorTest)  Time elapsed: 6.71 sec  <<< ERROR!
java.lang.Exception: test timed out after 5000 milliseconds
	at io.druid.server.coordinator.CuratorDruidCoordinatorTest.testMoveSegment(CuratorDruidCoordinatorTest.java:369)

Being optimistic, I restarted the tests. Although this test does seem potentially related to this patch.

@clintropolis

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2018

@gianm the test was added in #5528, and it also failed in a definitely unrelated PR where i increased the timeout so hopefully it will happen less often. I don't think it's because of a race condition with the curator test server used in the test because I ran it in a loop where it successfully passed ~720 times, with something like this

n=1; while mvn -q -Dcheckstyle.skip=true -Dpmd.skip=true -Dtest=io.druid.server.coordinator.CuratorDruidCoordinatorTest#testMoveSegment -pl server test; do echo "success: $n"; n=$[$n + 1];:; done

but I suppose it's possible it's still racy somewhere. Also possible is that travis just takes a bit longer than my local debug machine took to run the tests, which is why the increased timeout in the other PR.

@gianm
gianm approved these changes Apr 2, 2018
@gianm

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2018

The test passed after being restarted, so let's commit the patch.

@gianm gianm merged commit 6feac20 into apache:master Apr 2, 2018

2 checks passed

Inspections: pull requests (Druid) TeamCity build finished
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jon-wei added a commit to implydata/druid that referenced this pull request Apr 4, 2018
Coordinator primary segment assignment fix (apache#5532)
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
gianm added a commit to implydata/druid that referenced this pull request May 16, 2018
Coordinator primary segment assignment fix (apache#5532)
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress

@clintropolis clintropolis deleted the clintropolis:coordinator-primary-assign-fix branch Jun 29, 2018

@jihoonson jihoonson added this to the 0.12.2 milestone Jul 5, 2018

jihoonson added a commit to jihoonson/druid that referenced this pull request Jul 5, 2018
Coordinator primary segment assignment fix (apache#5532)
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
fjy added a commit that referenced this pull request Jul 6, 2018
Coordinator primary segment assignment fix (#5532) (#5950)
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
leventov added a commit to metamx/druid that referenced this pull request Jul 20, 2018
Coordinator primary segment assignment fix (apache#5532) (apache#5950)
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.