Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator primary segment assignment fix #5532

Merged
merged 3 commits into from
Apr 2, 2018

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 24, 2018

Fixes #5531 by taking segment loading into account when deciding how to place segments, avoiding a bug where primary assignment would assign the segment to all historicals until it is loaded.

Reverts some code removed in #4921 for tracking currently loading segments in SegmentReplicantLookup.

@jon-wei
Copy link
Contributor

jon-wei commented Mar 28, 2018

Fix looks good to me, would be nice if we had tests but looks like that's pretty difficult/complicated as of now

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Can we add a unit test to verify the load rule correctly considers segment loading?

@@ -33,9 +33,12 @@
*/
public class SegmentReplicantLookup
{


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary lines.

Copy link
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI

@gianm
Copy link
Contributor

gianm commented Mar 31, 2018

Saw a failure in this one:

testMoveSegment(io.druid.server.coordinator.CuratorDruidCoordinatorTest)  Time elapsed: 6.71 sec  <<< ERROR!
java.lang.Exception: test timed out after 5000 milliseconds
	at io.druid.server.coordinator.CuratorDruidCoordinatorTest.testMoveSegment(CuratorDruidCoordinatorTest.java:369)

Being optimistic, I restarted the tests. Although this test does seem potentially related to this patch.

@clintropolis
Copy link
Member Author

@gianm the test was added in #5528, and it also failed in a definitely unrelated PR where i increased the timeout so hopefully it will happen less often. I don't think it's because of a race condition with the curator test server used in the test because I ran it in a loop where it successfully passed ~720 times, with something like this

n=1; while mvn -q -Dcheckstyle.skip=true -Dpmd.skip=true -Dtest=io.druid.server.coordinator.CuratorDruidCoordinatorTest#testMoveSegment -pl server test; do echo "success: $n"; n=$[$n + 1];:; done

but I suppose it's possible it's still racy somewhere. Also possible is that travis just takes a bit longer than my local debug machine took to run the tests, which is why the increased timeout in the other PR.

@gianm
Copy link
Contributor

gianm commented Apr 2, 2018

The test passed after being restarted, so let's commit the patch.

@gianm gianm merged commit 6feac20 into apache:master Apr 2, 2018
jon-wei pushed a commit to implydata/druid-public that referenced this pull request Apr 4, 2018
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
gianm pushed a commit to implydata/druid-public that referenced this pull request May 16, 2018
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
@clintropolis clintropolis deleted the coordinator-primary-assign-fix branch June 29, 2018 20:10
@jihoonson jihoonson added this to the 0.12.2 milestone Jul 5, 2018
jihoonson pushed a commit to jihoonson/druid that referenced this pull request Jul 5, 2018
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
fjy pushed a commit that referenced this pull request Jul 6, 2018
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
leventov pushed a commit to metamx/druid that referenced this pull request Jul 20, 2018
* fix issue where assign primary assigns segments to all historical servers in cluster

* fix test

* add test to ensure primary assignment will not assign to another server while loading is in progress
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants