Add `RoundRobinServerSelector` to speed up segment assignments #13367

kfaraz · 2022-11-15T06:31:10Z

Description

Segment assignments can take very long due to the strategy cost computation for a large number of segments.

This PR addresses the issue by making the segment assignments round-robin within a tier.
Only segment balancing takes cost-based decisions to move segments around.

Changes

Add dynamic config useRoundRobinSegmentAssignment with default value false
Add RoundRobinServerSelector. This does not implement the BalancerStrategy
as it does not conform to that contract and may also be used in conjunction with a
strategy (round-robin for RunRules and strategy for BalanceSegments)
Parameterize LoadRuleTest to test segment loading using both regular balancer strategy
and round-robin

Changes not in this PR

Drops are still cost-based even when round-robin assignment is enabled.

Web-console change

Release note

Add a round-robin segment strategy to speed up initial segment assignments.

Set useRoundRobinSegmentAssigment to true in the coordinator dynamic config to enable this feature.

This PR has:

rohangarg · 2022-11-15T09:28:11Z

Is there a specific reason to not try using RandomBalancerStrategy for load rules?

kfaraz · 2022-11-15T10:11:00Z

@rohangarg , RandomBalancerStrategy might not help get a tier with uniform-ish distribution (atleast in terms of number of segments).

This is because even within a tier, the list of eligible servers given as input to the RandomBalancerStrategy would be different in every call. The list of eligible servers for a segment depends on whether the segment is being served or loaded by a server or not. Thus the random integer might not give a uniform distribution since the output integer range, as well as the server at each index changes in each call. It might be even more non-uniform with multiple datasources having different number of replicas in a tier.

I have not validated this behaviour though, it might be interesting to see if we get a similar behaviour with random strategy as we would with round-robin. Let me know what you think.

rohangarg · 2022-11-15T10:40:55Z

I think with respect to uniform distribution, even the round-robin allocation might create holes in servers since it also skips them incase a segment is already scheduled on the server.
Although, I've also not compared the behavior of the two strategies. My question was more related to that itself - that given there is RandomBalancerStrategy which has been there for sometime, is there some specific reason to not pick it?
Also, I think that if RoundRobin strategy is considered superior to Random principally and practically, we should probably implement that as a BalancerStrategy and recommend it over random too.

kfaraz · 2022-11-15T12:09:18Z

Yeah, I thought of making round-robin a balancer strategy but then decided against it as balancing is the one thing it would not be used for. Balancing with round-robin would again cause issues similar to the ones I believe random would face during assignments.
Also, the contract for the two are fairly different. We could make them conform to each other, I guess.

Also, I think that if RoundRobin strategy is considered superior to Random principally and practically, we should probably implement that as a BalancerStrategy and recommend it over random too.

I don't think RandomBalancerStrategy is ever recommended for a real use case. It is really a bare bones impl mostly for testing purposes and is effectively no-op as it always returns null for balancing, meaning "segment is already optimally placed".

even the round-robin allocation might create holes in servers since it also skips them incase a segment is already scheduled on the server.

I agree, round-robin would also create holes but these would be filled in the next round itself. After any number of rounds, this would be expected to have a near uniform distribution.

random would do it too, if we always passed the complete list of servers to the strategy (rather than only eligible servers) and kept generating random numbers until we found an eligible server. The problem is primarily the range of the generated number, which keeps changing, thus defeating the pseudo-randomness. But given enough rounds, this too might attain a uniform distribution.

But it would be best to validate this hypothesis 🙂. I will write out some simulation tests and share the results here.

rohangarg · 2022-11-15T12:43:16Z

Yeah, I thought of making round-robin a balancer strategy but then decided against it as balancing is the one thing it would not be used for. Balancing with round-robin would again cause issues similar to the ones I believe random would face during assignments.

I'm not sure how would we balance in round-robin strategy out of the box without a cost function. I thought it would be like random strategy where the balancing is a no-op logic. Only the assignment APIs of BalancerStrategy are implemented using findNewSegmentHomeReplicator method.

kfaraz · 2022-11-16T07:47:08Z

@rohangarg , I ran some simulation tests with 10 historicals of equal disk capacity.
The round-robin does seem to perform better than random strategy in the following cases, as seen by the distribution of number of segments on the different historicals.

1. datasource "wiki" with 1k segments, 2 replicas
random:      [186, 192, 192, 195, 199, 202, 207, 207, 210, 210] (max error 7%)
round-robin: [200, 200, 200, 200, 200, 200, 200, 200, 200, 200]

2. datasource "koala" with 10k segments, 2 replicas
random:       [965, 969, 989, 992, 995, 998, 1000, 1020, 1023, 1049] (max error 5%)
round-robin:  [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]

3. both "koala" (1 replica) and "wiki" (3 replicas) published together
random:       [1274, 1278, 1292, 1294, 1301, 1302, 1309, 1310, 1317, 1323] (max error 3%)
round-robin:  [1300, 1300, 1300, 1300, 1300, 1300, 1300, 1300, 1300, 1300]

I have also included the RoundRobinAssigmentTest in this PR.
To get the above results, you can re-run the tests in that class with round-robin disabled.

Even with random, the error rate reduces with a larger number of segments.

Add RoundRobinServerSelector

d54cf12

kfaraz added the Area - Segment Balancing/Coordination label Nov 15, 2022

kfaraz added 2 commits November 15, 2022 13:43

Fix bug, add web-console changes

1d12d24

Revert extra change

5e84409

abhishekagarwal87 mentioned this pull request Nov 15, 2022

Coordinator Segment Handoff - Is it possible to prioritize new segments from ingestion tasks? #12898

Open

imply-cheddar approved these changes Nov 15, 2022

View reviewed changes

Prettify web-console info text

7261bfb

AmatyaAvadhanula approved these changes Nov 16, 2022

View reviewed changes

Add tests

025d601

kfaraz merged commit 71b133f into apache:master Nov 16, 2022

kfaraz deleted the assign_round_robin branch November 16, 2022 14:43

kfaraz added the Release Notes label Nov 21, 2022

kfaraz added this to the 25.0 milestone Nov 21, 2022

kfaraz mentioned this pull request Dec 5, 2022

Docs: Update docs for coordinator dynamic config #13494

Merged

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

abhishekagarwal87 mentioned this pull request Jan 19, 2023

Druid Version 0.22.1 - Long Coordinator handoff times #13692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `RoundRobinServerSelector` to speed up segment assignments #13367

Add `RoundRobinServerSelector` to speed up segment assignments #13367

kfaraz commented Nov 15, 2022 •

edited

rohangarg commented Nov 15, 2022

kfaraz commented Nov 15, 2022

rohangarg commented Nov 15, 2022

kfaraz commented Nov 15, 2022

rohangarg commented Nov 15, 2022

kfaraz commented Nov 16, 2022 •

edited

Add RoundRobinServerSelector to speed up segment assignments #13367

Add RoundRobinServerSelector to speed up segment assignments #13367

Conversation

kfaraz commented Nov 15, 2022 • edited

Description

Changes

Changes not in this PR

Web-console change

Release note

rohangarg commented Nov 15, 2022

kfaraz commented Nov 15, 2022

rohangarg commented Nov 15, 2022

kfaraz commented Nov 15, 2022

rohangarg commented Nov 15, 2022

kfaraz commented Nov 16, 2022 • edited

Add `RoundRobinServerSelector` to speed up segment assignments #13367

Add `RoundRobinServerSelector` to speed up segment assignments #13367

kfaraz commented Nov 15, 2022 •

edited

kfaraz commented Nov 16, 2022 •

edited