Making optimal usage of multiple segment cache locations #8038

sashidhar · 2019-07-06T17:48:17Z

Design proposal for #7641.

Description

Making optimal usage of multiple segment cache locations to distribute segments. See #7641 for more details.

Proposed Algorithm

Round-robin algorithm: The algorithm selects segment cache locations in a round-robin fashion to distribute the segments. This I believe has a nice property that the writes are distributed evenly across the available drives/locations (with enough availability) there by improving the I/O throughput.

Alternative Algorithms Considered

The following alternative algorithms have been discussed.

Least bytes used algorithm (or least filled disk) approach: This algorithm picks a location with the least bytes used. This to me seems reasonable in most cases. See Making optimal usage of multiple segment cache locations #8038 (comment). In practice the distribution of segment sizes are not very even for several reasons (an interval having less or more data, improperly tuned cluster etc). For example, segments sizes across intervals could be any where from 100MB to 1GB assuming most intervals with very close segment sizes, few intervals having outliers like say 100MB or 1GB. Let us consider we have 3 locations. If a location (location 1) loads a segment of size 1GB, the subsequent calls to load segments of lesser sizes will be distributed between locations 2 and 3 until both of them reach/cross 1 GB. This repeats every time a particular location loads a bigger size segment. This might not have optimal write throughput in such a scenario. However, I'm not sure how much of a problem is this.
Max free size algorithm: Choose the segment cache location with the max free size each time. This algorithm has a possible short coming as explained in Making optimal usage of multiple segment cache locations #8038 (comment).

New configuration

This PR introduces an optional new Historical runtime property druid.segmentCache.locationSelectorStrategy to make the segment cache location selection strategy configurable. Possible values for the above property - round-robin, least-bytes-used.

Test plan

Unit tests to be added.

Documentation

Documentation needs to be updated with the new property if the location selection strategy is made configurable and release notes for the same.

…egments to multiple segment cache locations

nishantmonu51 · 2019-07-08T06:33:33Z

to me choosing the segment cache location with the max free size instead of round robin makes more sense. otherwise we can make the segment cache location selection strategy configurable and default to max free available.

nishantmonu51 · 2019-07-08T07:41:54Z

server/src/main/java/org/apache/druid/segment/loading/SegmentLoaderLocalCacheManager.java

@@ -102,6 +105,8 @@ public SegmentLoaderLocalCacheManager(
      );
    }
    locations.sort(COMPARATOR);


looks like we are already trying to sort via the available free size,
The issue seems like the order is not updated after a segment is loaded.
what do you think about sorting the locations after a segment has been loaded ?
I think that would probably fix the issue in #7641 ?

@nishantmonu51 , This probably makes sense. However, one case is when the segment cache location max sizes are skewed (one or few locations with way more availability than others). The sort strategy resorts to selecting the same location again and again until it's availability falls short of others. This will end up having more or less the same behaviour reported in #7641. Round-robin on the other hand will try to distribute the segments across multiple locations there by improving I/O if the locations are backed by different physical drives. However I'm not sure whether the round-robin strategy has any implications on query performance. Let me know your thoughts.

@dclim and others, let us know your thoughts.

sashidhar · 2019-07-08T14:40:37Z

to me choosing the segment cache location with the max free size instead of round robin makes more sense. otherwise we can make the segment cache location selection strategy configurable and default to max free available.

I like the idea of making the segment cache location selection strategy configurable.

dclim · 2019-07-10T07:42:35Z

Ah interesting - I thought I remembered the behavior used to select the least filled disk! Looks like a regression at some point.

@sashidhar I do still think there's value in making the selector strategy configurable to something like round-robin for the reason you mentioned. An example - I was setting up a Druid cluster that had two volumes mounted (let's say they were each 10G and called /mnt and /mnt1). I was also using /mnt for other stuff - as a general scratch drive, storing intermediate indexing files, log files, etc. so I needed to reserve some space for this - let's say I reserved 2G. I had 8G left, so I set the size of the segment cache for /mnt to 8G.

Now, what do I set the size of the segment cache for /mnt1 to? If I set it to 10G to fully utilize the volume and at a point in time have less than 2G of data, it would all be on /mnt1 and potentially wouldn't be maximizing the I/O throughput available. I could instead set it to 8G to be the same as /mnt and that would evenly distribute the segments, but I'd lose those 2G unnecessarily just to coax the algorithm to utilize both locations.

A round-robin strategy (or one that selects the location that has the least bytes used in absolute terms instead of relative to the capacity) would have been what I wanted.

sashidhar · 2019-07-10T12:47:00Z

@dclim , @nishantmonu51 Here's what I'm thinking.

As discussed, the segment cache location selector strategy should be configurable. There could be 3 possible strategies currently.

Round-robin selector strategy
Least bytes used selector strategy
Current behaviour

Questions:

Default strategy - Should this be the current behaviour which is there right now in production or one of round-robin or least bytes used ?
Property name for the new configuration - how does this sound druid.segmentCache.locationSelectorStrategy. ?
Possible values for the above property - round-robin, least-bytes-used ?

Other things to note:

This PR will have to introduce an optional new Historical runtime property.
Documentation for the same and mention in the release notes.

@gianm FYI.

jihoonson · 2019-07-10T18:41:42Z

This sounds like a PR which needs a proposal to me.

himanshug · 2019-07-10T20:22:09Z

I think, ideally in all cases, we want to minimize variance(location1_usedSpace, location2_usedSpace, location3_usedSpace .... ) and LeastBytesUsed should achieve that. Can't think of use cases that wouldn't want that.

sashidhar · 2019-07-11T02:41:54Z

@jihoonson , @himanshug , thanks for your inputs. Should I raise a separate proposal PR or modify this PR to make it a proposal ?

jihoonson · 2019-07-11T07:00:25Z

I think this kind of issue needs a proposal before writing code so that the author can avoid unnecessary work. However, in this case, I think you don’t have to write a proposal at this moment because you already raised this PR. But still, it would be worth to get design review from 3 or more committers. I added the label. Also please update the PR description accordingly once the design issue is resolved.

sashidhar · 2019-07-11T14:00:28Z

Updated the description with the proposed algorithm and the alternatives discussed. Round-robin and least-bytes-used both seem reasonable. Please review the design.

himanshug · 2019-07-11T19:44:42Z

it doesn't hurt to make the strategy configurable , however I think "Least-Bytes-Used" should be default instead of "Round-Robin" .

Let us consider we have 3 locations. If a location (location 1) loads a segment of size 1GB, the subsequent calls to load segments of lesser sizes will be distributed between locations 2 and 3 until both of them reach/cross 1 GB. This repeats every time a particular location loads a bigger size segment. This might not have optimal write throughput in such a scenario. However, I'm not sure how much of a problem is this.

write happens in 1 or very few threads so write throughput is not impacted and on the contrary it improves read throughput due to similar space utilization in each location which has significantly higher concurrency.

many times users add new segment locations after the node has been in use for a while and already has some data and then restart the node, with "Round Robin" newly added location will likely stay underutilized . Round-Robin wouldn't solve #7641 in that case.

sashidhar · 2019-07-12T04:44:37Z

many times users add new segment locations after the node has been in use for a while and already has some data and then restart the node, with "Round Robin" newly added location will likely stay underutilized . Round-Robin wouldn't solve #7641 in that case.

Makes sense. It seems to me that the negative case I mentioned for Least-Bytes-Used might not be much of a concern. It makes sense for the Least-Bytes-Used to be the default for the write and read throughput reasons mentioned.

sashidhar · 2019-07-22T18:25:01Z

@jihoonson , @himanshug , @dclim , @nishantmonu51 have you had a chance to review this ?

dclim · 2019-07-25T00:03:53Z

hey @sashidhar - your proposal mentioned:

This PR introduces an optional new Historical runtime property druid.segmentCache.locationSelectorStrategy to make the segment cache location selection strategy configurable. Possible values for the above property - round-robin, least-bytes-used.

But I don't see that implemented (I only see the round-robin implementation). Am I missing something? You also said:

It makes sense for the Least-Bytes-Used to be the default for the write and read throughput reasons mentioned.

Which I agree with. Apologies if you were waiting on further confirmation before implementing the Least-Bytes-Used strategy. Between round-robin and Least-Bytes-Used, I would be okay if you just implemented the latter, as I think it would be the right option in the majority of cases, but I would also be okay if you implemented both and had a configuration parameter to select the strategy.

sashidhar · 2019-07-25T03:30:22Z

Hi David, Sorry for the unclear wording. It should have been "This PR will introduce...". I was waiting for the design approval before implementing Least-Bytes-Used strategy. I would like to implement both and make the strategy configurable default being Least-Bytes-Used. I'll resume working on the implementation. Thanks, Sashi

…

On Thu, Jul 25, 2019, 5:34 AM David Lim ***@***.***> wrote: hey @sashidhar <https://github.com/sashidhar> - your proposal mentioned: This PR introduces an optional new Historical runtime property druid.segmentCache.locationSelectorStrategy to make the segment cache location selection strategy configurable. Possible values for the above property - round-robin, least-bytes-used. But I don't see that implemented (I only see the round-robin implementation). Am I missing something? You also said: It makes sense for the Least-Bytes-Used to be the default for the write and read throughput reasons mentioned. Which I agree with. Apologies if you were waiting on further confirmation before implementing the Least-Bytes-Used strategy. Between round-robin and Least-Bytes-Used, I would be okay if you just implemented the latter, as I think it would be the right option in the majority of cases, but I would also be okay if you implemented both and had a configuration parameter to select the strategy. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8038>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACGIELQDFBIGA3NCQLR3HLQBDUY3ANCNFSM4H6T5OQQ> .

…rategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy

… least bytes used. Adding currSizeBytes() method in StorageLocation.

sashidhar · 2019-07-29T19:52:38Z

Implemented both the strategies and made the strategy configurable. However there is one implementation glitch due to which SegmentLoaderLocalCacheManagerTest.testRetrySuccessAtSecondLocation() is failing.

Here's the scenario. For example, assume strategy configured is least bytes used strategy and there are two locations - loc1 and loc2 each on different disks disk1 and disk2 respectively. loc1 has the least bytes used. The strategy picks loc1 and before SegmentLoaderLocalCacheManager loads a segment if disk1 fails or is not writable the segment loading fails. The strategy has no way (with my impl) to find if loc1 is bad, this results in the strategy picking loc1 every time failing all segment load attempts. What is a clean way to handle this ?

himanshug · 2019-07-30T00:02:25Z

server/src/main/java/org/apache/druid/segment/loading/StorageLocationSelectorStrategy.java

+   *
+   * @return The storage location to load the given segment into or null if no location has the capacity to store the given segment.
+   */
+  StorageLocation select(DataSegment dataSegment, String storageDirStr);


Implemented both the strategies and made the strategy configurable. However there is one implementation glitch due to which SegmentLoaderLocalCacheManagerTest.testRetrySuccessAtSecondLocation() is failing.
Here's the scenario. For example, assume strategy configured is least bytes used strategy and there are two locations - loc1 and loc2 each on different disks disk1 and disk2 respectively. loc1 has the least bytes used. The strategy picks loc1 and before SegmentLoaderLocalCacheManager loads a segment if disk1 fails or is not writable the segment loading fails. The strategy has no way (with my impl) to find if loc1 is bad, this results in the strategy picking loc1 every time failing all segment load attempts. What is a clean way to handle this ?

good catch, I think, for that reason the interface here should be something like..

Iterator<StorageLocation> getLocations(..)

so that caller can go through all of them like it does currently and caller should be responsible for calling the reserve(..) method not the impls of this.

Changed the method contract to return an iterator of StorageLocations as suggested. If the changes look good will add a few more tests.

jihoonson · 2019-09-27T05:47:31Z

@sashidhar thanks for the quick update! I'll finish my review once #8038 (comment) is addressed. Would you take a look?

sashidhar · 2019-09-27T10:05:41Z

@sashidhar thanks for the quick update! I'll finish my review once #8038 (comment) is addressed. Would you take a look?

Addressed the comment please review.

jihoonson · 2019-09-27T10:17:34Z

server/src/main/java/org/apache/druid/segment/loading/SegmentLoaderLocalCacheManager.java

   */
  private StorageLocation loadSegmentWithRetry(DataSegment segment, String storageDirStr) throws SegmentLoadingException
  {
-    for (StorageLocation loc : locations) {
+    Iterator<StorageLocation> locationsIterator = strategy.getLocations();
+    int numLocationsToTry = this.locations.size();


numLocationsToTry is not necessary now.

Oops! will fix it.

#8038 (comment) javadocs u mean ?

Removed numLocationsToTry and update java docs. Let me know if the description isn't clear or any change is required.

jihoonson · 2019-09-27T10:24:05Z

...rc/main/java/org/apache/druid/segment/loading/RoundRobinStorageLocationSelectorStrategy.java

+  @Override
+  public Iterator<StorageLocation> getLocations(DataSegment dataSegment, String storageDirStr)
+  {
+    return cyclicIterator;


Oh, now I know what the round robin you want is. What I thought was, each caller will get an iterator with a different startIndex which is changed in a round robin fashion. Okay, your implementation makes sense. Please add more details description about the behavior of this strategy especially when multiple threads use this.

jihoonson

+1 after CI. Thank you @sashidhar!

sashidhar · 2019-09-27T13:39:39Z

Thanks a lot @jihoonson for your thorough and patient review.

jihoonson · 2019-09-27T14:11:06Z

@dclim @himanshug @nishantmonu51 do you have more comments?

dclim · 2019-09-27T17:06:57Z

server/src/main/java/org/apache/druid/segment/loading/StorageLocationSelectorStrategy.java

+ * https://github.com/apache/incubator-druid/pull/8038#discussion_r325520829 of PR https://github
+ * .com/apache/incubator-druid/pull/8038 for more details.
+ */
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "tier", defaultImpl = LeastBytesUsedStorageLocationSelectorStrategy.class)


Is property = "tier" here required, or is it copy/pasted from another location (like TierSelectorStrategy)?

good catch, should probably be "type"

@dclim, let me know if it needs to be removed or changed to type.

Changed it to type.

dclim · 2019-09-27T20:34:53Z

server/src/test/java/org/apache/druid/segment/loading/StorageLocationSelectorStrategyTest.java

+        localStorageFolder1, loc1.getPath());
+
+    StorageLocation loc2 = locations.next();
+    Assert.assertEquals("The next element of the iterator should point to path local_storage_folder_1",


The assert message is wrong here

dclim · 2019-09-27T20:35:01Z

server/src/test/java/org/apache/druid/segment/loading/StorageLocationSelectorStrategyTest.java

+        localStorageFolder2, loc2.getPath());
+
+    StorageLocation loc3 = locations.next();
+    Assert.assertEquals("The next element of the iterator should point to path local_storage_folder_1",


The assert message is wrong here

dclim · 2019-09-27T20:44:33Z

+1 after minor assert message change

dclim · 2019-09-27T20:46:02Z

Thank you @sashidhar and @jihoonson for working through this

sashidhar · 2019-09-28T09:37:33Z

Thanks @jihoonson, @himanshug, @dclim, @nishantmonu51 for your review and suggestions.

sashidhar · 2019-09-28T09:41:29Z

Please add Release Notes label as this PR introduces a new Historical runtime property.

jihoonson · 2019-09-28T09:44:39Z

@sashidhar oh yes I added. Thank you!

@notempty

* apache#7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations * Fixing indentation * WIP * Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy * fixing code style * Fixing test * Adding a method visible only for testing, fixing tests * 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes * fixing the conditional statement * Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy * to trigger CI build * Add documentation for the selection strategy configuration * to re trigger CI build * updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes * In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail * Implementing review comments. Added tests for StorageLocationSelectorStrategy * Checkstyle fixes * Adding java doc comments for StorageLocationSelectorStrategy interface * checkstyle * empty commit to retrigger build * Empty commit * Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file * Impl review comments including updating docs as suggested * Removing checkLocationConfigForNull(), @notempty annotation serves the purpose * Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy * Fixing the round robin iterator * Removed numLocationsToTry, updated java docs * changing property attribute value from tier to type * Fixing assert messages

@notempty

* apache#7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations * Fixing indentation * WIP * Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy * fixing code style * Fixing test * Adding a method visible only for testing, fixing tests * 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes * fixing the conditional statement * Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy * to trigger CI build * Add documentation for the selection strategy configuration * to re trigger CI build * updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes * In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail * Implementing review comments. Added tests for StorageLocationSelectorStrategy * Checkstyle fixes * Adding java doc comments for StorageLocationSelectorStrategy interface * checkstyle * empty commit to retrigger build * Empty commit * Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file * Impl review comments including updating docs as suggested * Removing checkLocationConfigForNull(), @notempty annotation serves the purpose * Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy * Fixing the round robin iterator * Removed numLocationsToTry, updated java docs * changing property attribute value from tier to type * Fixing assert messages

t-sashidhar added 3 commits July 6, 2019 22:57

apache#7641 - Changing segment distribution algorithm to distribute s…

a619112

…egments to multiple segment cache locations

Merge branch 'master' into s3_firehose

f23dac1

Fixing indentation

08b6256

sashidhar changed the title ~~S3 firehose~~ Making optimal usage of multiple segment cache locations Jul 6, 2019

nishantmonu51 reviewed Jul 8, 2019

View reviewed changes

jihoonson added the Design Review label Jul 11, 2019

WIP

c2bd858

gianm requested a review from dclim July 19, 2019 21:21

t-sashidhar added 6 commits July 25, 2019 18:47

Merge branch 'master' into s3_firehose

d88c614

Adding interface for location strategy selection, least bytes used st…

f9fa66f

…rategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy

Resolving merge conflicts. Adding comparator for sorting locations by…

e2efc73

… least bytes used. Adding currSizeBytes() method in StorageLocation.

fixing code style

fd0c6d6

Fixing test

449e929

Adding a method visible only for testing, fixing tests

33db506

himanshug reviewed Jul 30, 2019

View reviewed changes

Fixing the round robin iterator

844d55d

jihoonson reviewed Sep 27, 2019

View reviewed changes

Removed numLocationsToTry, updated java docs

0c35182

jihoonson approved these changes Sep 27, 2019

View reviewed changes

dclim reviewed Sep 27, 2019

View reviewed changes

himanshug approved these changes Sep 27, 2019

View reviewed changes

changing property attribute value from tier to type

dc2bf9f

dclim reviewed Sep 27, 2019

View reviewed changes

dclim approved these changes Sep 27, 2019

View reviewed changes

Fixing assert messages

55bc6e8

dclim merged commit 51a7235 into apache:master Sep 28, 2019

sashidhar deleted the s3_firehose branch September 28, 2019 09:37

jihoonson added the Release Notes label Sep 28, 2019

jihoonson mentioned this pull request Oct 1, 2019

The same storage location can be picked up over again with RoundRobinStorageLocationSelectorStrategy #8614

Closed

gianm added this to the 0.17.0 milestone Oct 10, 2019

jihoonson mentioned this pull request Oct 14, 2019

Historicals can't work with multiple druid.segmentCache.locations #8667

Closed

jon-wei mentioned this pull request Dec 28, 2019

0.17.0 release notes #9066

Closed

FrankChen021 mentioned this pull request Sep 4, 2020

druid.segmentCache.locationSelectorStrategy config raise Exception #10348

Closed

Making optimal usage of multiple segment cache locations #8038

Making optimal usage of multiple segment cache locations #8038

Conversation

sashidhar commented Jul 6, 2019 • edited Loading

Description

Proposed Algorithm

Alternative Algorithms Considered

New configuration

Test plan

Documentation

nishantmonu51 commented Jul 8, 2019

Choose a reason for hiding this comment

sashidhar Jul 8, 2019 • edited Loading

Choose a reason for hiding this comment

sashidhar commented Jul 8, 2019

dclim commented Jul 10, 2019

sashidhar commented Jul 10, 2019 • edited Loading

jihoonson commented Jul 10, 2019

himanshug commented Jul 10, 2019

sashidhar commented Jul 11, 2019

jihoonson commented Jul 11, 2019

sashidhar commented Jul 11, 2019

himanshug commented Jul 11, 2019 • edited Loading

sashidhar commented Jul 12, 2019 • edited Loading

sashidhar commented Jul 22, 2019

dclim commented Jul 25, 2019

sashidhar commented Jul 25, 2019 via email

sashidhar commented Jul 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson commented Sep 27, 2019

sashidhar commented Sep 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

sashidhar commented Sep 27, 2019

jihoonson commented Sep 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dclim commented Sep 27, 2019

dclim commented Sep 27, 2019

sashidhar commented Sep 28, 2019

sashidhar commented Sep 28, 2019

jihoonson commented Sep 28, 2019

sashidhar commented Jul 6, 2019 •

edited

Loading

sashidhar Jul 8, 2019 •

edited

Loading

sashidhar commented Jul 10, 2019 •

edited

Loading

himanshug commented Jul 11, 2019 •

edited

Loading

sashidhar commented Jul 12, 2019 •

edited

Loading

sashidhar commented Sep 27, 2019 •

edited

Loading