Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making optimal usage of multiple segment cache locations #8038

Merged
merged 39 commits into from
Sep 28, 2019

Conversation

sashidhar
Copy link
Contributor

@sashidhar sashidhar commented Jul 6, 2019

Design proposal for #7641.

Description

Making optimal usage of multiple segment cache locations to distribute segments. See #7641 for more details.

Proposed Algorithm

  1. Round-robin algorithm: The algorithm selects segment cache locations in a round-robin fashion to distribute the segments. This I believe has a nice property that the writes are distributed evenly across the available drives/locations (with enough availability) there by improving the I/O throughput.

Alternative Algorithms Considered

The following alternative algorithms have been discussed.

  1. Least bytes used algorithm (or least filled disk) approach: This algorithm picks a location with the least bytes used. This to me seems reasonable in most cases. See Making optimal usage of multiple segment cache locations #8038 (comment). In practice the distribution of segment sizes are not very even for several reasons (an interval having less or more data, improperly tuned cluster etc). For example, segments sizes across intervals could be any where from 100MB to 1GB assuming most intervals with very close segment sizes, few intervals having outliers like say 100MB or 1GB. Let us consider we have 3 locations. If a location (location 1) loads a segment of size 1GB, the subsequent calls to load segments of lesser sizes will be distributed between locations 2 and 3 until both of them reach/cross 1 GB. This repeats every time a particular location loads a bigger size segment. This might not have optimal write throughput in such a scenario. However, I'm not sure how much of a problem is this.

  2. Max free size algorithm: Choose the segment cache location with the max free size each time. This algorithm has a possible short coming as explained in Making optimal usage of multiple segment cache locations #8038 (comment).

New configuration

This PR introduces an optional new Historical runtime property druid.segmentCache.locationSelectorStrategy to make the segment cache location selection strategy configurable. Possible values for the above property - round-robin, least-bytes-used.

Test plan

Unit tests to be added.

Documentation

Documentation needs to be updated with the new property if the location selection strategy is made configurable and release notes for the same.

@sashidhar sashidhar changed the title S3 firehose Making optimal usage of multiple segment cache locations Jul 6, 2019
@nishantmonu51
Copy link
Member

to me choosing the segment cache location with the max free size instead of round robin makes more sense. otherwise we can make the segment cache location selection strategy configurable and default to max free available.

@@ -102,6 +105,8 @@ public SegmentLoaderLocalCacheManager(
);
}
locations.sort(COMPARATOR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we are already trying to sort via the available free size,
The issue seems like the order is not updated after a segment is loaded.
what do you think about sorting the locations after a segment has been loaded ?
I think that would probably fix the issue in #7641 ?

Copy link
Contributor Author

@sashidhar sashidhar Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishantmonu51 , This probably makes sense. However, one case is when the segment cache location max sizes are skewed (one or few locations with way more availability than others). The sort strategy resorts to selecting the same location again and again until it's availability falls short of others. This will end up having more or less the same behaviour reported in #7641. Round-robin on the other hand will try to distribute the segments across multiple locations there by improving I/O if the locations are backed by different physical drives. However I'm not sure whether the round-robin strategy has any implications on query performance. Let me know your thoughts.

@dclim and others, let us know your thoughts.

@sashidhar
Copy link
Contributor Author

to me choosing the segment cache location with the max free size instead of round robin makes more sense. otherwise we can make the segment cache location selection strategy configurable and default to max free available.

I like the idea of making the segment cache location selection strategy configurable.

@dclim
Copy link
Contributor

dclim commented Jul 10, 2019

Ah interesting - I thought I remembered the behavior used to select the least filled disk! Looks like a regression at some point.

@sashidhar I do still think there's value in making the selector strategy configurable to something like round-robin for the reason you mentioned. An example - I was setting up a Druid cluster that had two volumes mounted (let's say they were each 10G and called /mnt and /mnt1). I was also using /mnt for other stuff - as a general scratch drive, storing intermediate indexing files, log files, etc. so I needed to reserve some space for this - let's say I reserved 2G. I had 8G left, so I set the size of the segment cache for /mnt to 8G.

Now, what do I set the size of the segment cache for /mnt1 to? If I set it to 10G to fully utilize the volume and at a point in time have less than 2G of data, it would all be on /mnt1 and potentially wouldn't be maximizing the I/O throughput available. I could instead set it to 8G to be the same as /mnt and that would evenly distribute the segments, but I'd lose those 2G unnecessarily just to coax the algorithm to utilize both locations.

A round-robin strategy (or one that selects the location that has the least bytes used in absolute terms instead of relative to the capacity) would have been what I wanted.

@sashidhar
Copy link
Contributor Author

sashidhar commented Jul 10, 2019

@dclim , @nishantmonu51 Here's what I'm thinking.

As discussed, the segment cache location selector strategy should be configurable. There could be 3 possible strategies currently.

  1. Round-robin selector strategy
  2. Least bytes used selector strategy
  3. Current behaviour

Questions:

  1. Default strategy - Should this be the current behaviour which is there right now in production or one of round-robin or least bytes used ?
  2. Property name for the new configuration - how does this sound druid.segmentCache.locationSelectorStrategy. ?
  3. Possible values for the above property - round-robin, least-bytes-used ?

Other things to note:

  1. This PR will have to introduce an optional new Historical runtime property.
  2. Documentation for the same and mention in the release notes.

@gianm FYI.

@jihoonson
Copy link
Contributor

This sounds like a PR which needs a proposal to me.

@himanshug
Copy link
Contributor

I think, ideally in all cases, we want to minimize variance(location1_usedSpace, location2_usedSpace, location3_usedSpace .... ) and LeastBytesUsed should achieve that. Can't think of use cases that wouldn't want that.

@sashidhar
Copy link
Contributor Author

@jihoonson , @himanshug , thanks for your inputs. Should I raise a separate proposal PR or modify this PR to make it a proposal ?

@jihoonson
Copy link
Contributor

I think this kind of issue needs a proposal before writing code so that the author can avoid unnecessary work. However, in this case, I think you don’t have to write a proposal at this moment because you already raised this PR. But still, it would be worth to get design review from 3 or more committers. I added the label. Also please update the PR description accordingly once the design issue is resolved.

@sashidhar
Copy link
Contributor Author

Updated the description with the proposed algorithm and the alternatives discussed. Round-robin and least-bytes-used both seem reasonable. Please review the design.

@himanshug
Copy link
Contributor

himanshug commented Jul 11, 2019

it doesn't hurt to make the strategy configurable , however I think "Least-Bytes-Used" should be default instead of "Round-Robin" .

Let us consider we have 3 locations. If a location (location 1) loads a segment of size 1GB, the subsequent calls to load segments of lesser sizes will be distributed between locations 2 and 3 until both of them reach/cross 1 GB. This repeats every time a particular location loads a bigger size segment. This might not have optimal write throughput in such a scenario. However, I'm not sure how much of a problem is this.

write happens in 1 or very few threads so write throughput is not impacted and on the contrary it improves read throughput due to similar space utilization in each location which has significantly higher concurrency.

many times users add new segment locations after the node has been in use for a while and already has some data and then restart the node, with "Round Robin" newly added location will likely stay underutilized . Round-Robin wouldn't solve #7641 in that case.

@sashidhar
Copy link
Contributor Author

sashidhar commented Jul 12, 2019

many times users add new segment locations after the node has been in use for a while and already has some data and then restart the node, with "Round Robin" newly added location will likely stay underutilized . Round-Robin wouldn't solve #7641 in that case.

Makes sense. It seems to me that the negative case I mentioned for Least-Bytes-Used might not be much of a concern. It makes sense for the Least-Bytes-Used to be the default for the write and read throughput reasons mentioned.

@gianm gianm requested a review from dclim July 19, 2019 21:21
@sashidhar
Copy link
Contributor Author

@jihoonson , @himanshug , @dclim , @nishantmonu51 have you had a chance to review this ?

@dclim
Copy link
Contributor

dclim commented Jul 25, 2019

hey @sashidhar - your proposal mentioned:

This PR introduces an optional new Historical runtime property druid.segmentCache.locationSelectorStrategy to make the segment cache location selection strategy configurable. Possible values for the above property - round-robin, least-bytes-used.

But I don't see that implemented (I only see the round-robin implementation). Am I missing something? You also said:

It makes sense for the Least-Bytes-Used to be the default for the write and read throughput reasons mentioned.

Which I agree with. Apologies if you were waiting on further confirmation before implementing the Least-Bytes-Used strategy. Between round-robin and Least-Bytes-Used, I would be okay if you just implemented the latter, as I think it would be the right option in the majority of cases, but I would also be okay if you implemented both and had a configuration parameter to select the strategy.

@sashidhar
Copy link
Contributor Author

sashidhar commented Jul 25, 2019 via email

…rategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy
… least bytes used. Adding currSizeBytes() method in StorageLocation.
@sashidhar
Copy link
Contributor Author

Implemented both the strategies and made the strategy configurable. However there is one implementation glitch due to which SegmentLoaderLocalCacheManagerTest.testRetrySuccessAtSecondLocation() is failing.

Here's the scenario. For example, assume strategy configured is least bytes used strategy and there are two locations - loc1 and loc2 each on different disks disk1 and disk2 respectively. loc1 has the least bytes used. The strategy picks loc1 and before SegmentLoaderLocalCacheManager loads a segment if disk1 fails or is not writable the segment loading fails. The strategy has no way (with my impl) to find if loc1 is bad, this results in the strategy picking loc1 every time failing all segment load attempts. What is a clean way to handle this ?

*
* @return The storage location to load the given segment into or null if no location has the capacity to store the given segment.
*/
StorageLocation select(DataSegment dataSegment, String storageDirStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented both the strategies and made the strategy configurable. However there is one implementation glitch due to which SegmentLoaderLocalCacheManagerTest.testRetrySuccessAtSecondLocation() is failing.
Here's the scenario. For example, assume strategy configured is least bytes used strategy and there are two locations - loc1 and loc2 each on different disks disk1 and disk2 respectively. loc1 has the least bytes used. The strategy picks loc1 and before SegmentLoaderLocalCacheManager loads a segment if disk1 fails or is not writable the segment loading fails. The strategy has no way (with my impl) to find if loc1 is bad, this results in the strategy picking loc1 every time failing all segment load attempts. What is a clean way to handle this ?

good catch, I think, for that reason the interface here should be something like..

Iterator<StorageLocation> getLocations(..)

so that caller can go through all of them like it does currently and caller should be responsible for calling the reserve(..) method not the impls of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the method contract to return an iterator of StorageLocations as suggested. If the changes look good will add a few more tests.

@jihoonson
Copy link
Contributor

@sashidhar thanks for the quick update! I'll finish my review once #8038 (comment) is addressed. Would you take a look?

@sashidhar
Copy link
Contributor Author

sashidhar commented Sep 27, 2019

@sashidhar thanks for the quick update! I'll finish my review once #8038 (comment) is addressed. Would you take a look?

Addressed the comment please review.

*/
private StorageLocation loadSegmentWithRetry(DataSegment segment, String storageDirStr) throws SegmentLoadingException
{
for (StorageLocation loc : locations) {
Iterator<StorageLocation> locationsIterator = strategy.getLocations();
int numLocationsToTry = this.locations.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numLocationsToTry is not necessary now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8038 (comment) javadocs u mean ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed numLocationsToTry and update java docs. Let me know if the description isn't clear or any change is required.

@Override
public Iterator<StorageLocation> getLocations(DataSegment dataSegment, String storageDirStr)
{
return cyclicIterator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I know what the round robin you want is. What I thought was, each caller will get an iterator with a different startIndex which is changed in a round robin fashion. Okay, your implementation makes sense. Please add more details description about the behavior of this strategy especially when multiple threads use this.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI. Thank you @sashidhar!

@sashidhar
Copy link
Contributor Author

Thanks a lot @jihoonson for your thorough and patient review.

@jihoonson
Copy link
Contributor

@dclim @himanshug @nishantmonu51 do you have more comments?

* https://github.com/apache/incubator-druid/pull/8038#discussion_r325520829 of PR https://github
* .com/apache/incubator-druid/pull/8038 for more details.
*/
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "tier", defaultImpl = LeastBytesUsedStorageLocationSelectorStrategy.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is property = "tier" here required, or is it copy/pasted from another location (like TierSelectorStrategy)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, should probably be "type"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dclim, let me know if it needs to be removed or changed to type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to type.

localStorageFolder1, loc1.getPath());

StorageLocation loc2 = locations.next();
Assert.assertEquals("The next element of the iterator should point to path local_storage_folder_1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert message is wrong here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

localStorageFolder2, loc2.getPath());

StorageLocation loc3 = locations.next();
Assert.assertEquals("The next element of the iterator should point to path local_storage_folder_1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert message is wrong here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@dclim
Copy link
Contributor

dclim commented Sep 27, 2019

+1 after minor assert message change

@dclim
Copy link
Contributor

dclim commented Sep 27, 2019

Thank you @sashidhar and @jihoonson for working through this

@dclim dclim merged commit 51a7235 into apache:master Sep 28, 2019
@sashidhar
Copy link
Contributor Author

Thanks @jihoonson, @himanshug, @dclim, @nishantmonu51 for your review and suggestions.

@sashidhar sashidhar deleted the s3_firehose branch September 28, 2019 09:37
@sashidhar
Copy link
Contributor Author

Please add Release Notes label as this PR introduces a new Historical runtime property.

@jihoonson
Copy link
Contributor

@sashidhar oh yes I added. Thank you!

Fokko pushed a commit to Fokko/druid that referenced this pull request Sep 30, 2019
* apache#7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations

* Fixing indentation

* WIP

* Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy

* fixing code style

* Fixing test

* Adding a method visible only for testing, fixing tests

* 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes

* fixing the conditional statement

* Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy

* to trigger CI build

* Add documentation for the selection strategy configuration

* to re trigger CI build

* updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes

* In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail

* Implementing review comments. Added tests for StorageLocationSelectorStrategy

* Checkstyle fixes

* Adding java doc comments for StorageLocationSelectorStrategy interface

* checkstyle

* empty commit to retrigger build

* Empty commit

* Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file

* Impl review comments including updating docs as suggested

* Removing checkLocationConfigForNull(), @notempty annotation serves the purpose

* Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy

* Fixing the round robin iterator

* Removed numLocationsToTry, updated java docs

* changing property attribute value from tier to type

* Fixing assert messages
gianm pushed a commit to implydata/druid-public that referenced this pull request Oct 1, 2019
* apache#7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations

* Fixing indentation

* WIP

* Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy

* fixing code style

* Fixing test

* Adding a method visible only for testing, fixing tests

* 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes

* fixing the conditional statement

* Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy

* to trigger CI build

* Add documentation for the selection strategy configuration

* to re trigger CI build

* updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes

* In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail

* Implementing review comments. Added tests for StorageLocationSelectorStrategy

* Checkstyle fixes

* Adding java doc comments for StorageLocationSelectorStrategy interface

* checkstyle

* empty commit to retrigger build

* Empty commit

* Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file

* Impl review comments including updating docs as suggested

* Removing checkLocationConfigForNull(), @notempty annotation serves the purpose

* Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy

* Fixing the round robin iterator

* Removed numLocationsToTry, updated java docs

* changing property attribute value from tier to type

* Fixing assert messages
@gianm gianm added this to the 0.17.0 milestone Oct 10, 2019
@jon-wei jon-wei mentioned this pull request Dec 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants