Improve heap usage for IncrementalIndex #2228

nishantmonu51 · 2016-01-07T18:49:04Z

With current code OnheapIncrementalIndex ends up creating a new object of ColumnSelectorFactory (24 bytes each) and ColumnSelector ( 24 bytes) for every aggregator for each druid row.

This means an overhead of 48 bytes * number of aggs per row which becomes significant as the number of aggregators are increased. e.g for 1M rows each having 20 aggregators it turns out to be 800Mb.

This PR aims at removing this overhead by reusing the ColumnSelectorFactory and ColumnSelector by caching the selector objects.

For measuring the impact on heap usage for aggregators I created an IncrementalIndex with 1M rows each row having 20 longsum aggregators and 1 dimension and got an overall reduction in heap size from 1.9G to 1G. ( ~50% improvement)

Actual improvements in the index size will vary with distribution of number of aggregators and dimensions in IncrementalIndex.

fjy · 2016-01-07T23:10:17Z

👍 I think this looks cool and it is crazy we missed this optimization :P

jon-wei · 2016-01-07T23:31:05Z

👍 looks good to me

xvrl · 2016-01-07T23:31:20Z

processing/src/main/java/io/druid/segment/incremental/OnheapIncrementalIndex.java

+  // Caches references to selector objetcs for each column instead of creating a new object each time in order to save heap space.
+  private static class ObjectCachingColumnSelectorFactory implements ColumnSelectorFactory
+  {
+    private final ConcurrentMap<String, LongColumnSelector> longColumnSelectorMap = Maps.newConcurrentMap();


do those need to be concurrent maps? I don't think ColumnSelectors are threadsafe but I also don't think we ever share columnselectorfactories across multiple threads.

How about to use Cache in Guava, which also can specify various options, including expire policy.

It can be accessed by multiple threads in case of groupBy queryies,
dint used Guava Cache as we don't need expiration policies here, since if we expire entries we will end up creating multiple selector objects.

@nishantmonu51 Currently ColumnSelectorFactory.makeXXXXColumnSelector – such as this one – are not thread safe, this may be an oversight in the groupBy query engine. The resulting ColumnSelectors maybe be thread safe but not the methods that create them. We may want to investigate what needs to be done there.

Doesn't need to be for this PR, but I think the guarantees provided by the different methods are not clear. Maybe @cheddar has some input?

for groupBy, I particularly referred to the case when multiple threads adds rows to the same IncrementalIndex, the add method to IncIndex needs to be thread safe, for traversing the rows from the segments using the XXXColumnSelector uses single thread and doesnt need these guarantees.

@cheddar any suggestions on whether we need to make it thread safe or not ?

…in each row clear selectors on close. Add comments about thread safety.

nishantmonu51 · 2016-01-12T19:16:53Z

@xvrl Added code comments about thread safety.

xvrl · 2016-01-12T21:11:50Z

👍 looks good to me

drcrallen · 2016-01-13T21:55:40Z

Checking out thread safety aspect as I'm familiar with this part of the code. Give me a min.

fjy · 2016-01-13T22:47:51Z

👍

fjy · 2016-01-13T22:47:55Z

again :D

drcrallen · 2016-01-13T22:48:02Z

The comment is correct about the implementation of add to facts needing to be thread safe. The column selector impl provided to cache the column selectors looks correct for the case of column selectors used.

Improve heap usage for IncrementalIndex

fjy · 2016-01-13T22:48:47Z

@drcrallen missed your comment, sorry!

xvrl reviewed Jan 7, 2016
View reviewed changes

nishantmonu51 force-pushed the incremental-index-mem2 branch from 4e7fdf4 to b9dec22 Compare January 8, 2016 06:54

nishantmonu51 added this to the 0.9.0 milestone Jan 12, 2016

cache metric selectors instead of creating new ones for every metric …

4863e2c

…in each row clear selectors on close. Add comments about thread safety.

nishantmonu51 force-pushed the incremental-index-mem2 branch from b9dec22 to 4863e2c Compare January 12, 2016 19:16

fjy added a commit that referenced this pull request Jan 13, 2016

Merge pull request #2228 from metamx/incremental-index-mem2

4c014c1

Improve heap usage for IncrementalIndex

fjy merged commit 4c014c1 into apache:master Jan 13, 2016

drcrallen deleted the incremental-index-mem2 branch January 13, 2016 22:48

fjy mentioned this pull request Feb 5, 2016

druid-0.9.0 release notes #2404

Closed

fjy added the Improvement label Feb 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve heap usage for IncrementalIndex #2228

Improve heap usage for IncrementalIndex #2228

nishantmonu51 commented Jan 7, 2016

fjy commented Jan 7, 2016

jon-wei commented Jan 7, 2016

xvrl Jan 7, 2016

navis Jan 7, 2016

nishantmonu51 Jan 8, 2016

xvrl Jan 9, 2016

nishantmonu51 Jan 9, 2016

nishantmonu51 Jan 12, 2016

nishantmonu51 commented Jan 12, 2016

xvrl commented Jan 12, 2016

drcrallen commented Jan 13, 2016

fjy commented Jan 13, 2016

fjy commented Jan 13, 2016

drcrallen commented Jan 13, 2016

fjy commented Jan 13, 2016

Improve heap usage for IncrementalIndex #2228

Improve heap usage for IncrementalIndex #2228

Conversation

nishantmonu51 commented Jan 7, 2016

fjy commented Jan 7, 2016

jon-wei commented Jan 7, 2016

xvrl Jan 7, 2016

Choose a reason for hiding this comment

navis Jan 7, 2016

Choose a reason for hiding this comment

nishantmonu51 Jan 8, 2016

Choose a reason for hiding this comment

xvrl Jan 9, 2016

Choose a reason for hiding this comment

nishantmonu51 Jan 9, 2016

Choose a reason for hiding this comment

nishantmonu51 Jan 12, 2016

Choose a reason for hiding this comment

nishantmonu51 commented Jan 12, 2016

xvrl commented Jan 12, 2016

drcrallen commented Jan 13, 2016

fjy commented Jan 13, 2016

fjy commented Jan 13, 2016

drcrallen commented Jan 13, 2016

fjy commented Jan 13, 2016