Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve heap usage for IncrementalIndex #2228

Merged
merged 1 commit into from
Jan 13, 2016

Conversation

nishantmonu51
Copy link
Member

With current code OnheapIncrementalIndex ends up creating a new object of ColumnSelectorFactory (24 bytes each) and ColumnSelector ( 24 bytes) for every aggregator for each druid row.

This means an overhead of 48 bytes * number of aggs per row which becomes significant as the number of aggregators are increased. e.g for 1M rows each having 20 aggregators it turns out to be 800Mb.

This PR aims at removing this overhead by reusing the ColumnSelectorFactory and ColumnSelector by caching the selector objects.

For measuring the impact on heap usage for aggregators I created an IncrementalIndex with 1M rows each row having 20 longsum aggregators and 1 dimension and got an overall reduction in heap size from 1.9G to 1G. ( ~50% improvement)

Actual improvements in the index size will vary with distribution of number of aggregators and dimensions in IncrementalIndex.

@fjy
Copy link
Contributor

fjy commented Jan 7, 2016

👍 I think this looks cool and it is crazy we missed this optimization :P

@jon-wei
Copy link
Contributor

jon-wei commented Jan 7, 2016

👍 looks good to me

// Caches references to selector objetcs for each column instead of creating a new object each time in order to save heap space.
private static class ObjectCachingColumnSelectorFactory implements ColumnSelectorFactory
{
private final ConcurrentMap<String, LongColumnSelector> longColumnSelectorMap = Maps.newConcurrentMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do those need to be concurrent maps? I don't think ColumnSelectors are threadsafe but I also don't think we ever share columnselectorfactories across multiple threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to use Cache in Guava, which also can specify various options, including expire policy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be accessed by multiple threads in case of groupBy queryies,
dint used Guava Cache as we don't need expiration policies here, since if we expire entries we will end up creating multiple selector objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishantmonu51 Currently ColumnSelectorFactory.makeXXXXColumnSelector – such as this one – are not thread safe, this may be an oversight in the groupBy query engine. The resulting ColumnSelectors maybe be thread safe but not the methods that create them. We may want to investigate what needs to be done there.

Doesn't need to be for this PR, but I think the guarantees provided by the different methods are not clear. Maybe @cheddar has some input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for groupBy, I particularly referred to the case when multiple threads adds rows to the same IncrementalIndex, the add method to IncIndex needs to be thread safe, for traversing the rows from the segments using the XXXColumnSelector uses single thread and doesnt need these guarantees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheddar any suggestions on whether we need to make it thread safe or not ?

@nishantmonu51 nishantmonu51 added this to the 0.9.0 milestone Jan 12, 2016
…in each row

clear selectors on close.

Add comments about thread safety.
@nishantmonu51
Copy link
Member Author

@xvrl Added code comments about thread safety.

@xvrl
Copy link
Member

xvrl commented Jan 12, 2016

👍 looks good to me

@drcrallen
Copy link
Contributor

Checking out thread safety aspect as I'm familiar with this part of the code. Give me a min.

@fjy
Copy link
Contributor

fjy commented Jan 13, 2016

👍

@fjy
Copy link
Contributor

fjy commented Jan 13, 2016

again :D

@drcrallen
Copy link
Contributor

The comment is correct about the implementation of add to facts needing to be thread safe. The column selector impl provided to cache the column selectors looks correct for the case of column selectors used.

fjy added a commit that referenced this pull request Jan 13, 2016
Improve heap usage for IncrementalIndex
@fjy fjy merged commit 4c014c1 into apache:master Jan 13, 2016
@drcrallen drcrallen deleted the incremental-index-mem2 branch January 13, 2016 22:48
@fjy
Copy link
Contributor

fjy commented Jan 13, 2016

@drcrallen missed your comment, sorry!

@fjy fjy mentioned this pull request Feb 5, 2016
@fjy fjy added the Improvement label Feb 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants