-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve heap usage for IncrementalIndex #2228
Conversation
👍 I think this looks cool and it is crazy we missed this optimization :P |
👍 looks good to me |
// Caches references to selector objetcs for each column instead of creating a new object each time in order to save heap space. | ||
private static class ObjectCachingColumnSelectorFactory implements ColumnSelectorFactory | ||
{ | ||
private final ConcurrentMap<String, LongColumnSelector> longColumnSelectorMap = Maps.newConcurrentMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do those need to be concurrent maps? I don't think ColumnSelectors are threadsafe but I also don't think we ever share columnselectorfactories across multiple threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to use Cache in Guava, which also can specify various options, including expire policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be accessed by multiple threads in case of groupBy queryies,
dint used Guava Cache as we don't need expiration policies here, since if we expire entries we will end up creating multiple selector objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nishantmonu51 Currently ColumnSelectorFactory.makeXXXXColumnSelector
– such as this one – are not thread safe, this may be an oversight in the groupBy query engine. The resulting ColumnSelectors maybe be thread safe but not the methods that create them. We may want to investigate what needs to be done there.
Doesn't need to be for this PR, but I think the guarantees provided by the different methods are not clear. Maybe @cheddar has some input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for groupBy, I particularly referred to the case when multiple threads adds rows to the same IncrementalIndex, the add method to IncIndex needs to be thread safe, for traversing the rows from the segments using the XXXColumnSelector uses single thread and doesnt need these guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheddar any suggestions on whether we need to make it thread safe or not ?
4e7fdf4
to
b9dec22
Compare
…in each row clear selectors on close. Add comments about thread safety.
b9dec22
to
4863e2c
Compare
@xvrl Added code comments about thread safety. |
👍 looks good to me |
Checking out thread safety aspect as I'm familiar with this part of the code. Give me a min. |
👍 |
again :D |
The comment is correct about the implementation of add to facts needing to be thread safe. The column selector impl provided to cache the column selectors looks correct for the case of column selectors used. |
Improve heap usage for IncrementalIndex
@drcrallen missed your comment, sorry! |
With current code OnheapIncrementalIndex ends up creating a new object of ColumnSelectorFactory (24 bytes each) and ColumnSelector ( 24 bytes) for every aggregator for each druid row.
This means an overhead of 48 bytes * number of aggs per row which becomes significant as the number of aggregators are increased. e.g for 1M rows each having 20 aggregators it turns out to be 800Mb.
This PR aims at removing this overhead by reusing the ColumnSelectorFactory and ColumnSelector by caching the selector objects.
For measuring the impact on heap usage for aggregators I created an IncrementalIndex with 1M rows each row having 20 longsum aggregators and 1 dimension and got an overall reduction in heap size from 1.9G to 1G. ( ~50% improvement)
Actual improvements in the index size will vary with distribution of number of aggregators and dimensions in IncrementalIndex.