refactor druid-bloom-filter aggregators #7496

clintropolis · 2019-04-17T06:18:05Z

This PR refactors the druid-bloom-filter extension aggregators to strictly use ByteBuffer representation of BloomKFilter and buffer based methods to manipulate them in place, allowing for combined Aggregator and BufferAggregator implementations and vastly simplifying the code. It also fixes a bug in the combine logic of the current implementation, resulting from the mixed use of BloomKFilter and ByteBuffer (the current implementation only can combine BloomKFilter, but the broker will receive byte[] from the historicals, resulting in a cast exception).

Originally introduced in #6397, the PR sort of grew organically over its 41 commit lifetime; from starting out where it only dealt with BloomKFilter on heap and suffered the overhead of a lot of serde, to where it ended up, which mixed usage of BloomKFilter and ByteBuffer after adding some methods to optimize the BufferAggregator implementation, done in an attempt to reduce the overhead of serializing and deserializing such potentially large values so often, and picking up a bug or two along the way.

This new approach combines implementations of both types of the aggregators and now always operates with ByteBuffer. The BloomFilterAggregatorFactory and BloomFilterMergeAggregatorFactory construct the Aggregator and BufferAggregator implementations with a perhaps not greatly named boolean parameter on the constructor, onHeap. If onHeap is set to true, the aggregator will allocate an appropriately sized ByteBuffer to hold a BloomKFilter allow usage as an Aggregator, and if not must rely on the ByteBuffer that will be passed to it's methods during it's life as a BufferAggregator.

There is probably room for improvement to make this a bit more elegant, maybe making the aggregator constructors private and adding static methods to more explicitly construct either an Aggregator with onHeap = true, or a BufferAggregator with onHeap = false? Regardless, this eliminates all of the extra serde and should be a lot more simple to reason about and troubleshoot.

I have tested with top-n, timeseries, and group-by queries on a small test cluster with multiple-historical spun up on my laptop, and things look good so far.

gianm · 2019-04-17T16:17:19Z

Going to tag this as 'bug' too since it does fix a bug - right now the type mismatch between what deserialize generates and combine accepts will lead to class cast exceptions when merging.

gianm

Some suggestions and just 1 thing I think really should change (the try/finally).

gianm · 2019-04-17T14:06:21Z

...filter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java

 {
-  final BloomKFilter collector;
+
+  protected final ByteBuffer collector;


Could you annotate this @Nullable please?

gianm · 2019-04-17T19:58:11Z

...filter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java

+    final int oldPosition = buf.position();
+    buf.position(position);
+    bufferAdd(buf);
+    buf.position(oldPosition);


The position adjustment and restore should be wrapped in a try/finally, because we don't want the buffer to be left in an invalid state if an exception is thrown.

gianm · 2019-04-17T19:58:23Z

...filter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java

 {
-  final BloomKFilter collector;
+
+  protected final ByteBuffer collector;


Please mark this @Nullable.

gianm · 2019-04-17T20:03:26Z

...filter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java

+  {
+    ByteBuffer mutationBuffer = buf.duplicate();
+    mutationBuffer.position(position);
+    // | k (byte) | numLongs (int) | bitset (long[numLongs]) |


I think this comment would make more sense attached to "computeSizeBytes"?

Oops, yes i switched this method to using computeSizeBytes and forgot to remove the comment (previously it was calculating by hand)

gianm · 2019-04-17T20:05:33Z

...ter/src/main/java/org/apache/druid/query/aggregation/bloom/BloomFilterAggregateCombiner.java


-public class BloomFilterAggregateCombiner extends ObjectAggregateCombiner<BloomKFilter>


AggregateCombiner is only used at ingestion time AFAIK. Since this aggregator isn't meant to be used at ingestion time, you might as well delete this and make makeAggregateCombiner throw an UnsupportedOperationException.

I wasn't certain so left it in the original PR. However, all of my tests pass with it removed, as well as test queries against a debugging cluster, so I think you're right and it should likely be ok removed.

gianm · 2019-04-17T20:10:26Z

...lter/src/main/java/org/apache/druid/query/aggregation/bloom/DoubleBloomFilterAggregator.java

@@ -23,20 +23,22 @@
 import org.apache.druid.query.filter.BloomKFilter;
 import org.apache.druid.segment.BaseDoubleColumnValueSelector;

+import java.nio.ByteBuffer;
+
 public final class DoubleBloomFilterAggregator extends BaseBloomFilterAggregator<BaseDoubleColumnValueSelector>


I think you could collapse all of these implementations into a single one using a similar technique to CardinalityAggregator. (ColumnSelectorPluses encapsulating a strategy for reading different types)

No need to do it for this PR, IMO, unless you want to. Just wanted to mention the technique.

Heh, originally the bloom filter aggregators were done like that, refactored in the original PR in this commit, but it actually takes more classes to implement it that way (especially the way it is in this PR), and puts more indirection between the aggregators which doesn't really seem necessary. See this comment chain original PR for reference, and the issue spawned from it, #6909 for more details. I think I find it cleaner this way personally, especially after the changes of this PR.

Oh. Well, I think the way the CardinalityAggregator does it is cleaner, since it only splits out the logic for how to read from differently-typed inputs, and then composes that logic into a single Aggregator class, instead of using inheritance and a ton of Aggregator subclasses. I just generally find inheritance based structures less easy to follow since the logic seems 'inside out' to me.

But, I guess if reasonable people could differ on this point, it's up to you how you want to structure it. I don't think this inheritance based approach is better than the type-based-input-strategy approach, but it's acceptable.

By the way, one original reason for the CardinalityAggregator-style design choice is that the logic for choosing which implementation to use could be centralized / standardized. The equivalent of this code here: https://github.com/apache/incubator-druid/pull/7496/files#diff-06cafba60d560f4a5bb1551a56b8041dR99 is this code in CardinalityAggregatorFactory:

ColumnSelectorPlus<CardinalityAggregatorColumnSelectorStrategy>[] selectorPluses = DimensionHandlerUtils.createColumnSelectorPluses( STRATEGY_FACTORY, fields, columnFactory );

Which calls out to a standard library function. One other cool thing we can do here is modify the ColumnSelectorStrategyFactory interface to look more like this:

public interface ColumnStrategizer<T> { T makeDimensionStrategy(DimensionSelector selector); T makeFloatStrategy(ColumnValueSelector selector); T makeDoubleStrategy(ColumnValueSelector selector); T makeLongStrategy(ColumnValueSelector selector); }

That way, there's a structure in place to make sure every type has some sort of handling. Or if we used default methods, at least consistent exceptions being thrown when types are missing handling. It isn't like this yet, but it could be, and I think it would be nice.

I don't feel super strongly either way, but I think it is slightly fewer classes the way it is in this PR since just the base agg and the subclasses with the selector specific handling, instead of the agg, strategy factory, and strategy for each selector. Maybe there is another abstraction somewhere in between these two approaches, or maybe a way to just supply strategies to a common strategy factory so you don't have to do that strategy factory boilerplate? I'll think about this a bit more and maybe follow up in the future.

For logic as simple as this case, the strategies could probably all be lambdas inlined inside the strategy factory.

gianm · 2019-04-18T15:38:35Z

@clintropolis checkstyle has some words for you:

[ERROR] /home/travis/build/apache/incubator-druid/extensions-core/druid-bloom-filter/src/main/java/org/apache/druid/query/aggregation/bloom/BaseBloomFilterAggregator.java:94:5: '}' at column 5 should be alone on a line. [RightCurly]

gianm

LGTM after the style issues / CI addressed

gianm · 2019-04-18T15:53:43Z

Tagged 0.15 since it is a bug.

* now with 100% more buffer * there can be only 1 * simplify * javadoc * clean up unused test method * fix exception message * style * why does style hate javadocs * review stuff * style :(

clintropolis added 8 commits April 16, 2019 20:58

now with 100% more buffer

08ab983

there can be only 1

ab5e608

simplify

fea8d83

javadoc

bddd4fd

clean up unused test method

e7c835c

fix exception message

f2ab542

style

f139852

why does style hate javadocs

0c7b144

gianm added Area - Querying Bug Improvement Refactoring and removed Improvement labels Apr 17, 2019

gianm reviewed Apr 17, 2019

View reviewed changes

review stuff

a7b25fa

gianm mentioned this pull request Apr 18, 2019

Remove CardinalityAggregatorColumnSelectorStrategy and CardinalityAggregatorColumnSelectorStrategyFactory as too shallow abstractions. #6909

Open

gianm approved these changes Apr 18, 2019

View reviewed changes

gianm added this to the 0.15.0 milestone Apr 18, 2019

style :(

cf24b4e

gianm merged commit be65cca into apache:master Apr 18, 2019

clintropolis deleted the bloom-agg-refactor-to-buffer-only branch April 20, 2019 08:34

clintropolis modified the milestones: 0.15.0, 0.14.1 Apr 24, 2019

clintropolis mentioned this pull request Apr 25, 2019

0.14.1-incubating release notes #7553

Closed

clintropolis mentioned this pull request Apr 30, 2019

druid-bloom-filter could not work #7580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor druid-bloom-filter aggregators #7496

refactor druid-bloom-filter aggregators #7496

clintropolis commented Apr 17, 2019

gianm commented Apr 17, 2019

gianm left a comment

gianm Apr 17, 2019

gianm Apr 17, 2019

gianm Apr 17, 2019

gianm Apr 17, 2019

clintropolis Apr 18, 2019

gianm Apr 17, 2019

clintropolis Apr 18, 2019

gianm Apr 17, 2019

clintropolis Apr 18, 2019

gianm Apr 18, 2019

gianm Apr 18, 2019

clintropolis Apr 18, 2019

gianm Apr 18, 2019

gianm commented Apr 18, 2019

gianm left a comment

gianm commented Apr 18, 2019


		public class BloomFilterAggregateCombiner extends ObjectAggregateCombiner<BloomKFilter>

refactor druid-bloom-filter aggregators #7496

refactor druid-bloom-filter aggregators #7496

Conversation

clintropolis commented Apr 17, 2019

gianm commented Apr 17, 2019

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Apr 18, 2019

gianm left a comment

Choose a reason for hiding this comment

gianm commented Apr 18, 2019