support BOOL_AND and BOOL_OR aggregate functions #9848

agavra · 2022-11-23T01:18:10Z

Support the BOOL_AND/BOOL_OR functions (see https://www.postgresql.org/docs/9.1/functions-aggregate.html for documentation on function behavior).

Review Notes

the aggregates operate on integer types because that's the stored type, but the column schema is BOOLEAN both for intermediate and result type.
[multistage only] PinotBoolAndAggregateFunction and PinotBoolOrAggregateFunction allow us to register the functions with the PinotStdOperatorTable and have them be case insensitive, and it also makes sure the types are properly propagated

walterddr

lgtm overall. please take a look at the comments.

walterddr · 2022-11-28T18:22:35Z

pinot-query-runtime/src/test/resources/queries/Aggregates.json

+      },
+      {
+        "psql": "9.21.0",
+        "description": "aggregate boolean column",
+        "sql": "SELECT bool_and(bool_col), bool_or(bool_col) FROM {tbl} GROUP BY string_col"
+      },
+      {
+        "psql": "9.21.0",
+        "description": "aggregate boolean column no group by",
+        "sql": "SELECT bool_and(bool_col), bool_or(bool_col) FROM {tbl}"


do we support SELECT bool_and(startsWith(string_col, 'foo')) FROM ... ?
can we add a test for these type of use case?

we do not support this yet (for any aggregate function in v2). This is a huge gap, good call out.

walterddr · 2022-11-28T18:24:53Z

...st/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactoryTest.java

@@ -444,6 +444,20 @@ public void testGetAggregationFunction() {
    assertEquals(aggregationFunction.getType(), AggregationFunctionType.PERCENTILETDIGESTMV);
    assertEquals(aggregationFunction.getColumnName(), "percentileTDigest95.0MV_column");
    assertEquals(aggregationFunction.getResultColumnName(), "percentiletdigestmv(column, 95.0)");
+
+    function = getFunction("bool_and");


since we are adding this feature to V1 as well. we should add test for them in V1 too. see: #9236 when adding COVAR_POP

codecov-commenter · 2022-11-28T20:49:24Z

Codecov Report

Merging #9848 (cbefbe7) into master (6ef4dfc) will increase coverage by 5.08%.
The diff coverage is 54.93%.

@@             Coverage Diff              @@
##             master    #9848      +/-   ##
============================================
+ Coverage     63.44%   68.53%   +5.08%     
+ Complexity     5326     4924     -402     
============================================
  Files          1960     1978      +18     
  Lines        105350   105976     +626     
  Branches      15960    16057      +97     
============================================
+ Hits          66840    72628    +5788     
+ Misses        33589    28218    -5371     
- Partials       4921     5130     +209

Flag	Coverage Δ
integration1	`25.06% <0.00%> (-0.15%)`	⬇️
integration2	`?`
unittests1	`67.80% <54.93%> (+<0.01%)`	⬆️
unittests2	`15.79% <3.70%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ery/aggregation/DoubleAggregationResultHolder.java	`60.00% <0.00%> (-15.00%)`	⬇️
...ery/aggregation/ObjectAggregationResultHolder.java	`60.00% <0.00%> (-25.72%)`	⬇️
...aggregation/groupby/DoubleGroupByResultHolder.java	`82.75% <0.00%> (-6.14%)`	⬇️
...aggregation/groupby/ObjectGroupByResultHolder.java	`75.00% <0.00%> (-12.50%)`	⬇️
...ry/aggregation/groupby/IntGroupByResultHolder.java	`41.37% <41.37%> (ø)`
...re/query/aggregation/IntAggregateResultHolder.java	`60.00% <60.00%> (ø)`
...regation/function/BooleanAndAggregateFunction.java	`60.00% <60.00%> (ø)`
...gregation/function/BooleanOrAggregateFunction.java	`60.00% <60.00%> (ø)`
...egation/function/BaseBooleanAggregateFunction.java	`60.67% <60.67%> (ø)`
...inot/query/runtime/operator/AggregateOperator.java	`95.68% <75.00%> (-1.54%)`	⬇️
... and 431 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jackie-Jiang

We may consider using different code path with/without null handling to avoid the unnecessary boxing/unboxing. We have observed quite big performance impact using object vs primitive

Jackie-Jiang · 2022-11-28T23:49:53Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+import org.roaringbitmap.RoaringBitmap;
+
+
+public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> {


Should this be BaseSingleInputAggregationFunction<Boolean, Boolean>?

that's what I did at the start, but I think there's some issue with it because booleans are stored as ints. I don't remember exactly what it was, but I can try to make that change back.

The problem is that lots of the reducing code expects the data to be in stored type format. If we output a boolean, that breaks (see for example usage of ColumnDataType#convert, which expects booleans to be stored as ints).

I see. We can fix it separately. Suggest adding a TODO for it

Jackie-Jiang · 2022-11-28T23:51:32Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+  protected enum BooleanMerge {
+    AND {
+      @Override
+      int merge(Integer left, int right) {


For better performance, we want to avoid per value merge, but do array batch processing. Also the null values should be handled in batches

Discussed offline, we'll keep the enum to avoid duplicating code - I wrote this JMH to confirm that there shouldn't be a performance impact (see below). We can always go back and optimize it if need be.

@Setup public void setUp() { Random random = new Random(); _ints = IntStream.generate(random::nextInt).limit(1000).toArray(); _enum = MyEnum.ADD; } @Benchmark public int inline() { int sum = 0; for (int i : _ints) { sum = sum + i; } return sum; } @Benchmark public int delegate() { int sum = 0; for (int i : _ints) { sum = _enum.apply(sum, i); } return sum; }

The result was:

Benchmark Mode Cnt Score Error Units BenchmarkEnumOverhead.delegate avgt 5 0.306 ± 0.040 us/op BenchmarkEnumOverhead.inline avgt 5 0.306 ± 0.033 us/op

Jackie-Jiang · 2022-11-28T23:53:41Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+      includeDoc = docId -> true;
+    }
+
+    Integer agg = aggregationResultHolder.getResult();


We can early terminate when the previous value is true for OR and false for AND

Jackie-Jiang

Mostly good

Jackie-Jiang · 2022-11-30T01:03:04Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+
+    if (blockValSet.getValueType().getStoredType() != FieldSpec.DataType.INT) {
+      throw new IllegalArgumentException(
+          String.format("Unsupported data type %s for BOOL_AND", blockValSet.getValueType()));


(minor) This can also be BOOL_OR

Jackie-Jiang · 2022-11-30T01:05:16Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+      Map<ExpressionContext, BlockValSet> blockValSetMap) {
+    BlockValSet blockValSet = blockValSetMap.get(_expression);
+
+    if (blockValSet.getValueType().getStoredType() != FieldSpec.DataType.INT) {


Do we want to support it on INT column? If not, we should explicitly check the value type instead of the stored type. The current logic won't work well when the input value is not 0/1.

Jackie-Jiang · 2022-11-30T01:07:22Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+    OR {
+      @Override
+      long merge(long left, int right) {
+        return left + right;


This should be left | right. For boolean result, the value should be either 0 or 1.

nice! I used + and then just checked >0 but this is cleaner. I also changed AND to use &

Jackie-Jiang · 2022-11-30T01:12:40Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+  protected enum BooleanMerge {
+    AND {
+      @Override
+      long merge(long left, int right) {


We should store the value as int to avoid the implicit cast from int to long. Suggest adding the result holder for int type in this PR, and we may add it for long types if needed in the future.

Jackie-Jiang · 2022-11-30T01:15:23Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+        if (!nullBitmap.contains(i)) {
+          agg = _merger.merge(agg, bools[i]);
+          aggregationResultHolder.setValue((Object) agg);
+          if (_merger.isTerminal(agg)) {


Move this check to line 117 so that the block can be skipped.
We might not want to do per-value terminal check because tight loop tends to have better performance. We can do early terminate on block level

walterddr

lgtm

Jackie-Jiang

LGTM. Only minor comments

pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/AggregationResultHolder.java

...ore/src/main/java/org/apache/pinot/core/query/aggregation/DoubleAggregationResultHolder.java

Jackie-Jiang · 2022-11-30T19:11:54Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+
+      // early terminate on a per-block level to allow the
+      // loop below to be more tightly optimized (avoid a branch)
+      if (_merger.isTerminal(agg)) {


(minor) suggest moving this above the nullBitmap check, which is much more costly

Jackie-Jiang · 2022-11-30T19:14:13Z

...main/java/org/apache/pinot/core/query/aggregation/function/BaseBooleanAggregateFunction.java

+// TODO: change this to implement BaseSingleInputAggregationFunction<Boolean, Boolean> when we get proper
+// handling of booleans in serialization - today this would fail because ColumnDataType#convert assumes
+// that the boolean is encoded as its stored type (an integer)
+public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> {


(nit) We usually call it AggregationFunction instead of AggregateFunction. Keeping them consistent can help with the code search. Same for the sub-classes

Jackie-Jiang · 2022-11-30T19:16:06Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/AggregationFunctionType.java

+  DISTINCT("distinct"),
+
+  // boolean aggregate functions
+  BOOLAND("bool_and"),


Keep the name camel case (boolAnd and boolOr)

agavra force-pushed the boolean_agg branch 2 times, most recently from fcbbceb to d5fdad1 Compare November 23, 2022 17:28

agavra marked this pull request as ready for review November 23, 2022 18:42

walterddr reviewed Nov 28, 2022

View reviewed changes

agavra added 2 commits November 28, 2022 10:36

support BOOL_AND and BOOL_OR aggregate functions

2fbf04a

address feedback

9b97ccb

agavra force-pushed the boolean_agg branch from d5fdad1 to 9b97ccb Compare November 28, 2022 20:09

Jackie-Jiang reviewed Nov 29, 2022

View reviewed changes

agavra added 3 commits November 29, 2022 15:43

inline null handling and add LongResultHolders

817d07d

fix checkstyle

01916f9

fix null handling and add null tests

d69191c

Jackie-Jiang reviewed Nov 30, 2022

View reviewed changes

change Long to Int and a few other minor changes

cbefbe7

agavra requested review from walterddr and Jackie-Jiang and removed request for walterddr and Jackie-Jiang November 30, 2022 18:37

walterddr approved these changes Nov 30, 2022

View reviewed changes

Jackie-Jiang approved these changes Nov 30, 2022

View reviewed changes

agavra added 3 commits November 30, 2022 13:41

address minor feedback

4ceba84

trigger build

25f5353

fix test failure

5410b20

walterddr merged commit 0a6d843 into apache:master Dec 1, 2022

agavra deleted the boolean_agg branch December 1, 2022 15:21

walterddr mentioned this pull request Apr 18, 2023

[multistage] Query with Anti Semi-Join Fails #10628

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support BOOL_AND and BOOL_OR aggregate functions #9848

support BOOL_AND and BOOL_OR aggregate functions #9848

agavra commented Nov 23, 2022 •

edited

walterddr left a comment

walterddr Nov 28, 2022

agavra Nov 28, 2022

walterddr Nov 28, 2022

codecov-commenter commented Nov 28, 2022 •

edited

Jackie-Jiang left a comment

Jackie-Jiang Nov 28, 2022

agavra Nov 29, 2022

agavra Nov 29, 2022

Jackie-Jiang Nov 30, 2022

Jackie-Jiang Nov 28, 2022

agavra Nov 29, 2022

Jackie-Jiang Nov 28, 2022

Jackie-Jiang left a comment

Jackie-Jiang Nov 30, 2022

Jackie-Jiang Nov 30, 2022

Jackie-Jiang Nov 30, 2022

agavra Nov 30, 2022 •

edited

Jackie-Jiang Nov 30, 2022

Jackie-Jiang Nov 30, 2022

walterddr left a comment

Jackie-Jiang left a comment

Jackie-Jiang Nov 30, 2022

Jackie-Jiang Nov 30, 2022

Jackie-Jiang Nov 30, 2022

		import org.roaringbitmap.RoaringBitmap;


		public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> {

support BOOL_AND and BOOL_OR aggregate functions #9848

support BOOL_AND and BOOL_OR aggregate functions #9848

Conversation

agavra commented Nov 23, 2022 • edited

walterddr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 28, 2022 • edited

Codecov Report

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra Nov 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

walterddr left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra commented Nov 23, 2022 •

edited

codecov-commenter commented Nov 28, 2022 •

edited

agavra Nov 30, 2022 •

edited