Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support BOOL_AND and BOOL_OR aggregate functions #9848

Merged
merged 9 commits into from Dec 1, 2022

Conversation

agavra
Copy link
Contributor

@agavra agavra commented Nov 23, 2022

Support the BOOL_AND/BOOL_OR functions (see https://www.postgresql.org/docs/9.1/functions-aggregate.html for documentation on function behavior).

Review Notes

  1. the aggregates operate on integer types because that's the stored type, but the column schema is BOOLEAN both for intermediate and result type.
  2. [multistage only] PinotBoolAndAggregateFunction and PinotBoolOrAggregateFunction allow us to register the functions with the PinotStdOperatorTable and have them be case insensitive, and it also makes sure the types are properly propagated

@agavra agavra force-pushed the boolean_agg branch 2 times, most recently from fcbbceb to d5fdad1 Compare November 23, 2022 17:28
@agavra agavra marked this pull request as ready for review November 23, 2022 18:42
Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall. please take a look at the comments.

Comment on lines +120 to +129
},
{
"psql": "9.21.0",
"description": "aggregate boolean column",
"sql": "SELECT bool_and(bool_col), bool_or(bool_col) FROM {tbl} GROUP BY string_col"
},
{
"psql": "9.21.0",
"description": "aggregate boolean column no group by",
"sql": "SELECT bool_and(bool_col), bool_or(bool_col) FROM {tbl}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support SELECT bool_and(startsWith(string_col, 'foo')) FROM ... ?
can we add a test for these type of use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not support this yet (for any aggregate function in v2). This is a huge gap, good call out.

@@ -444,6 +444,20 @@ public void testGetAggregationFunction() {
assertEquals(aggregationFunction.getType(), AggregationFunctionType.PERCENTILETDIGESTMV);
assertEquals(aggregationFunction.getColumnName(), "percentileTDigest95.0MV_column");
assertEquals(aggregationFunction.getResultColumnName(), "percentiletdigestmv(column, 95.0)");

function = getFunction("bool_and");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are adding this feature to V1 as well. we should add test for them in V1 too. see: #9236 when adding COVAR_POP

@codecov-commenter
Copy link

codecov-commenter commented Nov 28, 2022

Codecov Report

Merging #9848 (cbefbe7) into master (6ef4dfc) will increase coverage by 5.08%.
The diff coverage is 54.93%.

@@             Coverage Diff              @@
##             master    #9848      +/-   ##
============================================
+ Coverage     63.44%   68.53%   +5.08%     
+ Complexity     5326     4924     -402     
============================================
  Files          1960     1978      +18     
  Lines        105350   105976     +626     
  Branches      15960    16057      +97     
============================================
+ Hits          66840    72628    +5788     
+ Misses        33589    28218    -5371     
- Partials       4921     5130     +209     
Flag Coverage Δ
integration1 25.06% <0.00%> (-0.15%) ⬇️
integration2 ?
unittests1 67.80% <54.93%> (+<0.01%) ⬆️
unittests2 15.79% <3.70%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ery/aggregation/DoubleAggregationResultHolder.java 60.00% <0.00%> (-15.00%) ⬇️
...ery/aggregation/ObjectAggregationResultHolder.java 60.00% <0.00%> (-25.72%) ⬇️
...aggregation/groupby/DoubleGroupByResultHolder.java 82.75% <0.00%> (-6.14%) ⬇️
...aggregation/groupby/ObjectGroupByResultHolder.java 75.00% <0.00%> (-12.50%) ⬇️
...ry/aggregation/groupby/IntGroupByResultHolder.java 41.37% <41.37%> (ø)
...re/query/aggregation/IntAggregateResultHolder.java 60.00% <60.00%> (ø)
...regation/function/BooleanAndAggregateFunction.java 60.00% <60.00%> (ø)
...gregation/function/BooleanOrAggregateFunction.java 60.00% <60.00%> (ø)
...egation/function/BaseBooleanAggregateFunction.java 60.67% <60.67%> (ø)
...inot/query/runtime/operator/AggregateOperator.java 95.68% <75.00%> (-1.54%) ⬇️
... and 431 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may consider using different code path with/without null handling to avoid the unnecessary boxing/unboxing. We have observed quite big performance impact using object vs primitive

import org.roaringbitmap.RoaringBitmap;


public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be BaseSingleInputAggregationFunction<Boolean, Boolean>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I did at the start, but I think there's some issue with it because booleans are stored as ints. I don't remember exactly what it was, but I can try to make that change back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that lots of the reducing code expects the data to be in stored type format. If we output a boolean, that breaks (see for example usage of ColumnDataType#convert, which expects booleans to be stored as ints).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We can fix it separately. Suggest adding a TODO for it

protected enum BooleanMerge {
AND {
@Override
int merge(Integer left, int right) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better performance, we want to avoid per value merge, but do array batch processing. Also the null values should be handled in batches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, we'll keep the enum to avoid duplicating code - I wrote this JMH to confirm that there shouldn't be a performance impact (see below). We can always go back and optimize it if need be.

  @Setup
  public void setUp() {
    Random random = new Random();
    _ints = IntStream.generate(random::nextInt).limit(1000).toArray();
    _enum = MyEnum.ADD;
  }

  @Benchmark
  public int inline() {
    int sum = 0;
    for (int i : _ints) {
      sum = sum + i;
    }
    return sum;
  }

  @Benchmark
  public int delegate() {
    int sum = 0;
    for (int i : _ints) {
      sum = _enum.apply(sum, i);
    }
    return sum;
  }

The result was:

Benchmark                       Mode  Cnt  Score   Error  Units
BenchmarkEnumOverhead.delegate  avgt    5  0.306 ± 0.040  us/op
BenchmarkEnumOverhead.inline    avgt    5  0.306 ± 0.033  us/op

includeDoc = docId -> true;
}

Integer agg = aggregationResultHolder.getResult();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can early terminate when the previous value is true for OR and false for AND

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good


if (blockValSet.getValueType().getStoredType() != FieldSpec.DataType.INT) {
throw new IllegalArgumentException(
String.format("Unsupported data type %s for BOOL_AND", blockValSet.getValueType()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) This can also be BOOL_OR

Map<ExpressionContext, BlockValSet> blockValSetMap) {
BlockValSet blockValSet = blockValSetMap.get(_expression);

if (blockValSet.getValueType().getStoredType() != FieldSpec.DataType.INT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support it on INT column? If not, we should explicitly check the value type instead of the stored type. The current logic won't work well when the input value is not 0/1.

OR {
@Override
long merge(long left, int right) {
return left + right;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be left | right. For boolean result, the value should be either 0 or 1.

Copy link
Contributor Author

@agavra agavra Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I used + and then just checked >0 but this is cleaner. I also changed AND to use &

protected enum BooleanMerge {
AND {
@Override
long merge(long left, int right) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should store the value as int to avoid the implicit cast from int to long. Suggest adding the result holder for int type in this PR, and we may add it for long types if needed in the future.

if (!nullBitmap.contains(i)) {
agg = _merger.merge(agg, bools[i]);
aggregationResultHolder.setValue((Object) agg);
if (_merger.isTerminal(agg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check to line 117 so that the block can be skipped.
We might not want to do per-value terminal check because tight loop tends to have better performance. We can do early terminate on block level

@agavra agavra requested review from walterddr and Jackie-Jiang and removed request for walterddr and Jackie-Jiang November 30, 2022 18:37
Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only minor comments


// early terminate on a per-block level to allow the
// loop below to be more tightly optimized (avoid a branch)
if (_merger.isTerminal(agg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) suggest moving this above the nullBitmap check, which is much more costly

// TODO: change this to implement BaseSingleInputAggregationFunction<Boolean, Boolean> when we get proper
// handling of booleans in serialization - today this would fail because ColumnDataType#convert assumes
// that the boolean is encoded as its stored type (an integer)
public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) We usually call it AggregationFunction instead of AggregateFunction. Keeping them consistent can help with the code search. Same for the sub-classes

DISTINCT("distinct"),

// boolean aggregate functions
BOOLAND("bool_and"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the name camel case (boolAnd and boolOr)

@walterddr walterddr merged commit 0a6d843 into apache:master Dec 1, 2022
@agavra agavra deleted the boolean_agg branch December 1, 2022 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants