New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support BOOL_AND and BOOL_OR aggregate functions #9848
Conversation
fcbbceb
to
d5fdad1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall. please take a look at the comments.
}, | ||
{ | ||
"psql": "9.21.0", | ||
"description": "aggregate boolean column", | ||
"sql": "SELECT bool_and(bool_col), bool_or(bool_col) FROM {tbl} GROUP BY string_col" | ||
}, | ||
{ | ||
"psql": "9.21.0", | ||
"description": "aggregate boolean column no group by", | ||
"sql": "SELECT bool_and(bool_col), bool_or(bool_col) FROM {tbl}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we support SELECT bool_and(startsWith(string_col, 'foo')) FROM ...
?
can we add a test for these type of use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not support this yet (for any aggregate function in v2). This is a huge gap, good call out.
@@ -444,6 +444,20 @@ public void testGetAggregationFunction() { | |||
assertEquals(aggregationFunction.getType(), AggregationFunctionType.PERCENTILETDIGESTMV); | |||
assertEquals(aggregationFunction.getColumnName(), "percentileTDigest95.0MV_column"); | |||
assertEquals(aggregationFunction.getResultColumnName(), "percentiletdigestmv(column, 95.0)"); | |||
|
|||
function = getFunction("bool_and"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are adding this feature to V1 as well. we should add test for them in V1 too. see: #9236 when adding COVAR_POP
Codecov Report
@@ Coverage Diff @@
## master #9848 +/- ##
============================================
+ Coverage 63.44% 68.53% +5.08%
+ Complexity 5326 4924 -402
============================================
Files 1960 1978 +18
Lines 105350 105976 +626
Branches 15960 16057 +97
============================================
+ Hits 66840 72628 +5788
+ Misses 33589 28218 -5371
- Partials 4921 5130 +209
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may consider using different code path with/without null handling to avoid the unnecessary boxing/unboxing. We have observed quite big performance impact using object vs primitive
import org.roaringbitmap.RoaringBitmap; | ||
|
||
|
||
public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be BaseSingleInputAggregationFunction<Boolean, Boolean>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I did at the start, but I think there's some issue with it because booleans are stored as ints. I don't remember exactly what it was, but I can try to make that change back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that lots of the reducing code expects the data to be in stored type format. If we output a boolean, that breaks (see for example usage of ColumnDataType#convert
, which expects booleans to be stored as ints).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We can fix it separately. Suggest adding a TODO for it
protected enum BooleanMerge { | ||
AND { | ||
@Override | ||
int merge(Integer left, int right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better performance, we want to avoid per value merge, but do array batch processing. Also the null
values should be handled in batches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, we'll keep the enum to avoid duplicating code - I wrote this JMH to confirm that there shouldn't be a performance impact (see below). We can always go back and optimize it if need be.
@Setup
public void setUp() {
Random random = new Random();
_ints = IntStream.generate(random::nextInt).limit(1000).toArray();
_enum = MyEnum.ADD;
}
@Benchmark
public int inline() {
int sum = 0;
for (int i : _ints) {
sum = sum + i;
}
return sum;
}
@Benchmark
public int delegate() {
int sum = 0;
for (int i : _ints) {
sum = _enum.apply(sum, i);
}
return sum;
}
The result was:
Benchmark Mode Cnt Score Error Units
BenchmarkEnumOverhead.delegate avgt 5 0.306 ± 0.040 us/op
BenchmarkEnumOverhead.inline avgt 5 0.306 ± 0.033 us/op
includeDoc = docId -> true; | ||
} | ||
|
||
Integer agg = aggregationResultHolder.getResult(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can early terminate when the previous value is true
for OR
and false
for AND
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly good
|
||
if (blockValSet.getValueType().getStoredType() != FieldSpec.DataType.INT) { | ||
throw new IllegalArgumentException( | ||
String.format("Unsupported data type %s for BOOL_AND", blockValSet.getValueType())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) This can also be BOOL_OR
Map<ExpressionContext, BlockValSet> blockValSetMap) { | ||
BlockValSet blockValSet = blockValSetMap.get(_expression); | ||
|
||
if (blockValSet.getValueType().getStoredType() != FieldSpec.DataType.INT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support it on INT
column? If not, we should explicitly check the value type instead of the stored type. The current logic won't work well when the input value is not 0/1.
OR { | ||
@Override | ||
long merge(long left, int right) { | ||
return left + right; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be left | right
. For boolean result, the value should be either 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I used + and then just checked >0
but this is cleaner. I also changed AND
to use &
protected enum BooleanMerge { | ||
AND { | ||
@Override | ||
long merge(long left, int right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should store the value as int to avoid the implicit cast from int to long. Suggest adding the result holder for int type in this PR, and we may add it for long types if needed in the future.
if (!nullBitmap.contains(i)) { | ||
agg = _merger.merge(agg, bools[i]); | ||
aggregationResultHolder.setValue((Object) agg); | ||
if (_merger.isTerminal(agg)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this check to line 117 so that the block can be skipped.
We might not want to do per-value terminal check because tight loop tends to have better performance. We can do early terminate on block level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only minor comments
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/AggregationResultHolder.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/org/apache/pinot/core/query/aggregation/DoubleAggregationResultHolder.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/org/apache/pinot/core/query/aggregation/DoubleAggregationResultHolder.java
Outdated
Show resolved
Hide resolved
|
||
// early terminate on a per-block level to allow the | ||
// loop below to be more tightly optimized (avoid a branch) | ||
if (_merger.isTerminal(agg)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) suggest moving this above the nullBitmap check, which is much more costly
// TODO: change this to implement BaseSingleInputAggregationFunction<Boolean, Boolean> when we get proper | ||
// handling of booleans in serialization - today this would fail because ColumnDataType#convert assumes | ||
// that the boolean is encoded as its stored type (an integer) | ||
public abstract class BaseBooleanAggregateFunction extends BaseSingleInputAggregationFunction<Integer, Integer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) We usually call it AggregationFunction
instead of AggregateFunction
. Keeping them consistent can help with the code search. Same for the sub-classes
DISTINCT("distinct"), | ||
|
||
// boolean aggregate functions | ||
BOOLAND("bool_and"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the name camel case (boolAnd
and boolOr
)
Support the
BOOL_AND
/BOOL_OR
functions (see https://www.postgresql.org/docs/9.1/functions-aggregate.html for documentation on function behavior).Review Notes
PinotBoolAndAggregateFunction
andPinotBoolOrAggregateFunction
allow us to register the functions with thePinotStdOperatorTable
and have them be case insensitive, and it also makes sure the types are properly propagated