perf: Implement boolean group values #17726

ashdnazg · 2025-09-22T15:48:28Z

Which issue does this PR close?

N/A

Rationale for this change

Having more data types supported in this code path avoids converting the group by columns to rows during aggregation.

What changes are included in this PR?

Adding boolean support for single and multiple grouping values.

Are these changes tested?

I added unit tests for the multi-value implementation and both single value and multi value implementations are tested in the sql logic tests.

Are there any user-facing changes?

No

…lues.

alamb

This looks great @ashdnazg -- thank you

I will try and review it carefully in the next day or two

alamb · 2025-09-23T17:39:57Z

datafusion/physical-plan/src/aggregates/group_values/mod.rs

            }
            DataType::Utf8 => {
-                return Ok(Box::new(GroupValuesByes::<i32>::new(OutputType::Utf8)));
+                return Ok(Box::new(GroupValuesBytes::<i32>::new(OutputType::Utf8)));


GroupValuesByes 🤦

alamb

Thank you @ashdnazg -- I reviewed the code and it all looks pretty good to me (follows the existing patterns)

However, I tested this locally with datafusion-cli and couldn't get it to show any difference. Do you have any benchmarks / queries that showed this would be faster?

Here is what I tried:

Table setup

> create or replace table foo as select random() as float_val, random() < 0.5 as bool_val, case (random() * 4)::integer when 0 THEN 'Foo' WHEN 1 then 'bar' when 2 then 'baz' else 'grogo' end as str_val  from generate_series(1, 1000000000);
0 row(s) fetched.
Elapsed 5.596 seconds.

> select * from foo limit 10;
+----------------------+----------+---------+
| float_val            | bool_val | str_val |
+----------------------+----------+---------+
| 0.003644060055853049 | true     | grogo   |
| 0.5430315943491387   | false    | Foo     |
| 0.0635246361601266   | false    | baz     |
| 0.09122350127376644  | true     | Foo     |
| 0.4325015383008821   | true     | baz     |
| 0.10141972676501176  | true     | Foo     |
| 0.6749389965920886   | false    | grogo   |
| 0.6319317066473308   | false    | grogo   |
| 0.07946106391385499  | true     | bar     |
| 0.5026330571571326   | true     | Foo     |
+----------------------+----------+---------+
10 row(s) fetched.
Elapsed 0.071 seconds.

Test query

And then ran

> select avg(float_val), bool_val, str_val from foo GROUP BY bool_val, str_val;

results

main: 1.645 seconds
This branch: 1.695 seconds

(basically no change)

🤔

alamb · 2025-09-25T18:08:54Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+    nulls: MaybeNullBufferBuilder,
+}
+
+impl<const NULLABLE: bool> BooleanGroupValueBuilder<NULLABLE> {


this code is pretty similar to PrimitiveGroupValueBuilder but I don't see any obvious way to improve that

Agreed. It's a recurring theme, I feel, that the boolean implementations are very similar to the primitive ones.

ashdnazg · 2025-09-25T19:49:15Z

Thank you for benchmarking! I thought I had some promising results, but maybe I was mistaken.
I'll look into it.

kosiew

Left a question.

kosiew · 2025-09-27T03:40:16Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        new_builder.finish();
+
+        Arc::new(BooleanArray::new(new_builder.finish(), first_n_nulls))


Why call new_builder.finish() twice?

Thanks for catching that! I'll add a test to verify that take_n works correctly.

rluvaton · 2025-09-29T11:55:14Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        lhs_rows: &[usize],
+        array: &ArrayRef,
+        rhs_rows: &[usize],
+        equal_to_results: &mut [bool],


Because this is a slice and not buffer this limit optimizations in my optimization for creating optimized version for all uniuqe, for example for non nullable checking if 2 arrays are the same is simple NOT XOR

That's beyond the scope of this PR

I filed a new ticket to track these ideas so we don't forget them

Improvements to BooleanGroupValueBuilder (grouping by boolean columns) #17860

rluvaton · 2025-09-29T12:00:13Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        lhs_rows: &[usize],
+        array: &ArrayRef,
+        rhs_rows: &[usize],
+        equal_to_results: &mut [bool],


I will try to change it to MutableBooleanBuffer or something in the future to allow for more optimizations

ashdnazg · 2025-09-29T14:20:25Z

@alamb
I added a boolean column to Query 10 in h2o_medium:

SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count, (v1 % 2)=0 AS b1 FROM x GROUP BY id1, id2, id3, id4, id5, id6, b1;

And I get the following results locally

Without the boolean impl:

Query 1 iteration 1 took 8836.5 ms and returned 100000000 rows
Query 1 iteration 2 took 8293.7 ms and returned 100000000 rows
Query 1 iteration 3 took 7957.9 ms and returned 100000000 rows
Query 1 avg time: 8362.69 ms

With the boolean impl

Query 1 iteration 1 took 5202.1 ms and returned 100000000 rows
Query 1 iteration 2 took 4666.2 ms and returned 100000000 rows
Query 1 iteration 3 took 4577.2 ms and returned 100000000 rows
Query 1 avg time: 4815.19 ms

In general the multi group by option seems to make certain scenarios worse.
I added a flag for easy control of whether it's used in this commit: ashdnazg@6a0b13b
and then checked:

create or replace table foo as select (random() * 4)::integer as int_val, (random() * 4)::integer as int_val2, (random() * 4)::integer as int_val3  from generate_series(1, 1000000000);
set datafusion.execution.enable_multi_group_by to false;
select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, int_val3;
set datafusion.execution.enable_multi_group_by to true;
select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, int_val3;

I get Elapsed 1.037 seconds. with the flag set to false and Elapsed 1.382 seconds. with the flag set to true.

Perhaps the logic for when to use it should be more careful than just "whenever it's supported". But I suspect this PR is not the right spot for that discussion.

martin-g · 2025-09-30T06:57:42Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        );
+
+        for (&lhs_row, &rhs_row, equal_to_result) in iter {
+            // Has found not equal to in previous column, don't need to check


This comment does not add any more information than the next line.
IMO it could be removed or updated with some information "why there is no need to check"

martin-g · 2025-09-30T07:01:59Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+
+        let null_count = array.null_count();
+        let num_rows = array.len();
+        let all_null_or_non_null = if null_count == 0 {


IMO this would be clearer with an enum: NO_NULLS, ALL_NULLS and SOME_NULLS

alamb · 2025-09-30T18:38:44Z

Perhaps the logic for when to use it should be more careful than just "whenever it's supported". But I suspect this PR is not the right spot for that discussion.

Good call -- filed #17850

alamb

Thanks @ashdnazg

I took another look through this PR and was able to reproduce your numbers and I think it is ready to go

Here is how I benchmarked the results:

./bench.sh data h2o_medium

And then ran

datafusion-cli -c "SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count, (v1 % 2)=0 AS b1 FROM '/Users/andrewlamb/Software/datafusion/benchmarks/data/h2o/G1_1e8_1e8_100_0.csv' GROUP BY id1, id2, id3, id4, id5, id6, b1;"

main

Elapsed 10.549 seconds.
Elapsed 8.316 seconds.
Elapsed 8.762 seconds.

This PR

Elapsed 4.625 seconds.
Elapsed 4.100 seconds.
Elapsed 3.722 seconds.

🚀

@martin-g I think your review suggestions about comments and enums great. Perhaps you can create a follow on PR to implement them?

alamb · 2025-10-01T13:22:31Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        lhs_rows: &[usize],
+        array: &ArrayRef,
+        rhs_rows: &[usize],
+        equal_to_results: &mut [bool],


I filed a new ticket to track these ideas so we don't forget them

Improvements to BooleanGroupValueBuilder (grouping by boolean columns) #17860

alamb · 2025-10-06T11:21:36Z

Thanks again @ashdnazg @kosiew and @rluvaton

…n array Follow-up of apache#17726 (review)

…n array (#18048) * chore: Use an enum to express the different kinds of nullability in an array Follow-up of #17726 (review) * Use the Nulls enum also in .../multi_group_by/primitive.rs * Use the Nulls enum in .../multi_group_by/bytes[_view].rs as well

…n array (apache#18048) * chore: Use an enum to express the different kinds of nullability in an array Follow-up of apache#17726 (review) * Use the Nulls enum also in .../multi_group_by/primitive.rs * Use the Nulls enum in .../multi_group_by/bytes[_view].rs as well

ashdnazg added 2 commits September 22, 2025 14:58

chore: fix typos in group_values.

24c1aee

perf: add support for boolean columns in single and multi group by va…

126817c

…lues.

alamb reviewed Sep 23, 2025

View reviewed changes

alamb added the performance Make DataFusion faster label Sep 25, 2025

Merge branch 'main' into multi-group-builders

04ee663

github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 25, 2025

alamb reviewed Sep 25, 2025

View reviewed changes

kosiew reviewed Sep 27, 2025

View reviewed changes

rluvaton reviewed Sep 29, 2025

View reviewed changes

remove extra finish

c98922f

martin-g reviewed Sep 30, 2025

View reviewed changes

alamb mentioned this pull request Sep 30, 2025

In general the multi group by option seems to make certain scenarios worse. #17850

Open

alamb mentioned this pull request Oct 1, 2025

Improvements to BooleanGroupValueBuilder (grouping by boolean columns) #17860

Open

alamb approved these changes Oct 1, 2025

View reviewed changes

alamb changed the title ~~perf: boolean group values implementations~~ perf: Implement boolean group values Oct 1, 2025

alamb added this pull request to the merge queue Oct 6, 2025

Merged via the queue into apache:main with commit 04af0c6 Oct 6, 2025
29 checks passed

martin-g added a commit to martin-g/datafusion that referenced this pull request Oct 14, 2025

chore: Use an enum to express the different kinds of nullability in a…

4efb2b4

…n array Follow-up of apache#17726 (review)

This was referenced Oct 14, 2025

Use an enum to express the nullability of an array #18047

Closed

chore: Use an enum to express the different kinds of nullability in an array #18048

Merged

		new_builder.finish();

		Arc::new(BooleanArray::new(new_builder.finish(), first_n_nulls))

perf: Implement boolean group values #17726

perf: Implement boolean group values #17726

Uh oh!

Conversation

ashdnazg commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Table setup

Test query

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashdnazg commented Sep 25, 2025

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashdnazg commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ashdnazg commented Sep 22, 2025 •

edited

Loading

rluvaton Sep 29, 2025 •

edited

Loading