Implement groups accumulator for stddev and variance #12095

eejbyfeldt · 2024-08-21T08:37:54Z

Which issue does this PR close?

Closes #12094.

Rationale for this change

Hopefully improve performance of queries using stddev and variance aggregates.

I did not find any of the current benchmarks containing stddev and/or variance. So I modified clickbench query 31 and 32 to use STDDEV instead of AVG. And these were the results:

Before

Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: 8192, debug: false, string_view: false }, path: "/home/eejbyfeldt/dev/apache/datafusion/benchmarks/data/hits.parquet", queries_path: "/home/eejbyfeldt/dev/apache/datafusion/benchmarks/queries/clickbench/queries.sql", output_path: Some("/home/eejbyfeldt/dev/apache/datafusion/benchmarks/results/implement-groups-accumulator-for-stddev/clickbench_1.json") }
Q0: SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), STDDEV("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;
Query 0 iteration 0 took 3945.8 ms and returned 10 rows
Query 0 iteration 1 took 3993.8 ms and returned 10 rows
Query 0 iteration 2 took 4016.9 ms and returned 10 rows
Query 0 iteration 3 took 3991.3 ms and returned 10 rows
Query 0 iteration 4 took 4024.2 ms and returned 10 rows
Q1: SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), STDDEV("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
Query 1 iteration 0 took 6551.8 ms and returned 10 rows
Query 1 iteration 1 took 6642.9 ms and returned 10 rows
Query 1 iteration 2 took 6618.6 ms and returned 10 rows
Query 1 iteration 3 took 6680.7 ms and returned 10 rows
Query 1 iteration 4 took 6568.3 ms and returned 10 rows
Done

After

Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: 8192, debug: false, string_view: false }, path: "/home/eejbyfeldt/dev/apache/datafusion/benchmarks/data/hits.parquet", queries_path: "/home/eejbyfeldt/dev/apache/datafusion/benchmarks/queries/clickbench/queries.sql", output_path: Some("/home/eejbyfeldt/dev/apache/datafusion/benchmarks/results/implement-groups-accumulator-for-stddev/clickbench_1.json") }
Q0: SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), STDDEV("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;
Query 0 iteration 0 took 2144.0 ms and returned 10 rows
Query 0 iteration 1 took 2155.1 ms and returned 10 rows
Query 0 iteration 2 took 2095.3 ms and returned 10 rows
Query 0 iteration 3 took 2100.2 ms and returned 10 rows
Query 0 iteration 4 took 2238.7 ms and returned 10 rows
Q1: SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), STDDEV("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
Query 1 iteration 0 took 3075.6 ms and returned 10 rows
Query 1 iteration 1 took 3062.4 ms and returned 10 rows
Query 1 iteration 2 took 3081.9 ms and returned 10 rows
Query 1 iteration 3 took 3105.1 ms and returned 10 rows
Query 1 iteration 4 took 3101.6 ms and returned 10 rows
Done

What changes are included in this PR?

An implementation of GroupsAccumulator for the stddev and variance aggregates.

It extracts the core logic from NullState::acummulate to a free function accumulate to make it reusable when implementing GroupsAccumulators that do not need separate null tracking as can be determine from some other value (in this case the count).

Are these changes tested?

Added more test cases in aggregates.slt

Are there any user-facing changes?

No.

andygrove · 2024-08-22T12:33:43Z

Thanks @eejbyfeldt. Do you have any before/after benchmark results?

eejbyfeldt · 2024-08-23T07:52:37Z

Thanks @eejbyfeldt. Do you have any before/after benchmark results?

Yes some benchmarks sounds like a great idea. I updated the PR description with something basic queries.

FAs far as I could tell stddev/varaiance was not used in any current benchmarks, so I modified two clickbench queries to use stddev instead of avg. The queries were choosen based on them having big improvements when the GroupsAccumulator api was first implemented (according to https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/) so they are probably showing best case improvement.

@andygrove Let me know if you think we should be doing something more advanced.

alamb

Thank you @eejbyfeldt and @andygrove -- I reviewed this PR carefully and I think it is really nicely written and tested. I had some small suggestions on adding comments, but otherwise it is really nice: it uses the existing code beautifully.

🚀

I also made a PR with a benchmark for STDDEV and VAR in #12146

Using that benchmark, I confirmed @eejbyfeldt 's results that this PR makes stddev and variance significantly faster.

On main
Elapsed 0.451 seconds.
Elapsed 0.469 seconds.

With this branch
Elapsed 0.331 seconds.
Elapsed 0.336 seconds.

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs

datafusion/functions-aggregate/src/variance.rs

alamb · 2024-08-24T10:39:51Z

datafusion/functions-aggregate/src/variance.rs

+        emit_to: datafusion_expr::EmitTo,
+    ) -> (Vec<f64>, NullBuffer) {
+        let mut counts = emit_to.take_needed(&mut self.counts);
+        let _ = emit_to.take_needed(&mut self.means);


it took me a while to understand why self.means was ignored in the final calculation (and thus is there any need to carry it through).

However, with some study I see that the means are needed to calculate m2 during accumulation but are not used in the final output

Maybe we could point that out in a comment (I can do it as a follow on PR too)

Added comment in 0f1b64c

alamb · 2024-08-24T10:43:51Z

datafusion/functions-aggregate/src/variance.rs

+                *count -= 1;
+            });
+        }
+        let nulls = NullBuffer::from_iter(counts.iter().map(|&count| count != 0));


👍 it took me a while to figure out how nulls were handled -- LGTM

alamb · 2024-08-24T10:44:01Z

datafusion/sqllogictest/test_files/aggregate.slt

@@ -500,6 +500,85 @@ select stddev(sq.column1) from (values (1.1), (2.0), (3.0)) as sq
 ----
 0.950438495292

+# csv_query_stddev_7


💯 for test coverage

Dandandan · 2024-08-25T05:15:16Z

datafusion/functions-aggregate/src/variance.rs

+        emit_to: datafusion_expr::EmitTo,
+    ) -> (Vec<f64>, NullBuffer) {
+        let mut counts = emit_to.take_needed(&mut self.counts);
+        let _ = emit_to.take_needed(&mut self.means);


Perhaps there could be a take_needed that doesn't generate / return the Vec

Although I think it won't gain much, so probably better to leave as is.

Yeah, there is probably no performance gains as it would need to do the same work as take_needed.

I guess it might be worth adding if this becomes a common pattern in more implementations.

Add tests cases for stddev_samp/pop and var_smap/pop the includes a group_by clause.

alamb · 2024-08-25T15:09:11Z

I'll plan to merge this tomorrow unless there are additional comments

alamb · 2024-08-26T19:11:26Z

Thanks again @eejbyfeldt and @Dandandan

The bug was in the orginal implementation in apache#12095. This fixes the issue and modify a test case such that it would have caught it.

The bug was in the orginal implementation in #12095. This fixes the issue and modify a test case such that it would have caught it.

The bug was in the orginal implementation in apache#12095. This fixes the issue and modify a test case such that it would have caught it.

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Aug 21, 2024

eejbyfeldt mentioned this pull request Aug 21, 2024

Improve performance of standard deviation aggregate apache/datafusion-comet#824

Open

andygrove mentioned this pull request Aug 21, 2024

Fix performance regression with stddev being enabled by default apache/datafusion-comet#840

Closed

andygrove requested review from huaxingao and kazuyukitanimura August 22, 2024 12:32

eejbyfeldt force-pushed the implement-groups-accumulator-for-stddev branch from 847e122 to 525106d Compare August 23, 2024 07:54

alamb mentioned this pull request Aug 24, 2024

Add benchmark for STDDEV and VAR to Clickbench extended #12146

Merged

alamb approved these changes Aug 24, 2024

View reviewed changes

Dandandan reviewed Aug 25, 2024

View reviewed changes

eejbyfeldt added 4 commits August 25, 2024 08:31

Add more stddev/var tests cases

cc226ad

Add tests cases for stddev_samp/pop and var_smap/pop the includes a group_by clause.

Implement GroupsAccumulator for stddev and variance

8b96c5a

Add cast to support all numeric types

1bf7b9b

Improve documenataion and comments

0f1b64c

eejbyfeldt force-pushed the implement-groups-accumulator-for-stddev branch from 525106d to 0f1b64c Compare August 25, 2024 07:03

alamb mentioned this pull request Aug 25, 2024

Proposal: Create dfdb, a new CLI different than datafusion-cli with pre-built integrations #11979

Closed

Dandandan approved these changes Aug 25, 2024

View reviewed changes

alamb merged commit 1b875f4 into apache:main Aug 26, 2024
24 checks passed

eejbyfeldt added a commit to eejbyfeldt/datafusion that referenced this pull request Sep 25, 2024

Fix panic in variance GroupsAccumulator

eed44cc

The bug was in the orginal implementation in apache#12095. This fixes the issue and modify a test case such that it would have caught it.

eejbyfeldt mentioned this pull request Sep 25, 2024

fix: Panic/correctness issue in variance GroupsAccumulator #12615

Merged

alamb pushed a commit that referenced this pull request Sep 25, 2024

Fix panic in variance GroupsAccumulator (#12615)

61e6db3

The bug was in the orginal implementation in #12095. This fixes the issue and modify a test case such that it would have caught it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groups accumulator for stddev and variance #12095

Implement groups accumulator for stddev and variance #12095

eejbyfeldt commented Aug 21, 2024 •

edited

Loading

andygrove commented Aug 22, 2024

eejbyfeldt commented Aug 23, 2024 •

edited

Loading

alamb left a comment

alamb Aug 24, 2024

eejbyfeldt Aug 25, 2024

alamb Aug 24, 2024

alamb Aug 24, 2024

Dandandan Aug 25, 2024

Dandandan Aug 25, 2024

eejbyfeldt Aug 25, 2024

alamb commented Aug 25, 2024

alamb commented Aug 26, 2024

Implement groups accumulator for stddev and variance #12095

Implement groups accumulator for stddev and variance #12095

Conversation

eejbyfeldt commented Aug 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

andygrove commented Aug 22, 2024

eejbyfeldt commented Aug 23, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 24, 2024

Choose a reason for hiding this comment

eejbyfeldt Aug 25, 2024

Choose a reason for hiding this comment

alamb Aug 24, 2024

Choose a reason for hiding this comment

alamb Aug 24, 2024

Choose a reason for hiding this comment

Dandandan Aug 25, 2024

Choose a reason for hiding this comment

Dandandan Aug 25, 2024

Choose a reason for hiding this comment

eejbyfeldt Aug 25, 2024

Choose a reason for hiding this comment

alamb commented Aug 25, 2024

alamb commented Aug 26, 2024

eejbyfeldt commented Aug 21, 2024 •

edited

Loading

eejbyfeldt commented Aug 23, 2024 •

edited

Loading