Skip to content

Conversation

@crepererum
Copy link
Contributor

@crepererum crepererum commented Nov 28, 2022

Which issue does this PR close?

Closes #3940.

Rationale for this change

Ensure that users don't run out of memory while performing group-by operations. This is esp. important for servers or multi-tenant systems.

What changes are included in this PR?

Same as #4371 and #4202 but for AggregateStream (used when there are no group keys).

Are these changes tested?

test_oom extended. Perf results:

❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline issue3940e-pre
    Finished bench [optimized] target(s) in 0.09s
     Running benches/aggregate_query_sql.rs (target/release/deps/aggregate_query_sql-6efc67fad0362d05)
aggregate_query_no_group_by 15 12
                        time:   [690.60 µs 691.61 µs 692.75 µs]
                        change: [+0.2423% +0.7005% +1.1322%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

aggregate_query_no_group_by_min_max_f64
                        time:   [628.35 µs 630.25 µs 632.41 µs]
                        change: [-0.0124% +0.5867% +1.1382%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  5 (5.00%) high severe

aggregate_query_no_group_by_count_distinct_wide
                        time:   [2.5104 ms 2.5292 ms 2.5484 ms]
                        change: [-0.0360% +0.9941% +2.1057%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
aggregate_query_no_group_by_count_distinct_narrow
                        time:   [1.7087 ms 1.7170 ms 1.7250 ms]
                        change: [+0.7899% +1.9667% +3.0347%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

aggregate_query_group_by
                        time:   [2.2304 ms 2.2478 ms 2.2671 ms]
                        change: [-0.4475% +0.6787% +1.7891%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter
                        time:   [1.1318 ms 1.1354 ms 1.1395 ms]
                        change: [-0.8011% -0.0882% +0.5275%] (p = 0.81 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

aggregate_query_group_by_u64 15 12
                        time:   [2.2595 ms 2.2757 ms 2.2925 ms]
                        change: [-0.2947% +0.7316% +1.7738%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter_u64 15 12
                        time:   [1.1343 ms 1.1370 ms 1.1401 ms]
                        change: [+0.1610% +0.6333% +1.2276%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

aggregate_query_group_by_u64_multiple_keys
                        time:   [14.649 ms 14.972 ms 15.301 ms]
                        change: [-2.8133% +0.2410% +3.4242%] (p = 0.87 > 0.05)
                        No change in performance detected.

aggregate_query_approx_percentile_cont_on_u64
                        time:   [3.7312 ms 3.7587 ms 3.7860 ms]
                        change: [-1.7540% -0.6290% +0.4997%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

aggregate_query_approx_percentile_cont_on_f32
                        time:   [3.1705 ms 3.1969 ms 3.2240 ms]
                        change: [-1.0292% +0.0460% +1.1440%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Tl;Dr: No relevant changes!

Are there any user-facing changes?

The no-group agg op an now emit a ResourceExhausted error if it runs out of memory. Note that the error is kinda nested/wrapped due to #4172.

@github-actions github-actions bot added the core Core DataFusion crate label Nov 28, 2022
@alamb alamb requested a review from yjshen November 28, 2022 17:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @crepererum -- I also plan to test this as part of #4404


for (version, aggregates) in [(1, aggregates_v1), (2, aggregates_v2)] {
for (version, groups, aggregates) in [
(0, groups_none, aggregates_v0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is the test coverage 👍

@alamb
Copy link
Contributor

alamb commented Nov 28, 2022

cc @milenkovicm

@alamb alamb merged commit dd3f72a into apache:master Nov 28, 2022
@ursabot
Copy link

ursabot commented Nov 28, 2022

Benchmark runs are scheduled for baseline = 0d334cf and contender = dd3f72a. dd3f72a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Throw a runtime error if the memory allocated to GroupByHash exceeds a limit

3 participants