feat: Improve metrics for aggregate streams. #18325

EmilyMatt · 2025-10-28T13:22:19Z

Which issue does this PR close?

Closes Improve metrics for Aggregate streams #18323 .

Rationale for this change

Adds more detailed metrics, so it is easier to identify which part of the aggregate streams are actually slow.

What changes are included in this PR?

Added a metrics struct, and used it in the functions common to the aggregate streams.

Are these changes tested?

Yes, added some tests to verify the metrics are actually updated and can be retrieved.

I've also ran the groupby benchmarks to ensure we don't create timers in a way that could impact performance, and it seems ok, all the changes are within what I'd expect as std variation on a local machine.

Comparing main and agg-metrics
--------------------
Benchmark h2o.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ agg-metrics ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 1252.42 ms │  1196.62 ms │     no change │
│ QQuery 2     │ 3976.62 ms │  3392.89 ms │ +1.17x faster │
│ QQuery 3     │ 3448.29 ms │  2918.47 ms │ +1.18x faster │
│ QQuery 4     │ 1909.15 ms │  1632.98 ms │ +1.17x faster │
│ QQuery 5     │ 3056.36 ms │  2831.82 ms │ +1.08x faster │
│ QQuery 6     │ 2663.13 ms │  2594.64 ms │     no change │
│ QQuery 7     │ 2802.28 ms │  2592.43 ms │ +1.08x faster │
│ QQuery 8     │ 4489.29 ms │  4199.00 ms │ +1.07x faster │
│ QQuery 9     │ 7001.75 ms │  6622.98 ms │ +1.06x faster │
│ QQuery 10    │ 4725.80 ms │  4619.37 ms │     no change │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)          │ 35325.09ms │
│ Total Time (agg-metrics)   │ 32601.19ms │
│ Average Time (main)        │  3532.51ms │
│ Average Time (agg-metrics) │  3260.12ms │
│ Queries Faster             │          7 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          3 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘

Are there any user-facing changes?

Nothing that is direct to the user, additional metrics will now be available, but no breaking changes.

rluvaton · 2025-10-28T14:22:22Z

datafusion/physical-plan/src/aggregates/mod.rs

 }

-/// Evaluates expressions against a record batch.
-pub fn evaluate_many(


this is a breaking change as you changed the visibility and the arguments.

you should create a new function and leave this function as is

Yes, that is mentioned in the PR

I don't think we should change that function even when it is a utility

I think this is a sure way to end up with lots of dead code because "someone surely uses it", instead, it was simply a mistake to have such a generic utility in this file made public, and the earlier it is fixed the less people are affected.

this was not a mistake but an inform decision

chore: Make GroupValues and APIs on PhysicalGroupBy aggregation APIs public #16733

Then this function should have been moved somewhere else, there are multiple copies of the code that does exactly that all over the project, keeping this specific instance pub is super random, and slightly disruptive, will work around it to not modify the visibility in this PR's scope and will open an issue to change that

datafusion/physical-plan/src/aggregates/mod.rs

datafusion/physical-plan/src/aggregates/group_values/metrics.rs

rluvaton · 2025-10-28T14:30:49Z

datafusion/physical-plan/src/aggregates/row_hash.rs

            evaluate_optional(&self.filter_expressions, &batch)?
        };

+        let aggregation_timer = self.group_by_metrics.aggregation_time.timer();


Observation: the aggregation time also include finding the group indices which I don't think it

however a possible solution is to move inside the loop but I'm not a fan of it as it can be a hot loop

Yes that was a conscious choice to avoid running this once per group, because current_group_indices is actually a scratch space, it can't organically be separated into two loops(although creating one scratch space per group_by is also valid)

the loop also contain building and searching the hash table self.group_values.intern which can be expensive (e.g. when GroupValuesRows is used)

Added a metric for it, I think this is a mistake tho

datafusion/physical-plan/src/aggregates/row_hash.rs

datafusion/physical-plan/src/aggregates/topk_stream.rs

rluvaton · 2025-10-28T14:40:55Z

datafusion/physical-plan/src/aggregates/row_hash.rs

    baseline_metrics: BaselineMetrics,
+
+    /// Aggregation-specific metrics
+    group_by_metrics: GroupByMetrics,


Can you also add a timer for building and searching (single timer) the the hash table (group_values.intern)

or if you prefer we can leave that for later

I think this will become a performance hit, I'd rather keep metrics concrete to avoid syscalls

rluvaton · 2025-10-28T14:45:38Z

datafusion/physical-plan/src/aggregates/group_values/metrics.rs

+pub struct GroupByMetrics {
+    pub aggregate_arguments_time: Time,
+    pub aggregation_time: Time,
+    pub emitting_time: Time,


Can you add few subset timers from this:

building aggregate expressions (how much time calling all accumulators update_batch/merge_batch)

emitting aggregate expressions (how much time calling all accumulators emit/state)

Emitting grouping columns (self.group_values.emit in row_hash is an example of that emission)

Same as previous comments, I think using clocks here is going to start encumbering the system

datafusion/physical-plan/src/aggregates/group_values/metrics.rs

Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com>

datafusion/physical-plan/src/aggregates/group_values/metrics.rs

Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com>

rluvaton · 2025-10-28T16:57:29Z

Can you please update the PR description after your changes?

datafusion/physical-plan/src/aggregates/row_hash.rs

datafusion/physical-plan/src/aggregates/topk_stream.rs

datafusion/physical-plan/src/aggregates/group_values/metrics.rs

rluvaton

LGTM, thanks @EmilyMatt !

## Which issue does this PR close?  - Closes apache#18323 . ## Rationale for this change  Adds more detailed metrics, so it is easier to identify which part of the aggregate streams are actually slow. ## What changes are included in this PR?  Added a metrics struct, and used it in the functions common to the aggregate streams. ## Are these changes tested?  Yes, added some tests to verify the metrics are actually updated and can be retrieved. I've also ran the groupby benchmarks to ensure we don't create timers in a way that could impact performance, and it seems ok, all the changes are within what I'd expect as std variation on a local machine. ``` Comparing main and agg-metrics -------------------- Benchmark h2o.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ agg-metrics ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 1252.42 ms │ 1196.62 ms │ no change │ │ QQuery 2 │ 3976.62 ms │ 3392.89 ms │ +1.17x faster │ │ QQuery 3 │ 3448.29 ms │ 2918.47 ms │ +1.18x faster │ │ QQuery 4 │ 1909.15 ms │ 1632.98 ms │ +1.17x faster │ │ QQuery 5 │ 3056.36 ms │ 2831.82 ms │ +1.08x faster │ │ QQuery 6 │ 2663.13 ms │ 2594.64 ms │ no change │ │ QQuery 7 │ 2802.28 ms │ 2592.43 ms │ +1.08x faster │ │ QQuery 8 │ 4489.29 ms │ 4199.00 ms │ +1.07x faster │ │ QQuery 9 │ 7001.75 ms │ 6622.98 ms │ +1.06x faster │ │ QQuery 10 │ 4725.80 ms │ 4619.37 ms │ no change │ └──────────────┴────────────┴─────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ Total Time (main) │ 35325.09ms │ │ Total Time (agg-metrics) │ 32601.19ms │ │ Average Time (main) │ 3532.51ms │ │ Average Time (agg-metrics) │ 3260.12ms │ │ Queries Faster │ 7 │ │ Queries Slower │ 0 │ │ Queries with No Change │ 3 │ │ Queries with Failure │ 0 │ └────────────────────────────┴────────────┘ ``` ## Are there any user-facing changes?   Nothing that is direct to the user, additional metrics will now be available, but no breaking changes. --------- Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com> Co-authored-by: Eshed Schacham <ashdnazg@gmail.com>

github-actions bot added the physical-plan Changes to the physical-plan crate label Oct 28, 2025

EmilyMatt added 3 commits October 28, 2025 15:28

feat: Add more detailed metrics for aggregate

0829b82

add tests and license

5ede253

fix typo

c8e9c6a

EmilyMatt force-pushed the agg-metrics branch from 01ee2fa to c8e9c6a Compare October 28, 2025 13:28

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/mod.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/metrics.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/metrics.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/row_hash.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/topk_stream.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

address CR

ef0885b

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/metrics.rs Outdated Show resolved Hide resolved

Update datafusion/physical-plan/src/aggregates/group_values/metrics.rs

eea5186

Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com>

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/metrics.rs Outdated Show resolved Hide resolved

EmilyMatt and others added 4 commits October 28, 2025 17:27

address CR

75112d7

Update datafusion/physical-plan/src/aggregates/group_values/metrics.rs

fbd9d60

Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com>

add doc

3ed0562

slightly more accurate doc

5360131

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/row_hash.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/topk_stream.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/topk_stream.rs Outdated Show resolved Hide resolved

rluvaton reviewed Oct 28, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/metrics.rs Outdated Show resolved Hide resolved

adress CR

d3fc591

rluvaton approved these changes Oct 28, 2025

View reviewed changes

rluvaton added this pull request to the merge queue Oct 28, 2025

Merged via the queue into apache:main with commit 6cc73fa Oct 28, 2025
32 checks passed

feat: Improve metrics for aggregate streams. #18325

feat: Improve metrics for aggregate streams. #18325

Uh oh!

Conversation

EmilyMatt commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilyMatt Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rluvaton Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rluvaton commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rluvaton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EmilyMatt commented Oct 28, 2025 •

edited

Loading

rluvaton Oct 28, 2025 •

edited

Loading

EmilyMatt Oct 28, 2025 •

edited

Loading

rluvaton Oct 28, 2025 •

edited

Loading

rluvaton Oct 28, 2025 •

edited

Loading

rluvaton Oct 28, 2025 •

edited

Loading

rluvaton left a comment •

edited

Loading