Use shared statistics merge for union stats by kumarUjjawal · Pull Request #21430 · apache/datafusion

kumarUjjawal · 2026-04-07T05:36:52Z

Which issue does this PR close?

Part of Consolidate statistics aggregation #8229

Rationale for this change

DataFusion already has shared logic for merging Statistics, but UnionExec and InterleaveExec still used their own local merge code.

That left duplicated path in the codebase and kept the behavior less consistent than the other statistics aggregation paths.

What changes are included in this PR?

Reuse Statistics::try_merge_iter for UnionExec statistics merging
Reuse the same shared path for InterleaveExec statistics merging
Remove the local union-specific statistics merge helpers
Add tests for union and interleave statistics merging
Add a test for interleave partition-level statistics merging

Are these changes tested?

Yes

Are there any user-facing changes?

No

kumarUjjawal · 2026-04-07T06:30:29Z

What's up with the failing avro test? I don't think it is related to this pr.

xudong963 · 2026-04-07T07:03:16Z

    }

    #[test]
-    fn test_union_distinct_count() {


There is a significant concern with the test coverage reduction.

The old test_union_distinct_count contained 15 carefully constructed edge cases for NDV estimation: disjoint ranges, identical ranges, partial overlap, containment, constant values, absent values, and mixed precision types.

These tests validated the correctness of estimate_ndv_with_overlap as used in the union context. The new tests use a single scenario (non-overlapping ranges, single column) and rely on the assumption that try_merge_iter is already well-tested elsewhere.

I have plans for the tests in the shared statistics, I am working on adding those.

nice, I converted the PR to draft now

I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.

Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?

xudong963 · 2026-04-07T07:03:38Z

is it expected?

This should not have happened, my mistake

kumarUjjawal · 2026-04-19T10:37:01Z

@xudong963 Can i get a look on this?

xudong963

I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt

kumarUjjawal · 2026-04-20T13:27:13Z

I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt

Yeah, I changed it.

xudong963

cc @asolimando in case you have a chance to have a look at this.

asolimando

Nice cleanup consolidating the union/interleave stats merging onto a single function @kumarUjjawal.

There is still a little gap in test coverage and I think it would be interesting to keep a customizable fallback for NDV merging so the change becomes a pure refactoring with no semantic changes.

Thanks @xudong963 for the ping!

asolimando · 2026-04-22T12:14:39Z

                    (Some(&l), Some(&r)) => Precision::Inexact(
                        estimate_ndv_with_overlap(col_stats, item_cs, l, r)
-                            .unwrap_or_else(|| usize::max(l, r)),
+                            .unwrap_or_else(|| l.saturating_add(r)),


The proposed change at this line is a semantic, the proposed fallback is sensible for unions (independent streams, summing NDVs is a good upper bound) but this function is also used to share statistics for Parquet files (see statistics.rs#L482 and statistics.rs#L528), for which max is a more classic fallback (files from the same table are likely to share common values, so summing NDV would overshoot in general).

One option would be to have a configurable fallback (e.g., an enum NdvFallback::Max vs NdvFallback::Sum), so the callers can choose based on their own semantics. WDYT?

This sounds like an improvement from the current approach. Thanks @asolimando

asolimando · 2026-04-22T12:18:58Z

    }

    #[test]
-    fn test_union_distinct_count() {


I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.

Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?

asolimando

My pending comments were fully addressed, thanks @kumarUjjawal!

EDIT: I noticed there is a test failure, but it seems a flaky test, and unrelated to this PR, so I keep my approval, not sure if guidelines require a green run, in case you can probably re-trigger with an empty commit? (I don't have permission to re-run selectively, and I guess you don't either)

kumarUjjawal · 2026-04-23T09:15:36Z

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

asolimando · 2026-04-23T09:16:47Z

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

kumarUjjawal · 2026-04-23T09:19:25Z

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d

or should i create new issue to address this?

asolimando · 2026-04-23T09:29:41Z

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d

or should i create new issue to address this?

No no, that's fine, I missed that and I wanted to make sure this wouldn't slip through the cracks!

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 7, 2026

xudong963 requested changes Apr 7, 2026

View reviewed changes

xudong963 marked this pull request as draft April 7, 2026 07:14

github-actions Bot added the common Related to common crate label Apr 7, 2026

kumarUjjawal marked this pull request as ready for review April 7, 2026 08:40

kumarUjjawal requested a review from xudong963 April 7, 2026 08:50

xudong963 reviewed Apr 20, 2026

View reviewed changes

xudong963 approved these changes Apr 22, 2026

View reviewed changes

asolimando reviewed Apr 22, 2026

View reviewed changes

github-actions Bot added the datasource Changes to the datasource crate label Apr 23, 2026

kumarUjjawal added 5 commits April 23, 2026 09:36

Use shared statistics merge for union stats

3018d63

test: restore shared NDV coverage and reset testing submodule

4a97bdb

test: add full shared NDV edge-case table

e08e95f

fix(stats): preserve NDV fallback semantics in shared merge

a964076

explicit NDV fallback mode

891da8f

kumarUjjawal force-pushed the feat/consolidated_satistics branch from f8fd060 to 891da8f Compare April 23, 2026 04:30

asolimando approved these changes Apr 23, 2026

View reviewed changes

Conversation

kumarUjjawal commented Apr 7, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kumarUjjawal commented Apr 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal commented Apr 19, 2026

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal commented Apr 20, 2026

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asolimando left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal commented Apr 23, 2026

Uh oh!

asolimando commented Apr 23, 2026

Uh oh!

kumarUjjawal commented Apr 23, 2026

Uh oh!

asolimando commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asolimando left a comment •

edited

Loading