Skip to content

Use shared statistics merge for union stats#21430

Open
kumarUjjawal wants to merge 5 commits intoapache:mainfrom
kumarUjjawal:feat/consolidated_satistics
Open

Use shared statistics merge for union stats#21430
kumarUjjawal wants to merge 5 commits intoapache:mainfrom
kumarUjjawal:feat/consolidated_satistics

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

DataFusion already has shared logic for merging Statistics, but UnionExec and InterleaveExec still used their own local merge code.

That left duplicated path in the codebase and kept the behavior less consistent than the other statistics aggregation paths.

What changes are included in this PR?

  • Reuse Statistics::try_merge_iter for UnionExec statistics merging
  • Reuse the same shared path for InterleaveExec statistics merging
  • Remove the local union-specific statistics merge helpers
  • Add tests for union and interleave statistics merging
  • Add a test for interleave partition-level statistics merging

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 7, 2026
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

What's up with the failing avro test? I don't think it is related to this pr.

}

#[test]
fn test_union_distinct_count() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a significant concern with the test coverage reduction.

The old test_union_distinct_count contained 15 carefully constructed edge cases for NDV estimation: disjoint ranges, identical ranges, partial overlap, containment, constant values, absent values, and mixed precision types.

These tests validated the correctness of estimate_ndv_with_overlap as used in the union context. The new tests use a single scenario (non-overlapping ranges, single column) and rely on the assumption that try_merge_iter is already well-tested elsewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have plans for the tests in the shared statistics, I am working on adding those.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I converted the PR to draft now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.

Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?

Comment thread testing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have happened, my mistake

@xudong963 xudong963 marked this pull request as draft April 7, 2026 07:14
@github-actions github-actions Bot added the common Related to common crate label Apr 7, 2026
@kumarUjjawal kumarUjjawal marked this pull request as ready for review April 7, 2026 08:40
@kumarUjjawal kumarUjjawal requested a review from xudong963 April 7, 2026 08:50
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

@xudong963 Can i get a look on this?

Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt

Yeah, I changed it.

Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @asolimando in case you have a chance to have a look at this.

Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup consolidating the union/interleave stats merging onto a single function @kumarUjjawal.

There is still a little gap in test coverage and I think it would be interesting to keep a customizable fallback for NDV merging so the change becomes a pure refactoring with no semantic changes.

Thanks @xudong963 for the ping!

Comment thread datafusion/common/src/stats.rs Outdated
(Some(&l), Some(&r)) => Precision::Inexact(
estimate_ndv_with_overlap(col_stats, item_cs, l, r)
.unwrap_or_else(|| usize::max(l, r)),
.unwrap_or_else(|| l.saturating_add(r)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change at this line is a semantic, the proposed fallback is sensible for unions (independent streams, summing NDVs is a good upper bound) but this function is also used to share statistics for Parquet files (see statistics.rs#L482 and statistics.rs#L528), for which max is a more classic fallback (files from the same table are likely to share common values, so summing NDV would overshoot in general).

One option would be to have a configurable fallback (e.g., an enum NdvFallback::Max vs NdvFallback::Sum), so the callers can choose based on their own semantics. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an improvement from the current approach. Thanks @asolimando

}

#[test]
fn test_union_distinct_count() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.

Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Apr 23, 2026
@kumarUjjawal kumarUjjawal force-pushed the feat/consolidated_satistics branch from f8fd060 to 891da8f Compare April 23, 2026 04:30
Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pending comments were fully addressed, thanks @kumarUjjawal!

EDIT: I noticed there is a test failure, but it seems a flaky test, and unrelated to this PR, so I keep my approval, not sure if guidelines require a green run, in case you can probably re-trigger with an empty commit? (I don't have permission to re-run selectively, and I guess you don't either)

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

@asolimando
Copy link
Copy Markdown
Member

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d

or should i create new issue to address this?

@asolimando
Copy link
Copy Markdown
Member

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d

or should i create new issue to address this?

No no, that's fine, I missed that and I wanted to make sure this wouldn't slip through the cracks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants