Use shared statistics merge for union stats#21430
Use shared statistics merge for union stats#21430kumarUjjawal wants to merge 5 commits intoapache:mainfrom
Conversation
|
What's up with the failing avro test? I don't think it is related to this pr. |
| } | ||
|
|
||
| #[test] | ||
| fn test_union_distinct_count() { |
There was a problem hiding this comment.
There is a significant concern with the test coverage reduction.
The old test_union_distinct_count contained 15 carefully constructed edge cases for NDV estimation: disjoint ranges, identical ranges, partial overlap, containment, constant values, absent values, and mixed precision types.
These tests validated the correctness of estimate_ndv_with_overlap as used in the union context. The new tests use a single scenario (non-overlapping ranges, single column) and rely on the assumption that try_merge_iter is already well-tested elsewhere.
There was a problem hiding this comment.
I have plans for the tests in the shared statistics, I am working on adding those.
There was a problem hiding this comment.
nice, I converted the PR to draft now
There was a problem hiding this comment.
I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.
Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?
There was a problem hiding this comment.
This should not have happened, my mistake
|
@xudong963 Can i get a look on this? |
xudong963
left a comment
There was a problem hiding this comment.
I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt
Yeah, I changed it. |
xudong963
left a comment
There was a problem hiding this comment.
cc @asolimando in case you have a chance to have a look at this.
asolimando
left a comment
There was a problem hiding this comment.
Nice cleanup consolidating the union/interleave stats merging onto a single function @kumarUjjawal.
There is still a little gap in test coverage and I think it would be interesting to keep a customizable fallback for NDV merging so the change becomes a pure refactoring with no semantic changes.
Thanks @xudong963 for the ping!
| (Some(&l), Some(&r)) => Precision::Inexact( | ||
| estimate_ndv_with_overlap(col_stats, item_cs, l, r) | ||
| .unwrap_or_else(|| usize::max(l, r)), | ||
| .unwrap_or_else(|| l.saturating_add(r)), |
There was a problem hiding this comment.
The proposed change at this line is a semantic, the proposed fallback is sensible for unions (independent streams, summing NDVs is a good upper bound) but this function is also used to share statistics for Parquet files (see statistics.rs#L482 and statistics.rs#L528), for which max is a more classic fallback (files from the same table are likely to share common values, so summing NDV would overshoot in general).
One option would be to have a configurable fallback (e.g., an enum NdvFallback::Max vs NdvFallback::Sum), so the callers can choose based on their own semantics. WDYT?
There was a problem hiding this comment.
This sounds like an improvement from the current approach. Thanks @asolimando
| } | ||
|
|
||
| #[test] | ||
| fn test_union_distinct_count() { |
There was a problem hiding this comment.
I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.
Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?
f8fd060 to
891da8f
Compare
There was a problem hiding this comment.
My pending comments were fully addressed, thanks @kumarUjjawal!
EDIT: I noticed there is a test failure, but it seems a flaky test, and unrelated to this PR, so I keep my approval, not sure if guidelines require a green run, in case you can probably re-trigger with an empty commit? (I don't have permission to re-run selectively, and I guess you don't either)
|
The CI failure looks lie a flaky EXPLAIN ANALYZE expectation. |
If you have bandwidth, would you mind filing an issue for this? |
I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d or should i create new issue to address this? |
No no, that's fine, I missed that and I wanted to make sure this wouldn't slip through the cracks! |
Which issue does this PR close?
Rationale for this change
DataFusion already has shared logic for merging
Statistics, butUnionExecandInterleaveExecstill used their own local merge code.That left duplicated path in the codebase and kept the behavior less consistent than the other statistics aggregation paths.
What changes are included in this PR?
Statistics::try_merge_iterforUnionExecstatistics mergingInterleaveExecstatistics mergingAre these changes tested?
Yes
Are there any user-facing changes?
No