Skip to content

feat(stats): cap NDV at row count in statistics estimation#21081

Open
asolimando wants to merge 1 commit intoapache:mainfrom
asolimando:asolimando/ndv-cap-row-count
Open

feat(stats): cap NDV at row count in statistics estimation#21081
asolimando wants to merge 1 commit intoapache:mainfrom
asolimando:asolimando/ndv-cap-row-count

Conversation

@asolimando
Copy link
Member

Which issue does this PR close?

Rationale for this change

After a filter reduces a table from 100 to 10 rows, or a LIMIT 10 caps the output, the NDV (e.g. 80) should not exceed the new row count. Without capping, join cardinality estimation uses an inflated denominator, leading to inaccurate estimates.

What changes are included in this PR?

Cap distinct_count at num_rows in three places to prevent NDV from exceeding the actual row count:

  • max_distinct_count in join cardinality estimation (joins/utils.rs)
  • collect_new_statistics in filter output statistics (filter.rs)
  • Statistics::with_fetch (stats.rs), which covers GlobalLimitExec, LocalLimitExec, SortExec (with fetch), CoalescePartitionsExec (with fetch), and CoalesceBatchesExec (with fetch)

Note: NDV capping for AggregateExec is covered separately in #20926.

Are these changes tested?

  • test_filter_statistics_ndv_capped_at_row_count - verifies NDV capped at filtered row count
  • 2 new join cardinality test cases - NDV > rows on both/one side
  • Updated test_join_cardinality expected values for capped NDV
  • test_with_fetch_caps_ndv_at_row_count - verifies NDV capped after LIMIT
  • test_with_fetch_ndv_below_row_count_unchanged - verifies NDV untouched when already below row count
  • All existing with_fetch tests pass

Are there any user-facing changes?

No public API changes. Only internal statistics estimation is affected.

Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

@asolimando asolimando marked this pull request as draft March 20, 2026 17:42
@github-actions github-actions bot added common Related to common crate physical-plan Changes to the physical-plan crate labels Mar 20, 2026
@asolimando asolimando force-pushed the asolimando/ndv-cap-row-count branch from 6adef34 to ddfd8f3 Compare March 20, 2026 18:01
@github-actions github-actions bot added the core Core DataFusion crate label Mar 20, 2026
@asolimando asolimando marked this pull request as ready for review March 20, 2026 18:51
@asolimando asolimando changed the title Cap NDV at row count in statistics estimation feat: cap NDV at row count in statistics estimation Mar 20, 2026
@asolimando asolimando changed the title feat: cap NDV at row count in statistics estimation feat(stats): cap NDV at row count in statistics estimation Mar 20, 2026
@asolimando
Copy link
Member Author

cc: @jonathanc-n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant