Skip to content

Statistics::is_exact semantics #5613

@crepererum

Description

@crepererum

Describe the bug
It is unclear what Statistics::is_exact = false means. The docs are here:

https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37

These state for this case:

may contain an inexact estimate and may not be the actual value

What does "inexact" mean? Some potential definitions (we only consider Some(...) fields here!):

  • underestimate: There are values within the data source that are NOT included within the statistics, i.e. the statistics do NOT cover the whole range. This could happen when you sample statistics from a larger data source.
  • overestimate: All values from the data stream are covered by the statistics, but the range might be too large. This can happen when some source doesn't fold predicates into the statistics (which in general is pretty hard to do).
  • both: The statistics are only a rough guide.

I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join).

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Expected behavior
Clarify behavior.

Additional context
Cross-ref #997.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions