-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
It is unclear what Statistics::is_exact
= false
means. The docs are here:
These state for this case:
may contain an inexact estimate and may not be the actual value
What does "inexact" mean? Some potential definitions (we only consider Some(...)
fields here!):
- underestimate: There are values within the data source that are NOT included within the statistics, i.e. the statistics do NOT cover the whole range. This could happen when you sample statistics from a larger data source.
- overestimate: All values from the data stream are covered by the statistics, but the range might be too large. This can happen when some source doesn't fold predicates into the statistics (which in general is pretty hard to do).
- both: The statistics are only a rough guide.
I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join).
Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.
Expected behavior
Clarify behavior.
Additional context
Cross-ref #997.