Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-19.2: opt: improve selectivity estimation based on null counts #42436

Merged
merged 2 commits into from Nov 13, 2019

Conversation

@rytaft
Copy link
Contributor

rytaft commented Nov 12, 2019

Backport:

  • 1/1 commits from "opt: improve selectivity estimation based on null counts" (#41520)
  • 1/1 commits from "opt: revert removal of code reducing columns for selectivity calculation" (#41686)

Please see individual PRs for details.

/cc @cockroachdb/release

@rytaft rytaft requested review from justinj and RaduBerinde Nov 12, 2019
@rytaft rytaft requested a review from cockroachdb/sql-opt-prs as a code owner Nov 12, 2019
@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Nov 12, 2019

This change is Reviewable

rytaft added 2 commits Oct 10, 2019
This commit makes several changes to improve selectivity estimation
when there are a large number of null values.

1. It removes the code to reduce the columns used for selectivity estimation
based on functional dependencies. This reduction was only useful in a few very
specific circumstamces, and added a lot of unnecessary complexity (especially
when calculating selectivity based on null values removed).

2. It combines selectivity estimation based on null counts with the other
selectivity estimation functions. This is much more accurate than doing it
separately. For example, consider a table t with column x, which has 10 rows.
7 of the rows have x=NULL, and the other 3 rows have 3 distinct values.

To estimate the selectivity of the predicate x=1 OR x IS NULL, we should use
the formula:
  sel = (new distinct / old distinct) * (1 - null fraction) + null fraction
  --> sel(x) = (1/3) * (1 - 7/10) + 7/10 = 0.8

As expected, `SELECT * FROM t WHERE x=1 OR x IS NULL` should return 8 rows.

Since it is necessary to *add* the null fraction to account for the IS NULL
part of the predicate, we must include it when we calculate the selectivity
for a column based on distinct counts or histograms. Column selectivities
are multiplied together to estimate the total selectivity, so it's not possible
to account for null values afterwards.

Fixes #36157

Release note (performance improvement): Improved statistics estimation
during query planning for columns with many null values.
This commit reverts part of #41520, which removed the code to reduce
the number of columns used for selectivity calculation based on functional
dependencies. I removed it because I thought it added unnecessary complexity
with minimal benefit, but we have already run into a real-world example
where it does actually provide a tangible benefit in producing a better
plan.

I was able to add it back with slightly less complexity than existed
before, so I'm satisfied that adding it back is worthwhile.

Release note: None
@rytaft rytaft force-pushed the rytaft:backport19.2-41520-41686 branch from 4f07f28 to 1b2cd54 Nov 12, 2019
Copy link
Member

RaduBerinde left a comment

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @justinj and @RaduBerinde)

@rytaft rytaft merged commit 1c95d3d into cockroachdb:release-19.2 Nov 13, 2019
2 checks passed
2 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
license/cla Contributor License Agreement is signed.
Details
@rytaft rytaft deleted the rytaft:backport19.2-41520-41686 branch Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.