# release-19.2: opt: improve selectivity estimation based on null counts #42436

Merged
merged 2 commits into from Nov 13, 2019
Merged

# release-19.2: opt: improve selectivity estimation based on null counts#42436

merged 2 commits into from Nov 13, 2019

## Conversation

Contributor

### rytaft commented Nov 12, 2019

 Backport: 1/1 commits from "opt: improve selectivity estimation based on null counts" (#41520) 1/1 commits from "opt: revert removal of code reducing columns for selectivity calculation" (#41686) Please see individual PRs for details.
requested review from justinj and RaduBerinde Nov 12, 2019
requested a review from cockroachdb/sql-opt-prs as a code owner Nov 12, 2019
Member

### cockroach-teamcity commented Nov 12, 2019

 This change is
added 2 commits Oct 10, 2019
``` opt: improve selectivity estimation based on null counts ```
``` 04b2110 ```
```This commit makes several changes to improve selectivity estimation
when there are a large number of null values.

1. It removes the code to reduce the columns used for selectivity estimation
based on functional dependencies. This reduction was only useful in a few very
specific circumstamces, and added a lot of unnecessary complexity (especially
when calculating selectivity based on null values removed).

2. It combines selectivity estimation based on null counts with the other
selectivity estimation functions. This is much more accurate than doing it
separately. For example, consider a table t with column x, which has 10 rows.
7 of the rows have x=NULL, and the other 3 rows have 3 distinct values.

To estimate the selectivity of the predicate x=1 OR x IS NULL, we should use
the formula:
sel = (new distinct / old distinct) * (1 - null fraction) + null fraction
--> sel(x) = (1/3) * (1 - 7/10) + 7/10 = 0.8

As expected, `SELECT * FROM t WHERE x=1 OR x IS NULL` should return 8 rows.

Since it is necessary to *add* the null fraction to account for the IS NULL
part of the predicate, we must include it when we calculate the selectivity
for a column based on distinct counts or histograms. Column selectivities
are multiplied together to estimate the total selectivity, so it's not possible
to account for null values afterwards.

Fixes #36157

Release note (performance improvement): Improved statistics estimation
during query planning for columns with many null values.```
``` opt: revert removal of code reducing columns for selectivity calculation ```
``` 1b2cd54 ```
```This commit reverts part of #41520, which removed the code to reduce
the number of columns used for selectivity calculation based on functional
dependencies. I removed it because I thought it added unnecessary complexity
with minimal benefit, but we have already run into a real-world example
where it does actually provide a tangible benefit in producing a better
plan.

I was able to add it back with slightly less complexity than existed
before, so I'm satisfied that adding it back is worthwhile.

Release note: None```
force-pushed the rytaft:backport19.2-41520-41686 branch from `4f07f28` to `1b2cd54` Nov 12, 2019
approved these changes
Member

 Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @justinj and @RaduBerinde)
merged commit `1c95d3d` into cockroachdb:release-19.2 Nov 13, 2019
2 checks passed
2 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details