Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
release-19.2: opt: improve selectivity estimation based on null counts #42436
This commit makes several changes to improve selectivity estimation when there are a large number of null values. 1. It removes the code to reduce the columns used for selectivity estimation based on functional dependencies. This reduction was only useful in a few very specific circumstamces, and added a lot of unnecessary complexity (especially when calculating selectivity based on null values removed). 2. It combines selectivity estimation based on null counts with the other selectivity estimation functions. This is much more accurate than doing it separately. For example, consider a table t with column x, which has 10 rows. 7 of the rows have x=NULL, and the other 3 rows have 3 distinct values. To estimate the selectivity of the predicate x=1 OR x IS NULL, we should use the formula: sel = (new distinct / old distinct) * (1 - null fraction) + null fraction --> sel(x) = (1/3) * (1 - 7/10) + 7/10 = 0.8 As expected, `SELECT * FROM t WHERE x=1 OR x IS NULL` should return 8 rows. Since it is necessary to *add* the null fraction to account for the IS NULL part of the predicate, we must include it when we calculate the selectivity for a column based on distinct counts or histograms. Column selectivities are multiplied together to estimate the total selectivity, so it's not possible to account for null values afterwards. Fixes #36157 Release note (performance improvement): Improved statistics estimation during query planning for columns with many null values.
This commit reverts part of #41520, which removed the code to reduce the number of columns used for selectivity calculation based on functional dependencies. I removed it because I thought it added unnecessary complexity with minimal benefit, but we have already run into a real-world example where it does actually provide a tangible benefit in producing a better plan. I was able to add it back with slightly less complexity than existed before, so I'm satisfied that adding it back is worthwhile. Release note: None
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments.