-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-24.1: opt/memo: improve zigzag join cost and selectivity estimation with multi-column stats #123106
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Prior to this commit, the optimizer could prefer a zigzag join over a scan even if they produced the same number of rows. This was because scans always included the cost of at least one seek (involving random I/O) and some distribution cost, while zigzag joins did not. This commit updates the cost of zigzag joins to include seek and distribution costs so they will never be chosen over scans unless they produce fewer rows. This change is behind the setting optimizer_use_improved_zigzag_join_costing. Release note (performance improvement): Added a new setting optimizer_use_improved_zigzag_join_costing. When enabled, the cost of zigzag joins is updated so they will be never be chosen over scans unless they produce fewer rows. This change only matters if the setting enable_zigzag_join is also true.
This commit updates correlationFromMultiColDistinctCounts in statisticsBuilder to use a tighter lower bound for the multi-column selectivity. This avoids cases where we significantly over-estimate the selectivity of a multi-column predicate. Fixes cockroachdb#121397 Release note (performance improvement): Improved the selectivity estimation of multi-column filters when the multi-column distinct count is high. This avoids cases where we significantly over-estimate the selectivity of a multi-column predicate and as a result can prevent the optimizer from choosing a bad query plan.
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
blathers-crl
bot
added
the
backport
Label PR's that are backports to older release branches
label
Apr 26, 2024
…mate Informs cockroachdb#121397 Release note (sql change): Added a setting optimizer_use_improved_multi_column_selectivity_estimate, which if enabled, causes the optimizer to use an improved selectivity estimate for multi-column predicates. This setting will default to true on versions 24.2+, and false on prior versions.
This commit improves the variable names in selectivityFromMultiColDistinctCounts in statisticsBuilder to be more self-documenting. Release note: None
rytaft
force-pushed
the
backport24.1-120805
branch
from
April 26, 2024 01:58
470a452
to
f2183d0
Compare
mgartner
approved these changes
Apr 29, 2024
Friendly ping @rafiss |
rafiss
approved these changes
Apr 30, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 4/4 commits from #120805.
/cc @cockroachdb/release
opt: update seek and distribution cost of zigzag join to match scan
Prior to this commit, the optimizer could prefer a zigzag join over a
scan even if they produced the same number of rows. This was because scans
always included the cost of at least one seek (involving random I/O) and
some distribution cost, while zigzag joins did not. This commit updates
the cost of zigzag joins to include seek and distribution costs so they
will never be chosen over scans unless they produce fewer rows.
This change is behind the setting
optimizer_use_improved_zigzag_join_costing
.Release note (performance improvement): Added a new setting
optimizer_use_improved_zigzag_join_costing
. When enabled, the cost of zigzagjoins is updated so they will be never be chosen over scans unless they
produce fewer rows. This change only matters if the setting
enable_zigzag_join
is also true.
opt/memo: improve selectivity estimation with multi-column stats
This commit updates
correlationFromMultiColDistinctCounts
instatisticsBuilder
to use a tighter lower bound for the multi-column selectivity. This avoids
cases where we significantly over-estimate the selectivity of a multi-column
predicate.
Fixes #121397
Release note (performance improvement): Improved the selectivity estimation of
multi-column filters when the multi-column distinct count is high. This avoids
cases where we significantly over-estimate the selectivity of a multi-column
predicate and as a result can prevent the optimizer from choosing a bad query
plan.
sql: add setting optimizer_use_improved_multi_column_selectivity_estimate
Informs #121397
Release note (sql change): Added a setting
optimizer_use_improved_multi_column_selectivity_estimate
, which if enabled,causes the optimizer to use an improved selectivity estimate for multi-column
predicates. This setting will default to true on versions 24.2+, and false
on prior versions.
opt: improve variable names in selectivityFromMultiColDistinctCounts
This commit improves the variable names in
selectivityFromMultiColDistinctCounts
instatisticsBuilder
to be moreself-documenting.
Release note: None
Release justification: low-risk, high benefit change to existing functionality to unblock a customer