New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.1: opt: calculate more accurate selectivity for disjunctions #67730
release-21.1: opt: calculate more accurate selectivity for disjunctions #67730
Conversation
This commit refactors `statisticsBuilder.applyFilter` by renaming it to `applyFilters` and moving the logic in the `applyConjunction` closure to a separate `applyFiltersItem` function. Release note: None
887dc5b
to
d2b1d9b
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 5 of 5 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)
pkg/sql/opt/memo/statistics_builder.go, line 34 at r2 (raw file):
// efficient query plans in some cases. var improveDisjunctionSelectivityEnabled = settings.RegisterBoolSetting( "sql.defaults.optimizer_improve_disjunction_selectivity.enabled",
I think that typically when we use sql.defaults....
there is a corresponding session variable. Maybe that's worth considering here as well?
If you don't want to add a session setting (maybe it requires too many code changes?) I think you should probably use a different name. I was looking around for some naming guidelines but all I could find was this: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170317_settings_table.md, which doesn't explicitly include that rule about sql.defaults
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/opt/memo/statistics_builder.go, line 34 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I think that typically when we use
sql.defaults....
there is a corresponding session variable. Maybe that's worth considering here as well?If you don't want to add a session setting (maybe it requires too many code changes?) I think you should probably use a different name. I was looking around for some naming guidelines but all I could find was this: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170317_settings_table.md, which doesn't explicitly include that rule about
sql.defaults
.
My reasoning was to not add anything that's not absolutely necessary, so I didn't add a session variable.
I see what you mean about the naming. defaults
implies that it is a cluster setting to set the default value of a session setting. So maybe sql.optimizer_improve_disjunction_selectivity.enabled
would be better?
The optimal plan for a query with a disjunctive filter is often a DistinctOn+UnionAll of the results of two index scans, when two separate indexes can satisfy either side of the disjunction. For example: SELECT * FROM t WHERE a = 'foo' OR b = 'foo' LIMIT 5 limit ├── distinct-on │ └── union-all │ ├── scan t@a_idx │ │ └── constraint: /a: [/'foo' - /'foo'] │ └── scan t@a_idx │ └── constraint: /b: [/'foo' - /'foo'] └── 5 However, this optimal plan is not always chosen. A `constraint.Set` cannot be built for a disjunction involving multiple columns, so the statistics builder assumes the selectivity of the disjunction to be 1/3, often resulting in an over-estimated row count. When the DistinctOn is built during exploration it is added to the same memo group as the Select that it replaces, so it shares the same row count estimate. Even though the UnionAll's cost is low because it produces a small subset of the table, the DistinctOn's cost is high because the coster is under the assumption that the DistinctOn will produce 1/3 of the rows in the table. The overhead of producing so many rows adds significant overhead to the overall cost, preventing this plan from being chosen by the optimizer. This commit fixes the issue by attempting to build a constraint set for each side of a disjunction in a filter. By unioning the selectivity of each constraint set, a more accurate row count estimate is calculated for the filter. As a result, the cost of the DistinctOn is more accurate and the optimal plan is chosen. This fix is only enabled if the cluster setting is enabled: `sql.defaults.optimizer_improve_disjunction_selectivity.enabled`. Informs cockroachdb#58744 Release note (performance improvement): A new cluster setting `sql.defaults.optimizer_improve_disjunction_selectivity.enabled` enables more accurate selectivity estimation of query filters with OR expressions. This improves query plans in some cases. The cluster setting is disabled by default.
d2b1d9b
to
54e2c6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner)
pkg/sql/opt/memo/statistics_builder.go, line 34 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
My reasoning was to not add anything that's not absolutely necessary, so I didn't add a session variable.
I see what you mean about the naming.
defaults
implies that it is a cluster setting to set the default value of a session setting. So maybesql.optimizer_improve_disjunction_selectivity.enabled
would be better?
That seems fine to me. Thanks!
Merging despite the failing blathers check - the requirements are met but the check is erroneously failing. |
Note that using a cluster setting like this doesn't play nice with memo caching. All other knobs are stored in Memo so |
Welp... I guess I'll have to adjust this to be a session setting that is plumbed into a memo field. Thanks for pointing this out! |
Previously, only a cluster setting could enable improved optimizer calculations for the selectivity of filters with disjunctions. This prevented the optimizer from detecting stale memos in the cache. See cockroachdb#67730 (comment). This commit adds the `optimizer_improve_disjunction_selectivity` session setting to fix this. The cluster setting, which is now called `sql.defaults.optimizer_improve_disjunction_selectivity.enable` controls the default value of the session setting. Release note: None
Previously, only a cluster setting could enable improved optimizer calculations for the selectivity of filters with disjunctions. This prevented the optimizer from detecting stale memos in the cache. See cockroachdb#67730 (comment). This commit adds the `optimizer_improve_disjunction_selectivity` session setting to fix this. The cluster setting, which is now called `sql.defaults.optimizer_improve_disjunction_selectivity.enable` controls the default value of the session setting. Release note: None
Backport 2/2 commits from #64886.
/cc @cockroachdb/release
I've changed the original commits, so please be mindful of these differences
while reviewing:
setting could be changed; optimizer tests do not currently allow
for changing cluster settings.
Release justification: This is needed to prevent the optimizer from choosing
query plans with full table scans for a customer's workload. It improves the
selectivity estimations for query filters with disjunctions. The change is gated
behind a cluster setting which is disabled by default to prevent other customers'
query plans from changing if they upgrade to a new patch release.
opt: refactor statisticsBuilder.applyFilter
This commit refactors
statisticsBuilder.applyFilter
by renaming it toapplyFilters
and moving the logic in theapplyConjunction
closure toa separate
applyFiltersItem
function.Release note: None
opt: calculate more accurate selectivity for disjunctions
The optimal plan for a query with a disjunctive filter is often a
DistinctOn+UnionAll of the results of two index scans, when two separate
indexes can satisfy either side of the disjunction. For example:
However, this optimal plan is not always chosen. A
constraint.Set
cannot be built for a disjunction involving multiple columns, so the
statistics builder assumes the selectivity of the disjunction to be 1/3,
often resulting in an over-estimated row count. When the DistinctOn is
built during exploration it is added to the same memo group as the
Select that it replaces, so it shares the same row count estimate. Even
though the UnionAll's cost is low because it produces a small subset of
the table, the DistinctOn's cost is high because the coster is under the
assumption that the DistinctOn will produce 1/3 of the rows in the
table. The overhead of producing so many rows adds significant overhead
to the overall cost, preventing this plan from being chosen by the
optimizer.
This commit fixes the issue by attempting to build a constraint set for
each side of a disjunction in a filter. By unioning the selectivity of
each constraint set, a more accurate row count estimate is calculated
for the filter. As a result, the cost of the DistinctOn is more accurate
and the optimal plan is chosen.
This fix is only enabled if the cluster setting is enabled:
sql.optimizer_improve_disjunction_selectivity.enabled
.Informs #58744
Release note (performance improvement): A new cluster setting
sql.optimizer_improve_disjunction_selectivity.enabled
enablesmore accurate selectivity estimation of query filters with OR
expressions. This improves query plans in some cases. The cluster
setting is disabled by default.