-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join Order + EXPLAIN Improvements #5891
Conversation
double inspect_result = (double)InspectTableFilters(card_after_filters, op, table_filters); | ||
card_after_filters = MinValue(inspect_result, (double)card_after_filters); | ||
} | ||
if (has_logical_filter) { | ||
card_after_filters *= DEFAULT_SELECTIVITY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this code caused the bug where we don't apply any filters? My initial thinking here was that if there are 1+ filters to a table, apply the default selectivity once. I didn't want to apply multiple filter selectivities independently as that could lead to greater inaccuracies. Looks like that can still happen here?
I think the issue might then be in the function InspectTableFilters? Maybe in that function the Default selectivity doesn't get applied for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed the code that caused the bug. It checked whether there was a LogicalFilter
. If there is one, it would ignore the filters that were pushed into the base tables. We now apply multiple filter selectivities independently, but I don't see why that would be a problem.
In InspectTableFilters
, we only apply the filters on the base table, and not those in the LogicalFilter
that might be on top. So the bug is similar. Maybe we can check together how to properly fix this tomorrow?
auto &filter_props = *result_operator->estimated_props; | ||
auto &child_operator = *result_operator->children[0]; | ||
child_operator.estimated_props = make_unique<EstimatedProperties>( | ||
filter_props.GetCardinality<double>() / CardinalityEstimator::DEFAULT_SELECTIVITY, filter_props.GetCost()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need estimated properties on both? If I'm not mistaken, the properties of a join order node are just pulled up from all the nodes below it. Those properties are then used by the JOO.
Or is this so prevent confusion when we print join plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely to improve EXPLAIN ANALYZE
output.
We just went over this and agreed that the implementation is good as is @Mytherin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
Important to note that we sometimes apply default selectivity twice in the case of 2+ filters. Specifically when one filter can be pushed into the sequential scan, and one cannot. For the TPCH query that Laurens mentioned in the description, the sequential scan has a filter on an integer type (an equality filter), and a logical filter is on a string type. After a brief discussion, we realized it is reasonable to assume the filters are independent in this case. In these cases, we now apply the default selectivity twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM. Ready to merge after tests pass.
The failing test is due to 2 plans in JOB regressing as mentioned |
This PR started off as improvements to the
EXPLAIN ANALYZE
output.We found that expected cardinalities were not being reported properly if we have a
FILTER
on top of aSEQ_SCAN
with filters pushed into it. This is rare since the expressions in theFILTER
can usually be pushed into theSEQ_SCAN
. However, sometimes a few filters are pushed into the scan, but others are not, causing us to report cardinality estimates for neither theFILTER
nor theSEQ_SCAN
.When investigating this, I found that when this occurred, we also only applied the filters that were pushed into the
SEQ_SCAN
in our cardinality estimates, and we did not consider theFILTER
on top of it. I've addressed this, and we now apply all filters properly, and we report the estimated cardinalities properly as well (for both theFILTER
and theSEQ_SCAN
nodes.While I was doing this, I also found that we do not re-order joins in sub-plans that contain a
DELIM_GET
. This is tricky to fix, but in many cases, we optimize theDELIM_GET
away with theDeliminator
optimizer, which is invoked right after theJoinOrderOptimizer
. Once theDELIM_GET
is gone, we can simply optimize the join order of the sub-plan. So, I've added a call to theJoinOrderOptimizer
within theDeliminator
when this occurs.Both of these optimizations are applicable to TPC-H query 2, and our
plan_cost_runner.py
benchmark script seems to be happy, as we reduce the size of the intermediates by ~11.5x.This is the new plan for this query:
As we can see, we report cardinality estimates for the
FILTER
as well as theSEQ_SCAN
it sits on top of. The top-most join is a comparison join withJoinType::SINGLE
. This was a delim join before. The right-hand side child of this containedDELIM_GET
s, and was therefore not re-ordered. Because we did not re-order this, we did not report any cardinality estimates either. As we can see, we re-order and report cardinality estimates now.The
FILTER
optimization has implications for the Join Order Benchmark as well. TheDELIM_GET
optimization does not, as there are no correlated subqueries in JOB. Here is the output of theplan_cost_runner.py
script:As we can see, we improve the cost of 6 queries, while regressing on 2 queries. I'd say that the difference is an improvement overall.
Finally, I've cleaned up some code.