-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorder semi anti joins #11573
Reorder semi anti joins #11573
Conversation
…uld be able to reorder the semi join
…so add better heuristics for determining a distinct count
…kdb into reorder-semi-and-anti-joins
…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed
…mator which is segfaulting now
…ter/duckdb into reorder_semi_anti_joins_easier_fix
…i_joins_easier_fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome that you got this working! I have left some comments below. Regarding this:
This PR also introduced a bug where statistics propagation would rewrite a part of the plan after the projection bindings were set. The culprit was converting mark joins to semi joins. To fix this, the Filter Pushdown optimizer no has an option to not rewrite these types of joins. I am open to thinking about other ways of how to do this.
Is this an issue with projection_map
s again? Can't we check whether the projection map is empty, and decide not to perform the optimization in that case, rather than adding the bool
to the FilterPushdown
? We already have a similar check at pushdown_filter.cpp:12
, maybe we should add it in more places.
@@ -8,10 +8,11 @@ CostModel::CostModel(QueryGraphManager &query_graph_manager) | |||
: query_graph_manager(query_graph_manager), cardinality_estimator() { | |||
} | |||
|
|||
double CostModel::ComputeCost(JoinNode &left, JoinNode &right) { | |||
double CostModel::ComputeCost(JoinNode &left, JoinNode &right, JoinType join_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly added join_type
parameter seems unused here
auto cost = cost_model.ComputeCost(left, right); | ||
join_type = filter_binding->join_type; | ||
// prefer joining on semi and anti joins as they have a higher chance of being more | ||
// selective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we compute the selectivity and select the best connection based on that? I know the default selectivity for SEMI/ANTI is set to 0.2 for now, but some INNER joins may be even more selective, in which case it would be better to perform the INNER first.
@@ -242,6 +242,7 @@ unique_ptr<NodeStatistics> StatisticsPropagator::PropagateStatistics(LogicalFilt | |||
i--; | |||
if (filter.expressions.empty()) { | |||
// just break. The physical filter planner will plan a projection instead | |||
// we don't remove the filter because it might have a projection map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you've only added a comment here, but we can check whether LogicalFilter::projection_map
is empty rather than assuming it's not
|
||
DenomInfo CardinalityEstimator::GetDenominator(JoinRelationSet &set) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function (previously EstimateCardinalityWithSet
, now GetDenominator
) is quite long (~170 lines), and I find it difficult to understand what is happening. Do you think it's possible to refactor this?
Perhaps we can change some of the if
/else
below to this:
switch(filter->join_type) {
case JoinType::INNER:
UpdateDenominatorInner(...);
case JoinType::SEMI:
case JoinType::ANTI:
UpdateDenominatorSemiAnti(...);
default:
D_ASSERT(filter->join_type == JoinType::INVALID);
// Not sure if it should be INVALID, I just like assertions
UpdateDenominatorCrossProduct(...);
}
I find using switches and separating logic into different functions improves readability.
To do this, we do the following,
Since this cardinality estimation method uses multiplication, it is also symmetrical, which means we don't have to worry much about different join plans for the same set of relations having different estimated cardinalities.
This PR also introduced a bug where statistics propagation would rewrite a part of the plan after the projection bindings were set. The culprit was converting mark joins to semi joins. To fix this, the Filter Pushdown optimizer no has an option to not rewrite these types of joins. I am open to thinking about other ways of how to do this.
In order to implement pushing down semi joins, we need to keep better track of relations on the left and right side and what the join type is. Cost computations now usually assume the join is an inner join unless stated otherwise.
Some other small improvements:
Another heuristic was added for determining the number of distinct elements in a column as well. For integral type columns, if the maxVal - minVal is less than the distinct count measured by HLL, then DuckDB will prefer max - min as the distinct count.
I thought this was the source of a bad join order. Turns out that wasn't the case, but I think it is still a good join heuristic.