Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder semi anti joins #11573

Closed

Conversation

Tmonster
Copy link
Contributor

@Tmonster Tmonster commented Apr 9, 2024

To do this, we do the following,

  1. When we look at reordering and what relations can join with what other relations, we treat semi joins and anti joins like regular inner joins.
  2. To make sure the left and right children are not flipped, we compare the left and right bindings from the original join to the left and right bindings of the recreated plan. If the children are flipped, we flip them back.
  3. When calculating the cost of a semi join we pretend the left table (i.e the table that propagates the up) is being filtered on. We take the cardinality of the left side and multiply it by 0.2.

Since this cardinality estimation method uses multiplication, it is also symmetrical, which means we don't have to worry much about different join plans for the same set of relations having different estimated cardinalities.

This PR also introduced a bug where statistics propagation would rewrite a part of the plan after the projection bindings were set. The culprit was converting mark joins to semi joins. To fix this, the Filter Pushdown optimizer no has an option to not rewrite these types of joins. I am open to thinking about other ways of how to do this.

In order to implement pushing down semi joins, we need to keep better track of relations on the left and right side and what the join type is. Cost computations now usually assume the join is an inner join unless stated otherwise.

Some other small improvements:

  1. More checks in the cardinality estimator to make sure the denominator is not 0 (since that will cause very inaccurate cardinality estimates)
  2. More checks during statistics extraction to make sure the distinct count of a column is always between 1 and the cardinality of the table.

Another heuristic was added for determining the number of distinct elements in a column as well. For integral type columns, if the maxVal - minVal is less than the distinct count measured by HLL, then DuckDB will prefer max - min as the distinct count.

I thought this was the source of a bad join order. Turns out that wasn't the case, but I think it is still a good join heuristic.

…so add better heuristics for determining a distinct count
…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed
@Tmonster Tmonster requested a review from lnkuiper April 9, 2024 07:11
Copy link
Contributor

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome that you got this working! I have left some comments below. Regarding this:

This PR also introduced a bug where statistics propagation would rewrite a part of the plan after the projection bindings were set. The culprit was converting mark joins to semi joins. To fix this, the Filter Pushdown optimizer no has an option to not rewrite these types of joins. I am open to thinking about other ways of how to do this.

Is this an issue with projection_maps again? Can't we check whether the projection map is empty, and decide not to perform the optimization in that case, rather than adding the bool to the FilterPushdown? We already have a similar check at pushdown_filter.cpp:12, maybe we should add it in more places.

@@ -8,10 +8,11 @@ CostModel::CostModel(QueryGraphManager &query_graph_manager)
: query_graph_manager(query_graph_manager), cardinality_estimator() {
}

double CostModel::ComputeCost(JoinNode &left, JoinNode &right) {
double CostModel::ComputeCost(JoinNode &left, JoinNode &right, JoinType join_type) {
Copy link
Contributor

@lnkuiper lnkuiper Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added join_type parameter seems unused here

auto cost = cost_model.ComputeCost(left, right);
join_type = filter_binding->join_type;
// prefer joining on semi and anti joins as they have a higher chance of being more
// selective
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we compute the selectivity and select the best connection based on that? I know the default selectivity for SEMI/ANTI is set to 0.2 for now, but some INNER joins may be even more selective, in which case it would be better to perform the INNER first.

@@ -242,6 +242,7 @@ unique_ptr<NodeStatistics> StatisticsPropagator::PropagateStatistics(LogicalFilt
i--;
if (filter.expressions.empty()) {
// just break. The physical filter planner will plan a projection instead
// we don't remove the filter because it might have a projection map.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you've only added a comment here, but we can check whether LogicalFilter::projection_map is empty rather than assuming it's not


DenomInfo CardinalityEstimator::GetDenominator(JoinRelationSet &set) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (previously EstimateCardinalityWithSet, now GetDenominator) is quite long (~170 lines), and I find it difficult to understand what is happening. Do you think it's possible to refactor this?

Perhaps we can change some of the if/else below to this:

switch(filter->join_type) {
case JoinType::INNER:
    UpdateDenominatorInner(...);
case JoinType::SEMI:
case JoinType::ANTI:
    UpdateDenominatorSemiAnti(...);
default:
    D_ASSERT(filter->join_type == JoinType::INVALID);
    // Not sure if it should be INVALID, I just like assertions
    UpdateDenominatorCrossProduct(...);
}

I find using switches and separating logic into different functions improves readability.

@Mytherin Mytherin marked this pull request as draft April 10, 2024 09:06
@Tmonster Tmonster closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants