Reorder semi anti joins #11573

Tmonster · 2024-04-09T07:10:51Z

To do this, we do the following,

When we look at reordering and what relations can join with what other relations, we treat semi joins and anti joins like regular inner joins.
To make sure the left and right children are not flipped, we compare the left and right bindings from the original join to the left and right bindings of the recreated plan. If the children are flipped, we flip them back.
When calculating the cost of a semi join we pretend the left table (i.e the table that propagates the up) is being filtered on. We take the cardinality of the left side and multiply it by 0.2.

Since this cardinality estimation method uses multiplication, it is also symmetrical, which means we don't have to worry much about different join plans for the same set of relations having different estimated cardinalities.

This PR also introduced a bug where statistics propagation would rewrite a part of the plan after the projection bindings were set. The culprit was converting mark joins to semi joins. To fix this, the Filter Pushdown optimizer no has an option to not rewrite these types of joins. I am open to thinking about other ways of how to do this.

In order to implement pushing down semi joins, we need to keep better track of relations on the left and right side and what the join type is. Cost computations now usually assume the join is an inner join unless stated otherwise.

Some other small improvements:

More checks in the cardinality estimator to make sure the denominator is not 0 (since that will cause very inaccurate cardinality estimates)
More checks during statistics extraction to make sure the distinct count of a column is always between 1 and the cardinality of the table.

Another heuristic was added for determining the number of distinct elements in a column as well. For integral type columns, if the maxVal - minVal is less than the distinct count measured by HLL, then DuckDB will prefer max - min as the distinct count.

I thought this was the source of a bad join order. Turns out that wasn't the case, but I think it is still a good join heuristic.

…uld be able to reorder the semi join

…e looked at

…ti-joins

…so add better heuristics for determining a distinct count

…d idea

…n reordering

…kdb into reorder-semi-and-anti-joins

…oins_easier_fix

…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed

…mator which is segfaulting now

…ter/duckdb into reorder_semi_anti_joins_easier_fix

…i_joins_easier_fix

lnkuiper

Awesome that you got this working! I have left some comments below. Regarding this:

This PR also introduced a bug where statistics propagation would rewrite a part of the plan after the projection bindings were set. The culprit was converting mark joins to semi joins. To fix this, the Filter Pushdown optimizer no has an option to not rewrite these types of joins. I am open to thinking about other ways of how to do this.

Is this an issue with projection_maps again? Can't we check whether the projection map is empty, and decide not to perform the optimization in that case, rather than adding the bool to the FilterPushdown? We already have a similar check at pushdown_filter.cpp:12, maybe we should add it in more places.

lnkuiper · 2024-04-09T12:43:00Z

src/optimizer/join_order/cost_model.cpp

@@ -8,10 +8,11 @@ CostModel::CostModel(QueryGraphManager &query_graph_manager)
    : query_graph_manager(query_graph_manager), cardinality_estimator() {
 }

-double CostModel::ComputeCost(JoinNode &left, JoinNode &right) {
+double CostModel::ComputeCost(JoinNode &left, JoinNode &right, JoinType join_type) {


The newly added join_type parameter seems unused here

lnkuiper · 2024-04-09T12:48:44Z

src/optimizer/join_order/plan_enumerator.cpp

-	auto cost = cost_model.ComputeCost(left, right);
+		join_type = filter_binding->join_type;
+		// prefer joining on semi and anti joins as they have a higher chance of being more
+		// selective


Can't we compute the selectivity and select the best connection based on that? I know the default selectivity for SEMI/ANTI is set to 0.2 for now, but some INNER joins may be even more selective, in which case it would be better to perform the INNER first.

lnkuiper · 2024-04-09T13:05:40Z

src/optimizer/statistics/operator/propagate_filter.cpp

@@ -242,6 +242,7 @@ unique_ptr<NodeStatistics> StatisticsPropagator::PropagateStatistics(LogicalFilt
 			i--;
 			if (filter.expressions.empty()) {
 				// just break. The physical filter planner will plan a projection instead
+				// we don't remove the filter because it might have a projection map.


I know you've only added a comment here, but we can check whether LogicalFilter::projection_map is empty rather than assuming it's not

lnkuiper · 2024-04-09T13:51:00Z

src/optimizer/join_order/cardinality_estimator.cpp


+DenomInfo CardinalityEstimator::GetDenominator(JoinRelationSet &set) {


This function (previously EstimateCardinalityWithSet, now GetDenominator) is quite long (~170 lines), and I find it difficult to understand what is happening. Do you think it's possible to refactor this?

Perhaps we can change some of the if/else below to this:

switch(filter->join_type) { case JoinType::INNER: UpdateDenominatorInner(...); case JoinType::SEMI: case JoinType::ANTI: UpdateDenominatorSemiAnti(...); default: D_ASSERT(filter->join_type == JoinType::INVALID); // Not sure if it should be INVALID, I just like assertions UpdateDenominatorCrossProduct(...); }

I find using switches and separating logic into different functions improves readability.

Tmonster added 30 commits January 25, 2024 17:59

I think we can reorder semi joins

18b7e51

I have access to the left and right sets and the join type. Now I sho…

77d503e

…uld be able to reorder the semi join

can now estimate a semi join to be 20% of the left table

34ad120

have most everything working. some CEs not quite. Q05 tpch needs to b…

d8f8f9f

…e looked at

Merge remote-tracking branch 'upstream/main' into reorder-semi-and-an…

8e6b730

…ti-joins

some debugging statements

4ef7415

make format-fix

d0e4892

fix code where numerator relations were not properly being merged. Al…

9eecc85

…so add better heuristics for determining a distinct count

add cross product join type

2732612

fix join is reorderabe function. generate enums

bd1d164

pausing point

8be872d

fix last join issue and better min/max stats help on distinct count

b9885f6

remove min max stats changesg

e7eaf10

add back in removed test

2183142

pausing to work on adding benchmarks for ingestion

7f49a57

trying to fix this column lifetime analyzer bug

206109e

you can also change the bindings filters return. but that's not a goo…

830cdec

…d idea

remove changing what bindings filters return

dfab1b6

check tests one more time. still need to figure out logical as of joi…

72cd1be

…n reordering

Merge branch 'reorder-semi-and-anti-joins' of github.com:Tmonster/duc…

990b296

…kdb into reorder-semi-and-anti-joins

make format-fix

333f252

might be a better solution, lets see if debug passes

7d7690c

Merge remote-tracking branch 'upstream/main' into reorder_semi_anti_j…

4414834

…oins_easier_fix

remove debugging statements

b22940f

fix logic for cross product relations

8f8edbd

skip test that I will remove eventually

d651dcf

these ideas dont work tbh

1fdb318

so close, but cant figure it out yet

8fc0048

still need to track down this bug

9827732

follow this path. Somewhere in the test visit_operator_expressions_in…

3ee5e65

…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed

Tmonster added 19 commits April 4, 2024 15:18

disable other test, see what else fails

ee258b3

make format-fix

9910429

disable other failing test, see what else fails

a08db68

basically everything works now. Just need to fix the cardinality esti…

46d0114

…mator which is segfaulting now

fix join order optimizer seg fault failure

a26d06b

remove/clean up tests

a3adc84

clean up some unused code

5076c09

Merge branch 'reorder_semi_anti_joins_easier_fix' of github.com:Tmons…

9ab4ca4

…ter/duckdb into reorder_semi_anti_joins_easier_fix

remove randomly failing test

5a00256

more code clean up

faa0755

make format-fix

71da791

comment out ensuring statististcs because arrow has not statistics

e4352af

make format-fix (again)

1cb53ad

rewrite mark join needs to be in more places

4d7b82e

Merge remote-tracking branch 'upstream/feature' into reorder_semi_ant…

e4d78f4

…i_joins_easier_fix

distinct count should be max of 1 and the reported distinct count

9e538dd

removed unused variable

d9df4df

clean up test fileg

e01ab55

remove hashtag

790bd63

Tmonster requested a review from lnkuiper April 9, 2024 07:11

Tmonster added 2 commits April 9, 2024 10:49

clang tidy fixes #1

b6bb52a

make format-fixes 2

c5a2a42

lnkuiper suggested changes Apr 9, 2024

View reviewed changes

pr comments 1

c9f56d1

Tmonster mentioned this pull request Apr 10, 2024

No Mark to Semi join conversion in statistics propagation #11596

Merged

add more comments for readability

7dc4bcb

Mytherin marked this pull request as draft April 10, 2024 09:06

Tmonster added the Changes Requested label Apr 11, 2024

Tmonster closed this Apr 24, 2024

Tmonster mentioned this pull request Apr 24, 2024

Reorder semi and anti joins. #11815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder semi anti joins #11573

Reorder semi anti joins #11573

Tmonster commented Apr 9, 2024

lnkuiper left a comment

lnkuiper Apr 9, 2024 •

edited

Loading

lnkuiper Apr 9, 2024

lnkuiper Apr 9, 2024

lnkuiper Apr 9, 2024


		DenomInfo CardinalityEstimator::GetDenominator(JoinRelationSet &set) {

Reorder semi anti joins #11573

Reorder semi anti joins #11573

Conversation

Tmonster commented Apr 9, 2024

lnkuiper left a comment

Choose a reason for hiding this comment

lnkuiper Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

lnkuiper Apr 9, 2024

Choose a reason for hiding this comment

lnkuiper Apr 9, 2024

Choose a reason for hiding this comment

lnkuiper Apr 9, 2024

Choose a reason for hiding this comment

lnkuiper Apr 9, 2024 •

edited

Loading