Reorder semi and anti joins. #11815

Tmonster · 2024-04-24T14:24:47Z

follow up of #11573

To do this, we do the following,

When we look at reordering and what relations can join with what other relations, we treat semi joins and anti joins like regular inner joins.
To make sure the left and right children are not flipped, we compare the left and right bindings from the original join to the left and right bindings of the recreated plan. If the children are flipped, we flip them back.
When calculating the cost of a semi join we pretend the left table (i.e the table that propagates the up) is being filtered on. We take the cardinality of the left side and multiply it by 0.2.

Since this cardinality estimation method uses multiplication, it is also symmetrical, which means we don't have to worry about different join plans for the same set of relations having different estimated cardinalities.

Calculating the denominator of the estimated cardinality is easier now. It works like finding a maximum spanning tree. Assuming relations are nodes and join filters are weighted edges, the process of finding the most selective filters is exactly like a maximum spanning tree problem. The weights of the edges come from the shared total domain of the columns of the filter.

Some other small improvements:

More checks in the cardinality estimator to make sure the denominator is not 0 (since that will cause very inaccurate cardinality estimates)
More checks during statistics extraction to make sure the distinct count of a column is always between 1 and the cardinality of the table.

Another heuristic was added for determining the number of distinct elements in a column as well. For integral type columns, if the maxVal - minVal is less than the distinct count measured by HLL, then DuckDB will prefer max - min as the distinct count.

I thought this was the source of a bad join order. Turns out that wasn't the case, but I think it is still a good join heuristic.

…uld be able to reorder the semi join

…e looked at

…ti-joins

…so add better heuristics for determining a distinct count

…d idea

…n reordering

…kdb into reorder-semi-and-anti-joins

…oins_easier_fix

…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed

…i_joins_easier_fix_refactor

Tmonster · 2024-06-05T06:54:12Z

@Mytherin this ready to go now. You reviewed once already, but I had to battle CI for a while since I also needed to patch substrait

Mytherin · 2024-06-05T07:02:44Z

Thanks!

Tmonster added 30 commits January 25, 2024 17:59

I think we can reorder semi joins

18b7e51

I have access to the left and right sets and the join type. Now I sho…

77d503e

…uld be able to reorder the semi join

can now estimate a semi join to be 20% of the left table

34ad120

have most everything working. some CEs not quite. Q05 tpch needs to b…

d8f8f9f

…e looked at

Merge remote-tracking branch 'upstream/main' into reorder-semi-and-an…

8e6b730

…ti-joins

some debugging statements

4ef7415

make format-fix

d0e4892

fix code where numerator relations were not properly being merged. Al…

9eecc85

…so add better heuristics for determining a distinct count

add cross product join type

2732612

fix join is reorderabe function. generate enums

bd1d164

pausing point

8be872d

fix last join issue and better min/max stats help on distinct count

b9885f6

remove min max stats changesg

e7eaf10

add back in removed test

2183142

pausing to work on adding benchmarks for ingestion

7f49a57

trying to fix this column lifetime analyzer bug

206109e

you can also change the bindings filters return. but that's not a goo…

830cdec

…d idea

remove changing what bindings filters return

dfab1b6

check tests one more time. still need to figure out logical as of joi…

72cd1be

…n reordering

Merge branch 'reorder-semi-and-anti-joins' of github.com:Tmonster/duc…

990b296

…kdb into reorder-semi-and-anti-joins

make format-fix

333f252

might be a better solution, lets see if debug passes

7d7690c

Merge remote-tracking branch 'upstream/main' into reorder_semi_anti_j…

4414834

…oins_easier_fix

remove debugging statements

b22940f

fix logic for cross product relations

8f8edbd

skip test that I will remove eventually

d651dcf

these ideas dont work tbh

1fdb318

so close, but cant figure it out yet

8fc0048

still need to track down this bug

9827732

follow this path. Somewhere in the test visit_operator_expressions_in…

3ee5e65

…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed

Tmonster marked this pull request as ready for review May 15, 2024 12:56

make format-fix

9a02adb

duckdb-draftbot marked this pull request as draft May 15, 2024 14:08

Tmonster marked this pull request as ready for review May 16, 2024 13:45

change default semi anti selectivity to 5

82043f3

duckdb-draftbot marked this pull request as draft May 17, 2024 10:38

add test to make sure push down is happening

a7b60eb

Tmonster marked this pull request as ready for review May 17, 2024 11:26

Tmonster marked this pull request as draft May 17, 2024 12:08

Tmonster marked this pull request as ready for review May 17, 2024 12:09

duckdb-draftbot marked this pull request as draft May 17, 2024 12:10

make format-fix

72ec97a

Tmonster marked this pull request as ready for review May 21, 2024 09:14

CI should run again please

3f039bc

duckdb-draftbot marked this pull request as draft May 30, 2024 13:44

Tmonster marked this pull request as ready for review May 30, 2024 13:44

add patch for substrait for semi and anti joins

f3ffde1

duckdb-draftbot marked this pull request as draft June 4, 2024 09:06

Tmonster marked this pull request as ready for review June 4, 2024 09:08

Merge remote-tracking branch 'upstream/feature' into reorder_semi_ant…

fe56a6b

…i_joins_easier_fix_refactor

duckdb-draftbot marked this pull request as draft June 4, 2024 11:41

actually apply patches

8fa3935

Tmonster marked this pull request as ready for review June 4, 2024 11:48

fix patch fileg

336dafa

duckdb-draftbot marked this pull request as draft June 4, 2024 13:30

Tmonster marked this pull request as ready for review June 4, 2024 13:31

Tmonster requested a review from Mytherin June 5, 2024 06:52

Mytherin merged commit 0ad2ad5 into duckdb:feature Jun 5, 2024
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder semi and anti joins. #11815

Reorder semi and anti joins. #11815

Tmonster commented Apr 24, 2024

Tmonster commented Jun 5, 2024

Mytherin commented Jun 5, 2024

Reorder semi and anti joins. #11815

Reorder semi and anti joins. #11815

Conversation

Tmonster commented Apr 24, 2024

Tmonster commented Jun 5, 2024

Mytherin commented Jun 5, 2024