Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder semi and anti joins. #11815

Merged

Conversation

Tmonster
Copy link
Contributor

follow up of #11573

To do this, we do the following,

  1. When we look at reordering and what relations can join with what other relations, we treat semi joins and anti joins like regular inner joins.
  2. To make sure the left and right children are not flipped, we compare the left and right bindings from the original join to the left and right bindings of the recreated plan. If the children are flipped, we flip them back.
  3. When calculating the cost of a semi join we pretend the left table (i.e the table that propagates the up) is being filtered on. We take the cardinality of the left side and multiply it by 0.2.

Since this cardinality estimation method uses multiplication, it is also symmetrical, which means we don't have to worry about different join plans for the same set of relations having different estimated cardinalities.

Calculating the denominator of the estimated cardinality is easier now. It works like finding a maximum spanning tree. Assuming relations are nodes and join filters are weighted edges, the process of finding the most selective filters is exactly like a maximum spanning tree problem. The weights of the edges come from the shared total domain of the columns of the filter.

Some other small improvements:

  1. More checks in the cardinality estimator to make sure the denominator is not 0 (since that will cause very inaccurate cardinality estimates)
  2. More checks during statistics extraction to make sure the distinct count of a column is always between 1 and the cardinality of the table.

Another heuristic was added for determining the number of distinct elements in a column as well. For integral type columns, if the maxVal - minVal is less than the distinct count measured by HLL, then DuckDB will prefer max - min as the distinct count.

I thought this was the source of a bad join order. Turns out that wasn't the case, but I think it is still a good join heuristic.

…so add better heuristics for determining a distinct count
…_column_lifetime_analyzer you will find why operator expressions need to be visited first. or why rilters cannot be removed
@Tmonster Tmonster marked this pull request as ready for review May 15, 2024 12:56
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 15, 2024 14:08
@Tmonster Tmonster marked this pull request as ready for review May 16, 2024 13:45
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 17, 2024 10:38
@Tmonster Tmonster marked this pull request as ready for review May 17, 2024 11:26
@Tmonster Tmonster marked this pull request as draft May 17, 2024 12:08
@Tmonster Tmonster marked this pull request as ready for review May 17, 2024 12:09
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 17, 2024 12:10
@Tmonster Tmonster marked this pull request as ready for review May 21, 2024 09:14
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 30, 2024 13:44
@Tmonster Tmonster marked this pull request as ready for review May 30, 2024 13:44
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 4, 2024 09:06
@Tmonster Tmonster marked this pull request as ready for review June 4, 2024 09:08
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 4, 2024 11:41
@Tmonster Tmonster marked this pull request as ready for review June 4, 2024 11:48
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 4, 2024 13:30
@Tmonster Tmonster marked this pull request as ready for review June 4, 2024 13:31
@Tmonster Tmonster requested a review from Mytherin June 5, 2024 06:52
@Tmonster
Copy link
Contributor Author

Tmonster commented Jun 5, 2024

@Mytherin this ready to go now. You reviewed once already, but I had to battle CI for a while since I also needed to patch substrait

@Mytherin Mytherin merged commit 0ad2ad5 into duckdb:feature Jun 5, 2024
40 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Jun 5, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants