Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and deduplicate self-joins on unique keys #7709

Open
max-hoffman opened this issue Apr 8, 2024 · 1 comment
Open

Detect and deduplicate self-joins on unique keys #7709

max-hoffman opened this issue Apr 8, 2024 · 1 comment
Labels
analyzer sql Issue with SQL

Comments

@max-hoffman
Copy link
Contributor

A table that joins itself on a unique key can be deduplicated:

select * from mytable a join mytable b on a.i = b.i;
=>
select i, s, i, s from mytable;

In the above query, (i) is a unique key, and the output of the join will simply be the output of a single table.

This came up recently in a context where there was a self join on a non-unique key, but with a DISTINCT field that in practice collapses any join cardinality amplification:

select count(*) from (select distinct a.i from mytable a join mytable b on a.s = b.s);
=>
select count(*) from mytable;

In the above query, we join two tables on a non-unique key, (s). Because (s) is non-unique, the output of a self-join on s might be greater than the table on its own. The DISTINCT operator removes any of this duplication, letting us remove the self-join.

In the general case, DISTINCT only nullifies some varieties of join duplication:

select * from (select distinct a,i, count(*) as cnt from mytable a join mytable b on a.s = b.s where cnt > 3)

In the query above, we added a filter between the self-join and DISTINCT that depends on the self-join cardinality. Aggregations, windows, or scalar subqueries whose logic depends on self-join cardinality all have this issue (and potentially other operators).

I think it might be safe to apply this optimization:

  • same table joined on a unique key
  • same table joined on a non-unique key and a DISTINCT operator and no aggregation/window/scalar subquery expression between the join and DISTINCT
@max-hoffman max-hoffman added sql Issue with SQL analyzer labels Apr 8, 2024
@max-hoffman
Copy link
Contributor Author

I may be overcomplicating this, but I think the transformation needs to happen between building the memo and creating the join reorder graph. This is because we need the following to do redundancy checks:

  • list of the nodes in the graph (to partition by underlying table)
  • equivalent column sets (to check that a table is self-joined)
  • list of indexes supplemented with column ids (to check that a self-join columns complete an index column set)

Without these, we can't (1) partition the nodes by underlying table, (2) determine which tables have overlapping equivalency sets, and then (3) check whether the partitioned tables have any indexes whose columns are subsets of the equivalency sets. The only way to do this with the old transformation rule style is to basically build the memo/relProps separately from join reordering, or worse round-trip the memo twice. Collecting this info is expensive and I think it's probably easier and more desirable to just separate memo building from reorder graph building. The memo separated from reorder graph would let us perform safer and more powerful transformations.

This might take a few days, so I am going to defer for more direct client problems until this arises again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer sql Issue with SQL
Projects
None yet
Development

No branches or pull requests

1 participant