Skip to content

opt: Add JoinAssociate rule for LeftJoin + InnerJoin #38711

@andy-kimball

Description

@andy-kimball

As shown in a seminal paper on outer join optimization (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.2531&rep=rep1&type=pdf), the following are not equivalent:

1. A left-join (B inner-join C)
2. (A left-join B) inner-join C

However, finding a way to reorder the joins would result in a big speedup if the inner join is large. In that case, it could be much better to first combine B with A. The aforementioned paper introduces a new operator, called generalized outer join that is associative in this case.

However, introducing a new join operator carries with it a ton of work and ongoing maintenance. Ideally, we could use existing operators to perform a rewrite. Another paper proposes using the SQL Window functions to do this: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.609&rep=rep1&type=pdf.

The rewrite in this paper is general (can be used to string together multiple joins) and has an associated proof. However, it's quite complex, and requires sorting the entire result and using the max(), rownumber() and PRECEDING window functionality. That said, we should still investigate using this solution.

Separately, I think (but have not proved) that an alternate simpler formulation could be used for a JoinAssociate rule for this case. Here is an example schema:

create table c (c_id int primary key, join_date date, index (join_date));
create table o (o_id int primary key, c_id_ref int, index (c_id_ref));
create table i (i_id int primary key, o_id_ref int, index (o_id_ref));

Here is the original query, that returns up to 100 customers that joined on a given date, along with any orders they've placed:

SELECT c_id, join_date, o_id, c_id_ref, i_id, o_id_ref
FROM c
LEFT JOIN
(
	o INNER JOIN i
	ON o_id=o_id_ref
)
ON c_id=c_id_ref
WHERE join_date='2019-01-01'
LIMIT 100;

Here is an equivalent query that uses a left-join + rank function + filter:

-- Convert inner-join into a left-join. Because the join predicate is null-rejecting,
-- the left-joins can then be reordered. But it then means that extra rows can be
-- added by the new left-join, and those have to be removed later on.
--
-- Compute RANK over each group of orders. If i_id is NULL, it's a row that was
-- added by one of the left-joins. We want to remove all these rows, except for
-- the case when RANK = 1. Rows with RANK = 1 represent customers with no
-- orders, and need to be preserved.
--
-- Use `ORDER BY i_id DESC` to order NULL i_id values last in the group. This
-- guarantees that a NULL i_id value will have RANK = 1 only if it's the only row in
-- the orders group (i.e. a customer with no orders).
SELECT
	c_id,
	join_date,
	CASE WHEN i_id IS NOT NULL THEN o_id END AS o_id,
	CASE WHEN i_id IS NOT NULL THEN c_id_ref END AS c_id_ref,
	i_id,
	o_id_ref
FROM
(
	SELECT *, RANK() OVER (PARTITION BY c_id ORDER BY i_id DESC, o_id) AS rank
	FROM
	(
		SELECT * FROM
		(
			SELECT *
			FROM c
			LEFT JOIN o
			ON c_id=c_id_ref
			WHERE join_date='2019-01-01'
		)
		LEFT JOIN i
		ON o_id=o_id_ref
	)
)
WHERE i_id IS NOT NULL OR rank = 1
LIMIT 100;

The optimizer estimates the cost of the original query as ~11,000,000. By contrast, the equivalent query's estimated cost is ~30,000.

Jira issue: CRDB-5617

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-sql-queriesSQL Queries Team

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions