Skip to content

Add support for nested laterals#7528

Merged
Mytherin merged 4 commits into
duckdb:featurefrom
CMU-15-745:nested_laterals
May 22, 2023
Merged

Add support for nested laterals#7528
Mytherin merged 4 commits into
duckdb:featurefrom
CMU-15-745:nested_laterals

Conversation

@arhamchopra

Copy link
Copy Markdown
Contributor

This PR adds support for nested LATERAL joins (arbitrary nesting of subqueries and LATERAL joins) to DuckDB. In the current version of DuckDB, the following example from PR #5393 will produce a binder error:

SELECT * FROM (SELECT 42) t(i), (SELECT * FROM (SELECT 142 k) t3(k), (SELECT k+i) t4(l)) t2(j);
# Binder Error: Nested lateral joins are not supported yet

However, after this PR, DuckDB produces the correct result:

SELECT * FROM (SELECT 42) t(i), (SELECT * FROM (SELECT 142 k) t3(k), (SELECT k+i) t4(l)) t2(j);
┌───────┬───────┐
│   i   │   j   │
│ int64 │ int64 │
├───────┼───────┤
│    42142 │
└───────┴───────┘

Further, after this PR, queries with correlations across LATERALs and subqueries also produce the correct result:

SELECT * FROM (SELECT 42) t4(m) WHERE m IN (SELECT i FROM (SELECT m) t(i), (SELECT i + m) t2(j));
┌───────┐
│   m   │
│ int32 │
├───────┤
│    42 │
└───────┘ 

Current Restrictions

This PR does not add support for Correlated Recursive CTEs i.e.

CREATE MACRO udf(x) AS (WITH RECURSIVE CTE as (SELECT 0 AS i UNION ALL SELECT i + 1 FROM CTE WHERE i <= x ) SELECT * FROM CTE);
SELECT udf(x) FROM generate_series(1,10) t(x);
# Error: Binder Error: Recursive CTEs not supported in correlated subquery

Overview of Changes

Core Ideas

  1. Changing Flattening Order: Previously, planning and flattening of joins used to happen inside out (or children nodes before parent nodes). To support the flattening of arbitrary LATERAL joins, we kept the planning inside out but changed the flattening to be outside in (like in the case of subqueries).
  2. Depth Tracking during Flattening: During flattening and planning, we keep track of the current depth information as we recursively visit the logical plan. LATERAL joins cause the depths of the left and right subtrees to differ by one due to the LateralBinder. Therefore whenever we recursively visit the right subtree of a LATERAL join, we increment the depth information. This allows easy identification of the source of correlation for each column binding. This tracking is used in detecting correlated expressions, rewriting correlated expressions, and during the pushdown of dependent joins.

Changes to support flattening nested LATERALs:

  1. We modify the Binder::Bind for joins to ensure correct bindings for correlated columns as follows:

    • Correctly extract the correlated columns from the right side of the join.
    • Differentiate between LATERAL and subquery correlations by keeping track of correlation due to LATERAL joins.
    • Update the depths for the column bindings before passing them to the parent binder.
  2. We create a new LogicalOperator called LogicalDependentJoin to identify LATERAL joins. It keeps track of all information needed for flattening. This operator is used as follows:

    • This operator is created only when a LATERAL join needs to be planned but cannot be flattened due to an unflattened dependent join in the outer query.
    • This operator should be completely flattened by the end of the planning phase. Therefore, it should not exist during physical planning.
  3. We modify the RecursiveSubqueryPlanner to also flatten any LogicalDependentJoin that is encountered when re-iterating over the plan.

  4. We modify the Binder::CreatePlan for joins to correctly plan LATERAL joins based on the flags like whether a dependent join exists in the outer query. Further planning and flattening the LATERAL join, we also recursively plan and flatten the children to push any remaining dependent joins. We handle the swap of left and right subtrees by maintaining a swapped_children flag in the LogicalOperator class for correctly updating depths during flattening.

  5. During flattening, any pushdown through, a LogicalDependentJoin will always push down on the left, regardless of whether there is a correlation. Further, all the correlated bindings on the right side will be recursively updated to the new bindings from the left side.

  6. We enabled tests previously disabled in DuckDB (taken from PostgreSQL) to test the correctness of nested LATERAL joins.

  7. We also added tests that handle edge cases related to the correct execution of nested LATERAL joins.

Acknowledgements

This PR was a joint effort between @arhamchopra(myself), @Mayank-Baranwal, and @SamArch27 as part of @apavlo's Advanced Database Systems course at Carnegie Mellon University. In addition, we want to thank @Mytherin for his patience and guidance throughout the project.

@Mytherin Mytherin changed the base branch from master to feature May 16, 2023 12:42
@Mytherin Mytherin changed the base branch from feature to master May 16, 2023 12:42
@Mytherin

Copy link
Copy Markdown
Collaborator

Awesome! Great that you got this to work!

We are a bit busy this week as we are doing a release tomorrow, but I will review ASAP :)

@l1t1

This comment was marked as abuse.

@arhamchopra

Copy link
Copy Markdown
Contributor Author

could you list a LATERAL joins example?

@l1t1, the example below is a LATERAL join

SELECT * FROM (SELECT 42) t(i), (SELECT 42+i) t2(j);
┌───────┬───────┐
│   i   │   j   │
│ int64 │ int64 │
├───────┼───────┤
│    4284  │
└───────┴───────┘

The RHS ((SELECT 42+i) t2(j)) in the JOIN uses column i from the LHS ((SELECT 42) t(i)) of the join. So column i is correlated in the LATERAL join. This is a single-level LATERAL join.
The example in the previous message has correlations from multiple levels so that is a nested LATERAL join.

@l1t1

This comment was marked as abuse.

@arhamchopra

Copy link
Copy Markdown
Contributor Author

@l1t1 The example you provided is that of a subquery.
Here a is a table with one column (a) and the values being [1,2,3].

In your query, the first column is just the values of a.a.
The second column comes from the correlated subquery, (SELECT a1.a FROM a AS a1). This subquery is supposed to return 3 values ([1,2,3]) when run as an independent query.
When used as a subquery, which by definition should return only 1 value, I believe that DuckDB picks the first value from the results and discards the rest of them. As per my understanding, this is why we see 1 being repeated in the results.

@l1t1

This comment was marked as abuse.

@Mytherin Mytherin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR! Great work. Some minor comments but otherwise I think this looks great. Good idea to explicitly model the dependent join as a node in the plan.

bool has_estimated_cardinality;
//! Flag to track if left and right child have been swapped in a join
//! Needed so that lateral_depth can be tracked and updated correctly during flattening and planning
bool swapped_children = false;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow why we need this flag - it seems that this flag is only used for lateral joins, yet it is always false for lateral joins. Could we remove it or am I missing something?

@Mayank-Baranwal Mayank-Baranwal May 19, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a right join, the optimizer swaps the left and right child and makes it a left join. Without the swapped_children flag, lateral_depth would be incremented for the call to the left child (as it has now become the right child), which triggers assertions on correlated column depth (since there is no LateralBinder on the left child). As such, we set swapped_children = true in plan_joinref.cpp, when a right join is detected and the optimizer is on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support right lateral joins, no? I have tried removing the flag and it seems like all tests still pass.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this flag is not needed.
It was an artifact of our previous design where we used to track all the joins (not just dependent joins). In that case, we had to specially handle the right joins as the bindings would be off when the two sides were swapped.
In the new design, we only track lateral depth and don't need this flag anymore.

Thanks for pointing that out, we missed it during our code cleanup.
Will push a commit to clean this up and the rest of the comments you have added.


result->lateral = binder.HasCorrelatedColumns();
bool is_lateral = false;
auto all_correlated_columns = vector<CorrelatedColumnInfo>();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_correlated_columns seems to be unused

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! We somehow missed it in the code cleanup.

}
// update the bindings in the correlated columns of the dependendent join
if (op.type == LogicalOperatorType::LOGICAL_DEPENDENT_JOIN) {
auto &plan = (LogicalDependentJoin &)op;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the new Cast syntax here (op.Cast< LogicalDependentJoin>())?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we'll make that change!

}

void LogicalDependentJoin::Serialize(FieldWriter &writer) const {
LogicalComparisonJoin::Serialize(writer);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can likely throw an InternalException - we should never serialize a logical dependent join as it should always be eliminated before the plan comes out of the planner.

@Mayank-Baranwal

Copy link
Copy Markdown
Contributor

@Mytherin we've made all the changes you suggested. Please let us know if there's anything else you want us to look at!

@Mytherin Mytherin changed the base branch from master to feature May 22, 2023 08:44
@Mytherin Mytherin merged commit 7558597 into duckdb:feature May 22, 2023
@Mytherin

Copy link
Copy Markdown
Collaborator

Awesome! Thanks for all the work, looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants