Skip to content

Issue #3207: ASOF JOIN Compilation #6719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 22, 2023
Merged

Conversation

hawkfish
Copy link
Contributor

Part 1 of the ASOF JOIN implementation.

This PR adds ASOF to the grammar and converts it into an inequality join on a window:

SELECT p.key, p.ts, e.value 
FROM probe p ASOF JOIN event e
  ON p.key = e.key AND p.ts >= e.ts

-- Translates to

SELECT p.key, p.ts, e.value
FROM probe p LEFT JOIN (
  SELECT value, ts, LEAD(ts, 1, 'infinity') OVER (PARTITION BY key ORDER BY ts) AS ts_temp
  ) e
  ON p.key = e.key AND p.ts >= e.ts AND p.ts < e.ts_temp

where the left join is implemented using the IEJoin operator.

This is not very efficient (it does three sorts), and part 2 will add a new physical operator that uses only one partitioned sort and the outer partitioning from hash join. But it will hopefully make it easier to specify these joins in a future-proof manner.

Richard Wesley added 7 commits March 10, 2023 13:10
Checkpoint commit with the ASOF grammar additions.
Implement ASOF joins using IEJoin + Window.
Fix and test equality clauses in ASOF Joins.
Add missing EnumSerializer::StringToEnum case.
@hawkfish hawkfish requested review from Mytherin and Maxxen March 14, 2023 20:42
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good. Some comments:

Richard Wesley added 8 commits March 15, 2023 14:12
Move the join projection logic into PhysicalProjectionOperator.
* Decouple the JoinType from the JoinRefType.
* Convert ASOF to be a JoinRefType.
* Attach a JoinRefType to LogicalConditionalJoin to indicate how the conditions should be interpreted
* Stop the optimiser from thinking it knows how to deal with ASOF joins.
Regenerate plans file.
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Looks great. Some additional comments:

LEAD(begin, 1, 'infinity'::DOUBLE) OVER (ORDER BY begin ASC) AS end
FROM events0
) e
ON p.ts >= e.begin AND p.ts < e.end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some extra tests:

  • Join on a timestamp range
  • Join on a string range (it is fine if that throws a non-internal exception)
  • Join on integer ranges (it is fine if that throws a non-internal exception)
  • Multiple ranges in one join condition
  • What if we invert the comparisons (e.g. e.begin <= p.ts AND e.end > p.ts)
  • What if we have a different closed range (e.g. p.ts > e.begin AND p.ts <= e.end)
  • What if we have a range that is not actually a range (e.g. p.ts >= e.begin and p.other_col < e.end)
  • ASOF join on a constant (e.g. ON 1=1)
  • Could we add some tests where we have NULL values in the tables?
  • Could we add some tests where we have infinity values (e.g. NaN, or infinity::timestamp) in the tables?
  • What happens if we use an ASOF join inside of a correlated subquery? (it is fine if that throws an exception)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple ranges was covered by the last error test, and you can't have invalid ranges because the bounds are implied (i.e., end is synthesised). But I'll have a go at the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got all of these except the last one. How would you write an ASOF join inside a correlated subquery? Aren't the joins in correlated subqueries generated by the planner?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant an ASOF join inside a subquery that has correlated columns in the left and/or right side, see e.g. here

@Mytherin Mytherin self-requested a review March 20, 2023 12:42
Richard Wesley added 4 commits March 21, 2023 10:06
Switch to using a separate LogicalOperator to model AsOf joins.
Beef up the test suite with multiple data types and corner cases.
Fix arbitrary expression planning issue.
Add test to show it works.
@Mytherin Mytherin merged commit d43e34e into duckdb:master Mar 22, 2023
@Mytherin
Copy link
Collaborator

Thanks for the fixes! Looks great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants