ASOF join #3207

fabianoliver · 2022-03-10T10:47:59Z

fabianoliver
Mar 10, 2022

I'm curious, would DuckDb ever consider supporting some dedicated syntax for such joins?

For example, KDB has its infamous and extremely useful asof join (aj, aj0, afj, afj0). This is in absolute ubiquitous in particular in finance (and probably in more fields that deal a lot with time series).
A random / simple demonstrative use case could be: Assuming you have a table of trades and a table of prices from some exchange, you may want to join the latest price from the exchange for the given security onto your trades.

Traditionally, its quite painful to do in classic SQL. Lateral joins would be one way to go - and these seem to be on the roadmap already. But in my opinion, their syntax is still not particularly nice for this use case (as they of course cover way more generic use cases as well).

(For general context, I've just stumbled upon DuckDb for a much simpler use case, but was really impressed by the overall feature set. I've been pondering whether to look into using it for some more serious analytics. This kind of operation is so omnipresent in almost all analytics in this field, however, that not having a very clear, concise method for it is unfortunately quite prohibitive)

hawkfish · 2022-03-10T18:01:17Z

hawkfish
Mar 10, 2022
Collaborator

Hi @fabianoliver - thank you for your interest in and kind words for DuckDB!

ASOF joins have come up a few times, and we certainly have an interest in time series operations. Some of these can already be done without too much trouble e.g., your example could be expressed as:

SELECT prices.symbol, arg_max(prices.price, prices.when)
FROM trades, prices
WHERE trades.symbol = prices.symbol
GROUP BY 1

I'm looking at other temporal join operations, and right now I'm implementing a fast range intersection join, which is useful for matching events or states to state tables. I may have misunderstood your example, and if the trades have associated times, then this could be used if the price table was converted from an event table to a state table using a OVER clause:

WITH states AS 
    SELECT symbol, 
        price, 
        when AS start, 
        LEAD(start, 1, NOW()) OVER(PARTITION BY symbol, ORDER BY when) AS stop
    FROM prices
SELECT trades.symbol, when, price
FROM trades, states
WHERE states.start <= trades.when AND trades.when < states.stop

0 replies

hawkfish · 2022-03-10T20:13:19Z

hawkfish
Mar 10, 2022
Collaborator

I think the aj, aj0, afj, afj0 joins can be implemented by using a LEFT JOIN in the first example?

0 replies

fabianoliver · 2022-03-11T09:30:08Z

fabianoliver
Mar 11, 2022
Author

Hi @hawkfish ,

Thanks for your thoughts!

Yes, this type of join can indeed be done with fairly standard SQL (including in DuckDB). I wasn't aware of argmax, that is a rather cool feature. I've pasted two working approaches below for reference - one using argmax, the other using an over clause. I'm not quite sure if one would be preferential over the other in terms of performance?

Having said that, I think there might still be a lot of benefit promoting this type of join into a first-class citizen of the query syntax - if it fits within DuckDb's overall mission, and syntax constraints of course.

The queries below feel a little verbose. That'd make analytics a bit harder to read/understand, and harder to write as well. For example, I haven't used kdb in about 2 years, but instantly recalled the signature of aj, because its so beautifully simple. For the query below, I'd probably need to think about it again, or look it up, if I were to use it again in a week's time (and then try to assess efficiency of the different approaches). Obviously, I can't judge if that justifies the complexity of extending the syntax though.

If DuckDb wanted to ever consider this, a few other systems have implementations that could be a good reference. For example, kdb's AJ, or pandas' asof, or QuestDB's ASOF join.

In terms of syntax, maybe something like this could work, albeit I'm sure there'd be many (better?) ways to do it;

SELECT FROM trades t LEFT JOIN prices p ON t.security_id=p.security_id ASOF(t.date, p.date)

or more maybe something that is even more generic, allowing to filter the matches of the ON part:

SELECT FROM trades t LEFT JOIN prices p ON t.security_id=p.security_id AND p.date <= t.date ONWHERE t.date=MAX(t.date)

--

Current approaches:

import duckdb
import pandas as pd
from datetime import date

trades = pd.DataFrame({'security_id': ['a', 'a', 'b'], 'date': [date(2000,1,1), date(2000,1,2), date(2000,1,3)], 'trade_id': [1,2,3]})
prices = pd.DataFrame({'security_id': ['a', 'a', 'a', 'c'], 'date': [date(1999,12,30), date(1999,12,31), date(2000,1,2), date(2000,1,2)], 'price': [1,2,3,4]})

con = duckdb.connect()

trades_rel = con.from_df(trades)
prices_rel = con.from_df(prices)

trades_rel.create("trades")
prices_rel.create("prices")

# ----------- APPROACH 1 -----------#
con.execute("""
WITH tmp AS (
  SELECT    t.security_id,
            t.date,
            p.price,
            ROW_NUMBER() OVER (PARTITION BY t.security_id, t.date ORDER BY p.date DESC) As RowNum
  FROM      trades t
  LEFT JOIN prices p
  ON        p.security_id=t.security_id AND
            p.date <= t.date)

SELECT * EXCLUDE RowNum
FROM   tmp
WHERE  RowNum=1
""").fetchdf()

# ----------- APPROACH 2 -----------#
# Could probably avoid the last INNER JOIN if security_id and date are unique keys of the left table (trades),
# if any other value column of the left table to be kept in the resultset was included in the aggregation, e.g.
# first(t.whatever_column) or such

# Unfortunately, I think every column of the right table (prices) to be included would have to be its own argmax() staetment
# Not sure if this is optimised efficiently under the hood?
con.execute("""
WITH tmp AS (
  SELECT    t.security_id,
            t.date,
            argmax(p.price, p.date) as price
  FROM      trades t
  LEFT JOIN prices p
  ON        p.security_id=t.security_id AND
            p.date <= t.date
  GROUP BY  t.security_id, t.date)

SELECT     t.*, tmp.price
FROM       trades t
INNER JOIN tmp
ON         t.security_id=tmp.security_id AND t.date=tmp.date
""").fetchdf()

1 reply

hawkfish Mar 11, 2022
Collaborator

It looks like I left off the security_id condition in my second query - sorry about that.

In general, simple aggregation (Approach 2) is faster than windowing because less data is materialised, so I'd stick with those when you can - especially the filtering on row number (BTW that filtering can be simplified with the new QUALIFY clause).

What will happen "under the hood" in Approach 2 is we will hash join the two tables on the security_id (which will match all the prices for that security) and filter out the ones whose dates do not match. These rows will then be aggregated back down to the size of the trades table. But since everything is streamed, the only thing that will get materialised is the (smaller) of the two tables. The number of rows processed would be roughly (avg price count) X (the number of trades), which might be largish. It is certainly not as efficient as working with sorted data, but that is an issue for the optimiser. We are considering supporting sorted tables for time series operations #2548 and the optimiser could then leverage that for planning.

As for syntax, I think we need to stick with SQL, but you might consider writing SQL writing functions in Python for common operations?

fabianoliver · 2022-03-14T13:31:12Z

fabianoliver
Mar 14, 2022
Author

Interesting, thanks for those details @hawkfish !

Not a problem if this concept doesn't directly fit with the (SQL) syntax - I've indeed added a small python function for now which constructs these types of queries for now, that should work well for the time being as you say.

Sort-based optimisations would definitely be interesting. (I might add add a small comment to #2548 with a few small ideas that would be great to consider in that context)

1 reply

ttomasz Jan 10, 2023

I wonder if SQL:2011 standard temporal features could be used to construct new query language features that would allow for these queries... something like being able to define which column defines time in e.g. CTEs like:

WTIH
events PERIOD FOR period_col(event_time) as (
  SELECT event_data, event_time, dim_id FROM event_table
),
dim PERIOD FOR period_col(record_start, record_end) as (
  SELECT dim_id, dim_value, record_start, record_end FROM dim_table
)
SELECT *
FROM events
JOIN dim USING(dim_id)
AS OF '2023-01-10 00:00:00.000Z'

redviking1 · 2023-01-11T23:37:30Z

redviking1
Jan 11, 2023

TimescaleDB users have also requested "as of" joins.

0 replies

hawkfish · 2023-01-12T16:04:18Z

hawkfish
Jan 12, 2023
Collaborator

Some more thoughts after a conversation with @Mytherin this AM (well PM for him!)

ASOF joins are basically a join between an event table events(key ANY, value ANY, time TIMESTAMP) and some kind of probe table probes(key ANY, time TIMESTAMP). The naïve way to do this is first convert the event table to a state table:

CREATE VIEW states AS (
    SELECT key, value, time AS begin,
        lead(time, 1, 'infinity'::TIMESTAMP) OVER( PARTITION BY key ORDER BY time) AS end
);

Then you can do a conditional join with three conditions:

SELECT p.key, value, time
FROM probes p, states s
WHERE p.key = s.key
  AND begin <= time
  AND time < end

Unfortunately, we would assume that the equality is more selective and apply the inequality conditions as a secondary filter. If you imagine the canonical case of stock valuations where the events are 1.5B stock prices for the S&P 500 over 20 years at 1 minute intervals, then the selectivity of the inequality is about 1:500 but the selectivity of the equality is about 1:3000000! So figuring out how to make the optimiser make the correct choice is very important here.

Even if we could do this, it's still very inefficient because the inequality join algorithm would materialise and sort both sides. By contrast, a true ASOF operator would do the window's partitioning and sorting on the build (events) side and then just use binary search for the probe. There might be further optimisations here for chunking/sorting the probe blocks to avoid paging at scale (like we do for hash joins). In fact, its really just a variant of hash join where the probe step is a binary search instead of an equality test.

2 replies

redviking1 Jan 12, 2023

@hawkfish, thank you for sharing this insightful analysis. Does your conditional join algorithm also consider the time interval that the probes cover? This time interval might often be much shorter than that which the events (and states) cover or it might cover several shorter non-overlapping intervals. In either case, the states which the optimiser must select, sort, and search would be a much smaller subset of all possible states and this would reduce the execution time of the conditional join.

Do you happen to have any insights or guesses about how ClickHouse implements its ASOF JOIN operator?

hawkfish Jan 13, 2023
Collaborator

That's a good point about leveraging statistics to filter data before expensive operators. We would have to be very confident of the statistics, but I think we already do things like this.

I haven't looked at the CH code base at all (OSS?).

bwlewis · 2023-01-20T01:59:52Z

bwlewis
Jan 20, 2023

You may like this as-of join write up (comparing Pandas, Polars, R data.tables, xts, and R zoo, and DuckDB) on a simple as-of join examples (I can't include KDB results, but it is incredibly fast on this problem as you would expect).

https://bwlewis.github.io/duckdb_and_r/asof/asof.html

That write-up includes a rather horrible-looking but correct SQL approach, that is without using modern CTE ideas as @ttomasz suggested above (which is much more elegant). My SQL approach is slow-ish, but at least it works. What surprised me is that I've been unable to get an approach that uses arg_max to perform better on large problems.

My notes on as-of joins above are using a much older DuckDB version. I am in the process of updating them and will add an arg_max-style DuckDB-specific query to the mix.

0 replies

tdoehmen · 2023-02-03T10:34:14Z

tdoehmen
Feb 3, 2023

We had a Master's student, Axel Petterson, working on this particular problem for his thesis. His solution was an Early Stop Sort-Merge Join, that is inspired by how pandas and polars do as_of joins, which also do quite well in the post mentioned by @bwlewis.
The idea was to:

Partition both sides of the join into pairs of left- and right-hand rows that match the equality condition.
Then sort both sides based on the timestamp columns in the inequality condition.
Per partition of rows:
- Per left-hand row:
  - (continue to) scan the right-hand partition until the inequality condition is met for the first time

It was designed for Spark (therefore the partitioning) and I'm not sure how the partitioning aspect would translate to DuckDB. Maybe it would just be a sorting on the equality-condition column rather than partitioning.

There are more details in the thesis https://payberah.github.io/files/download/students/axel_pettersson_master_thesis.pdf (don't be confused, we call it point-in-time join, but it's the same thing as ASOF joins). And there is also an implementation here https://github.com/Ackuq/spark-pit.

1 reply

hawkfish Feb 3, 2023
Collaborator

Thanks @tdoehmen This is similar to what I was planning to do (partition the build side on the equality predicates like window PARTITION BY and then probe by hashing the equality keys to the partition and doing an upper bound search). The main difference is that we can stream the probe side, so the one question would be do we cache spooled keys like @lnkuiper does for our hash joins.

hawkfish · 2023-04-24T18:02:47Z

hawkfish
Apr 24, 2023
Collaborator

The has been implemented and merged, so is it time to close?

0 replies

bwlewis · 2023-04-25T23:44:50Z

bwlewis
Apr 25, 2023

I agree!

…

On 4/24/23, Richard Wesley ***@***.***> wrote: The has been implemented and merged, so is it time to close? -- Reply to this email directly or view it on GitHub: #3207 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

0 replies

era127 · 2023-04-26T23:54:56Z

era127
Apr 26, 2023

I was comparing a development version of 0.8.0 against the other packages in the asof posting previously mentioned.

I want to confirm two behaviors that I observe through the R api using the development version. First I noticed that the fractional seconds are not supported in duckdb, so the behavior is slightly different than other packages. It will associate a row with fractional seconds even if it occurred after another row. Also, I noticed that it doesn't seem to work with arrow tables.

set.seed(1)
end <- as.POSIXct("2020-06-20")
start <- as.POSIXct("2020-1-1")
# Every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 5e6
data <- data.frame(date = end - runif(N) * as.integer(difftime(end, start, units = "secs")), value = runif(N))
data <- data[order(data[["date"]]),]

# fractional seconds on first row
> format(calendar$date[[1]], format = '%Y-%m-%d %H:%M:%OS5')
[1] "2020-01-01 00:00:00.00000"
> format(data$date[[1]], format = '%Y-%m-%d %H:%M:%OS5')
[1] "2020-01-01 00:00:00.56057"

# data.table
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')
data.dt[calendar.dt, on = "date", roll = TRUE]
>                        date      value
>      1: 2020-01-01 00:00:00         NA
>      2: 2020-01-01 00:01:00 0.41639909
>      3: 2020-01-01 00:02:00 0.23235543
>      4: 2020-01-01 00:03:00 0.67948438
>      5: 2020-01-01 00:04:00 0.43368901

     

# duckdb
conn <- DBI::dbConnect(duckdb::duckdb( ))
# virtual tables
duckdb::duckdb_register( conn, "data_v", data)
duckdb::duckdb_register( conn, "calendar_v", calendar)
DBI::dbGetQuery(conn, "SELECT * FROM calendar_v ASOF JOIN data_v USING(date)") |> data.table::data.table()
>                        date      value
>      1: 2020-01-01 05:00:00 0.03036744
>      2: 2020-01-01 05:01:00 0.41639909
>      3: 2020-01-01 05:02:00 0.59852317
>      4: 2020-01-01 05:03:00 0.67948438
>      5: 2020-01-01 05:04:00 0.43368901

# arrow
data.path = tempfile()
arrow::write_dataset(data, path = data.path, format = 'feather')
duckdb::duckdb_register_arrow( conn, "data_a", data.path)
calendar.path = tempfile()
arrow::write_dataset(calendar, path = calendar.path, format = 'feather')
duckdb::duckdb_register_arrow( conn, "calendar_a", calendar.path)
DBI::dbGetQuery(conn, "SELECT * FROM calendar_a ASOF JOIN data_a USING(date)") |> data.table::data.table()
> Error in arrow_scannable$schema :
> $ operator is invalid for atomic vectors
> Error: rapi_prepare: Failed to prepare query SELECT * FROM calendar_a ASOF JOIN data_a USING(date)
> Error: Invalid Error: std::exception

1 reply

hawkfish Apr 28, 2023
Collaborator

There is an issue here with SELECT * and USING. For normal joins it doesn't matter which key you return, but for ASOF it should probably be the probe side key (or both).

hawkfish · 2023-04-27T00:47:38Z

hawkfish
Apr 27, 2023
Collaborator

First I noticed that the fractional seconds are not supported in duckdb, so the behavior is slightly different than other packages. It will associate a row with fractional seconds even if it occurred after another row. Also, I noticed that it doesn't seem to work with arrow tables.

This is probably an R/arrow problem because our timestamps support µs precision. There may be a type binding issue because we have several TS precisions (s, ms, s, ns)? You could check this by just doing a simple inequality select on the arrow data set that is sensitive to sub-second precision.

In any case, this is worth investigating, so please file an issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASOF join #3207

{{title}}

Replies: 12 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

ASOF join #3207

Replies: 12 comments · 6 replies

hawkfish Mar 10, 2022 Collaborator

hawkfish Mar 10, 2022 Collaborator

fabianoliver Mar 11, 2022 Author

hawkfish Mar 11, 2022 Collaborator

fabianoliver Mar 14, 2022 Author

hawkfish Jan 12, 2023 Collaborator

hawkfish Jan 13, 2023 Collaborator

hawkfish Feb 3, 2023 Collaborator

hawkfish Apr 24, 2023 Collaborator

hawkfish Apr 28, 2023 Collaborator

hawkfish Apr 27, 2023 Collaborator

Replies: 12 comments 6 replies

hawkfish
Mar 10, 2022
Collaborator

hawkfish
Mar 10, 2022
Collaborator

fabianoliver
Mar 11, 2022
Author

hawkfish Mar 11, 2022
Collaborator

fabianoliver
Mar 14, 2022
Author

hawkfish
Jan 12, 2023
Collaborator

hawkfish Jan 13, 2023
Collaborator

hawkfish Feb 3, 2023
Collaborator

hawkfish
Apr 24, 2023
Collaborator

hawkfish Apr 28, 2023
Collaborator

hawkfish
Apr 27, 2023
Collaborator