ASOF join #3207
Replies: 12 comments 6 replies
-
Hi @fabianoliver - thank you for your interest in and kind words for DuckDB! ASOF joins have come up a few times, and we certainly have an interest in time series operations. Some of these can already be done without too much trouble e.g., your example could be expressed as: SELECT prices.symbol, arg_max(prices.price, prices.when)
FROM trades, prices
WHERE trades.symbol = prices.symbol
GROUP BY 1 I'm looking at other temporal join operations, and right now I'm implementing a fast range intersection join, which is useful for matching events or states to state tables. I may have misunderstood your example, and if the trades have associated times, then this could be used if the price table was converted from an event table to a state table using a WITH states AS
SELECT symbol,
price,
when AS start,
LEAD(start, 1, NOW()) OVER(PARTITION BY symbol, ORDER BY when) AS stop
FROM prices
SELECT trades.symbol, when, price
FROM trades, states
WHERE states.start <= trades.when AND trades.when < states.stop |
Beta Was this translation helpful? Give feedback.
-
I think the aj, aj0, afj, afj0 joins can be implemented by using a |
Beta Was this translation helpful? Give feedback.
-
Hi @hawkfish , Thanks for your thoughts! Yes, this type of join can indeed be done with fairly standard SQL (including in DuckDB). I wasn't aware of argmax, that is a rather cool feature. I've pasted two working approaches below for reference - one using argmax, the other using an over clause. I'm not quite sure if one would be preferential over the other in terms of performance? Having said that, I think there might still be a lot of benefit promoting this type of join into a first-class citizen of the query syntax - if it fits within DuckDb's overall mission, and syntax constraints of course. The queries below feel a little verbose. That'd make analytics a bit harder to read/understand, and harder to write as well. For example, I haven't used kdb in about 2 years, but instantly recalled the signature of aj, because its so beautifully simple. For the query below, I'd probably need to think about it again, or look it up, if I were to use it again in a week's time (and then try to assess efficiency of the different approaches). Obviously, I can't judge if that justifies the complexity of extending the syntax though. If DuckDb wanted to ever consider this, a few other systems have implementations that could be a good reference. For example, kdb's AJ, or pandas' asof, or QuestDB's ASOF join. In terms of syntax, maybe something like this could work, albeit I'm sure there'd be many (better?) ways to do it;
or more maybe something that is even more generic, allowing to filter the matches of the ON part:
-- Current approaches:
|
Beta Was this translation helpful? Give feedback.
-
Interesting, thanks for those details @hawkfish ! Not a problem if this concept doesn't directly fit with the (SQL) syntax - I've indeed added a small python function for now which constructs these types of queries for now, that should work well for the time being as you say. Sort-based optimisations would definitely be interesting. (I might add add a small comment to #2548 with a few small ideas that would be great to consider in that context) |
Beta Was this translation helpful? Give feedback.
-
TimescaleDB users have also requested "as of" joins. |
Beta Was this translation helpful? Give feedback.
-
Some more thoughts after a conversation with @Mytherin this AM (well PM for him!) ASOF joins are basically a join between an event table CREATE VIEW states AS (
SELECT key, value, time AS begin,
lead(time, 1, 'infinity'::TIMESTAMP) OVER( PARTITION BY key ORDER BY time) AS end
); Then you can do a conditional join with three conditions: SELECT p.key, value, time
FROM probes p, states s
WHERE p.key = s.key
AND begin <= time
AND time < end Unfortunately, we would assume that the equality is more selective and apply the inequality conditions as a secondary filter. If you imagine the canonical case of stock valuations where the events are 1.5B stock prices for the S&P 500 over 20 years at 1 minute intervals, then the selectivity of the inequality is about 1:500 but the selectivity of the equality is about 1:3000000! So figuring out how to make the optimiser make the correct choice is very important here. Even if we could do this, it's still very inefficient because the inequality join algorithm would materialise and sort both sides. By contrast, a true ASOF operator would do the window's partitioning and sorting on the build (events) side and then just use binary search for the probe. There might be further optimisations here for chunking/sorting the probe blocks to avoid paging at scale (like we do for hash joins). In fact, its really just a variant of hash join where the probe step is a binary search instead of an equality test. |
Beta Was this translation helpful? Give feedback.
-
You may like this as-of join write up (comparing Pandas, Polars, R data.tables, xts, and R zoo, and DuckDB) on a simple as-of join examples (I can't include KDB results, but it is incredibly fast on this problem as you would expect). https://bwlewis.github.io/duckdb_and_r/asof/asof.html That write-up includes a rather horrible-looking but correct SQL approach, that is without using modern CTE ideas as @ttomasz suggested above (which is much more elegant). My SQL approach is slow-ish, but at least it works. What surprised me is that I've been unable to get an approach that uses arg_max to perform better on large problems. My notes on as-of joins above are using a much older DuckDB version. I am in the process of updating them and will add an arg_max-style DuckDB-specific query to the mix. |
Beta Was this translation helpful? Give feedback.
-
We had a Master's student, Axel Petterson, working on this particular problem for his thesis. His solution was an Early Stop Sort-Merge Join, that is inspired by how pandas and polars do as_of joins, which also do quite well in the post mentioned by @bwlewis.
It was designed for Spark (therefore the partitioning) and I'm not sure how the partitioning aspect would translate to DuckDB. Maybe it would just be a sorting on the equality-condition column rather than partitioning. There are more details in the thesis https://payberah.github.io/files/download/students/axel_pettersson_master_thesis.pdf (don't be confused, we call it point-in-time join, but it's the same thing as ASOF joins). And there is also an implementation here https://github.com/Ackuq/spark-pit. |
Beta Was this translation helpful? Give feedback.
-
The has been implemented and merged, so is it time to close? |
Beta Was this translation helpful? Give feedback.
-
I agree!
…On 4/24/23, Richard Wesley ***@***.***> wrote:
The has been implemented and merged, so is it time to close?
--
Reply to this email directly or view it on GitHub:
#3207 (comment)
You are receiving this because you were mentioned.
Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I was comparing a development version of 0.8.0 against the other packages in the asof posting previously mentioned. I want to confirm two behaviors that I observe through the R api using the development version. First I noticed that the fractional seconds are not supported in duckdb, so the behavior is slightly different than other packages. It will associate a row with fractional seconds even if it occurred after another row. Also, I noticed that it doesn't seem to work with arrow tables. set.seed(1)
end <- as.POSIXct("2020-06-20")
start <- as.POSIXct("2020-1-1")
# Every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 5e6
data <- data.frame(date = end - runif(N) * as.integer(difftime(end, start, units = "secs")), value = runif(N))
data <- data[order(data[["date"]]),]
# fractional seconds on first row
> format(calendar$date[[1]], format = '%Y-%m-%d %H:%M:%OS5')
[1] "2020-01-01 00:00:00.00000"
> format(data$date[[1]], format = '%Y-%m-%d %H:%M:%OS5')
[1] "2020-01-01 00:00:00.56057"
# data.table
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')
data.dt[calendar.dt, on = "date", roll = TRUE]
> date value
> 1: 2020-01-01 00:00:00 NA
> 2: 2020-01-01 00:01:00 0.41639909
> 3: 2020-01-01 00:02:00 0.23235543
> 4: 2020-01-01 00:03:00 0.67948438
> 5: 2020-01-01 00:04:00 0.43368901
# duckdb
conn <- DBI::dbConnect(duckdb::duckdb( ))
# virtual tables
duckdb::duckdb_register( conn, "data_v", data)
duckdb::duckdb_register( conn, "calendar_v", calendar)
DBI::dbGetQuery(conn, "SELECT * FROM calendar_v ASOF JOIN data_v USING(date)") |> data.table::data.table()
> date value
> 1: 2020-01-01 05:00:00 0.03036744
> 2: 2020-01-01 05:01:00 0.41639909
> 3: 2020-01-01 05:02:00 0.59852317
> 4: 2020-01-01 05:03:00 0.67948438
> 5: 2020-01-01 05:04:00 0.43368901
# arrow
data.path = tempfile()
arrow::write_dataset(data, path = data.path, format = 'feather')
duckdb::duckdb_register_arrow( conn, "data_a", data.path)
calendar.path = tempfile()
arrow::write_dataset(calendar, path = calendar.path, format = 'feather')
duckdb::duckdb_register_arrow( conn, "calendar_a", calendar.path)
DBI::dbGetQuery(conn, "SELECT * FROM calendar_a ASOF JOIN data_a USING(date)") |> data.table::data.table()
> Error in arrow_scannable$schema :
> $ operator is invalid for atomic vectors
> Error: rapi_prepare: Failed to prepare query SELECT * FROM calendar_a ASOF JOIN data_a USING(date)
> Error: Invalid Error: std::exception
|
Beta Was this translation helpful? Give feedback.
-
This is probably an R/arrow problem because our timestamps support µs precision. There may be a type binding issue because we have several TS precisions (s, ms, s, ns)? You could check this by just doing a simple inequality select on the arrow data set that is sensitive to sub-second precision. In any case, this is worth investigating, so please file an issue. |
Beta Was this translation helpful? Give feedback.
-
I'm curious, would DuckDb ever consider supporting some dedicated syntax for such joins?
For example, KDB has its infamous and extremely useful asof join (aj, aj0, afj, afj0). This is in absolute ubiquitous in particular in finance (and probably in more fields that deal a lot with time series).
A random / simple demonstrative use case could be: Assuming you have a table of trades and a table of prices from some exchange, you may want to join the latest price from the exchange for the given security onto your trades.
Traditionally, its quite painful to do in classic SQL. Lateral joins would be one way to go - and these seem to be on the roadmap already. But in my opinion, their syntax is still not particularly nice for this use case (as they of course cover way more generic use cases as well).
(For general context, I've just stumbled upon DuckDb for a much simpler use case, but was really impressed by the overall feature set. I've been pondering whether to look into using it for some more serious analytics. This kind of operation is so omnipresent in almost all analytics in this field, however, that not having a very clear, concise method for it is unfortunately quite prohibitive)
Beta Was this translation helpful? Give feedback.
All reactions