-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Introduction
This proposal suggests adding a new scalar function, date_bin, to DataFusion, for transforming timestamp values to arbitrary intervals for the purpose of grouping and aggregating time-series data.
Motivation
Time-series data is typically analysed in aggregate where one axis is almost always time. DataFusion's date_trunc is modelled after the PostgreSQL date_trunc function, which allows truncating a timestamp column for the purpose of grouping, however, the intervals are limited to an enumeration, such as second, minute, hour, day, week, month, quarter and year. To address this limitation, PostgreSQL 14 introduced the date_bin function, which can bin or adjust the input timestamp to arbitrary intervals.
Describe the solution you'd like
Add a new function, date_bin to DataFusion with the same semantics as the PostgreSQL function.
Name: date_bin(stride, source, origin)
Per the PostgreSQL 14 docs
The function date_bin “bins” the input timestamp into the specified interval (the stride) aligned with a specified origin.
Required arguments
| Name | Type | Description |
|---|---|---|
stride |
INTERVAL | time interval specifying the width of each bin. |
source |
TIMESTAMP | The timestamp to transform. |
origin |
TIMESTAMP | Acts as an offset applied to the transformed timestamp |
Example Usage
Demonstrate date_bin1:
SELECT
DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP '2001-01-01') AS time,
val
FROM (
VALUES
('2021-06-10 17:05:00Z', 0.5),
('2021-06-10 17:19:10Z', 0.3)
) as t (time, val);producing the following output:
+---------------------+-----+
| time | val |
+---------------------+-----+
| 2021-06-10 17:00:00 | 0.5 |
| 2021-06-10 17:15:00 | 0.3 |
+---------------------+-----+
2 rows in set. Query took 0.002 seconds.
Example Usage: time offset for origin
SELECT
DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP '2001-01-01 00:02:30') AS time,
val
FROM (
VALUES
('2021-06-10 17:05:00Z', 0.5),
('2021-06-10 17:19:10Z', 0.3)
) as t (time, val);producing the following output:
+---------------------+-----+
| time | val |
+---------------------+-----+
| 2021-06-10 17:02:30 | 0.5 |
| 2021-06-10 17:17:30 | 0.3 |
+---------------------+-----+
2 rows in set. Query took 0.002 seconds.
Describe alternatives you've considered
date_trunc, as mentioned, provides limited support for binning timestamps, but there is no alternative but to provide a native function.
Footnotes
-
DataFusion does not support "typed string" literals in a
VALUESstatement, likeVALUES ((TIMESTAMP '2021-06-10 17:05:00Z')), but feat: Enable typed strings expressions for VALUES clause #3018 will address that ↩