Skip to content

Proposal: Add date_bin function #3015

@stuartcarnie

Description

@stuartcarnie

Introduction

This proposal suggests adding a new scalar function, date_bin, to DataFusion, for transforming timestamp values to arbitrary intervals for the purpose of grouping and aggregating time-series data.

Motivation

Time-series data is typically analysed in aggregate where one axis is almost always time. DataFusion's date_trunc is modelled after the PostgreSQL date_trunc function, which allows truncating a timestamp column for the purpose of grouping, however, the intervals are limited to an enumeration, such as second, minute, hour, day, week, month, quarter and year. To address this limitation, PostgreSQL 14 introduced the date_bin function, which can bin or adjust the input timestamp to arbitrary intervals.

Describe the solution you'd like

Add a new function, date_bin to DataFusion with the same semantics as the PostgreSQL function.

Name: date_bin(stride, source, origin)

Per the PostgreSQL 14 docs

The function date_bin “bins” the input timestamp into the specified interval (the stride) aligned with a specified origin.

Required arguments

Name Type Description
stride INTERVAL time interval specifying the width of each bin.
source TIMESTAMP The timestamp to transform.
origin TIMESTAMP Acts as an offset applied to the transformed timestamp

Example Usage

Demonstrate date_bin1:

SELECT 
  DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP '2001-01-01') AS time, 
  val 
FROM (
  VALUES 
    ('2021-06-10 17:05:00Z', 0.5),
    ('2021-06-10 17:19:10Z', 0.3)
  ) as t (time, val);

producing the following output:

+---------------------+-----+
| time                | val |
+---------------------+-----+
| 2021-06-10 17:00:00 | 0.5 |
| 2021-06-10 17:15:00 | 0.3 |
+---------------------+-----+
2 rows in set. Query took 0.002 seconds.

Example Usage: time offset for origin

SELECT 
  DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP '2001-01-01 00:02:30') AS time, 
  val 
FROM (
  VALUES 
    ('2021-06-10 17:05:00Z', 0.5),
    ('2021-06-10 17:19:10Z', 0.3)
  ) as t (time, val);

producing the following output:

+---------------------+-----+
| time                | val |
+---------------------+-----+
| 2021-06-10 17:02:30 | 0.5 |
| 2021-06-10 17:17:30 | 0.3 |
+---------------------+-----+
2 rows in set. Query took 0.002 seconds.

Describe alternatives you've considered

date_trunc, as mentioned, provides limited support for binning timestamps, but there is no alternative but to provide a native function.

Footnotes

  1. DataFusion does not support "typed string" literals in a VALUES statement, like VALUES ((TIMESTAMP '2021-06-10 17:05:00Z')), but feat: Enable typed strings expressions for VALUES clause #3018 will address that

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions