Skip to content

Conversation

@vinodkc
Copy link
Contributor

@vinodkc vinodkc commented Dec 7, 2025

What changes were proposed in this pull request?

This PR introduces a new timestamp_bucket function for grouping temporal data into fixed-width intervals with configurable alignment.

SQL Signature
timestamp_bucket(bucket_width, timestamp[, origin]) -> TIMESTAMP | TIMESTAMP_NTZ

Parameters:
bucket_width: INTERVAL DAY TO SECOND - The bucket width
timestamp: DATE | TIMESTAMP | TIMESTAMP_NTZ - The timestamp to bucket
origin: TIMESTAMP (optional) - The origin for bucket alignment (default: TIMESTAMP'1970-01-01 00:00:00')

The return type depends on the input type:

  • DATE input → returns TIMESTAMP (implicitly converted)
  • TIMESTAMP input → returns TIMESTAMP
  • TIMESTAMP_NTZ input → returns TIMESTAMP_NTZ

Key Features:

  • Accepts DATE, TIMESTAMP, and TIMESTAMP_NTZ as input
  • Custom origin parameter for flexible bucket alignment (e.g., Monday-aligned weeks)

Why are the changes needed?

Temporal bucketing is a common requirement in time-series analysis and data aggregation.
Currently, users must:

  • Write complex custom logic with floor division and date arithmetic
  • Use workarounds like date_trunc which only supports fixed calendar units
  • Lack flexibility for custom bucket alignments (e.g., fiscal weeks, shift schedules)

Use Cases:

  • Arbitrary intervals: Bucket by 15 minutes, 6 hours, 7 days, etc. (not limited to calendar units)
  • Custom alignment: Align weekly buckets to Monday instead of Thursday (epoch default)
  • Time-series aggregation: Group sensor data, logs, or events into custom time windows
  • Report generation: Create custom reporting periods aligned to business schedules

Comparison to existing functions:

  • date_trunc: Limited to calendar units (year, month, day, hour), no custom alignment
  • window: For streaming/windowing operations, different use case
  • timestamp_bucket (this PR): Flexible intervals + custom alignment for batch analysis

Comparison with Other Databases:

  • PostgreSQL: date_bin(interval, timestamp, origin)
  • TimescaleDB: time_bucket(interval, timestamp, origin)
  • This PR implementation: timestamp_bucket(interval, timestamp, origin)
    The proposed function provides similar functionality to PostgreSQL's date_bin and TimescaleDB's time_bucket, making Spark more competitive for time-series analysis.

Does this PR introduce any user-facing change?

Yes. This PR adds a new SQL function and API methods:

-- Hourly bucketing (default epoch alignment)
SELECT timestamp_bucket(INTERVAL '1' HOUR, TIMESTAMP'2024-12-04 14:30:00');
-- Result: 2024-12-04 14:00:00

-- Weekly bucketing with Monday alignment
SELECT timestamp_bucket(
  INTERVAL '7' DAY, 
  DATE'2024-12-04',
  TIMESTAMP'1970-01-05 00:00:00'  -- Monday origin
);
-- Result: 2024-12-02 00:00:00

-- Aggregation example
SELECT 
  timestamp_bucket(INTERVAL '15' MINUTE, event_time) AS bucket,
  COUNT(*) AS event_count
FROM events
GROUP BY bucket
ORDER BY bucket;

Scala API Example:

import org.apache.spark.sql.functions._

// 2-argument version (default origin)
df.select(timestamp_bucket(expr("INTERVAL '1' HOUR"), col("ts")))

// 3-argument version (custom origin)
df.select(timestamp_bucket(
  expr("INTERVAL '7' DAY"), 
  col("date_col"),
  expr("TIMESTAMP'1970-01-05 00:00:00'")
))

Python API Example:

import pyspark.sql.functions as sf
import datetime

# 2-argument version
df.select(sf.timestamp_bucket(sf.expr("INTERVAL '1' HOUR"), "ts"))

# 3-argument version with custom origin
df.select(sf.timestamp_bucket(
    sf.expr("INTERVAL '7' DAY"),
    df.ts,
    sf.lit(datetime.datetime(1970, 1, 5))  # Monday origin
))

How was this patch tested?

Added test DateFunctionsSuite, date.sql,

Was this patch authored or co-authored using generative AI tooling?

No

@vinodkc vinodkc changed the title [SPARK-54630][SQL] Add date_bucket function for arbitrary interval bucketing of dates [SPARK-54630][SQL][WIP] Add date_bucket function for arbitrary interval bucketing of dates Dec 10, 2025
@vinodkc vinodkc changed the title [SPARK-54630][SQL][WIP] Add date_bucket function for arbitrary interval bucketing of dates [SPARK-54630][SQL][WIP] Add timestamp_bucket function for temporal bucketing Dec 10, 2025
@vinodkc vinodkc changed the title [SPARK-54630][SQL][WIP] Add timestamp_bucket function for temporal bucketing [SPARK-54630][SQL] Add timestamp_bucket function for temporal bucketing Dec 10, 2025
@asl3
Copy link
Contributor

asl3 commented Dec 29, 2025

cc @srielau

@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-54630 ===
Summary: Add timestamp_bucket function for temporal bucketing with configurable alignment
Assignee: None
Status: Open
Affected: ["3.5.0","4.0.0"]


This comment was automatically generated by GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants