[SPARK-54630][SQL] Add timestamp_bucket function for temporal bucketing #53376
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces a new
timestamp_bucketfunction for grouping temporal data into fixed-width intervals with configurable alignment.SQL Signature
timestamp_bucket(bucket_width, timestamp[, origin]) -> TIMESTAMP | TIMESTAMP_NTZParameters:
bucket_width:INTERVAL DAY TO SECOND- The bucket widthtimestamp:DATE|TIMESTAMP|TIMESTAMP_NTZ- The timestamp to bucketorigin:TIMESTAMP(optional) - The origin for bucket alignment (default:TIMESTAMP'1970-01-01 00:00:00')The return type depends on the input type:
DATEinput → returnsTIMESTAMP(implicitly converted)TIMESTAMPinput → returnsTIMESTAMPTIMESTAMP_NTZinput → returnsTIMESTAMP_NTZKey Features:
DATE,TIMESTAMP, andTIMESTAMP_NTZas inputWhy are the changes needed?
Temporal bucketing is a common requirement in time-series analysis and data aggregation.
Currently, users must:
date_truncwhich only supports fixed calendar unitsUse Cases:
Comparison to existing functions:
Comparison with Other Databases:
date_bin(interval, timestamp, origin)time_bucket(interval, timestamp, origin)timestamp_bucket(interval, timestamp, origin)The proposed function provides similar functionality to PostgreSQL's
date_binand TimescaleDB'stime_bucket, making Spark more competitive for time-series analysis.Does this PR introduce any user-facing change?
Yes. This PR adds a new SQL function and API methods:
Scala API Example:
Python API Example:
How was this patch tested?
Added test
DateFunctionsSuite,date.sql,Was this patch authored or co-authored using generative AI tooling?
No