-
Notifications
You must be signed in to change notification settings - Fork 272
Description
What is the problem the feature request solves?
Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark parse_to_timestamp function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.
ParseToTimestamp is a Spark Catalyst expression that converts string, date, timestamp, or numeric values to a timestamp data type. It supports optional format specifications for parsing string inputs and provides timezone-aware conversion capabilities with configurable error handling behavior.
Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.
Describe the potential solution
Spark Specification
Syntax:
to_timestamp(timestamp_str[, format])// DataFrame API usage
df.select(to_timestamp($"timestamp_column"))
df.select(to_timestamp($"timestamp_column", "yyyy-MM-dd HH:mm:ss"))Arguments:
| Argument | Type | Description |
|---|---|---|
| left | Expression | The input expression to convert to timestamp |
| format | Option[Expression] | Optional format string for parsing input |
| dataType | DataType | Target timestamp data type |
| timeZoneId | Option[String] | Optional timezone identifier for conversion |
| failOnError | Boolean | Whether to fail on conversion errors (defaults to ANSI mode setting) |
Return Type: Returns a timestamp data type as specified by the dataType parameter, typically TimestampType or TimestampNTZType.
Supported Data Types:
- StringType with collation support (including trim collation)
- DateType
- TimestampType
- TimestampNTZType
- NumericType (only when target dataType is TimestampType)
Edge Cases:
- Null inputs are handled gracefully and typically return null outputs
- Invalid format strings will cause runtime errors when
failOnErroris true - Unparseable timestamp strings behavior depends on ANSI mode settings
- Numeric inputs are interpreted as seconds since epoch when converting to TimestampType
- Timezone conversion edge cases (DST transitions) are handled according to Java timezone rules
Examples:
-- Basic timestamp parsing
SELECT to_timestamp('2016-12-31 00:00:00');
-- With custom format
SELECT to_timestamp('12/31/2016 00:00:00', 'MM/dd/yyyy HH:mm:ss');
-- Converting date to timestamp
SELECT to_timestamp(current_date());// DataFrame API usage
import org.apache.spark.sql.functions._
// Basic conversion
df.select(to_timestamp($"timestamp_str"))
// With format specification
df.select(to_timestamp($"date_str", "yyyy-MM-dd"))
// Converting numeric epoch seconds
df.select(to_timestamp($"epoch_seconds"))Implementation Approach
See the Comet guide on adding new expressions for detailed instructions.
- Scala Serde: Add expression handler in
spark/src/main/scala/org/apache/comet/serde/ - Register: Add to appropriate map in
QueryPlanSerde.scala - Protobuf: Add message type in
native/proto/src/proto/expr.protoif needed - Rust: Implement in
native/spark-expr/src/(check if DataFusion has built-in support first)
Additional context
Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.ParseToTimestamp
Related:
GetTimestamp- Underlying expression for formatted parsingCast- Underlying expression for unformatted conversionParseToDate- Similar expression for date parsingUnixTimestamp- Converting to Unix timestamp format
This issue was auto-generated from Spark reference documentation.