Skip to content

[Feature] Support Spark expression: parse_to_timestamp #3109

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark parse_to_timestamp function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

ParseToTimestamp is a Spark Catalyst expression that converts string, date, timestamp, or numeric values to a timestamp data type. It supports optional format specifications for parsing string inputs and provides timezone-aware conversion capabilities with configurable error handling behavior.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Describe the potential solution

Spark Specification

Syntax:

to_timestamp(timestamp_str[, format])
// DataFrame API usage
df.select(to_timestamp($"timestamp_column"))
df.select(to_timestamp($"timestamp_column", "yyyy-MM-dd HH:mm:ss"))

Arguments:

Argument Type Description
left Expression The input expression to convert to timestamp
format Option[Expression] Optional format string for parsing input
dataType DataType Target timestamp data type
timeZoneId Option[String] Optional timezone identifier for conversion
failOnError Boolean Whether to fail on conversion errors (defaults to ANSI mode setting)

Return Type: Returns a timestamp data type as specified by the dataType parameter, typically TimestampType or TimestampNTZType.

Supported Data Types:

  • StringType with collation support (including trim collation)
  • DateType
  • TimestampType
  • TimestampNTZType
  • NumericType (only when target dataType is TimestampType)

Edge Cases:

  • Null inputs are handled gracefully and typically return null outputs
  • Invalid format strings will cause runtime errors when failOnError is true
  • Unparseable timestamp strings behavior depends on ANSI mode settings
  • Numeric inputs are interpreted as seconds since epoch when converting to TimestampType
  • Timezone conversion edge cases (DST transitions) are handled according to Java timezone rules

Examples:

-- Basic timestamp parsing
SELECT to_timestamp('2016-12-31 00:00:00');

-- With custom format
SELECT to_timestamp('12/31/2016 00:00:00', 'MM/dd/yyyy HH:mm:ss');

-- Converting date to timestamp
SELECT to_timestamp(current_date());
// DataFrame API usage
import org.apache.spark.sql.functions._

// Basic conversion
df.select(to_timestamp($"timestamp_str"))

// With format specification
df.select(to_timestamp($"date_str", "yyyy-MM-dd"))

// Converting numeric epoch seconds
df.select(to_timestamp($"epoch_seconds"))

Implementation Approach

See the Comet guide on adding new expressions for detailed instructions.

  1. Scala Serde: Add expression handler in spark/src/main/scala/org/apache/comet/serde/
  2. Register: Add to appropriate map in QueryPlanSerde.scala
  3. Protobuf: Add message type in native/proto/src/proto/expr.proto if needed
  4. Rust: Implement in native/spark-expr/src/ (check if DataFusion has built-in support first)

Additional context

Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.ParseToTimestamp

Related:

  • GetTimestamp - Underlying expression for formatted parsing
  • Cast - Underlying expression for unformatted conversion
  • ParseToDate - Similar expression for date parsing
  • UnixTimestamp - Converting to Unix timestamp format

This issue was auto-generated from Spark reference documentation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions