Skip to content

Conversation

@vinodkc
Copy link
Contributor

@vinodkc vinodkc commented Nov 23, 2025

What changes were proposed in this pull request?

This PR adds ORC serialization and deserialization support for Spark's TIME type.

Why are the changes needed?

TIME type currently lacks ORC support, preventing users from:

  • Reading/writing ORC files with TIME columns
  • Integrating TIME data with existing ORC-based data lakes
  • Preserving TIME precision in columnar storage

Does this PR introduce any user-facing change?

Yes. Users can now:

  1. Read ORC with TIME columns
spark.read.format("orc").load("data.orc")
// Returns DataFrame with TIME columns preserved
  1. Write DataFrames with TIME to ORC
val df = spark.sql("SELECT TIME'14:30:45.123456' as shift_start")
df.write.format("orc").save("output.orc")

Technical Details

Storage Format

ORC Column:
  Physical Type: LONG (nanoseconds since midnight)
  Custom Attribute: spark.sql.catalyst.type = "time(<precision>)"
  Value Range: 0 to 86,399,999,999,999

Precision Handling

Precision Catalyst Type ORC Attribute Example Value
0 (seconds) TimeType(0) "time(0)" 12:34:56
3 (millis) TimeType(3) "time(3)" 12:34:56.123
6 (micros) TimeType(6) "time(6)" 12:34:56.123456

Future Compatibility

  • Versioned via file metadata: Uses existing org.apache.spark.version for compatibility
  • Forward-compatible: If ORC adds native TIME type, can migrate based on version

How was this patch tested?

Added tests in OrcQuerySuite

Was this patch authored or co-authored using generative AI tooling?

Yes.
Generated-by: Claude 3.5 Sonnet

AI assistance was used for:

  • Code pattern analysis and design discussions
  • Implementation guidance following Spark conventions
  • Test case generation and organization
  • Documentation and examples

@github-actions github-actions bot added the SQL label Nov 23, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @vinodkc .

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.2.0.

huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This PR adds ORC serialization and deserialization support for Spark's TIME type.

### Why are the changes needed?

TIME type currently lacks ORC support, preventing users from:

- Reading/writing ORC files with TIME columns
- Integrating TIME data with existing ORC-based data lakes
- Preserving TIME precision in columnar storage

### Does this PR introduce _any_ user-facing change?

Yes. Users can now:

1. Read ORC with TIME columns
```scala
spark.read.format("orc").load("data.orc")
// Returns DataFrame with TIME columns preserved
```
2. Write DataFrames with TIME to ORC
```scala
val df = spark.sql("SELECT TIME'14:30:45.123456' as shift_start")
df.write.format("orc").save("output.orc")
```
### Technical Details

#### Storage Format
```
ORC Column:
  Physical Type: LONG (nanoseconds since midnight)
  Custom Attribute: spark.sql.catalyst.type = "time(<precision>)"
  Value Range: 0 to 86,399,999,999,999
```
#### Precision Handling
| Precision | Catalyst Type | ORC Attribute | Example Value |
|-----------|---------------|---------------|---------------|
| 0 (seconds) | `TimeType(0)` | `"time(0)"` | `12:34:56` |
| 3 (millis) | `TimeType(3)` | `"time(3)"` | `12:34:56.123` |
| 6 (micros) | `TimeType(6)` | `"time(6)"` | `12:34:56.123456` |

***Future Compatibility***
- Versioned via file metadata: Uses existing `org.apache.spark.version` for compatibility
- Forward-compatible: If ORC adds native TIME type, can migrate based on version

### How was this patch tested?

Added tests in `OrcQuerySuite`

### Was this patch authored or co-authored using generative AI tooling?

Yes.
Generated-by: Claude 3.5 Sonnet

AI assistance was used for:

- Code pattern analysis and design discussions
- Implementation guidance following Spark conventions
- Test case generation and organization
- Documentation and examples

Closes apache#53185 from vinodkc/br_time_orc_read_write.

Authored-by: vinodkc <vinod.kc.in@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants