Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet reader of Int96 columns and coercion to timestamps #4075

Closed
rtyler opened this issue Apr 13, 2023 · 4 comments
Closed

Parquet reader of Int96 columns and coercion to timestamps #4075

rtyler opened this issue Apr 13, 2023 · 4 comments
Labels
parquet Changes to the parquet crate question Further information is requested

Comments

@rtyler
Copy link
Contributor

rtyler commented Apr 13, 2023

Which part is this question about

I am using the parquet crate through delta-rs and trying to understand the disconnect between Delta's interpretation of timestamp and parquet. For example, Delta considers timestamps as microseconds since epoch

Describe your question

The parquet format docs have a dedicated timestamp type which I don't believe Delta is using. The parquet files written by Delta (the Spark implementation) write out an int96 type.

The parquet-tools CLI shows the column type from a .parquet file as:

############ Column(timestamp) ############
name: timestamp
path: timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 13%)

When I modify the read_parquet.rs example, the schema of RecordBatch coming from an example file with the above column is:

Field { name: "timestamp", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

I am assuming that the code which is doing this conversation on the INT96 column to a timezone is in consume_batch within primitive_array.rs but I'm not entirely sure.

I'm hoping for some help figuring out where the disconnect might be between how Delta Lake thinks "timestamp" should look (microseconds) versus the Parquet Rust reader which coerces that INT96 to nanoseconds.

I'm trying to figure out

Additional context

@rtyler rtyler added the question Further information is requested label Apr 13, 2023
@tustvold
Copy link
Contributor

tustvold commented Apr 13, 2023

The parquet reader is returning nanoseconds because that is the precision present in the encoding. I'm not familiar with deltalake's timestamp handling but it may be they assume all timestamps are microseconds. As this is not actually true, delta-rs should probably be adding coercion logic to convert where appropriate.

FWIW the Int96 encoding has been deprecated for almost a decade, it is slightly ridiculous that Spark still is using it

@tustvold
Copy link
Contributor

apache/datafusion#5950 may be related here, FYI @wjones127

@rtyler
Copy link
Contributor Author

rtyler commented Apr 13, 2023

FWIW the Int96 encoding has been deprecated for almost a decade, it is slightly ridiculous that Spark still is using it

Well that makes me sad 😆 but I'm not surprised.

@rtyler
Copy link
Contributor Author

rtyler commented Apr 13, 2023

This link to Apache Spark code was shared with me, and it makes me so sad.

Thanks for the input @tustvold

@rtyler rtyler closed this as completed Apr 13, 2023
@tustvold tustvold added the parquet Changes to the parquet crate label May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants