Parquet reader of Int96 columns and coercion to timestamps #4075

rtyler · 2023-04-13T04:33:44Z

Which part is this question about

I am using the parquet crate through delta-rs and trying to understand the disconnect between Delta's interpretation of timestamp and parquet. For example, Delta considers timestamps as microseconds since epoch

Describe your question

The parquet format docs have a dedicated timestamp type which I don't believe Delta is using. The parquet files written by Delta (the Spark implementation) write out an int96 type.

The parquet-tools CLI shows the column type from a .parquet file as:

############ Column(timestamp) ############
name: timestamp
path: timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 13%)

When I modify the read_parquet.rs example, the schema of RecordBatch coming from an example file with the above column is:

Field { name: "timestamp", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

I am assuming that the code which is doing this conversation on the INT96 column to a timezone is in consume_batch within primitive_array.rs but I'm not entirely sure.

I'm hoping for some help figuring out where the disconnect might be between how Delta Lake thinks "timestamp" should look (microseconds) versus the Parquet Rust reader which coerces that INT96 to nanoseconds.

I'm trying to figure out

Additional context

The text was updated successfully, but these errors were encountered:

tustvold · 2023-04-13T09:13:46Z

The parquet reader is returning nanoseconds because that is the precision present in the encoding. I'm not familiar with deltalake's timestamp handling but it may be they assume all timestamps are microseconds. As this is not actually true, delta-rs should probably be adding coercion logic to convert where appropriate.

FWIW the Int96 encoding has been deprecated for almost a decade, it is slightly ridiculous that Spark still is using it

tustvold · 2023-04-13T09:14:49Z

apache/datafusion#5950 may be related here, FYI @wjones127

rtyler · 2023-04-13T14:48:56Z

FWIW the Int96 encoding has been deprecated for almost a decade, it is slightly ridiculous that Spark still is using it

Well that makes me sad 😆 but I'm not surprised.

rtyler · 2023-04-13T19:08:04Z

This link to Apache Spark code was shared with me, and it makes me so sad.

Thanks for the input @tustvold

rtyler added the question Further information is requested label Apr 13, 2023

rtyler closed this as completed Apr 13, 2023

tustvold added the parquet Changes to the parquet crate label May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet reader of Int96 columns and coercion to timestamps #4075

Parquet reader of Int96 columns and coercion to timestamps #4075

rtyler commented Apr 13, 2023

tustvold commented Apr 13, 2023 •

edited

tustvold commented Apr 13, 2023

rtyler commented Apr 13, 2023

rtyler commented Apr 13, 2023

Parquet reader of Int96 columns and coercion to timestamps #4075

Parquet reader of Int96 columns and coercion to timestamps #4075

Comments

rtyler commented Apr 13, 2023

tustvold commented Apr 13, 2023 • edited

tustvold commented Apr 13, 2023

rtyler commented Apr 13, 2023

rtyler commented Apr 13, 2023

tustvold commented Apr 13, 2023 •

edited