Handling Unsupported Arrow Types in Parquet #1666

tustvold · 2022-05-07T07:22:16Z

Problem

Data in parquet is stored as one of a limited number of physical types, there are then three mechanisms that an arrow reader can use to infer the type of the data.

The deprecated ConvertedType enumeration stored within the parquet schema
The LogicalType enumeration stored within the parquet schema
An embedded arrow schema stored within the parquet file metadata

All parquet readers support 1, all v2 readers support 2, and only arrow readers support 3.

In some cases the logical type is not necessary to correctly interpret the data, e.g. String vs LargeString, but in some cases it fundamentally alters the meaning of the data.

Timestamp

A nanosecond time units is not supported in 1, and a second time unit is only natively supported in 3.

Currently this crate encodes second time units as logical type of milliseconds, this is likely a bug. It also does not convert to nanoseconds to microseconds when using writer version 1, despite this only being supported in >2.6.

The python implementation will, depending on the writer version, potentially cast nanoseconds to microseconds, and seconds to milliseconds - see here

There does not appear to be a way to round-trip timezones in 1 or 2. The C++ implementation appears to always normalize timezones to UTC and set is_adjusted_to_utc to true.

Currently this crate sets is_adjusted_to_utc to true if the timezone is set, this is despite the writer not actually performing the normalisation. I think this is a bug.

Date64

The arrow type Date64 is milliseconds since epoch and does not have an equivalent ConvertedType nor LogicalType.

Currently this crate converts the type to Date32 on write, losing sub-second precision. This what the C++ implementation does - see here.

Interval

The interval data type has a ConvertedType but not a LogicalType, there is a PR to add LogicalType support apache/parquet-format#165 but it appears to have stalled somewhat.

An interval of MonthDayNano cannot be represented by the ConvertedType. The C++ implementation does not appear to support Interval types.

Proposal

There are broadly speaking 3 ways to handle data types that cannot be represented in the parquet schema:

Return an error
Cast to a native parquet representation, potentially losing precision in the process
Encode the data, only encoding its logical type in the embedded arrow schema

Returning an error is the safest, but is not a great UX. Casting to a native parquet representation is inline with what other arrow implementations do and gives the best ecosystem compatibility, but also doesn't make for a great UX, just search StackOverflow for allow_truncated_timestamps to see the confusion this causes.

The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility.

Users who care about interoperability can explicitly cast data as appropriate, similar to python's coerce_timestamps functionality, or just restrict themselves to using a supported subset of arrow types.

Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao

Additional Context

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-05-07T07:53:40Z

Great write-up, @tustvold !

I agree with your assessment of least surprise on option 3.

The way I think about option 1 and 3 is:

Option 1 is potentially lossy on data (some bytes may be lost) and lossless on metadata (metadata is preserved); option 3 is lossless on data and potentially lossy on metadata.

In my experience metadata is easier to recover/preserve through other means (e.g. catalog, parquet's custom metadata, even column naming conventions) than data, that is usually unrecoverable. From this perspective, option 1 has a real impact on data integrity.

fwiw we use option 3 on arrow2. E.g. Date64 is encoded as Parquet's Int64 with no annotated converted nor logical type (and the arrow schema in the metadata for arrow-aware readers).

viirya · 2022-05-07T20:21:26Z

Thanks for summarizing the current status and the proposals. @tustvold

So looks like we have already used option 2 and option 3 on Timestamp and Date64? I think that we should come out a consistent approach on dealing with this kind of Arrow types. No matter it is option 2 or option 3.

Cast to a native parquet representation, potentially losing precision in the process
Encode the data, only encoding its logical type in the embedded arrow schema

I think that above two options are better choices and they have some advantages separately. Regarding with ecosystem compatibility, I feel it might be somehow confused to users too that Parquet files from one arrow implementation is not compatible with other implementations. Especially if the users aren't aware of the fact that the produced Parquet files might be not compatible with other tools.

If there is a config/feature used to explicitly enable such possibly incompatible Parquet files from this crate. It might be helpful as users will know what they are doing.

alamb · 2022-05-09T12:31:12Z

I think that we should come out a consistent approach on dealing with this kind of Arrow types.

100% agree with @viirya on this one

The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility.

I agree with @jorgecarleitao and @tustvold that this (option 3) would be best for some users (such as IOx), but as @viirya mentions this may not be ideal from a "least surprising" point of view with people using a broader selection of ecosystem tools

Maybe we could add an option to WriterProperties that allows choosing the tradeoff desired. Something like

  /// Converts data prior to writing to maximize compatibility with other 
  /// parquet implementations. This may result in losing precision on some data
  /// types that can not be natively represented, such as `Timestamp(Nanosecond)`
  /// as parquet only supports millisecond precision timestamps natively
  maximize_ecosystem_compatibility: bool

Don't treat embedded arrow schema as authoritative (apache#1663) Fix projection of nested parquet files (apache#1652) (apache#1654) Fix schema inference for repeated fields (apache#1681) Support reading alternative list representations from parquet (apache#1680) Consistent handling of unsupported arrow types in parquet (apache#1666)

sunchao · 2022-05-09T17:15:05Z

Thanks for the discussion. I too think the option 3 looks like the best choice, with a flag to coerce data types for inter-op with non-arrow Parquet readers.

tustvold · 2022-06-24T10:33:57Z

I've created #1938 as a follow up, thank you all for your feedback

tustvold added the question Further information is requested label May 7, 2022

tustvold mentioned this issue May 7, 2022

parquet doesn't preserve the time in Datatype::Date64 #1275

Closed

tustvold mentioned this issue May 9, 2022

Fix Parquet Reader's Arrow Schema Inference #1682

Merged

tustvold closed this as completed Jun 24, 2022

alamb added the parquet Changes to the parquet crate label Jun 24, 2022

tustvold mentioned this issue Aug 16, 2022

Extension type support #2444

Closed

tustvold mentioned this issue Feb 8, 2023

RFC: Use Apache Arrow Parquet Crate pola-rs/polars#6735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Unsupported Arrow Types in Parquet #1666

Handling Unsupported Arrow Types in Parquet #1666

tustvold commented May 7, 2022 •

edited

jorgecarleitao commented May 7, 2022 •

edited

viirya commented May 7, 2022 •

edited

alamb commented May 9, 2022

sunchao commented May 9, 2022

tustvold commented Jun 24, 2022

Handling Unsupported Arrow Types in Parquet #1666

Handling Unsupported Arrow Types in Parquet #1666

Comments

tustvold commented May 7, 2022 • edited

Problem

Timestamp

Date64

Interval

Proposal

Additional Context

jorgecarleitao commented May 7, 2022 • edited

viirya commented May 7, 2022 • edited

alamb commented May 9, 2022

sunchao commented May 9, 2022

tustvold commented Jun 24, 2022

tustvold commented May 7, 2022 •

edited

jorgecarleitao commented May 7, 2022 •

edited

viirya commented May 7, 2022 •

edited