Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Unsupported Arrow Types in Parquet #1666

Closed
tustvold opened this issue May 7, 2022 · 5 comments
Closed

Handling Unsupported Arrow Types in Parquet #1666

tustvold opened this issue May 7, 2022 · 5 comments
Labels
parquet Changes to the parquet crate question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented May 7, 2022

Problem

Data in parquet is stored as one of a limited number of physical types, there are then three mechanisms that an arrow reader can use to infer the type of the data.

  1. The deprecated ConvertedType enumeration stored within the parquet schema
  2. The LogicalType enumeration stored within the parquet schema
  3. An embedded arrow schema stored within the parquet file metadata

All parquet readers support 1, all v2 readers support 2, and only arrow readers support 3.

In some cases the logical type is not necessary to correctly interpret the data, e.g. String vs LargeString, but in some cases it fundamentally alters the meaning of the data.

Timestamp

A nanosecond time units is not supported in 1, and a second time unit is only natively supported in 3.

Currently this crate encodes second time units as logical type of milliseconds, this is likely a bug. It also does not convert to nanoseconds to microseconds when using writer version 1, despite this only being supported in >2.6.

The python implementation will, depending on the writer version, potentially cast nanoseconds to microseconds, and seconds to milliseconds - see here

There does not appear to be a way to round-trip timezones in 1 or 2. The C++ implementation appears to always normalize timezones to UTC and set is_adjusted_to_utc to true.

Currently this crate sets is_adjusted_to_utc to true if the timezone is set, this is despite the writer not actually performing the normalisation. I think this is a bug.

Date64

The arrow type Date64 is milliseconds since epoch and does not have an equivalent ConvertedType nor LogicalType.

Currently this crate converts the type to Date32 on write, losing sub-second precision. This what the C++ implementation does - see here.

Interval

The interval data type has a ConvertedType but not a LogicalType, there is a PR to add LogicalType support apache/parquet-format#165 but it appears to have stalled somewhat.

An interval of MonthDayNano cannot be represented by the ConvertedType. The C++ implementation does not appear to support Interval types.

Proposal

There are broadly speaking 3 ways to handle data types that cannot be represented in the parquet schema:

  • Return an error
  • Cast to a native parquet representation, potentially losing precision in the process
  • Encode the data, only encoding its logical type in the embedded arrow schema

Returning an error is the safest, but is not a great UX. Casting to a native parquet representation is inline with what other arrow implementations do and gives the best ecosystem compatibility, but also doesn't make for a great UX, just search StackOverflow for allow_truncated_timestamps to see the confusion this causes.

The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility.

Users who care about interoperability can explicitly cast data as appropriate, similar to python's coerce_timestamps functionality, or just restrict themselves to using a supported subset of arrow types.

Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao

Additional Context

@tustvold tustvold added the question Further information is requested label May 7, 2022
@jorgecarleitao
Copy link
Member

jorgecarleitao commented May 7, 2022

Great write-up, @tustvold !

I agree with your assessment of least surprise on option 3.

The way I think about option 1 and 3 is:

Option 1 is potentially lossy on data (some bytes may be lost) and lossless on metadata (metadata is preserved); option 3 is lossless on data and potentially lossy on metadata.

In my experience metadata is easier to recover/preserve through other means (e.g. catalog, parquet's custom metadata, even column naming conventions) than data, that is usually unrecoverable. From this perspective, option 1 has a real impact on data integrity.

fwiw we use option 3 on arrow2. E.g. Date64 is encoded as Parquet's Int64 with no annotated converted nor logical type (and the arrow schema in the metadata for arrow-aware readers).

@viirya
Copy link
Member

viirya commented May 7, 2022

Thanks for summarizing the current status and the proposals. @tustvold

So looks like we have already used option 2 and option 3 on Timestamp and Date64? I think that we should come out a consistent approach on dealing with this kind of Arrow types. No matter it is option 2 or option 3.

Cast to a native parquet representation, potentially losing precision in the process
Encode the data, only encoding its logical type in the embedded arrow schema

I think that above two options are better choices and they have some advantages separately. Regarding with ecosystem compatibility, I feel it might be somehow confused to users too that Parquet files from one arrow implementation is not compatible with other implementations. Especially if the users aren't aware of the fact that the produced Parquet files might be not compatible with other tools.

If there is a config/feature used to explicitly enable such possibly incompatible Parquet files from this crate. It might be helpful as users will know what they are doing.

@alamb
Copy link
Contributor

alamb commented May 9, 2022

I think that we should come out a consistent approach on dealing with this kind of Arrow types.

100% agree with @viirya on this one

The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility.

I agree with @jorgecarleitao and @tustvold that this (option 3) would be best for some users (such as IOx), but as @viirya mentions this may not be ideal from a "least surprising" point of view with people using a broader selection of ecosystem tools

Maybe we could add an option to WriterProperties that allows choosing the tradeoff desired. Something like

  /// Converts data prior to writing to maximize compatibility with other 
  /// parquet implementations. This may result in losing precision on some data
  /// types that can not be natively represented, such as `Timestamp(Nanosecond)`
  /// as parquet only supports millisecond precision timestamps natively
  maximize_ecosystem_compatibility: bool

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 9, 2022
Don't treat embedded arrow schema as authoritative (apache#1663)

Fix projection of nested parquet files (apache#1652) (apache#1654)

Fix schema inference for repeated fields (apache#1681)

Support reading alternative list representations from parquet (apache#1680)

Consistent handling of unsupported arrow types in parquet (apache#1666)
@sunchao
Copy link
Member

sunchao commented May 9, 2022

Thanks for the discussion. I too think the option 3 looks like the best choice, with a flag to coerce data types for inter-op with non-arrow Parquet readers.

@tustvold
Copy link
Contributor Author

I've created #1938 as a follow up, thank you all for your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants