Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] [Parquet] Writing uint32 does not preserve parquet's LogicalType #28020

Closed
asfimport opened this issue Apr 5, 2021 · 7 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Apr 5, 2021

When writing a uint32 column, (parquet's) logical type is not written, limiting interoperability with other engines.

Minimal Python

import pyarrow as pa

data = {"uint32", [1, None, 0]}
schema = pa.schema([pa.field('uint32', pa.uint32())])

t = pa.table(data, schema=schema)
pa.parquet.write_table(t, "bla.parquet")

 
Inspecting it with spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.parquet("bla.parquet")
print(df.select("uint32").schema)

shows StructType(List(StructField(uint32,LongType,true))). "LongType" indicates that the field is interpreted as a 64 bit integer. Further inspection of the metadata shows that both convertedType and logicalType are not being set. Note that this is independent of the arrow-specific schema written in the metadata.

Reporter: Jorge Leitão / @jorgecarleitao

Related issues:

Note: This issue was originally created as ARROW-12201. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Hmm, that's bummer. Hopefully we can fix this for 4.0.0.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Oh, actually, you simply need to add version='2.0' to the write_table call, and the annotation will be written out.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Will close as this works as intended (by default maximum compatibility with old readers is ensured). However, we may want to bump the default version as some point.

@asfimport
Copy link
Author

Jorge Leitão / @jorgecarleitao:
Got it.

In this case, doesn't compatibility require setting ConvertedType = UINT_32?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I have no idea why it wasn't done. This seems to date back to a 2016 PR:
https://github.com/apache/parquet-cpp/pull/158/files#r607025397

@asfimport
Copy link
Author

Jorge Leitão / @jorgecarleitao:
holy cow. Ok, that explains it. Thanks a lot, and sorry for the noise.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
There was another bug that discussed this at length. int64 for int32 was used so the type could properly round trip. But yes this is WAI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant