-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Hi there,
I face an issue when I write a parquet file by PyArrow.
In the older version of Hive, it can only recognize the timestamp type stored in INT96, so I use table.write_to_data with use_deprecated timestamp_int96_timestamps=True option to save the parquet file. But the HiveQL will skip conversion when the metadata of parquet file is not created_by "parquet-mr".
So I have to save the timestamp columns with timezone info(pad to UTC+8).
But when pyarrow.parquet read from a dir which contains parquets created by both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for parquet-mr files.
Maybe PyArrow can expose the created_by option in pyarrow({}prefer{}, parquet::WriterProperties::created_by is available in the C++ ).
Or handle the timestamp type with timezone which files created by parquet-mr?
Maybe related to https://issues.apache.org/jira/browse/ARROW-14422
Reporter: nero
Note: This issue was originally created as ARROW-15492. Please see the migration documentation for further details.