Skip to content

[Python] handle timestamp type in parquet file for compatibility with older HiveQL #30967

@asfimport

Description

@asfimport

Hi there,

I face an issue when I write a parquet file by PyArrow.

In the older version of Hive, it can only recognize the timestamp type stored in INT96, so I use table.write_to_data with use_deprecated timestamp_int96_timestamps=True option to save the parquet file. But the HiveQL will skip conversion when the metadata of parquet file is not created_by "parquet-mr".

hive/ParquetRecordReaderBase.java at f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive (github.com)

 

So I have to save the timestamp columns with timezone info(pad to UTC+8).

But when pyarrow.parquet read from a dir which contains parquets created by both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for parquet-mr files.

 

Maybe PyArrow can expose the created_by option in pyarrow({}prefer{}, parquet::WriterProperties::created_by is available in the C++ ).

Or handle the timestamp type with timezone which files created by parquet-mr?

 

Maybe related to https://issues.apache.org/jira/browse/ARROW-14422

Reporter: nero

Note: This issue was originally created as ARROW-15492. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions