Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

houqp · 2021-06-13T23:58:45Z

Describe the bug

Here is a pandas dataframe with nanosecond timestamp Data index:

>>> hist.index
DatetimeIndex(['1986-03-13', '1986-03-14', '1986-03-17', '1986-03-18',
               '1986-03-19', '1986-03-20', '1986-03-21', '1986-03-24',
               '1986-03-25', '1986-03-26',
               ...
               '2021-05-28', '2021-06-01', '2021-06-02', '2021-06-03',
               '2021-06-04', '2021-06-07', '2021-06-08', '2021-06-09',
               '2021-06-10', '2021-06-11'],
              dtype='datetime64[ns]', name='Date', length=8885, freq=None)

When storing this dataframe into parquet 1.0 format, pyarrow stores the Date column in microsecond unit. pyarrow is able to load the Date column with microsecond precision as well:

>>> from pyarrow.parquet import ParquetFile
>>> pp = ParquetFile("test_data/msft.parquet")
>>> pp.metadata.schema
<pyarrow._parquet.ParquetSchema object at 0x7f720d1bbac0>
required group field_id=0 schema {
  optional double field_id=1 Open;
  optional double field_id=2 High;
  optional double field_id=3 Low;
  optional double field_id=4 Close;
  optional int64 field_id=5 Volume;
  optional double field_id=6 Dividends;
  optional double field_id=7 StockSplits;
  optional int64 field_id=8 Date (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

But when loaded using arrow parquet crate, it is incorrectly loaded as nanosecond timestamp type.

To Reproduce

Here is a sample file to reproduce the issue: https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.

The file can be reproduced with the following python code:

import yfinance as yf
hist = yf.Ticker('MSFT').history(period="max")
hist.to_parquet('msft.parquet')

Expected behavior

Data column should be loaded as micro second precision.

Additional context

Arrow parquet crate handles parquet 2.0 files without any issue.

Initially reported in roapi/roapi#42.

Here is the decoded ipc field from the 'ARROW:schema' metadata for the Date column in arrow crate:

Field {
    name: Some(
        "Date",
    ),
    nullable: true,
    type_type: Timestamp,
    type_: Timestamp {
        unit: NANOSECOND,
        timezone: None,
    },
    dictionary: None,
    children: Some(
        [],
    ),
    custom_metadata: None,
}

The text was updated successfully, but these errors were encountered:

nevi-me · 2021-06-14T00:52:10Z

I'd have expected the ns timestamp to be written to int96 as that is the legacy nanosecond timestamp format. I'll also dig into this to see, maybe if the format is 1.0 then we should coerce the timestamp to a different resolution, but I'd also see this as a pyarrow quirk that could be documented somewhere.

@emkornfield do you know what we should do on the Rust side to roundtrip the file correctly?

houqp · 2021-06-14T00:58:07Z

pyarrow parquet writer has an option that one can set to write ns timestamp as int96 for parquet 1.0, but is turned off by default. It is quite strange that decoded IPC schema comes back a ns type even though it's stored as ms.

emkornfield · 2021-06-15T02:53:59Z

@emkornfield do you know what we should do on the Rust side to roundtrip the file correctly?

Sorry, I would have to dig in the code to have a better understanding. In general we try to avoid int96 because it is a deprecated type. If this is an issue with pyarrow can we open a separate bug to track that?

jorgecarleitao · 2021-06-15T05:28:22Z

This may be related to which priority we give when converting, the parquet schema or the arrow schema. I would expect pyarrow to write them in a consistent manner though, so, as @nevi-me mentioned, an arrow schema in ns with a parquet schema in ms does seem odd.

emkornfield · 2021-06-16T03:53:26Z

https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.cc#L215 is probably the relevant writing code for 1.0 so that might explain the discrepancy, let me see If I can dig up the reader code.

emkornfield · 2021-06-16T04:05:39Z

Here is C++ attempts to reuse original metadata. It doesn't look like we would use the arrow schema in this case. I'm guessing conversion back the Pandas type happens correctly because microseconds would be cast back at the arrow to pandas layer.

emkornfield · 2021-06-16T04:06:47Z

So I think there is probably a bug in pyarrow here since we wouldn't round trip a pure timestamp[ns] type, but we don't run into it in from pandas because the casting just works.

tustvold · 2022-10-21T06:07:50Z

I believe this is a duplicate of #1459 which was fixed by #1682. Please feel free to reopen if I am mistaken

houqp added the bug label Jun 13, 2021

waitingkuo mentioned this issue Aug 15, 2022

[EPIC] A collection of Date/Time related open issues apache/datafusion#3148

Closed

85 tasks

tustvold closed this as completed Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

houqp commented Jun 13, 2021

nevi-me commented Jun 14, 2021

houqp commented Jun 14, 2021

emkornfield commented Jun 15, 2021

jorgecarleitao commented Jun 15, 2021

emkornfield commented Jun 16, 2021

emkornfield commented Jun 16, 2021

emkornfield commented Jun 16, 2021

tustvold commented Oct 21, 2022

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

Comments

houqp commented Jun 13, 2021

nevi-me commented Jun 14, 2021

houqp commented Jun 14, 2021

emkornfield commented Jun 15, 2021

jorgecarleitao commented Jun 15, 2021

emkornfield commented Jun 16, 2021

emkornfield commented Jun 16, 2021

emkornfield commented Jun 16, 2021

tustvold commented Oct 21, 2022