Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Parquet: Object type and stats lost when using 96-bit timestamps #16034

Closed
asfimport opened this issue Mar 20, 2019 · 3 comments
Closed

Comments

@asfimport
Copy link

Run the following code:

import datetime as dt
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
table = pa.Table.from_pandas(dataframe, preserve_index=False)

pq.write_table(table, 'int64.parq')
pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)

Examining the int64.parq file, we see that the column metadata includes an object type of TIMESTAMP_MICROS and also gives some stats. All is well.

file schema: schema 
--------------------------------------------------------------------------------
foo:         OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1

row group 1: RC:1 TS:76 OFFSET:4 
--------------------------------------------------------------------------------
foo:          INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 2019-12-31T23:59:59.999000, num_nulls: 0]

However, if we look at int96.parq, it appears that that metadata is lost. No object type, and no column stats.

file schema: schema 
--------------------------------------------------------------------------------
foo:         OPTIONAL INT96 R:0 D:1

row group 1: RC:1 TS:58 OFFSET:4 
--------------------------------------------------------------------------------
foo:          INT96 SNAPPY ... ST:[no stats for this column]

This is a bit confusing since the metadata for the exact same data can look differently depending on an unrelated flag being set or cleared.

Environment: PyArrow: 0.12.1
Python: 2.7.15, 3.7.2
Pandas: 0.24.2
Reporter: Diego Argueta / @dargueta

Note: This issue was originally created as ARROW-4967. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@dargueta Regarding the logical type, I think this is expected: INT96 is only a physical type in the parquet format, and there is no timestamp-like logical type that uses INT96 as physical type. 

The usage of INT96 for timestamps only stems from a convention in some of the parquet implementations (I think Hive and Impala, but not very familiar with it), and therefore arrow has the option to write them, for compatibility with those systems. But note that this type is actually deprecated in the parquet format.

See eg https://stackoverflow.com/a/54665645/653364, https://stackoverflow.com/questions/42628287/sparks-int96-time-type and the discussion in apache/parquet-format#49

 

That's the explanation for the missing logical type. For the missing stats, I am not sure.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Computation of statistics is disabled for INT96. We don't intend to do anything about this AFAIK cc @majetideepak

@asfimport
Copy link
Author

Deepak Majeti / @majetideepak:
The comments above are correct! INT96 type is deprecated and it statistics are disabled by default. The timestamp byte layout in INT96 is big endian and does not comply with the standard sort orders in the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant