Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Return parquet statistics min/max as values instead of strings #17966

Closed
asfimport opened this issue Jan 10, 2018 · 2 comments
Closed

Comments

@asfimport
Copy link

Currently min and max column statistics are returned as formatted strings of the physical type. This makes using them in python a bit tricky, as the strings need to be parsed as the proper logical type. Observe:

In [20]: import pandas as pd

In [21]: df = pd.DataFrame({'a': [1, 2, 3],
    ...:                    'b': ['a', 'b', 'c'],
    ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
    ...:

In [22]: df.to_parquet('temp.parquet', engine='pyarrow')

In [23]: from pyarrow import parquet as pq

In [24]: f = pq.ParquetFile('temp.parquet')

In [25]: rg = f.metadata.row_group(0)

In [26]: rg.column(0).statistics.min  # string instead of integer
Out[26]: '1'

In [27]: rg.column(1).statistics.min  # weird space added after value due to formatter
Out[27]: 'a '

In [28]: rg.column(2).statistics.min  # formatted as physical type (int) instead of logical (datetime)
Out[28]: '662688000000'

Since the type information is known, it should be possible to convert these to arrow values instead of strings.

Reporter: Jim Crist / @jcrist
Assignee: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-1982. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
This seems easy enough to fix. Marked for 0.9.0

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 1698
#1698

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants