Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

Closed
asfimport opened this issue Aug 23, 2019 · 4 comments
Closed

[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

asfimport opened this issue Aug 23, 2019 · 4 comments

Comments

@asfimport
Copy link

When initialising an array with NaT only values the row group statistic is corrupt returning either random values or raises integer out of bound exceptions.

import io
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"t": pd.Series([pd.NaT], dtype="datetime64[ns]")})
buf = pa.BufferOutputStream()
pq.write_table(pa.Table.from_pandas(df), buf, version="2.0")
buf = io.BytesIO(buf.getvalue().to_pybytes())
parquet_file = pq.ParquetFile(buf)
# Asserting behaviour is difficult since it is random and the state is ill defined. 
# After a few iterations an exception is raised.
while True:
    parquet_file.metadata.row_group(0).column(0).statistics.max

Reporter: Florian Jetter / @fjetter
Assignee: Uwe Korn / @xhochy

PRs and other links:

Note: This issue was originally created as ARROW-6339. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Florian Jetter / @fjetter:
The same is true for other null values, e.g.

df = pd.DataFrame({
    "t": pd.Series([pd.NaT], dtype="datetime64[ns]"),
    "f": [pd.np.nan],
    "o": [None],
})

The statistics are mostly initialised to zero but not all the times.

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
The problem here is that parquet_file.metadata.row_group(0).column(0).statistics.has_min_max is False and thus .max should never be accessed. Instead of returning undefined data, we should raise an exception.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Issue resolved by pull request 5403
#5403

@asfimport asfimport added this to the 0.15.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants