[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

asfimport · 2019-08-23T20:40:40Z

When initialising an array with NaT only values the row group statistic is corrupt returning either random values or raises integer out of bound exceptions.

import io
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"t": pd.Series([pd.NaT], dtype="datetime64[ns]")})
buf = pa.BufferOutputStream()
pq.write_table(pa.Table.from_pandas(df), buf, version="2.0")
buf = io.BytesIO(buf.getvalue().to_pybytes())
parquet_file = pq.ParquetFile(buf)
# Asserting behaviour is difficult since it is random and the state is ill defined. 
# After a few iterations an exception is raised.
while True:
    parquet_file.metadata.row_group(0).column(0).statistics.max

Reporter: Florian Jetter / @fjetter
Assignee: Uwe Korn / @xhochy

PRs and other links:

_{Note: This issue was originally created as ARROW-6339. Please see the migration documentation for further details.}

asfimport · 2019-08-24T13:21:30Z

Florian Jetter / @fjetter:
The same is true for other null values, e.g.

df = pd.DataFrame({
    "t": pd.Series([pd.NaT], dtype="datetime64[ns]"),
    "f": [pd.np.nan],
    "o": [None],
})

The statistics are mostly initialised to zero but not all the times.

asfimport · 2019-09-09T15:09:39Z

Antoine Pitrou / @pitrou:
@jorisvandenbossche

asfimport · 2019-09-17T08:22:59Z

Uwe Korn / @xhochy:
The problem here is that parquet_file.metadata.row_group(0).column(0).statistics.has_min_max is False and thus .max should never be accessed. Instead of returning undefined data, we should raise an exception.

asfimport · 2019-09-18T11:09:12Z

Krisztian Szucs / @kszucs:
Issue resolved by pull request 5403
#5403

asfimport closed this as completed Sep 18, 2019

asfimport assigned xhochy Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

asfimport commented Aug 23, 2019

asfimport commented Aug 24, 2019

asfimport commented Sep 9, 2019

asfimport commented Sep 17, 2019

asfimport commented Sep 18, 2019

[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

[Python][C++] Rowgroup statistics for pd.NaT array ill defined #22716

Comments

asfimport commented Aug 23, 2019

PRs and other links:

asfimport commented Aug 24, 2019

asfimport commented Sep 9, 2019

asfimport commented Sep 17, 2019

asfimport commented Sep 18, 2019