[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect #27497

asfimport · 2021-02-15T18:35:37Z

I would expect to see ('A','A') for the first row group and ('B','B') for the second rowgroup.

I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.

>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
[ 
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      ...
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ]
]
>>> f.read_row_groups([1]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
[
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      ...
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1
    ]
]

Reporter: Daniel Nugent / @nugend
Assignee: Weston Pace / @westonpace

Related issues:

[C++] Parquet statistics wrong for dictionary type (is related to)

_{Note: This issue was originally created as ARROW-11634. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-02-16T07:53:37Z

Micah Kornfield / @emkornfield:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492 is where I believe the problematic code. min/max statistics should calculated by doing dictionary lookup with the indices instead of simply using the entire dictionary.

asfimport · 2021-09-14T03:32:01Z

Micah Kornfield / @emkornfield:
@westonpace I think this was fixed with the other dictionary changes you made?

asfimport · 2021-09-14T08:32:07Z

Joris Van den Bossche / @jorisvandenbossche:
With latest master, I can indeed confirm the above snippet now gives the correct output:

In [20]: (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
Out[20]: ('A', 'A')

In [21]: (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
Out[21]: ('B', 'B')

Was this sufficiently tested in the PR that fixed this?

asfimport · 2021-09-14T18:21:22Z

Weston Pace / @westonpace:
The issue I fixed was ARROW-12513 and I did add some unit tests at the C++ level to regress this behavior. I'd be good with closing this here.

asfimport closed this as completed Sep 15, 2021

asfimport assigned westonpace Jan 10, 2023

asfimport added this to the 6.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect #27497

[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect #27497

asfimport commented Feb 15, 2021

asfimport commented Feb 16, 2021

asfimport commented Sep 14, 2021

asfimport commented Sep 14, 2021

asfimport commented Sep 14, 2021

[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect #27497

[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect #27497

Comments

asfimport commented Feb 15, 2021

Related issues:

asfimport commented Feb 16, 2021

asfimport commented Sep 14, 2021

asfimport commented Sep 14, 2021

asfimport commented Sep 14, 2021