Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect #27497

Closed
asfimport opened this issue Feb 15, 2021 · 4 comments

Comments

@asfimport
Copy link
Collaborator

I would expect to see ('A','A') for the first row group and ('B','B') for the second rowgroup.

I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.

>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
[ 
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      ...
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ]
]
>>> f.read_row_groups([1]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
[
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      ...
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1
    ]
]

Reporter: Daniel Nugent / @nugend
Assignee: Weston Pace / @westonpace

Related issues:

Note: This issue was originally created as ARROW-11634. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492 is where I believe the problematic code.  min/max statistics should calculated by doing dictionary lookup with the indices instead of simply using the entire dictionary.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
@westonpace  I think this was fixed with the other dictionary changes you made?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
With latest master, I can indeed confirm the above snippet now gives the correct output:

In [20]: (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
Out[20]: ('A', 'A')

In [21]: (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
Out[21]: ('B', 'B')

Was this sufficiently tested in the PR that fixed this?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
The issue I fixed was ARROW-12513 and I did add some unit tests at the C++ level to regress this behavior. I'd be good with closing this here.

@asfimport asfimport added this to the 6.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants