Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Parquet V2 page headers have incorrect number of rows #34086

Closed
etseidl opened this issue Feb 9, 2023 · 2 comments · Fixed by #34096
Closed

[C++][Parquet] Parquet V2 page headers have incorrect number of rows #34086

etseidl opened this issue Feb 9, 2023 · 2 comments · Fixed by #34096
Assignees
Labels
Component: C++ Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@etseidl
Copy link
Contributor

etseidl commented Feb 9, 2023

Describe the bug, including details regarding any error messages, version, and platform.

When writing Parquet files with version 2 page headers, the num_rows field is incorrect. This appears to be because in column_writer.cc ColumnWriterImpl::BuildDataPageV2() num_values is passed twice to the constructor for DataPageV2. The 4th argument should be num_rows.

To reproduce:

import pyarrow.parquet as pq
import pyarrow as pa
table = pa.table({'col0': [[1,2,3]]})
pq.write_table(table, 'bug.parquet', data_page_version="2.0")

Examining with parquet-cli:

% parquet-cli pages bug.parquet
Column: col0.list.item
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  S _  3       8.00 B     24 B      
  0-1    data  _ R  3       2.67 B     8 B        3        0       "1" / "3"

"rows" should be 1.
Rewriting the file with parquet-mr gives:

% parquet-cli pages bug-mr.parquet
Column: col0.list.element
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  _ D  3       5.00 B     15 B       1        0       

Component(s)

C++, Parquet

@kou kou changed the title Parquet V2 page headers have incorrect number of rows [C++][Parquet] Parquet V2 page headers have incorrect number of rows Feb 9, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Feb 9, 2023

This might be easier to fix when #34054 is merged.

@wgtmac
Copy link
Member

wgtmac commented Feb 9, 2023

I exactly encounter this issue as well as other issues when I was implementing #34054. Will fix them shortly.

wjones127 pushed a commit that referenced this issue Feb 10, 2023
### Rationale for this change

The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header.

### What changes are included in this PR?

ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2.

### Are these changes tested?

A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns.

### Are there any user-facing changes?

No.
* Closes: #34086

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 12.0.0 milestone Feb 10, 2023
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…pache#34096)

### Rationale for this change

The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header.

### What changes are included in this PR?

ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2.

### Are these changes tested?

A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns.

### Are there any user-facing changes?

No.
* Closes: apache#34086

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…pache#34096)

### Rationale for this change

The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header.

### What changes are included in this PR?

ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2.

### Are these changes tested?

A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns.

### Are there any user-facing changes?

No.
* Closes: apache#34086

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants