-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Parquet V2 page headers have incorrect number of rows #34086
Labels
Component: C++
Component: Parquet
Critical Fix
Bugfixes for security vulnerabilities, crashes, or invalid data.
Type: bug
Milestone
Comments
kou
changed the title
Parquet V2 page headers have incorrect number of rows
[C++][Parquet] Parquet V2 page headers have incorrect number of rows
Feb 9, 2023
This might be easier to fix when #34054 is merged. |
I exactly encounter this issue as well as other issues when I was implementing #34054. Will fix them shortly. |
wgtmac
added a commit
to wgtmac/arrow
that referenced
this issue
Feb 9, 2023
wgtmac
added a commit
to wgtmac/arrow
that referenced
this issue
Feb 9, 2023
wjones127
pushed a commit
that referenced
this issue
Feb 10, 2023
### Rationale for this change The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header. ### What changes are included in this PR? ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2. ### Are these changes tested? A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns. ### Are there any user-facing changes? No. * Closes: #34086 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
gringasalpastor
pushed a commit
to gringasalpastor/arrow
that referenced
this issue
Feb 17, 2023
…pache#34096) ### Rationale for this change The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header. ### What changes are included in this PR? ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2. ### Are these changes tested? A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns. ### Are there any user-facing changes? No. * Closes: apache#34086 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
fatemehp
pushed a commit
to fatemehp/arrow
that referenced
this issue
Feb 24, 2023
…pache#34096) ### Rationale for this change The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header. ### What changes are included in this PR? ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2. ### Are these changes tested? A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns. ### Are there any user-facing changes? No. * Closes: apache#34086 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
wjones127
added
the
Critical Fix
Bugfixes for security vulnerabilities, crashes, or invalid data.
label
Apr 26, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Component: C++
Component: Parquet
Critical Fix
Bugfixes for security vulnerabilities, crashes, or invalid data.
Type: bug
Describe the bug, including details regarding any error messages, version, and platform.
When writing Parquet files with version 2 page headers, the
num_rows
field is incorrect. This appears to be because incolumn_writer.cc ColumnWriterImpl::BuildDataPageV2()
num_values
is passed twice to the constructor forDataPageV2
. The 4th argument should benum_rows
.To reproduce:
Examining with parquet-cli:
"rows" should be 1.
Rewriting the file with parquet-mr gives:
% parquet-cli pages bug-mr.parquet Column: col0.list.element -------------------------------------------------------------------------------- page type enc count avg size size rows nulls min / max 0-0 data _ D 3 5.00 B 15 B 1 0
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: