New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity #8589
ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity #8589
Conversation
cpp/src/parquet/column_writer.cc
Outdated
return arrow::MakeArray(std::make_shared<ArrayData>( | ||
array->type(), array->length(), std::move(buffers), new_null_count)); | ||
return arrow::MakeArray(std::make_shared<ArrayData>(array->type(), array->length(), | ||
std::move(buffers), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch.
I think there is an edge case here. I think offset applies to the null buffer also, so this would cause an illegal access if any values are actually null. unfortunately, I think we need to slice all the buffers according to offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right the offset also applies to the null buffer. But I am not sure how to best slice the other buffers, since ArrayData::Slice won't help here and using SliceBuffer directly would require figuring out the physical offset from the logical offset in ArrayData. Any suggestions or pointers here would be appreciated!
Fow now, I pushed a not so elegant workaround that allocates a bitmap for the whole column instead of for only a batch_size. That way the offset for the bitmap and the other buffers should be the same, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm not sure I like that change that way, I'll try to make a new commit on your branch with something that I think will be cleaner (adding an API to Array to get the right sliced buffer.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments, the test one is more of a nit, but we should it would be useful to test with nulls.
0c7798a
to
7fcdb2d
Compare
just realized there are some constness issues I need to address. |
Okay, I'm done pushing. @chrisavl I think the new API should make the change for parquet_writer smaller (and probably more efficient). See unit tests added for intended usage. Note you probably need to check that array length > for assignment (we might need to do that for zero size also to test NA arrays). |
I will move the slice method into parquet, expect some pushes to the branch over the next couple of hours. |
This reverts commit 6a00509.
@emkornfield thanks, LGTM! |
This PR fixes the offset being lost when the validity bitmap is updated. https://issues.apache.org/jira/browse/ARROW-10493 Closes apache#8589 from chrisavl/10493-column-writer-nested-nullable-offset Lead-authored-by: Christian Lundgren <chrisavl@users.noreply.github.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
This PR fixes the offset being lost when the validity bitmap is updated. https://issues.apache.org/jira/browse/ARROW-10493 Closes apache#8589 from chrisavl/10493-column-writer-nested-nullable-offset Lead-authored-by: Christian Lundgren <chrisavl@users.noreply.github.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
This PR fixes the offset being lost when the validity bitmap is updated.
https://issues.apache.org/jira/browse/ARROW-10493