Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity #8589

Conversation

chrisavl
Copy link
Contributor

@chrisavl chrisavl commented Nov 4, 2020

This PR fixes the offset being lost when the validity bitmap is updated.

https://issues.apache.org/jira/browse/ARROW-10493

@github-actions
Copy link

github-actions bot commented Nov 4, 2020

return arrow::MakeArray(std::make_shared<ArrayData>(
array->type(), array->length(), std::move(buffers), new_null_count));
return arrow::MakeArray(std::make_shared<ArrayData>(array->type(), array->length(),
std::move(buffers),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

I think there is an edge case here. I think offset applies to the null buffer also, so this would cause an illegal access if any values are actually null. unfortunately, I think we need to slice all the buffers according to offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right the offset also applies to the null buffer. But I am not sure how to best slice the other buffers, since ArrayData::Slice won't help here and using SliceBuffer directly would require figuring out the physical offset from the logical offset in ArrayData. Any suggestions or pointers here would be appreciated!

Fow now, I pushed a not so elegant workaround that allocates a bitmap for the whole column instead of for only a batch_size. That way the offset for the bitmap and the other buffers should be the same, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm not sure I like that change that way, I'll try to make a new commit on your branch with something that I think will be cleaner (adding an API to Array to get the right sliced buffer.)

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments, the test one is more of a nit, but we should it would be useful to test with nulls.

@chrisavl chrisavl force-pushed the 10493-column-writer-nested-nullable-offset branch from 0c7798a to 7fcdb2d Compare November 6, 2020 05:12
@emkornfield
Copy link
Contributor

@chrisavl I added a new interface that you should be able to use for replacing the second buffer for arrow writing. I'm going to push one more commit which will add tests. Could you update your code to use it.

@bkietz could you review/check the interface code?

@emkornfield
Copy link
Contributor

just realized there are some constness issues I need to address.

@emkornfield
Copy link
Contributor

Okay, I'm done pushing. @chrisavl I think the new API should make the change for parquet_writer smaller (and probably more efficient). See unit tests added for intended usage. Note you probably need to check that array length > for assignment (we might need to do that for zero size also to test NA arrays).

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_base.h Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

I will move the slice method into parquet, expect some pushes to the branch over the next couple of hours.

@emkornfield
Copy link
Contributor

@chrisavl I'm done pushing. I expanded the test case to try to cover all branches in the new slice method and reverted my previous changes. Let me know if the new code looks reasonable. @bkietz could you take a look also?

@chrisavl
Copy link
Contributor Author

@emkornfield thanks, LGTM!

yordan-pavlov pushed a commit to yordan-pavlov/arrow that referenced this pull request Nov 14, 2020
This PR fixes the offset being lost when the validity bitmap is updated.

https://issues.apache.org/jira/browse/ARROW-10493

Closes apache#8589 from chrisavl/10493-column-writer-nested-nullable-offset

Lead-authored-by: Christian Lundgren <chrisavl@users.noreply.github.com>
Co-authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR fixes the offset being lost when the validity bitmap is updated.

https://issues.apache.org/jira/browse/ARROW-10493

Closes apache#8589 from chrisavl/10493-column-writer-nested-nullable-offset

Lead-authored-by: Christian Lundgren <chrisavl@users.noreply.github.com>
Co-authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants