Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2744: [C++] Avoid creating list arrays with a null values buffer #2243

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
3 participants
@pitrou
Copy link
Contributor

commented Jul 10, 2018

No description provided.

@codecov-io

This comment has been minimized.

Copy link

commented Jul 10, 2018

Codecov Report

Merging #2243 into master will increase coverage by 2.44%.
The diff coverage is 54.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2243      +/-   ##
==========================================
+ Coverage   84.34%   86.78%   +2.44%     
==========================================
  Files         281      237      -44     
  Lines       43760    41931    -1829     
==========================================
- Hits        36909    36390     -519     
+ Misses       6820     5541    -1279     
+ Partials       31        0      -31
Impacted Files Coverage Δ
cpp/src/arrow/array-test.cc 100% <100%> (ø) ⬆️
python/pyarrow/tests/test_parquet.py 96.97% <100%> (+0.01%) ⬆️
cpp/src/arrow/ipc/test-common.h 93.13% <100%> (+0.01%) ⬆️
cpp/src/arrow/builder.cc 79.76% <100%> (+0.09%) ⬆️
cpp/src/arrow/array.cc 85.97% <37.03%> (-1.37%) ⬇️
go/arrow/internal/testing/tools/bits.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/checked_allocator.go
go/arrow/datatype_numeric.gen.go
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d42a7d7...715189a. Read the comment docs.

@wesm
Copy link
Member

left a comment

As a matter of usability, I don't know that we should expect all public API users to create a length-0 buffer when there is no data. I believe that code that interacts with the buffers in an array needs to treat length-0 and null equivalently.

A possibly extreme approach to resolve the issue would be to have a length-0 singleton kNullBuffer. But to really sanitize well, we would need to check every buffer going into ArrayData::Make, which would have negative performance implications. We could do the null check in the array containers, e.g. right here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L350.

Allowing the result of values() to be null means more tests to write to make sure that code doesn't break in that edge case.

Another side of this is that for validity bitmaps, it would be incorrect to return a length-0 buffer in the event that there are no nulls, but right now we permit that buffer to be null. In Java they allocate an array of all set bits, which I don't think we should do. So any way you slice it, some code will have to deal with the null buffer case.

My gut feeling is that we should allow the null buffers and document the issue well so that users can defend themselves from untrusted data.

@pitrou

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2018

@wesm

As a matter of usability, I don't know that we should expect all public API users to create a length-0 buffer when there is no data.

I don't think this PR is doing that, except in ValidateArray (which can actually be helpful for users encountering this kind of issue).

I believe that code that interacts with the buffers in an array needs to treat length-0 and null equivalently.

I agree. The parquet-cpp issue was already fixed in apache/parquet-cpp#474

However, I think it's also safer to ensure that we don't generate such buffers unwillingly. I don't think it was deliberate for ListBuilder and StructBuilder to create null buffers for arrays with 0 child values.

So any way you slice it, some code will have to deal with the null buffer case.

Yes, I agree for validity bitmaps code will have to deal with it. For actual values it is a bit unexpected, though (at least the person who wrote the parquet-cpp code clearly didn't expect it :-)).

My gut feeling is that we should allow the null buffers and document the issue well so that users can defend themselves from untrusted data.

Where would you document it? in ArrayData?

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 19, 2018

I don't think this PR is doing that, except in ValidateArray (which can actually be helpful for users encountering this kind of issue).

What would you say to adding an option to ValidateArray about checking for null buffers? I agree that being able to easily check is valuable, even if it isn't strictly "wrong".

However, I think it's also safer to ensure that we don't generate such buffers unwillingly. I don't think it was deliberate for ListBuilder and StructBuilder to create null buffers for arrays with 0 child values.

Agreed

Where would you document it? in ArrayData?

I think in the APIs where std::shared_ptr<Buffer> is returned, like PrimitiveArray<T>::values(). These are already "advanced" APIs in that they do not account for a slice offset

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 19, 2018

On this

What would you say to adding an option to ValidateArray about checking for null buffers? I agree that being able to easily check is valuable, even if it isn't strictly "wrong".

I'm OK with doing this later. I'll give this another review and merge since I don't think it does anything problematic

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 19, 2018

Shoot I think we need to get ARROW-2822 #2239 in first. Let me see if that's ready to merge

@pitrou

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2018

This has conflicts now, I'm gonna rebase.

@pitrou pitrou force-pushed the pitrou:ARROW-2744 branch from 6136ecd to 715189a Jul 23, 2018

@wesm

wesm approved these changes Jul 23, 2018

Copy link
Member

left a comment

+1. Thanks @pitrou!

@wesm wesm closed this in 491114b Jul 23, 2018

@pitrou pitrou deleted the pitrou:ARROW-2744 branch Jul 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.