Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016

Closed
unj1m opened this issue Apr 4, 2024 · 13 comments
Assignees
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@unj1m
Copy link

unj1m commented Apr 4, 2024

Applying pyarrow.compute.and_ produces corrupted arrays that segfault when you try to get their true_count.

$ python
Python 3.10.12 (main, Apr  4 2024, 12:45:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.compute
>>> a1 = pyarrow.array([True]*48 + [False]*48)
>>> a2 = pyarrow.array([True, False] * 48)
>>> pyarrow.compute.and_(a1, a2).true_count
Segmentation fault (core dumped)

There segfault started in pyarrow==9.0.0.

It doesn't happen in 7 or 8.

Component(s)

Python

@unj1m unj1m added the Type: bug label Apr 4, 2024
@jorisvandenbossche jorisvandenbossche changed the title pyarrow.compute.and_ produces corrupted arrow files [C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer Apr 8, 2024
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 8, 2024

@unj1m Thanks for the report!

The direct cause of this crash is because inside the true_count method, we are incorrectly checking the array's null_count attribute to take a different code path, but when this attribute is still set to "unknown" (-1), this leads us to a code path that actually accesses the non-existing validity bitmap buffer.

Now, an additional underlying issue is that the binary "and" kernel (and I assume this is true for all simple binary kernels) should not produce a result with the null_count set to -1 if it didn't need to allocate any validity bitmap. I seem to recall a previous issue where this also came up, but don't directly find it.

@unj1m
Copy link
Author

unj1m commented Apr 8, 2024

Thank you for getting to the bottom of this so quickly.

Are changes also needed for the simple binary kernals?

I wonder why this isn't affecting my Arrow Rust code, which I assume is calling the same C++ code. I guess arrow-rs re-implements some functions.

@jorisvandenbossche
Copy link
Member

I wonder why this isn't affecting my Arrow Rust code, which I assume is calling the same C++ code. I guess arrow-rs re-implements some functions.

Arrow C++ (on which pyarrow is based) and Arrow Rust are two completely independent implementations, so that is expected

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 9, 2024
@jorisvandenbossche jorisvandenbossche added this to the 16.0.0 milestone Apr 9, 2024
@rohanjain101
Copy link

Is #38553 a similar issue? The error in that case is also related to nulls.

@felipecrv
Copy link
Contributor

After working for a while on the Arrow codebase, I can say that there is a lot of code that assumes the following (undocumented) invariant:

null_count can only transition from -1 to a value different from 0 if and only if the bitmap buffer present

another way to put it:

well-formed arrays that have null_count != -1 are that way for only one reason: length > 0 and there is a bitmap that must be scanned to get to the real value of null_count

The fact that this invariant is undocumented means there are defensive coding here and there to protect against violations:

bool MayHaveNulls() const {
// If an ArrayData is slightly malformed it may have kUnknownNullCount set
// but no buffer
return null_count.load() != 0 && buffers[0] != NULLPTR;
}

...but there is also code that carefully guarantees that it's preserved in constructors:

static inline void AdjustNonNullable(Type::type type_id, int64_t length,
std::vector<std::shared_ptr<Buffer>>* buffers,
int64_t* null_count) {
if (type_id == Type::NA) {
*null_count = length;
(*buffers)[0] = nullptr;
} else if (internal::HasValidityBitmap(type_id)) {
if (*null_count == 0) {
// In case there are no nulls, don't keep an allocated null bitmap around
(*buffers)[0] = nullptr;
} else if (*null_count == kUnknownNullCount && buffers->at(0) == nullptr) {
// Conversely, if no null bitmap is provided, set the null count to 0
*null_count = 0;
}
} else {
*null_count = 0;
}
}

@felipecrv
Copy link
Contributor

This discussion triggered me to create this issue: #41113

@felipecrv
Copy link
Contributor

felipecrv commented Apr 10, 2024

Is #38553 a similar issue? The error in that case is also related to nulls.

Yes. And the immediate fix is similar to what I recommended in @jorisvandenbossche PR:

Replace data->null_count != 0 checks with data->MayHaveNulls() checks. This is the code:

if (pair_data->null_count != 0) {
return Status::Invalid("Map array child array should have no nulls");
}
if (pair_data->child_data.size() != 2) {
return Status::Invalid("Map array child array should have two fields");
}
if (pair_data->child_data[0]->null_count != 0) {
return Status::Invalid("Map array keys array should have no nulls");
}

@jorisvandenbossche there needs to be an issue about investigating why pyarrow is producing these malformed arrays. Because it's unlikely that this was the case for too long and not one hit these crashes before.

EDIT: nevermind, the second issue is from November.

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 10, 2024
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 11, 2024
pitrou pushed a commit to jorisvandenbossche/arrow that referenced this issue Apr 15, 2024
pitrou pushed a commit that referenced this issue Apr 15, 2024
…1070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: #41016

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou
Copy link
Member

pitrou commented Apr 15, 2024

Issue resolved by pull request 41070
#41070

@pitrou pitrou closed this as completed Apr 15, 2024
raulcd pushed a commit that referenced this issue Apr 15, 2024
…1070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: #41016

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@unj1m
Copy link
Author

unj1m commented Apr 15, 2024

Any idea when this will be released? 😃

@raulcd
Copy link
Member

raulcd commented Apr 15, 2024

This will be released with 16.0.0. The 16.0.0 has been on feature freeze for a week and we are still fixing some CI jobs and bugs. If things go as planned we probably have a Release Candidate by the end of the week. We should have have a release in ~1-2 weeks.

@unj1m
Copy link
Author

unj1m commented Apr 15, 2024

Cool, Thanks.

@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Apr 27, 2024
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…() (apache#41070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: apache#41016

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@kyle-hex
Copy link

kyle-hex commented May 13, 2024

Unfortunately, my coding environment won't allow me to upgrade to pyarrow v 16. Is there any workaround for this that doesn't involve iterating over the entire array and testing each value individually?

edit: I am using the compute.invert method to grab the false_count, which i'm treating as "true". if anyone has a better idea, i would appreciate it.

thanks

@jorisvandenbossche
Copy link
Member

Is there any workaround for this that doesn't involve iterating over the entire array and testing each value individually?

A workaround to avoid the segfault is adding a call to null_count before calling true_count. Editing the original example, the below does not crash on pyarrow 15.0:

import pyarrow.compute
a1 = pyarrow.array([True]*48 + [False]*48)
a2 = pyarrow.array([True, False] * 48)
res = pyarrow.compute.and_(a1, a2)
# call null_count to populate it before calling true_count
res.null_count
true_count = res.true_count

vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…() (apache#41070)

### Rationale for this change

Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case.

### What changes are included in this PR?

Use `data->MayHaveNulls()` instead of `data->null_count.load()`

### Are these changes tested?

Yes

* GitHub Issue: apache#41016

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

No branches or pull requests

8 participants