-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016
Comments
@unj1m Thanks for the report! The direct cause of this crash is because inside the Now, an additional underlying issue is that the binary "and" kernel (and I assume this is true for all simple binary kernels) should not produce a result with the |
Thank you for getting to the bottom of this so quickly. Are changes also needed for the simple binary kernals? I wonder why this isn't affecting my Arrow Rust code, which I assume is calling the same C++ code. I guess arrow-rs re-implements some functions. |
Arrow C++ (on which pyarrow is based) and Arrow Rust are two completely independent implementations, so that is expected |
Is #38553 a similar issue? The error in that case is also related to nulls. |
After working for a while on the Arrow codebase, I can say that there is a lot of code that assumes the following (undocumented) invariant:
another way to put it:
The fact that this invariant is undocumented means there are defensive coding here and there to protect against violations: arrow/cpp/src/arrow/array/data.h Lines 287 to 291 in cd607d0
...but there is also code that carefully guarantees that it's preserved in constructors: arrow/cpp/src/arrow/array/data.cc Lines 50 to 67 in cd607d0
|
This discussion triggered me to create this issue: #41113 |
Yes. And the immediate fix is similar to what I recommended in @jorisvandenbossche PR: Replace arrow/cpp/src/arrow/array/array_nested.cc Lines 835 to 843 in cd607d0
@jorisvandenbossche EDIT: nevermind, the second issue is from November. |
…1070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: #41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Issue resolved by pull request 41070 |
…1070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: #41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Any idea when this will be released? 😃 |
This will be released with 16.0.0. The 16.0.0 has been on feature freeze for a week and we are still fixing some CI jobs and bugs. If things go as planned we probably have a Release Candidate by the end of the week. We should have have a release in ~1-2 weeks. |
Cool, Thanks. |
…() (apache#41070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: apache#41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Unfortunately, my coding environment won't allow me to upgrade to pyarrow v 16. Is there any workaround for this that doesn't involve iterating over the entire array and testing each value individually? edit: I am using the compute.invert method to grab the false_count, which i'm treating as "true". if anyone has a better idea, i would appreciate it. thanks |
A workaround to avoid the segfault is adding a call to import pyarrow.compute
a1 = pyarrow.array([True]*48 + [False]*48)
a2 = pyarrow.array([True, False] * 48)
res = pyarrow.compute.and_(a1, a2)
# call null_count to populate it before calling true_count
res.null_count
true_count = res.true_count |
…() (apache#41070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: apache#41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Applying
pyarrow.compute.and_
produces corrupted arrays that segfault when you try to get theirtrue_count
.There segfault started in pyarrow==9.0.0.
It doesn't happen in 7 or 8.
Component(s)
Python
The text was updated successfully, but these errors were encountered: