[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016

unj1m · 2024-04-04T19:37:34Z

Applying pyarrow.compute.and_ produces corrupted arrays that segfault when you try to get their true_count.

$ python
Python 3.10.12 (main, Apr  4 2024, 12:45:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.compute
>>> a1 = pyarrow.array([True]*48 + [False]*48)
>>> a2 = pyarrow.array([True, False] * 48)
>>> pyarrow.compute.and_(a1, a2).true_count
Segmentation fault (core dumped)

There segfault started in pyarrow==9.0.0.

It doesn't happen in 7 or 8.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-04-08T11:39:05Z

@unj1m Thanks for the report!

The direct cause of this crash is because inside the true_count method, we are incorrectly checking the array's null_count attribute to take a different code path, but when this attribute is still set to "unknown" (-1), this leads us to a code path that actually accesses the non-existing validity bitmap buffer.

Now, an additional underlying issue is that the binary "and" kernel (and I assume this is true for all simple binary kernels) should not produce a result with the null_count set to -1 if it didn't need to allocate any validity bitmap. I seem to recall a previous issue where this also came up, but don't directly find it.

unj1m · 2024-04-08T14:53:34Z

Thank you for getting to the bottom of this so quickly.

Are changes also needed for the simple binary kernals?

I wonder why this isn't affecting my Arrow Rust code, which I assume is calling the same C++ code. I guess arrow-rs re-implements some functions.

jorisvandenbossche · 2024-04-09T06:39:00Z

I wonder why this isn't affecting my Arrow Rust code, which I assume is calling the same C++ code. I guess arrow-rs re-implements some functions.

Arrow C++ (on which pyarrow is based) and Arrow Rust are two completely independent implementations, so that is expected

rohanjain101 · 2024-04-09T18:11:15Z

Is #38553 a similar issue? The error in that case is also related to nulls.

felipecrv · 2024-04-10T01:20:29Z

After working for a while on the Arrow codebase, I can say that there is a lot of code that assumes the following (undocumented) invariant:

null_count can only transition from -1 to a value different from 0 if and only if the bitmap buffer present

another way to put it:

well-formed arrays that have null_count != -1 are that way for only one reason: length > 0 and there is a bitmap that must be scanned to get to the real value of null_count

The fact that this invariant is undocumented means there are defensive coding here and there to protect against violations:

arrow/cpp/src/arrow/array/data.h

Lines 287 to 291 in cd607d0

    
           bool MayHaveNulls() const { 
        
             // If an ArrayData is slightly malformed it may have kUnknownNullCount set 
        
             // but no buffer 
        
             return null_count.load() != 0 && buffers[0] != NULLPTR; 
        
           }

...but there is also code that carefully guarantees that it's preserved in constructors:

arrow/cpp/src/arrow/array/data.cc

Lines 50 to 67 in cd607d0

    
           static inline void AdjustNonNullable(Type::type type_id, int64_t length, 
        
                                                std::vector<std::shared_ptr<Buffer>>* buffers, 
        
                                                int64_t* null_count) { 
        
             if (type_id == Type::NA) { 
        
               *null_count = length; 
        
               (*buffers)[0] = nullptr; 
        
             } else if (internal::HasValidityBitmap(type_id)) { 
        
               if (*null_count == 0) { 
        
                 // In case there are no nulls, don't keep an allocated null bitmap around 
        
                 (*buffers)[0] = nullptr; 
        
               } else if (*null_count == kUnknownNullCount && buffers->at(0) == nullptr) { 
        
                 // Conversely, if no null bitmap is provided, set the null count to 0 
        
                 *null_count = 0; 
        
               } 
        
             } else { 
        
               *null_count = 0; 
        
             } 
        
           }

felipecrv · 2024-04-10T01:24:58Z

This discussion triggered me to create this issue: #41113

felipecrv · 2024-04-10T01:46:56Z

Is #38553 a similar issue? The error in that case is also related to nulls.

Yes. And the immediate fix is similar to what I recommended in @jorisvandenbossche PR:

Replace data->null_count != 0 checks with data->MayHaveNulls() checks. This is the code:

arrow/cpp/src/arrow/array/array_nested.cc

Lines 835 to 843 in cd607d0

    
           if (pair_data->null_count != 0) { 
        
             return Status::Invalid("Map array child array should have no nulls"); 
        
           } 
        
           if (pair_data->child_data.size() != 2) { 
        
             return Status::Invalid("Map array child array should have two fields"); 
        
           } 
        
           if (pair_data->child_data[0]->null_count != 0) { 
        
             return Status::Invalid("Map array keys array should have no nulls"); 
        
           }

@jorisvandenbossche ~~there needs to be an issue about investigating why pyarrow is producing these malformed arrays. Because it's unlikely that this was the case for too long and not one hit these crashes before.~~

EDIT: nevermind, the second issue is from November.

…1070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: #41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou · 2024-04-15T15:30:34Z

Issue resolved by pull request 41070
#41070

…1070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: #41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

unj1m · 2024-04-15T15:59:10Z

Any idea when this will be released? 😃

raulcd · 2024-04-15T16:04:02Z

This will be released with 16.0.0. The 16.0.0 has been on feature freeze for a week and we are still fixing some CI jobs and bugs. If things go as planned we probably have a Release Candidate by the end of the week. We should have have a release in ~1-2 weeks.

unj1m · 2024-04-15T16:16:45Z

Cool, Thanks.

…() (apache#41070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: apache#41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

kyle-hex · 2024-05-13T21:54:43Z

Unfortunately, my coding environment won't allow me to upgrade to pyarrow v 16. Is there any workaround for this that doesn't involve iterating over the entire array and testing each value individually?

edit: I am using the compute.invert method to grab the false_count, which i'm treating as "true". if anyone has a better idea, i would appreciate it.

thanks

jorisvandenbossche · 2024-05-14T07:24:10Z

Is there any workaround for this that doesn't involve iterating over the entire array and testing each value individually?

A workaround to avoid the segfault is adding a call to null_count before calling true_count. Editing the original example, the below does not crash on pyarrow 15.0:

import pyarrow.compute
a1 = pyarrow.array([True]*48 + [False]*48)
a2 = pyarrow.array([True, False] * 48)
res = pyarrow.compute.and_(a1, a2)
# call null_count to populate it before calling true_count
res.null_count
true_count = res.true_count

…() (apache#41070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: apache#41016 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

unj1m added the Type: bug label Apr 4, 2024

github-actions bot added the Component: Python label Apr 4, 2024

jorisvandenbossche changed the title ~~pyarrow.compute.and_ produces corrupted arrow files~~ [C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer Apr 8, 2024

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 8, 2024

apacheGH-41016: [C++] Fix null count check in BooleanArray.true_count()

79d9b3c

github-actions bot mentioned this issue Apr 8, 2024

GH-41016: [C++] Fix null count check in BooleanArray.true_count() #41070

Merged

github-actions bot assigned jorisvandenbossche Apr 8, 2024

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 9, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-41016

4cc4aba

jorisvandenbossche added this to the 16.0.0 milestone Apr 9, 2024

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 10, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-41016

2ee2c99

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 11, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-41016

93b4988

pitrou pushed a commit to jorisvandenbossche/arrow that referenced this issue Apr 15, 2024

apacheGH-41016: [C++] Fix null count check in BooleanArray.true_count()

0996ef8

pitrou closed this as completed Apr 15, 2024

amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Apr 27, 2024

amoeba mentioned this issue May 13, 2024

pyarrow.lib.BooleanArray.true_count causing kernel to die in Jupyter #41642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016

[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016

unj1m commented Apr 4, 2024 •

edited

Loading

jorisvandenbossche commented Apr 8, 2024 •

edited

Loading

unj1m commented Apr 8, 2024

jorisvandenbossche commented Apr 9, 2024

rohanjain101 commented Apr 9, 2024

felipecrv commented Apr 10, 2024

felipecrv commented Apr 10, 2024

felipecrv commented Apr 10, 2024 •

edited

Loading

pitrou commented Apr 15, 2024

unj1m commented Apr 15, 2024

raulcd commented Apr 15, 2024

unj1m commented Apr 15, 2024

kyle-hex commented May 13, 2024 •

edited

Loading

jorisvandenbossche commented May 14, 2024

[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016

[C++] BooleanArray true_count crashing in case of unknown null count (null_count = -1) without validity buffer #41016

Comments

unj1m commented Apr 4, 2024 • edited Loading

Component(s)

jorisvandenbossche commented Apr 8, 2024 • edited Loading

unj1m commented Apr 8, 2024

jorisvandenbossche commented Apr 9, 2024

rohanjain101 commented Apr 9, 2024

felipecrv commented Apr 10, 2024

felipecrv commented Apr 10, 2024

felipecrv commented Apr 10, 2024 • edited Loading

pitrou commented Apr 15, 2024

unj1m commented Apr 15, 2024

raulcd commented Apr 15, 2024

unj1m commented Apr 15, 2024

kyle-hex commented May 13, 2024 • edited Loading

jorisvandenbossche commented May 14, 2024

unj1m commented Apr 4, 2024 •

edited

Loading

jorisvandenbossche commented Apr 8, 2024 •

edited

Loading

felipecrv commented Apr 10, 2024 •

edited

Loading

kyle-hex commented May 13, 2024 •

edited

Loading