Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pa.compute.is_null() returns incorrect answer for dense union arrays and segfaults for dense union scalars #34315

Open
dannygoldstein opened this issue Feb 23, 2023 · 3 comments · May be fixed by #35036 or #37642

Comments

@dannygoldstein
Copy link

dannygoldstein commented Feb 23, 2023

Describe the bug, including details regarding any error messages, version, and platform.

In pyarrow version 11.0.0 and 10.0.1, if I create a dense array with some null elements, pa.compute.is_null() returns that they are not null. Repro:

import pyarrow as pa
types = pa.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], type=pa.int8())
value_offsets = pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0], type=pa.int32())
array1 = pa.array([1, 2, 3, 1, None, 3, None, None, None, None, 1])
array2 = pa.array(["b"])
dense = pa.UnionArray.from_dense(types, value_offsets, [array1, array2])
print(dense)

# <pyarrow.lib.UnionArray object at 0x285dfbc40>
# -- is_valid: all not null
# -- type_ids:   [
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     1
#   ]
# -- value_offsets:   [
#     0,
#     1,
#     2,
#     3,
#     4,
#     5,
#     6,
#     7,
#     8,
#     9,
#     10,
#     0
#   ]
# -- child 0 type: int64
#   [
#     1,
#     2,
#     3,
#     1,
#     null,
#     3,
#     null,
#     null,
#     null,
#     null,
#     1
#   ]
# -- child 1 type: string
#   [
#     "b"
#   ]

Illustration of the first issue:

pa.compute.is_null(dense)
# expected: BooleanArray [false, false, false, false, true, false, true, true, true, true, false, false]
# actual: 
# <pyarrow.lib.BooleanArray object at 0x285dfbdc0>
# [
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false
# ]

Illustration of the second issue:
I do pa.compute.is_null() on a null element of the array, i get a segfault:

null_element = dense[4]
print(null_element)
# <pyarrow.UnionScalar: None>
pa.compute.is_null(null_element)
# Fatal Python error: Segmentation fault

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title pa.compute.is_null() returns incorrect answer for dense union arrays and segfaults for dense union scalars [Python] pa.compute.is_null() returns incorrect answer for dense union arrays and segfaults for dense union scalars Feb 24, 2023
@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Feb 24, 2023
@jorisvandenbossche
Copy link
Member

@dannygoldstein thanks for the report! That's indeed a bug. The problem is that for a union array, there is no top-level validity, but this is defined by the validity bitmaps of its child arrays. But the is_null kernel should take this into account, which doesn't seem to happen.

The problem is also that the null_count attribute is already wrong (and it might be that is_null is taking a shortcut because of that):

>>> dense.null_count
0

@dannygoldstein
Copy link
Author

thanks for the quick response @jorisvandenbossche! and thanks also for all the great work on arrow. it is an awesome package :)

@westonpace
Copy link
Member

Looking at the kernel it seems both problems are there. It does indeed shortcut based on null_count and, even if it didn't, there is no special logic for unions (it just grabs the validity bitmap).

zeroshade pushed a commit that referenced this issue Apr 6, 2023
…itmaps like Unions and Run-End Encoded (#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to #34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken #34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: #34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 11, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 13, 2023
@raulcd raulcd modified the milestones: 12.0.0, 13.0.0 Apr 20, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 21, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 2, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 17, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 23, 2023
@raulcd raulcd modified the milestones: 13.0.0, 14.0.0 Jul 7, 2023
felipecrv pushed a commit to felipecrv/arrow that referenced this issue Sep 9, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 14.0.0, 15.0.0 Oct 6, 2023
@raulcd raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024
@raulcd raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024
@raulcd raulcd modified the milestones: 17.0.0, 18.0.0 Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment