Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Handling "logical" nulls: add GetLogicalNullCount / update IsNull() to check for logical null #34361

Closed
jorisvandenbossche opened this issue Feb 27, 2023 · 2 comments · Fixed by #34408
Assignees
Labels
Breaking Change Includes a breaking change to the API Component: C++ Component: Go
Milestone

Comments

@jorisvandenbossche
Copy link
Member

There are some data types where the nulls are not stored "physically" using a validity bitmap on the parent ArrayData, but through nulls in child data:

(sidenote: Dictionary arrays could be considered here as well, but are a bit a mixed bag: there is a top-level null count through nulls in the indices, but additionally also the dictionary can contain nulls. So nulls can be encoded in two different ways)

The format specification has a "null_count" (in the IPC FieldNode in the Recordbatch message, and in the C Data Interface), and in those cases this refers to the "physical" null count. And this is followed by the C++ implementation, where the base Array::null_count() (implemented by ArrayData::GetNullCount()) looks at the validity buffer (typically the first buffer) to count the the unset bits, or directly return 0 if there is no validity buffer.

However, in practice you often want to know if there are actual "logical" nulls (not considering those leads to bugs, for example #34315).

@felipecrv @westonpace and I had some discussion about this on zulip (https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20count.20of.20a.20UnionArray/near/336538844), and I think our current idea would be:

  • Add a GetLogicalNullCount to complement the existing Array::null_count() / ArrayData::GetNullCount()(changing null_count() itself might be too much of a breaking change? And would also create an inconsistency with where this is used in the specs)

  • Change Array::IsNull(i) to consider logical nulls instead of just physical nulls

@jorisvandenbossche jorisvandenbossche added Component: C++ Breaking Change Includes a breaking change to the API labels Feb 27, 2023
@felipecrv
Copy link
Contributor

take

@zeroshade
Copy link
Member

Might make sense to do the same for Go too, to keep parity in features.

@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Mar 22, 2023
zeroshade pushed a commit that referenced this issue Apr 6, 2023
…itmaps like Unions and Run-End Encoded (#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to #34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken #34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: #34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: C++ Component: Go
Projects
None yet
3 participants