Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37072: [C++] GetArrayView: allow non-nullable fields to be null if parent is null #37202

Closed
wants to merge 2 commits into from

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Aug 16, 2023

Rationale for this change

The layout constraint checks in GetArrayView are too strict when they don't allow non-nullable STRUCT fields to be null since STRUCT fields can legally have any value when the parent Array describes the entry as not valid.

Reproduction script

import pyarrow as pa

struct_type = pa.struct([pa.field("x", pa.int32(), nullable=False)])
nulls = pa.nulls(5, struct_type)

# The following is an error:
nulls.cast(struct_type)

What changes are included in this PR?

Allow the execution of GetArrayView with nulls in non-nullable fields iff the parent Array is 100% null. To cover all cases, the checks would have to be done value by value, but that would be expensive. This PR covers the cases where STRUCT arrays are created by MakeArrayOfNull. Kernels producing STRUCTs with a few nulls might be wiser to set validity bits to 1 on non-nullable fields even when the parent is null.

Are these changes tested?

By existing tests and manual reproduction script.

@github-actions
Copy link

⚠️ GitHub issue #37072 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Aug 16, 2023

I'm honestly not sure this is the right solution. If a field is not nullable, then ideally it shouldn't have any nulls even if it is shadowed by an all-nulls parent array.

Other implementations may also not like receiving such data over IPC or the C Data Interface. @alippai @lidavidm @paleolimbot What do you think?

Also, in any case, this should come with unit tests!

@felipecrv
Copy link
Contributor Author

If a field is not nullable, then ideally it shouldn't have any nulls even if it is shadowed by an all-nulls parent array.

But what value should a field of a struct have when the entire struct is a logical null? Isn't that completely undefined and a waste of time to be populated?

@pitrou
Copy link
Member

pitrou commented Aug 16, 2023

It's undefined but it needs to be populated anyway :-) (we can't leave memory uninitialized as it can be serialized through IPC, for example)

MakeArrayOfNull already assigns a buffer full of zeros to all array buffers.

@felipecrv
Copy link
Contributor Author

MakeArrayOfNull already assigns a buffer full of zeros to all array buffers.

But what you are asking for here requires a validity buffer set to 0xff on the children and data buffers set to 0 or some type-specific notion of default value.

@pitrou
Copy link
Member

pitrou commented Aug 16, 2023

Right, it would add a bit of complexity, hopefully not too much. Most types would probably be ok with 0xff for all buffer data. Perhaps (list/binary/dense union) offsets and run-ends would better be zero, though?

@lidavidm
Copy link
Member

Other implementations may also not like receiving such data over IPC or the C Data Interface.

Practically I think implementations are not great about using the nullability flag, instead relying mostly on the physical presence of nulls, but this is a chicken-and-egg issue.

If a field is not nullable, then ideally it shouldn't have any nulls even if it is shadowed by an all-nulls parent array.

This sounds reasonable, but we should clarify this more generally? It sounds like we aren't precise enough with that field nullability means for the physical data.

@pitrou
Copy link
Member

pitrou commented Aug 16, 2023

This sounds reasonable, but we should clarify this more generally? It sounds like we aren't precise enough with that field nullability means for the physical data.

Hmm, it seems like this could be discussed on the mailing-list. Perhaps we don't want to impose undue constraints here.

@felipecrv
Copy link
Contributor Author

Right, it would add a bit of complexity, hopefully not too much. Most types would probably be ok with 0xff for all buffer data. Perhaps (list/binary/dense union) offsets and run-ends would better be zero, though?

If we were to got that route we would have to define the "default value" for each type similar to how protobuf 3 is allowed to omit data and default values are assumed.

  • Boolean - false
  • Number - 0.0
  • List - empty list (offsets would have to be 0 and not 0xffffffff)
  • Unions - the default value of the first type

This sounds reasonable, but we should clarify this more generally? It sounds like we aren't precise enough with that field nullability means for the physical data.

Hmm, it seems like this could be discussed on the mailing-list. Perhaps we don't want to impose undue constraints here.

I vote in favor of not imposing constraints. This is why I don't think the way to go here is "fixing" MakeArrayOfNull but relaxing the constraint checking that's failing on per-spec valid data.

@pitrou
Copy link
Member

pitrou commented Aug 16, 2023

We don't need to officially define "default values", just make sure we don't produce invalid data (such as non-monotonic offsets). This is exactly the idea behind ArrayBuilder::AppendEmptyValue:

/// \brief Append a non-null value to builder
///
/// The appended value is an implementation detail, but the corresponding
/// memory slot is guaranteed to be initialized.
/// This method is useful when appending a null value to a parent nested type.
virtual Status AppendEmptyValue() = 0;

But this discussion can also regard other implementations, I think it would be useful to discuss on the ML.

@alippai
Copy link
Contributor

alippai commented Aug 16, 2023

My only insight here is that nullability in nested types and the need of fast paths are causing non-trivial performance degradation in parquet serde. So if restricting some edge cases helps here, likely it’ll help in other code paths as well.

@paleolimbot
Copy link
Member

I'm not sure I understand all of this; however, would this mean that there could be an ArrowArray exported via the C Data interface that has is marked non-nullable but contains a non-NULL validity buffer with nulls? If that is the case, I suspect it may cause a problem at some point...it is a generally useful property that arrays are valid without considering information from any parent.

@felipecrv
Copy link
Contributor Author

But this discussion can also regard other implementations, I think it would be useful to discuss on the ML.

The discussion is needed if we decide to put the definition of empty value (a better terminology than "default value" I used above) on the spec. My current position is that validation should be relaxed and not assume other implementations are producing non-null fields inside null structs.

For instance, I believe that a compute kernel, in the official implementation or other implementations, should be allowed to skip producing empty values inside of null structs. This can be a very meaningful improvement for structs with a large number of fields.

@pitrou
Copy link
Member

pitrou commented Aug 17, 2023

The discussion is needed if we decide to put the definition of empty value (a better terminology than "default value" I used above) on the spec.

No, this is entirely orthogonal.

My current position is that validation should be relaxed and not assume other implementations are producing non-null fields inside null structs.

If we follow your position, other implementations will be exposed to such data. So they will be affected anyway.

For instance, I believe that a compute kernel, in the official implementation or other implementations, should be allowed to skip producing empty values inside of null structs.

What do you mean with "skip producing empty values"? We need to initialize memory anyway. Whether it's by a single memset call or per-value appends is an implementation detail.

@felipecrv
Copy link
Contributor Author

If we follow your position, other implementations will be exposed to such data. So they will be affected anyway.

All I'm saying is that we be liberal in data that we receive (relaxed checks). We can still be strict in MakeArrayOfNull but we shouldn't rely on everyone producing non-null values in struct fields.

Starting a whole specification refinement discussion takes much more time than I thought would be required from me to fix this issue. If the relaxed checks is not an acceptable solution, I can close the PR and move on.

@pitrou
Copy link
Member

pitrou commented Aug 17, 2023

It's not only "data we receive" here, but "data we produce" (MakeArrayOfNull).

@felipecrv
Copy link
Contributor Author

My only insight here is that nullability in nested types and the need of fast paths are causing non-trivial performance degradation in parquet serde. So if restricting some edge cases helps here, likely it’ll help in other code paths as well.

@alippai on the Arrow->Parquet or on the Parquet->Arrow path?

@alippai
Copy link
Contributor

alippai commented Aug 17, 2023

This was list specific and for the parquet->arrow path: #34510

I’ve read this issue again and as I understand both of the solutions would serialize into exactly the same parquet? Not sure which version the parsing will yield

@felipecrv
Copy link
Contributor Author

This was list specific and for the parquet->arrow path: #34510

I’ve read this issue again and as I understand both of the solutions would serialize into exactly the same parquet? Not sure which version the parsing will yield

The issue there seems to be the complexity of the Parquet layout and the overhead of zero-initializing in-memory arrays, although non-zero, is not yet highlighted as an issue.

@felipecrv
Copy link
Contributor Author

Closing this in favor of #38252

@felipecrv felipecrv closed this Oct 20, 2023
@felipecrv felipecrv deleted the allow_nulls_under_nulls branch October 20, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

null structarrays are poorly handled by cast
5 participants