-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fields within a null struct are not initialized with null values #41833
Comments
@timsaucer I see what you mean, but as far as I know, nothing in the Arrow columnar format specification requires that those values are null. In the end, also for a primitive array with a null, we actually put some "default" value in the null slot: >>> arr = pa.array([1, None, 3])
>>> arr
<pyarrow.lib.Int64Array object at 0x7f23782d5360>
[
1,
null,
3
]
# using nanoarrow to more easily view the actual buffers
>>> import nanoarrow as na
>>> na.array(arr).inspect()
<ArrowArray int64>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
- validity <bool[1 b] 10100000>
- data <int64[24 b] 1 0 3> # <-- looking at the actual data buffer, the null slot is also filled with 0
- dictionary: NULL
- children[0]: Similarly, in the nested struct case, those default values in the child array are masked by the validity of the parent struct array. While you could argue that for specifically this kind of conversion of python objects to Arrow data, we could put a null in the child array as well (although that would require to allocate an additional validity bitmap in this small example case), other code should never assume this is the case, as you can easily create a StructArray in a different way (eg directly from the child arrays and a validity bitmap) that would also not give this guarantee. |
Note that if it is about accessing that subfield of a struct array: at that point you indeed typically (although depending on the exact use case) want to "propagate" the parent struct null values to child field as well. For that reason, pyarrow provides two separate APIs to get the child array (using your original example as # getting the "raw" child array as stored under the hood
>>> arr.field("outer").field("inner_1")
Out[14]:
<pyarrow.lib.Int64Array object at 0x7f23734339a0>
[
1,
3,
0
]
# getting the "logical" child array
>>> pc.struct_field(arr, ["outer", "inner_1"])
Out[20]:
<pyarrow.lib.Int64Array object at 0x7f237276b3a0>
[
1,
3,
null
] This API is far from ideal. On the C++ side, there is a I assume that on the datafusion side, there should also be some distinction between those two ways to get a field. |
Describe the bug, including details regarding any error messages, version, and platform.
When creating an array from a python dict, field entries of a null struct are initialized with default values rather than null even if their field is nullable. In the minimal example below, you would expect the 3rd row to have values of
inner_1
andinner_2
to be null.Generates the following output:
Component(s)
Python
The text was updated successfully, but these errors were encountered: