New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946
Comments
I think the current semantics are ok. |
Maybe FieldRef::Get -> FieldRef::GetRaw or similar? But yeah, I think the behavior is fine considering they are different contexts. |
Thanks for bringing this up as I hadn't run into this nuance before. Do |
I don't think we actually use the specific I have been looking through the places where we call The only place I found (based on a quick look, outside arrow/cpp/src/arrow/dataset/scanner_test.cc Lines 672 to 675 in 5c1044f
and I would say this is only "correct" because the test batch doesn't have missing values. |
I actually only looked for arrow/cpp/src/arrow/compute/kernels/vector_sort.cc Lines 1236 to 1246 in 5ce8d79
And so because of Tweaking the cython SortKey bindings a little bit to allow constructing it with a FieldRef (currently in pyarrow we only allow specifying the column to sort by using a string, from C++ you can of course already do that), we can see this bug in action by sorting a RecordBatch that has a struct column that has a top-level null:
while if we compare that to sorting the struct array directly (which was just merged in #14781 and does correctly "flatten" the field when you specify to sort by a field):
In this case the null value is correctly sorted at the end. |
So as a summary: given that those FieldRef/FieldPath |
FieldRef/FieldPath are public APIs, we cannot change their semantics like that even though that might benefit some of our internal usage. |
As for the sorting "bug", I would argue that sorting was never intended to work on nested fields like that, and I don't see any C++ tests for it. |
|
Assuming we want this behavior for sort (I think we do) then we should change that call to |
Perhaps we can introduce |
Introducing |
### Rationale for this change The current `FieldPath::Get` methods - when extracting nested child values - don't combine the child's null bitmap with higher-level parent bitmaps. While this is often preferable (it allows for zero-copy), there are cases where higher level "flattening" version is useful - e.g. when pre-processing sort keys for structs. ### What changes are included in this PR? - Adds `FieldPath::GetFlattened` methods alongside the existing `FieldPath::Get` overloads - Adds `GetAllFlattened`, `GetOneFlattened` and `GetOneOrNoneFlattened` methods to `FieldRef` - Adds a couple internal helpers for dealing with both `Get` variants in templates - Overhauls the `FieldPath` tests in an effort to improve coverage and generalize across the supported input types More significantly, this alters the `FieldPathGetImpl` internals to use a new `NestedSelector` class. The reason for this is that the prior method required presenting a vector of instantiated child values for each depth level prior to selection. With support for flattening (and recently, `ChunkedArrays`), this becomes a problem since we need to explicitly create all of those child values for each depth level despite the fact that we're only going to select one. So these changes allow any expensive instantiations to be deferred to selection time. This also indirectly solves an issue that surfaced in the new tests, which is that `FieldPath::Get` would return incorrect nested values when sliced `Array`s are involved. This is because the underlying child data's offset/length weren't being adjusted based on the parent. ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? Yes, this adds methods to a public API * Closes: #14946 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Related to the discussion in #14697 (comment) (but also to the python equivalent APIs like in #14781 (comment)).
Currently, the
FieldRef
/FieldPath
methods to get this field from a StructArray returns the child array (i.e.FieldRef::GetOne/GetAll
orFieldPath::Get()
when passing those methods a StructArray (or RecordBatch with struct typed field).Basically, under the hood, it does
struct_array.data().child_data[idx]
.On the other hand, if you use the same
FieldRef
/FieldPath
object in a compute context, and for example pass this to thestruct_field
kernel, the result you get is the "flattened" field array (I find the "flatten" term somewhat confusing), which combines the top-level validity bitmap of the struct array with the validity bitmap of the child array, to ensure you have a correct indication of missing values in the result.Basically, under the hood this kernel is doing
struct_array.GetFlattenedField(idx)
instead.So I am wondering if those two usage patterns of FieldRefs should be consistent? Although that probably depends on what the intended use case is for the direct FieldRef::GetOne / FieldPath::Get ? (currently, I don't see those used in actual code, except as helper in some testing code)
cc @pitrou @westonpace @lidavidm
Component(s)
C++
The text was updated successfully, but these errors were encountered: