[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946

jorisvandenbossche · 2022-12-14T13:07:42Z

Related to the discussion in #14697 (comment) (but also to the python equivalent APIs like in #14781 (comment)).

Currently, the FieldRef/FieldPath methods to get this field from a StructArray returns the child array (i.e. FieldRef::GetOne/GetAll or FieldPath::Get() when passing those methods a StructArray (or RecordBatch with struct typed field).
Basically, under the hood, it does struct_array.data().child_data[idx].

On the other hand, if you use the same FieldRef/FieldPath object in a compute context, and for example pass this to the struct_field kernel, the result you get is the "flattened" field array (I find the "flatten" term somewhat confusing), which combines the top-level validity bitmap of the struct array with the validity bitmap of the child array, to ensure you have a correct indication of missing values in the result.
Basically, under the hood this kernel is doing struct_array.GetFlattenedField(idx) instead.

So I am wondering if those two usage patterns of FieldRefs should be consistent? Although that probably depends on what the intended use case is for the direct FieldRef::GetOne / FieldPath::Get ? (currently, I don't see those used in actual code, except as helper in some testing code)

cc @pitrou @westonpace @lidavidm

Component(s)

C++

The text was updated successfully, but these errors were encountered:

pitrou · 2022-12-14T13:27:50Z

I think the current semantics are ok.

lidavidm · 2022-12-14T13:58:37Z

Maybe FieldRef::Get -> FieldRef::GetRaw or similar? But yeah, I think the behavior is fine considering they are different contexts.

westonpace · 2022-12-14T18:09:27Z

Thanks for bringing this up as I hadn't run into this nuance before. Do FieldRef/FieldPath get used outside of the compute context?

jorisvandenbossche · 2022-12-15T11:37:50Z

I don't think we actually use the specific Get* methods on FieldRef/FieldPath anywhere with array input, not even in the compute module (whether we use FieldRef/FieldPath in general outside of compute, not directly an idea).

I have been looking through the places where we call FieldRef/FieldPath::Get* internally, and I only see places where we call this with a type or schema (which doesn't have this nuance of raw child vs flattened).

The only place I found (based on a quick look, outside type.cc/type_test.cc) is in the tests in scanner_test.cc:

arrow/cpp/src/arrow/dataset/scanner_test.cc

Lines 672 to 675 in 5c1044f

    
           std::shared_ptr<RecordBatch> batch = MakeTestBatch(0); 
        
           ASSERT_OK_AND_ASSIGN(std::shared_ptr<Array> nested_col, FieldPath({2, 0}).Get(*batch)); 
        
           std::shared_ptr<RecordBatch> one_column = RecordBatch::Make( 
        
               schema({field("x", int32())}), batch->num_rows(), ArrayVector{nested_col});

and I would say this is only "correct" because the test batch doesn't have missing values.

jorisvandenbossche · 2022-12-15T14:21:47Z

I actually only looked for FieldPath::Get, but also taking a look at FieldRef::GetOne now, and that's something we actually do use in the compute module. We use that to get the array to sort by in the kernel to sort a RecordBatch:

arrow/cpp/src/arrow/compute/kernels/vector_sort.cc

Lines 1236 to 1246 in 5ce8d79

    
           Result<Datum> SortIndices(const RecordBatch& batch, const SortOptions& options, 
        
                                     ExecContext* ctx) const { 
        
             auto n_sort_keys = options.sort_keys.size(); 
        
             if (n_sort_keys == 0) { 
        
               return Status::Invalid("Must specify one or more sort keys"); 
        
             } 
        
             if (n_sort_keys == 1) { 
        
               ARROW_ASSIGN_OR_RAISE( 
        
                   auto array, PrependInvalidColumn(options.sort_keys[0].target.GetOne(batch))); 
        
               return SortIndices(*array, options, ctx); 
        
             }

And so because of GetOne not having the "flatten" semantics, this actually causes a bug here.

Tweaking the cython SortKey bindings a little bit to allow constructing it with a FieldRef (currently in pyarrow we only allow specifying the column to sort by using a string, from C++ you can of course already do that), we can see this bug in action by sorting a RecordBatch that has a struct column that has a top-level null:

import pyarrow as pa
import pyarrow.compute as pc
arr = pa.StructArray.from_arrays(
    [pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'],
    mask=pa.array([False, True, False, False, False])
)
batch = pa.table({"col": arr}).to_batches()[0]

In [4]: batch.sort_by([(pc.field("col", "a"), "ascending")]).to_pandas()
Out[4]: 
                col
0  {'a': 1, 'b': 5}
1  {'a': 2, 'b': 4}
2              None
3  {'a': 4, 'b': 3}
4  {'a': 5, 'b': 1}

while if we compare that to sorting the struct array directly (which was just merged in #14781 and does correctly "flatten" the field when you specify to sort by a field):

Out[5]: 
                col
0  {'a': 1, 'b': 5}
1  {'a': 2, 'b': 4}
2  {'a': 4, 'b': 3}
3  {'a': 5, 'b': 1}
4              None

In this case the null value is correctly sorted at the end.

jorisvandenbossche · 2022-12-15T14:24:32Z

So as a summary: given that those FieldRef/FieldPath Get* methods are mainly used in compute/dataset context, IMO we should change those to just return the correct values ("flattened") instead of the raw child.
(is there a use case for wanting the raw child with those methods? If you want that, you can use the StructArray::field method?)

pitrou · 2022-12-15T15:50:11Z

FieldRef/FieldPath are public APIs, we cannot change their semantics like that even though that might benefit some of our internal usage.

pitrou · 2022-12-15T15:51:26Z

As for the sorting "bug", I would argue that sorting was never intended to work on nested fields like that, and I don't see any C++ tests for it.

westonpace · 2022-12-15T22:24:37Z

FieldRef is used rather extensively in Acero (asof-join, aggregate, and hash-join all use field refs to refer to keys and soon scan-node will use it I think) and in all cases we should interpret the ref as a "flattened" ref. That being said, we don't have to call FeldRef::Get and can change any places where we were relying on it to use a new FieldRef::GetOneFlattened method.

westonpace · 2022-12-15T22:26:18Z

And so because of GetOne not having the "flatten" semantics, this actually causes a bug here.

Assuming we want this behavior for sort (I think we do) then we should change that call to GetOne to a call GetOneFlattened.

westonpace · 2022-12-15T22:27:28Z

Perhaps we can introduce GetOneRaw AND GetOneFlattened and then mark GetOne for deprecation. This would prevent users from running into the same mishaps in the future.

pitrou · 2022-12-15T22:31:02Z

Introducing GetOneFlattened would probably be enough of a signal that GetOne returns the physical child.

### Rationale for this change The current `FieldPath::Get` methods - when extracting nested child values - don't combine the child's null bitmap with higher-level parent bitmaps. While this is often preferable (it allows for zero-copy), there are cases where higher level "flattening" version is useful - e.g. when pre-processing sort keys for structs. ### What changes are included in this PR? - Adds `FieldPath::GetFlattened` methods alongside the existing `FieldPath::Get` overloads - Adds `GetAllFlattened`, `GetOneFlattened` and `GetOneOrNoneFlattened` methods to `FieldRef` - Adds a couple internal helpers for dealing with both `Get` variants in templates - Overhauls the `FieldPath` tests in an effort to improve coverage and generalize across the supported input types More significantly, this alters the `FieldPathGetImpl` internals to use a new `NestedSelector` class. The reason for this is that the prior method required presenting a vector of instantiated child values for each depth level prior to selection. With support for flattening (and recently, `ChunkedArrays`), this becomes a problem since we need to explicitly create all of those child values for each depth level despite the fact that we're only going to select one. So these changes allow any expensive instantiations to be deferred to selection time. This also indirectly solves an issue that surfaced in the new tests, which is that `FieldPath::Get` would return incorrect nested values when sliced `Array`s are involved. This is because the underlying child data's offset/length weren't being adjusted based on the parent. ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? Yes, this adds methods to a public API * Closes: #14946 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

jorisvandenbossche added the Component: C++ label Dec 14, 2022

jorisvandenbossche mentioned this issue Dec 15, 2022

[Python] Change StructArray.field(..) to return "flattened" field? #14970

Open

westonpace mentioned this issue Mar 24, 2023

GH-14939: [C++] Support Table lookups in FieldRef and FieldPath #34537

Merged

github-actions bot mentioned this issue Apr 17, 2023

GH-14946: [C++] Add flattening FieldPath/FieldRef::Get methods #35197

Merged

github-actions bot assigned benibus Apr 17, 2023

pitrou closed this as completed in #35197 May 22, 2023

pitrou added this to the 13.0.0 milestone May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946

[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946

jorisvandenbossche commented Dec 14, 2022

pitrou commented Dec 14, 2022

lidavidm commented Dec 14, 2022

westonpace commented Dec 14, 2022

jorisvandenbossche commented Dec 15, 2022

jorisvandenbossche commented Dec 15, 2022

jorisvandenbossche commented Dec 15, 2022

pitrou commented Dec 15, 2022

pitrou commented Dec 15, 2022

westonpace commented Dec 15, 2022 •

edited

westonpace commented Dec 15, 2022 •

edited

westonpace commented Dec 15, 2022

pitrou commented Dec 15, 2022

[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946

[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946

Comments

jorisvandenbossche commented Dec 14, 2022

Component(s)

pitrou commented Dec 14, 2022

lidavidm commented Dec 14, 2022

westonpace commented Dec 14, 2022

jorisvandenbossche commented Dec 15, 2022

jorisvandenbossche commented Dec 15, 2022

jorisvandenbossche commented Dec 15, 2022

pitrou commented Dec 15, 2022

pitrou commented Dec 15, 2022

westonpace commented Dec 15, 2022 • edited

westonpace commented Dec 15, 2022 • edited

westonpace commented Dec 15, 2022

pitrou commented Dec 15, 2022

westonpace commented Dec 15, 2022 •

edited

westonpace commented Dec 15, 2022 •

edited