Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array? #14946

Closed
jorisvandenbossche opened this issue Dec 14, 2022 · 12 comments · Fixed by #35197
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Related to the discussion in #14697 (comment) (but also to the python equivalent APIs like in #14781 (comment)).

Currently, the FieldRef/FieldPath methods to get this field from a StructArray returns the child array (i.e. FieldRef::GetOne/GetAll or FieldPath::Get() when passing those methods a StructArray (or RecordBatch with struct typed field).
Basically, under the hood, it does struct_array.data().child_data[idx].

On the other hand, if you use the same FieldRef/FieldPath object in a compute context, and for example pass this to the struct_field kernel, the result you get is the "flattened" field array (I find the "flatten" term somewhat confusing), which combines the top-level validity bitmap of the struct array with the validity bitmap of the child array, to ensure you have a correct indication of missing values in the result.
Basically, under the hood this kernel is doing struct_array.GetFlattenedField(idx) instead.

So I am wondering if those two usage patterns of FieldRefs should be consistent? Although that probably depends on what the intended use case is for the direct FieldRef::GetOne / FieldPath::Get ? (currently, I don't see those used in actual code, except as helper in some testing code)

cc @pitrou @westonpace @lidavidm

Component(s)

C++

@pitrou
Copy link
Member

pitrou commented Dec 14, 2022

I think the current semantics are ok.

@lidavidm
Copy link
Member

Maybe FieldRef::Get -> FieldRef::GetRaw or similar? But yeah, I think the behavior is fine considering they are different contexts.

@westonpace
Copy link
Member

Thanks for bringing this up as I hadn't run into this nuance before. Do FieldRef/FieldPath get used outside of the compute context?

@jorisvandenbossche
Copy link
Member Author

I don't think we actually use the specific Get* methods on FieldRef/FieldPath anywhere with array input, not even in the compute module (whether we use FieldRef/FieldPath in general outside of compute, not directly an idea).

I have been looking through the places where we call FieldRef/FieldPath::Get* internally, and I only see places where we call this with a type or schema (which doesn't have this nuance of raw child vs flattened).

The only place I found (based on a quick look, outside type.cc/type_test.cc) is in the tests in scanner_test.cc:

std::shared_ptr<RecordBatch> batch = MakeTestBatch(0);
ASSERT_OK_AND_ASSIGN(std::shared_ptr<Array> nested_col, FieldPath({2, 0}).Get(*batch));
std::shared_ptr<RecordBatch> one_column = RecordBatch::Make(
schema({field("x", int32())}), batch->num_rows(), ArrayVector{nested_col});

and I would say this is only "correct" because the test batch doesn't have missing values.

@jorisvandenbossche
Copy link
Member Author

I actually only looked for FieldPath::Get, but also taking a look at FieldRef::GetOne now, and that's something we actually do use in the compute module. We use that to get the array to sort by in the kernel to sort a RecordBatch:

Result<Datum> SortIndices(const RecordBatch& batch, const SortOptions& options,
ExecContext* ctx) const {
auto n_sort_keys = options.sort_keys.size();
if (n_sort_keys == 0) {
return Status::Invalid("Must specify one or more sort keys");
}
if (n_sort_keys == 1) {
ARROW_ASSIGN_OR_RAISE(
auto array, PrependInvalidColumn(options.sort_keys[0].target.GetOne(batch)));
return SortIndices(*array, options, ctx);
}

And so because of GetOne not having the "flatten" semantics, this actually causes a bug here.

Tweaking the cython SortKey bindings a little bit to allow constructing it with a FieldRef (currently in pyarrow we only allow specifying the column to sort by using a string, from C++ you can of course already do that), we can see this bug in action by sorting a RecordBatch that has a struct column that has a top-level null:

import pyarrow as pa
import pyarrow.compute as pc
arr = pa.StructArray.from_arrays(
    [pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'],
    mask=pa.array([False, True, False, False, False])
)
batch = pa.table({"col": arr}).to_batches()[0]

In [4]: batch.sort_by([(pc.field("col", "a"), "ascending")]).to_pandas()
Out[4]: 
                col
0  {'a': 1, 'b': 5}
1  {'a': 2, 'b': 4}
2              None
3  {'a': 4, 'b': 3}
4  {'a': 5, 'b': 1}

while if we compare that to sorting the struct array directly (which was just merged in #14781 and does correctly "flatten" the field when you specify to sort by a field):

Out[5]: 
                col
0  {'a': 1, 'b': 5}
1  {'a': 2, 'b': 4}
2  {'a': 4, 'b': 3}
3  {'a': 5, 'b': 1}
4              None

In this case the null value is correctly sorted at the end.

@jorisvandenbossche
Copy link
Member Author

So as a summary: given that those FieldRef/FieldPath Get* methods are mainly used in compute/dataset context, IMO we should change those to just return the correct values ("flattened") instead of the raw child.
(is there a use case for wanting the raw child with those methods? If you want that, you can use the StructArray::field method?)

@pitrou
Copy link
Member

pitrou commented Dec 15, 2022

FieldRef/FieldPath are public APIs, we cannot change their semantics like that even though that might benefit some of our internal usage.

@pitrou
Copy link
Member

pitrou commented Dec 15, 2022

As for the sorting "bug", I would argue that sorting was never intended to work on nested fields like that, and I don't see any C++ tests for it.

@westonpace
Copy link
Member

westonpace commented Dec 15, 2022

FieldRef is used rather extensively in Acero (asof-join, aggregate, and hash-join all use field refs to refer to keys and soon scan-node will use it I think) and in all cases we should interpret the ref as a "flattened" ref. That being said, we don't have to call FeldRef::Get and can change any places where we were relying on it to use a new FieldRef::GetOneFlattened method.

@westonpace
Copy link
Member

westonpace commented Dec 15, 2022

And so because of GetOne not having the "flatten" semantics, this actually causes a bug here.

Assuming we want this behavior for sort (I think we do) then we should change that call to GetOne to a call GetOneFlattened.

@westonpace
Copy link
Member

Perhaps we can introduce GetOneRaw AND GetOneFlattened and then mark GetOne for deprecation. This would prevent users from running into the same mishaps in the future.

@pitrou
Copy link
Member

pitrou commented Dec 15, 2022

Introducing GetOneFlattened would probably be enough of a signal that GetOne returns the physical child.

pitrou added a commit that referenced this issue May 22, 2023
### Rationale for this change

The current `FieldPath::Get` methods - when extracting nested child values - don't combine the child's null bitmap with higher-level parent bitmaps. While this is often preferable (it allows for zero-copy), there are cases where higher level "flattening" version is useful - e.g. when pre-processing sort keys for structs.

### What changes are included in this PR?

- Adds `FieldPath::GetFlattened` methods alongside the existing  `FieldPath::Get` overloads
- Adds `GetAllFlattened`, `GetOneFlattened` and `GetOneOrNoneFlattened` methods to `FieldRef`
- Adds a couple internal helpers for dealing with both `Get` variants in templates
- Overhauls the `FieldPath` tests in an effort to improve coverage and generalize across the supported input types

More significantly, this alters the `FieldPathGetImpl` internals to use a new `NestedSelector` class. The reason for this is that the prior method required presenting a vector of instantiated child values for each depth level prior to selection. With support for flattening (and recently, `ChunkedArrays`), this becomes a problem since we need to explicitly create all of those child values for each depth level despite the fact that we're only going to select one. So these changes allow any expensive instantiations to be deferred to selection time.

This also indirectly solves an issue that surfaced in the new tests, which is that `FieldPath::Get` would return incorrect nested values when sliced `Array`s are involved. This is because the underlying child data's offset/length weren't being adjusted based on the parent.

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

Yes, this adds methods to a public API
* Closes: #14946

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants