Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Change StructArray.field(..) to return "flattened" field? #14970

Open
jorisvandenbossche opened this issue Dec 15, 2022 · 0 comments
Open

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 15, 2022

Related to #14946 on the C++ side, and this recently came up in #14781 (comment).

A StructArray has child arrays that make up its "fields", but in addition it can also have a top-level validity bitmap. So when accessing a field of a StructArray that has such top-level nulls, you can retrieve the "raw" child array or you can get the "logical" field array that combines the child array with the top-level bitmap.

To illustrate:

In [1]: arr = pa.StructArray.from_arrays([pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, False, False, False]))

In [2]: arr.to_pandas()
Out[2]: 
0    {'a': 5, 'b': 1}
1                None
2    {'a': 4, 'b': 3}
3    {'a': 2, 'b': 4}
4    {'a': 1, 'b': 5}
dtype: object

In [3]: arr.field('a')
Out[3]: 
<pyarrow.lib.Int64Array object at 0x7f9db84cdd20>
[
  5,
  3,
  4,
  2,
  1
]

In [4]: arr.flatten()[0]
Out[4]: 
<pyarrow.lib.Int64Array object at 0x7f9db855f400>
[
  5,
  null,
  4,
  2,
  1
]

Currently, the field() method on a StructArray gives you the raw child array, and there is a flatten() method that returns those "logical" field arrays for all the fields as a list of arrays.
We should have a method with which you can get the field array for a single field instead of having to use flatten(), and in #14781, @amol- added a _flattened_field (private for now, but we needed it to get the correct values to sort by):

In [5]: arr._flattened_field('a')
Out[5]: 
<pyarrow.lib.Int64Array object at 0x7f9db85d9780>
[
  5,
  null,
  4,
  2,
  1
]

We could just make that a public method instead, however, some questions/concerns about this:

  • I personally don't like the "flattened" term. I know we already use this in C++ as well (this basically just exposes the C++ StructArray::GetFlattenedField), but I don't find it very clear that it means this distinction.
  • We could also change field() instead? I personally think this is what people typically will want when they currently call field (like @amol- was doing in the sort PR, to get the values of a certain field of the struct). The value in the raw child that is being masked by the top-level bitmap is kind of an implementation detail, and IMO a user should not necessarily get that so easily.
  • If we would change field() to default to the "flattened" field, we need an alternative to access the raw child. We could add a keyword for this? (but what name?) Or a separate method like child()?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant