Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset #41672

Open
pvardanis opened this issue May 15, 2024 · 12 comments

Comments

@pvardanis
Copy link

pvardanis commented May 15, 2024

We clarified the docstring of ListArray.values a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values

We should do the same for FixedSizeListArray.values, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values


Original report:

Describe the bug, including details regarding any error messages, version, and platform.

I have a pa.FixedSizeListArray that looks like this:

<pyarrow.lib.FixedSizeListArray object at 0x13f9b45e0>
[
  [
    [
      1,
      2
    ],
    [
      3,
      4
    ]
  ],
  [
    [
      5,
      6
    ],
    [
      7,
      8
    ]
  ]
]

I'm trying to flatten each row in the pa.FixedSizeListArray using .values. Doing array[0].values, returns:

<pyarrow.lib.FixedSizeListArray object at 0x13f8e7fa0>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

but doing array[0].values.values returns the initial array flattened, not the one that I want:

[
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8
]

Weird thing is that array[0].values.flatten() returns what it's supposed to:

[
  1,
  2,
  3,
  4
]

However, I don't want to use flatten() since this discards Null values. Is this a bug or am I missing something?

Component(s)

Python

@rok
Copy link
Member

rok commented May 15, 2024

Thanks for reporting this @pvardanis. This might be expected as .values accesses underlying storage which will start at the beginning of an array.
Could you say what version of pyarrow are you on? Also your array seems nested, which I'm not sure how to reproduce, can you give an example?

@pvardanis
Copy link
Author

pvardanis commented May 15, 2024

@rok I'm using 12.0.1 but can also reproduce the bug in 16.1.0. This is how to create the array:

array_type = pa.list_(pa.list_(pa.int32(), list_size=2), list_size=2)
array = pa.array(
        [[[1, 2], [3, 4]], [[5, 6], [7, 8]]],
        type=array_type,
    )

@rok
Copy link
Member

rok commented May 15, 2024

@pvardanis I can confirm this is still behavior on main branch.
As per docs this is expected (but unintuitive):

values Return the underlying array of values which backs the FixedSizeListArray.

The reason this happens is that getting .values.values will give you a zero-copy reference to the underlying storage and loose the offset of value whose values you were getting. This has utility in that we avoid a copy. But it is somewhat unintuitive.
.flatten on the other hand creates a copy from the offset you're getting and returns the correct values.
You can work around this by either calculating your own offset and slice the .values.values or by doing a array.take(i).values (which will copy values at i) or you can use .flatten which will probably do the same as .take(i) (I'm not sure about this, maybe flatten copies the whole array).

@jorisvandenbossche would you say this is approximately right?

@pvardanis
Copy link
Author

pvardanis commented May 15, 2024

@rok

array.take([0]).values.values

seems to be working, however I was hoping for a zero copy solution.

array.slice(offset=0, length=1).values.values

still returns the whole array. Also, I don't want to use .flatten since it gets rid of Null values.

@rok
Copy link
Member

rok commented May 15, 2024

@pvardanis
By calculating your own offset I mean something like this:

i = 1
array.values.values.slice(offset=i * array.type.list_size, length=array.type.list_size)
<pyarrow.lib.Int32Array object at 0x71226477b2e0>
[
  3,
  4
]

I think this should be zero-copy.

@pvardanis
Copy link
Author

@rok that works thanks!

@rok
Copy link
Member

rok commented May 16, 2024

Great to hear! Closing the issue.
If there's more reports like this we should maybe consider some kind of .flattened_slice_view method?

@rok rok closed this as completed May 16, 2024
@felipecrv felipecrv added Type: usage Issue is a user question and removed Type: bug labels May 17, 2024
@felipecrv
Copy link
Contributor

Also, I don't want to use .flatten since it gets rid of Null values.

@pvardanis watch out for the fact that null lists can be "non-empty" in list arrays (the values array might contain values for that slot).

And in fixed_size_list arrays, the values in [i..i + fsl.list_size) are not necessarily null themselves. They could be anything.

@jorisvandenbossche
Copy link
Member

We clarified the docstring of ListArray.values a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values

We should do the same for FixedSizeListArray.values, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values

BTW, an important reason why the .values for a ListArray returns the non-offsetted child array is to ensure this still matches with the .offsets of a sliced ListArray (the offsets array itself will be sliced as well, but the actual values in it don't change. So for those offset values to be correct indices into the values array, that values array should still be the full non-sliced version).

Of course, that reason is not important for FixedSizeListArray given there are no offsets in this case, but it also makes sense to have .values be consistent between both.

@jorisvandenbossche
Copy link
Member

I am going to reopen this issue and rephrase the title as a documentation issue to clarify the docstring.

@jorisvandenbossche jorisvandenbossche changed the title Nested pa.FixedSizeListArray returns wrong flatten values [Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset May 17, 2024
@jorisvandenbossche jorisvandenbossche added Component: Documentation and removed Type: usage Issue is a user question labels May 17, 2024
@felipecrv
Copy link
Contributor

Of course, that reason is not important for FixedSizeListArray given there are no offsets in this case...

Technically the offsets exist the same way, but are implicit because they can be computed with (arr.offset + i) * list_size.

Child arrays of nested layouts are required to keep the prefix-padding on the buffer when an upper level .offset > 0 to make it so that slicing an array, doesn't have to slice recursively. This makes the zero-copy .values that @pvardanis wants to do, a bit trickier, but on the other hand makes slicing arrays constant-time instead of O(depth_of_the_layout)

@felipecrv
Copy link
Contributor

@jorisvandenbossche perhaps *ListArray and FixedSizeListArray need a sliced_values() accessor. For list-views sliced_values would be hard to compute and explain to users though (because offsets are not necessarily sorted in list-views).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants