New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7031: [Python] Expose the offsets of a ListArray in python #5759
ARROW-7031: [Python] Expose the offsets of a ListArray in python #5759
Conversation
python/pyarrow/array.pxi
Outdated
Return the offsets as an int32 array. | ||
""" | ||
return Array.from_buffers( | ||
int32(), len(self) + 1, [None, self.buffers()[1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this rather be implemented in C++ (there is currently no method on ListArray in C++ that directly returns the offsets as an arrow::Array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that would be better IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy to do, picking a name for such a method maybe is the hardest part.
We have
ListArray::values -> shared_ptr<Array>
ListArray::value_offset -> int32_t
ListArray::value_offsets -> shared_ptr<Buffer>
ListArray::raw_value_offsets -> const int32_t*
Thoughts about what to call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListArray::offsets_array
?
As I noted in the issue, the current |
@jorisvandenbossche I think it makes sense to follow the semantics of the current implemenation. Could you please create an issue from your JIRA comment to track it? |
Hm, I think we should probably slices the offsets so that So to summarize:
|
de87830
to
c63ac61
Compare
c63ac61
to
6642257
Compare
@wesm I implemented the behaviour you proposed (in Python for now, first making sure I understood correctly). So we now have: In [35]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5])
In [36]: arr
Out[36]:
<pyarrow.lib.ListArray object at 0x7ff2e6406be8>
[
[
1,
2,
3
],
[
4,
5
]
]
In [37]: arr.values
Out[37]:
<pyarrow.lib.Int64Array object at 0x7ff2d1d813a8>
[
1,
2,
3,
4,
5
]
In [38]: arr.offsets
Out[38]:
<pyarrow.lib.Int32Array object at 0x7ff2e6406b28>
[
0,
3,
5
]
In [39]: arr2 = arr[1:]
In [40]: arr2
Out[40]:
<pyarrow.lib.ListArray object at 0x7ff2e6406c48>
[
[
4,
5
]
]
# values still return the full buffer (same as for arr)
In [41]: arr2.values
Out[41]:
<pyarrow.lib.Int64Array object at 0x7ff2e6406b88>
[
1,
2,
3,
4,
5
]
# but the offsets are sliced, so you can use this to correctly index
# the values for this sliced array
In [42]: arr2.offsets
Out[42]:
<pyarrow.lib.Int32Array object at 0x7ff2d1d811c8>
[
3,
5
]
In [43]: arr2.values[3:5]
Out[43]:
<pyarrow.lib.Int64Array object at 0x7ff2e6406dc8>
[
4,
5
] That looks correct? |
Yes, that looks right to me. +1. We can handle the C++ issue separately |
OK, created https://issues.apache.org/jira/browse/ARROW-7068 for the C++ side |
Follow-up on #5759, which was apparently merged too quickly (I only now saw that I did the slicing behaviour only for ListArray, and not yet updated LargeListArray). Also added the LargeListArray.values attribute which was missing (compared to ListArray) Closes #5784 from jorisvandenbossche/ARROW-7031-follow-up and squashes the following commits: b84c496 <Joris Van den Bossche> ARROW-7031: Correct LargeListArray.offsets attribute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
No description provided.