Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7031: [Python] Expose the offsets of a ListArray in python #5759

Conversation

jorisvandenbossche
Copy link
Member

No description provided.

Return the offsets as an int32 array.
"""
return Array.from_buffers(
int32(), len(self) + 1, [None, self.buffers()[1]])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this rather be implemented in C++ (there is currently no method on ListArray in C++ that directly returns the offsets as an arrow::Array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would be better IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy to do, picking a name for such a method maybe is the hardest part.

We have

ListArray::values -> shared_ptr<Array>
ListArray::value_offset -> int32_t
ListArray::value_offsets -> shared_ptr<Buffer>
ListArray::raw_value_offsets -> const int32_t*

Thoughts about what to call it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListArray::offsets_array?

@jorisvandenbossche
Copy link
Member Author

As I noted in the issue, the current ListArray.values (and now also the ListArray.offsets) does not take the Array.offset into account. Personally, I think it would be useful to have some method to get the values/offsets that do take that into account, so the python does not need to be aware of sliced arrays.

@github-actions
Copy link

@kszucs
Copy link
Member

kszucs commented Nov 1, 2019

@jorisvandenbossche I think it makes sense to follow the semantics of the current implemenation. Could you please create an issue from your JIRA comment to track it?

@wesm
Copy link
Member

wesm commented Nov 4, 2019

Hm, I think we should probably slices the offsets so that arr.offsets[0] is the offset of the first element in arr.values. They aren't strictly comparable because the offsets are relative to values.

So to summarize:

  • Slice result of offsets
  • Do not slice result of values

@jorisvandenbossche
Copy link
Member Author

@wesm I implemented the behaviour you proposed (in Python for now, first making sure I understood correctly). So we now have:

In [35]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5])  

In [36]: arr  
Out[36]: 
<pyarrow.lib.ListArray object at 0x7ff2e6406be8>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5
  ]
]

In [37]: arr.values  
Out[37]: 
<pyarrow.lib.Int64Array object at 0x7ff2d1d813a8>
[
  1,
  2,
  3,
  4,
  5
]

In [38]: arr.offsets 
Out[38]: 
<pyarrow.lib.Int32Array object at 0x7ff2e6406b28>
[
  0,
  3,
  5
]

In [39]: arr2 = arr[1:]   

In [40]: arr2  
Out[40]: 
<pyarrow.lib.ListArray object at 0x7ff2e6406c48>
[
  [
    4,
    5
  ]
]

# values still return the full buffer (same as for arr)
In [41]: arr2.values 
Out[41]: 
<pyarrow.lib.Int64Array object at 0x7ff2e6406b88>
[
  1,
  2,
  3,
  4,
  5
]

# but the offsets are sliced, so you can use this to correctly index 
# the values for this sliced array
In [42]: arr2.offsets 
Out[42]: 
<pyarrow.lib.Int32Array object at 0x7ff2d1d811c8>
[
  3,
  5
]

In [43]: arr2.values[3:5]  
Out[43]: 
<pyarrow.lib.Int64Array object at 0x7ff2e6406dc8>
[
  4,
  5
]

That looks correct?

@wesm
Copy link
Member

wesm commented Nov 5, 2019

Yes, that looks right to me.

+1. We can handle the C++ issue separately

@wesm wesm closed this in 09c3049 Nov 5, 2019
@jorisvandenbossche jorisvandenbossche deleted the ARROW-7031-list-array-offsets branch November 5, 2019 17:44
@jorisvandenbossche
Copy link
Member Author

OK, created https://issues.apache.org/jira/browse/ARROW-7068 for the C++ side

wesm pushed a commit that referenced this pull request Nov 7, 2019
Follow-up on #5759, which was apparently merged too quickly (I only now saw that I did the slicing behaviour only for ListArray, and not yet updated LargeListArray).

Also added the LargeListArray.values attribute which was missing (compared to ListArray)

Closes #5784 from jorisvandenbossche/ARROW-7031-follow-up and squashes the following commits:

b84c496 <Joris Van den Bossche> ARROW-7031:  Correct LargeListArray.offsets attribute

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants