Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] list_parent_indices only computes for first chunk #29317

Closed
asfimport opened this issue Aug 20, 2021 · 6 comments
Closed

[C++] list_parent_indices only computes for first chunk #29317

asfimport opened this issue Aug 20, 2021 · 6 comments
Assignees
Milestone

Comments

@asfimport
Copy link

Pyarrow version: 5.0.0. 
Python version: 3.7.9

I came across this issue due to very unexpected behaviour from the "explode" function obtained here:

https://issues.apache.org/jira/browse/ARROW-12099
indices = pc.list_parent_indices(table[col_name])

if table[column] in this example contains several chunks, the indices will look perfectly fine for that chunk, but erratic and unexpected results for second chunk.
No warning or info was given either

A workaround that solved the problem for me is:

  indices = pc.list_parent_indices(table.combine_chunks()[col_name])

The behaviour then changes dramatically.

I'm assuming this isnt expected and should be fixed?

Reporter: Tor Eivind McKenzie-Syvertsen
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-13681. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~TorMcK] could you show a reproducible example?

I can't directly reproduce it with this:

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: arr = pa.array([[1, 2], [3, 4, 5]])

In [4]: pc.list_parent_indices(arr)
Out[4]: 
<pyarrow.lib.Int32Array object at 0x7f9eebd67be0>
[
  0,
  0,
  1,
  1,
  1
]

In [5]: chunked_arr = pa.chunked_array([arr, arr])

In [6]: pc.list_parent_indices(chunked_arr)
Out[6]: 
<pyarrow.lib.ChunkedArray object at 0x7f9f37c926d0>
[
  [
    0,
    0,
    1,
    1,
    1
  ],
  [
    0,
    0,
    1,
    1,
    1
  ]
]

In [7]: pa.__version__
Out[7]: '5.0.0'

where calling the compute function on the chunked array also gives a chunked array as result.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 10985
#10985

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@jorisvandenbossche The results were wrong for the second chunk as they were indexed from the start of the chunk, rather than the start of the entire chunked array (think what happens if you call take() with the result indices).

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Concretely, the (correct) result is now as follows:

>>> arr = pa.array([[1, 2], [3, 4, 5]])
>>> pc.list_parent_indices(pa.chunked_array([arr, arr]))
<pyarrow.lib.ChunkedArray object at 0x7faa18d1ba10>
[
  [
    0,
    0,
    1,
    1,
    1
  ],
  [
    2,
    2,
    3,
    3,
    3
  ]
]

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Ah, yes of course :) (I was too focused on expecting the bug was that it returned indices only for the first chunk, that I missed the indices for the second chunk were wrong!)

@asfimport
Copy link
Author

Tor Eivind McKenzie-Syvertsen:
Thanks for fixing this :) 

@asfimport asfimport added this to the 6.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants