Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add combine_chunks method to ChunkedArray #23640

Closed
asfimport opened this issue Dec 10, 2019 · 4 comments
Closed

[Python] Add combine_chunks method to ChunkedArray #23640

asfimport opened this issue Dec 10, 2019 · 4 comments

Comments

@asfimport
Copy link

asfimport commented Dec 10, 2019

Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a list without flattening anything.

// code placeholder
aa = pa.array([[1],[2]])
bb = pa.chunked_array([aa,aa])
 
bb.flatten()

Out[15]:
[<pyarrow.lib.ChunkedArray object> [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]]

Expected:
[ <pyarrow.lib.Array object> [ 1, 2 ], <pyarrow.lib.Array object> [ 1, 2 ] ]

 

Reporter: marc abboud
Assignee: Andrew Wieteska / @arw2019

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-7363. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
From looking at the code, I_think_ that the ChunkedArray flatten() method maps to the StructArray.flatten() method, and not to the ListArray.flatten() method.

StructArray and ListArray implement (somewhat unfortunately maybe) a different flatten method: for StructArray it returns a list of arrays (returning one individual array for each field in the struct), while ListArray returns a new Array with one level of nesting reduced (list array -> array, or list of list array -> list array, ..).

I am not fully sure how to deal with this. Should ChunkedArray.flatten do something different depending on the type? (but it's also not nice that the type of return is then variable) Should be rename the flatten() method for ListArrays ?

@asfimport
Copy link
Author

Daniel Nugent / @nugend:
It seems like there should be some way to get to a contiguous buffer of data from a chunkedarray even if it involves copying. I'm looking at something right now where I want to try and produce Parquet RowGroups of identical length to an input dataset and it'd be nice to be able to handle this in Arrow before passing it off to the analysis functions I'm using.

Could it just be called unchunk or something? (Maybe a peanut butter pun would be good: creamy)

edit: D'oh. Just realized this is on Table already as combine_chunks. It should probably just be implemented for ChunkedArray, no?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
There is also already pa.concat_arrays, with which you can combine the chunks:

In [42]: chunked_array = pa.chunked_array([[1, 2], [3, 4]])                                                                                                                                                        

In [43]: chunked_array                                                                                                                                                                                             
Out[43]: 
<pyarrow.lib.ChunkedArray object at 0x7fa785879ea8>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

In [44]: pa.concat_arrays(chunked_array.chunks)                                                                                                                                                                    
Out[44]: 
<pyarrow.lib.Int64Array object at 0x7fa785824468>
[
  1,
  2,
  3,
  4
]

(which is in the end using the same C++ Concatenate functionality as combine_chunks)

But so maybe we could indeed expose this as a combine_chunks method on ChunkedArray as well.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 8657
#8657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant