-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 -- I'm in favor of these changes. I can't remember what was the argument against this in the past but I will wait for others to chime in before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am also fine with this. I think the difference between .values
/.offsets
and flatten()
can be a bit confusing (eg that you then cannot use the result of offsets
together with the result of flatten()
), so we should document this properly.
Can you:
- add python tests that asserts the different behaviour of
values
andflatten()
- update the docstrings of the different methods/properties ?
This is a breaking change, though. In principle we could deprecate it with through a keyword in the flatten()
method, but not sure that is worth it.
@bkietz made a good point and this PR cannot proceed without a resolution of that. The most straightforward way to address that is to implement a O(N) flatten(), in which the null bitmap and the offsets are scanned, and non-null fragments are collected, and then Concatenate() is called if necessary. That implies flatten() will need to take a MemoryPool, and will need to return a Status/Result<>. However I think it's valuable to have a O(1) flatten().. Do we know how common is the case where non-empty lists are behind null elements? |
It's not extraordinarily common, but such a flatten would probably need to be strictly regarded as being "unsafe" |
@brills I don't think methods which have preconditions outside the guarantees of the arrow format should be added to |
Is the O(N) Flatten() still desirable as a method of {Large,}ListArray? |
Yes, I would say so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think O(N) flatten is worthwhile.
Some observations to recover O(1) flattening:
- You can still apply the O(1) method when null count is 0.
- If you happen to know that your null lists are also empty you can drop the null bitmap to trivially coalesce the nulls to empty lists.
- Since the coalesced list array would have a null count of 0, calling flatten on it would trigger the O(1) path.
@bkietz @wesm arrow/cpp/src/arrow/array/validate.cc Line 411 in 860796e
It seems to me that it is required that a null value should be backed by an empty sub-list? |
That may be an error. There was prior discussion about this on the mailing list cc @pitrou |
I found https://issues.apache.org/jira/browse/ARROW-6929. I made the O(N) approach (with a shortcut) PTAL. Tests are also updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, just a few nits
Hmm.. maybe I missed something, but the Ursabot Python build failed and it looked like the changes in pyarrow were not picked up.. This PR builds on my local environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
It's weird. I failed to reproduce using |
Ahah, it's a git merge error. It's merging the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments on the docstring / tests
Can you mention in the offsets
property's docstring that this are the offsets into values
and not flatten()
?
@@ -1766,47 +1763,6 @@ def test_list_array_flatten(): | |||
assert arr2.flatten().flatten().equals(arr0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now .values
is implemented separately from .flatten()
, can you add asserts for this property as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
cpp/src/arrow/array.cc
Outdated
// Shortcut: if a ListArray does not contain nulls, then simply slice its | ||
// value array with the first and the last offsets. We don't use | ||
// Array::null_count() because it's potentially O(N) (thus not faster than not | ||
// taking the shortcut). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is suboptimal. null_count()
is cached, and even the first computation is much faster than walking the null bits one by one (because it uses CPU popcount instructions over entire words).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. Thanks!
AppVeyor: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/29626801 Will merge when green. |
Because the following PRs are merged in 0.16, we are able to clean-up the code base a lot: apache/arrow#6066 apache/arrow#6006 PiperOrigin-RevId: 314947604
Currently ListArray.flatten() simply returns the child array. If a ListArray is a slice of another ListArray, they will share the same child array, however the expected behavior (I think) of flatten() should be returning an Array that's a concatenation of all the sub-lists in the ListArray, so the slicing offset should be taken into account.
For example: