ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006

brills · 2019-12-10T06:53:39Z

Currently ListArray.flatten() simply returns the child array. If a ListArray is a slice of another ListArray, they will share the same child array, however the expected behavior (I think) of flatten() should be returning an Array that's a concatenation of all the sub-lists in the ListArray, so the slicing offset should be taken into account.

For example:

a = pa.array([[1], [2], [3]])

assert a.flatten().equals(pa.array([1,2,3]))

# expected:
a.slice(1).flatten().equals(pa.array([2, 3]))

github-actions · 2019-12-10T07:00:44Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

wesm

+1 -- I'm in favor of these changes. I can't remember what was the argument against this in the past but I will wait for others to chime in before merging

jorisvandenbossche

Yes, I am also fine with this. I think the difference between .values/.offsets and flatten() can be a bit confusing (eg that you then cannot use the result of offsets together with the result of flatten()), so we should document this properly.

Can you:

add python tests that asserts the different behaviour of values and flatten()
update the docstrings of the different methods/properties ?

This is a breaking change, though. In principle we could deprecate it with through a keyword in the flatten() method, but not sure that is worth it.

github-actions · 2019-12-10T15:15:50Z

https://issues.apache.org/jira/browse/ARROW-7362

cpp/src/arrow/array.h

brills · 2019-12-10T17:27:36Z

@bkietz made a good point and this PR cannot proceed without a resolution of that.

The most straightforward way to address that is to implement a O(N) flatten(), in which the null bitmap and the offsets are scanned, and non-null fragments are collected, and then Concatenate() is called if necessary.

That implies flatten() will need to take a MemoryPool, and will need to return a Status/Result<>.

However I think it's valuable to have a O(1) flatten().. Do we know how common is the case where non-empty lists are behind null elements?

wesm · 2019-12-10T17:32:34Z

However I think it's valuable to have a O(1) flatten().. Do we know how common is the case where non-empty lists are behind null elements?

It's not extraordinarily common, but such a flatten would probably need to be strictly regarded as being "unsafe"

bkietz · 2019-12-10T19:16:57Z

@brills I don't think methods which have preconditions outside the guarantees of the arrow format should be added to /\w*Array/, which unfortunately includes the always-O(1) Flatten. Enforcing empty lists behind nulls is a topic for the mailing list; the impact of that change would be too great to hash out here.

brills · 2019-12-10T22:11:57Z

Is the O(N) Flatten() still desirable as a method of {Large,}ListArray?

wesm · 2019-12-13T05:27:49Z

Yes, I would say so

bkietz

I think O(N) flatten is worthwhile.

Some observations to recover O(1) flattening:

You can still apply the O(1) method when null count is 0.
If you happen to know that your null lists are also empty you can drop the null bitmap to trivially coalesce the nulls to empty lists.
Since the coalesced list array would have a null count of 0, calling flatten on it would trigger the O(1) path.

brills · 2019-12-16T21:10:17Z

@bkietz @wesm
Could you take a look at this logic?

arrow/cpp/src/arrow/array/validate.cc

Line 411 in 860796e

return Status::Invalid("Offset invariant failure at: ", i,

It seems to me that it is required that a null value should be backed by an empty sub-list?

wesm · 2019-12-16T21:13:34Z

It seems to me that it is required that a null value should be backed by an empty sub-list?

That may be an error. There was prior discussion about this on the mailing list cc @pitrou

brills · 2019-12-16T23:42:01Z

I found https://issues.apache.org/jira/browse/ARROW-6929.

I made the O(N) approach (with a shortcut) PTAL. Tests are also updated.

bkietz

This looks good, just a few nits

cpp/src/arrow/array.cc

brills · 2019-12-17T17:59:57Z

Hmm.. maybe I missed something, but the Ursabot Python build failed and it looked like the changes in pyarrow were not picked up.. This PR builds on my local environment.

pitrou

+1

python/pyarrow/array.pxi

pitrou · 2019-12-18T10:49:27Z

Hmm.. maybe I missed something, but the Ursabot Python build failed and it looked like the changes in pyarrow were not picked up.. This PR builds on my local environment.

It's weird. I failed to reproduce using docker-compose.

pitrou · 2019-12-18T10:56:03Z

Ahah, it's a git merge error. It's merging the Flatten declaration into CFixedSizeListArray... will fix.

jorisvandenbossche

Few minor comments on the docstring / tests

Can you mention in the offsets property's docstring that this are the offsets into values and not flatten() ?

python/pyarrow/array.pxi

jorisvandenbossche · 2019-12-18T11:17:39Z

python/pyarrow/tests/test_array.py

@@ -1766,47 +1763,6 @@ def test_list_array_flatten():
    assert arr2.flatten().flatten().equals(arr0)


Now .valuesis implemented separately from .flatten(), can you add asserts for this property as well?

pitrou · 2019-12-18T13:43:06Z

cpp/src/arrow/array.cc

+  // Shortcut: if a ListArray does not contain nulls, then simply slice its
+  // value array with the first and the last offsets. We don't use
+  // Array::null_count() because it's potentially O(N) (thus not faster than not
+  // taking the shortcut).


This is suboptimal. null_count() is cached, and even the first computation is much faster than walking the null bits one by one (because it uses CPU popcount instructions over entire words).

Good to know. Thanks!

pitrou · 2019-12-18T15:47:53Z

AppVeyor: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/29626801

Will merge when green.

Because the following PRs are merged in 0.16, we are able to clean-up the code base a lot: apache/arrow#6066 apache/arrow#6006 PiperOrigin-RevId: 314947604

wesm changed the title ~~Arrow-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray~~ ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray Dec 10, 2019

wesm approved these changes Dec 10, 2019

View reviewed changes

jorisvandenbossche requested changes Dec 10, 2019

View reviewed changes

bkietz requested changes Dec 10, 2019

View reviewed changes

cpp/src/arrow/array.h Outdated Show resolved Hide resolved

brills requested review from bkietz and wesm December 10, 2019 23:09

bkietz requested changes Dec 13, 2019

View reviewed changes

brills requested a review from bkietz December 16, 2019 21:13

bkietz requested changes Dec 17, 2019

View reviewed changes

cpp/src/arrow/array.cc Outdated Show resolved Hide resolved

cpp/src/arrow/array.cc Outdated Show resolved Hide resolved

brills requested a review from bkietz December 17, 2019 18:00

bkietz approved these changes Dec 17, 2019

View reviewed changes

pitrou approved these changes Dec 18, 2019

View reviewed changes

python/pyarrow/array.pxi Outdated Show resolved Hide resolved

brills and others added 5 commits December 18, 2019 11:53

Added a C++ method Flatten() to ListArray.

a3d4d2f

Also changed the python wrapper.

3c87462

Take care of nulls correctly.

6aa3181

comments

5f3650f

Fix typo + print out conda env in conda docker builds

7d0b864

Fix git merge error

c789812

pitrou force-pushed the flatten branch from 0905b93 to c789812 Compare December 18, 2019 10:58

jorisvandenbossche reviewed Dec 18, 2019

View reviewed changes

Address review comments, add a test for non-canonical list arrays

d14210d

pitrou reviewed Dec 18, 2019

View reviewed changes

Improve implementation characteristics

4702f59

pitrou closed this in d0126e7 Dec 18, 2019

brills mentioned this pull request Mar 27, 2020

Merge pull request #11245 - Update the range for pyarrow to support new pyarrow version 0.16.0 apache/beam#11251

Merged

4 tasks

tfx-copybara pushed a commit to tensorflow/tfx-bsl that referenced this pull request Jun 5, 2020

Cleaned up some arrow utils.

eaa4c0b

Because the following PRs are merged in 0.16, we are able to clean-up the code base a lot: apache/arrow#6066 apache/arrow#6006 PiperOrigin-RevId: 314947604

asfimport mentioned this pull request Dec 18, 2019

[Python] ListArray.flatten() should take care of slicing offsets #16993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006

ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006

brills commented Dec 10, 2019

github-actions bot commented Dec 10, 2019

wesm left a comment

jorisvandenbossche left a comment

github-actions bot commented Dec 10, 2019

brills commented Dec 10, 2019

wesm commented Dec 10, 2019

bkietz commented Dec 10, 2019

brills commented Dec 10, 2019 •

edited

Loading

wesm commented Dec 13, 2019

bkietz left a comment

brills commented Dec 16, 2019

wesm commented Dec 16, 2019

brills commented Dec 16, 2019

bkietz left a comment

brills commented Dec 17, 2019

pitrou left a comment

pitrou commented Dec 18, 2019

pitrou commented Dec 18, 2019

jorisvandenbossche left a comment

jorisvandenbossche Dec 18, 2019

pitrou Dec 18, 2019

pitrou Dec 18, 2019

brills Dec 18, 2019

pitrou commented Dec 18, 2019

		@@ -1766,47 +1763,6 @@ def test_list_array_flatten():
		assert arr2.flatten().flatten().equals(arr0)

ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006

ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006

Conversation

brills commented Dec 10, 2019

github-actions bot commented Dec 10, 2019

wesm left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 10, 2019

brills commented Dec 10, 2019

wesm commented Dec 10, 2019

bkietz commented Dec 10, 2019

brills commented Dec 10, 2019 • edited Loading

wesm commented Dec 13, 2019

bkietz left a comment

Choose a reason for hiding this comment

brills commented Dec 16, 2019

wesm commented Dec 16, 2019

brills commented Dec 16, 2019

bkietz left a comment

Choose a reason for hiding this comment

brills commented Dec 17, 2019

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Dec 18, 2019

pitrou commented Dec 18, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Dec 18, 2019

Choose a reason for hiding this comment

pitrou Dec 18, 2019

Choose a reason for hiding this comment

pitrou Dec 18, 2019

Choose a reason for hiding this comment

brills Dec 18, 2019

Choose a reason for hiding this comment

pitrou commented Dec 18, 2019

brills commented Dec 10, 2019 •

edited

Loading