Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray #6006

Closed
wants to merge 8 commits into from

Conversation

brills
Copy link
Contributor

@brills brills commented Dec 10, 2019

Currently ListArray.flatten() simply returns the child array. If a ListArray is a slice of another ListArray, they will share the same child array, however the expected behavior (I think) of flatten() should be returning an Array that's a concatenation of all the sub-lists in the ListArray, so the slicing offset should be taken into account.

For example:

a = pa.array([[1], [2], [3]])

assert a.flatten().equals(pa.array([1,2,3]))

# expected:
a.slice(1).flatten().equals(pa.array([2, 3]))

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@wesm wesm changed the title Arrow-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray Dec 10, 2019
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- I'm in favor of these changes. I can't remember what was the argument against this in the past but I will wait for others to chime in before merging

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am also fine with this. I think the difference between .values/.offsets and flatten() can be a bit confusing (eg that you then cannot use the result of offsets together with the result of flatten()), so we should document this properly.

Can you:

  • add python tests that asserts the different behaviour of values and flatten()
  • update the docstrings of the different methods/properties ?

This is a breaking change, though. In principle we could deprecate it with through a keyword in the flatten() method, but not sure that is worth it.

@github-actions
Copy link

cpp/src/arrow/array.h Outdated Show resolved Hide resolved
@brills
Copy link
Contributor Author

brills commented Dec 10, 2019

@bkietz made a good point and this PR cannot proceed without a resolution of that.

The most straightforward way to address that is to implement a O(N) flatten(), in which the null bitmap and the offsets are scanned, and non-null fragments are collected, and then Concatenate() is called if necessary.

That implies flatten() will need to take a MemoryPool, and will need to return a Status/Result<>.

However I think it's valuable to have a O(1) flatten().. Do we know how common is the case where non-empty lists are behind null elements?

@wesm
Copy link
Member

wesm commented Dec 10, 2019

However I think it's valuable to have a O(1) flatten().. Do we know how common is the case where non-empty lists are behind null elements?

It's not extraordinarily common, but such a flatten would probably need to be strictly regarded as being "unsafe"

@bkietz
Copy link
Member

bkietz commented Dec 10, 2019

@brills I don't think methods which have preconditions outside the guarantees of the arrow format should be added to /\w*Array/, which unfortunately includes the always-O(1) Flatten. Enforcing empty lists behind nulls is a topic for the mailing list; the impact of that change would be too great to hash out here.

@brills
Copy link
Contributor Author

brills commented Dec 10, 2019

Is the O(N) Flatten() still desirable as a method of {Large,}ListArray?

@brills brills requested review from bkietz and wesm December 10, 2019 23:09
@wesm
Copy link
Member

wesm commented Dec 13, 2019

Yes, I would say so

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think O(N) flatten is worthwhile.

Some observations to recover O(1) flattening:

  • You can still apply the O(1) method when null count is 0.
  • If you happen to know that your null lists are also empty you can drop the null bitmap to trivially coalesce the nulls to empty lists.
  • Since the coalesced list array would have a null count of 0, calling flatten on it would trigger the O(1) path.

@brills
Copy link
Contributor Author

brills commented Dec 16, 2019

@bkietz @wesm
Could you take a look at this logic?

return Status::Invalid("Offset invariant failure at: ", i,

It seems to me that it is required that a null value should be backed by an empty sub-list?

@brills brills requested a review from bkietz December 16, 2019 21:13
@wesm
Copy link
Member

wesm commented Dec 16, 2019

It seems to me that it is required that a null value should be backed by an empty sub-list?

That may be an error. There was prior discussion about this on the mailing list cc @pitrou

@brills
Copy link
Contributor Author

brills commented Dec 16, 2019

I found https://issues.apache.org/jira/browse/ARROW-6929.

I made the O(N) approach (with a shortcut) PTAL. Tests are also updated.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just a few nits

cpp/src/arrow/array.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array.cc Outdated Show resolved Hide resolved
@brills
Copy link
Contributor Author

brills commented Dec 17, 2019

Hmm.. maybe I missed something, but the Ursabot Python build failed and it looked like the changes in pyarrow were not picked up.. This PR builds on my local environment.

@brills brills requested a review from bkietz December 17, 2019 18:00
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Dec 18, 2019

Hmm.. maybe I missed something, but the Ursabot Python build failed and it looked like the changes in pyarrow were not picked up.. This PR builds on my local environment.

It's weird. I failed to reproduce using docker-compose.

@pitrou
Copy link
Member

pitrou commented Dec 18, 2019

Ahah, it's a git merge error. It's merging the Flatten declaration into CFixedSizeListArray... will fix.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments on the docstring / tests

Can you mention in the offsets property's docstring that this are the offsets into values and not flatten() ?

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
python/pyarrow/array.pxi Outdated Show resolved Hide resolved
@@ -1766,47 +1763,6 @@ def test_list_array_flatten():
assert arr2.flatten().flatten().equals(arr0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now .valuesis implemented separately from .flatten(), can you add asserts for this property as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

// Shortcut: if a ListArray does not contain nulls, then simply slice its
// value array with the first and the last offsets. We don't use
// Array::null_count() because it's potentially O(N) (thus not faster than not
// taking the shortcut).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suboptimal. null_count() is cached, and even the first computation is much faster than walking the null bits one by one (because it uses CPU popcount instructions over entire words).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Thanks!

@pitrou
Copy link
Member

pitrou commented Dec 18, 2019

@pitrou pitrou closed this in d0126e7 Dec 18, 2019
tfx-copybara pushed a commit to tensorflow/tfx-bsl that referenced this pull request Jun 5, 2020
Because the following PRs are merged in 0.16, we are able to clean-up the
code base a lot:
apache/arrow#6066
apache/arrow#6006

PiperOrigin-RevId: 314947604
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants