Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10378: [Rust] Update take() kernel with support for LargeList. #8556

Closed
wants to merge 2 commits into from

Conversation

drusso
Copy link
Contributor

@drusso drusso commented Oct 30, 2020

This change adds support for LargeList in take().

There is an additional update to the underlying implementation of take() such that the indices may be any PrimitiveArray of ArrowNumericType, rather than only UInt32Array. This change is motivated by the recursive call to take() in take_list() (here), since in order to support LargeListArray, which use i64 offsets, the recursive call must support indices arrays that are Int64Array.

@github-actions
Copy link

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Oct 31, 2020

Hi @drusso , thanks a lot.

I haven't look into the changes in detail, but I always understood offsets and indices as different concepts: List vs LargeList is about the maximum value of the offset (i.e. the number of items including childs), while the indices in take concern the maximum size of an array.

AFAI understand, LargeList overflows when its maximum offset (i.e. total number of bytesentries on all entries) is larger than i64::MAX, while take overflows when the maximum len of the array is larger than i32::MAX. AFAI understand, the reason we have Large vs "small" is that if the lists themselves have a lot of entries, e.g. pixels in an image or something, the total number of entries of the array (~ len * avg(inner len)) may overflow an i32.

According to the spec,

A length as a 64-bit signed integer. Implementations are permitted to be limited to 32-bit lengths, see more on this below.

So, if anything, what we could change is to make take accept an i64 instead of an i32 (to take into account the larger size). I would probably still try to understand why the i32 was chosen in the first place, specially when lengths in Rust are currently measured in usize (see ArrayData::len). Maybe @nevi-me @sunchao or @andygrove know the answer ^_^

@drusso
Copy link
Contributor Author

drusso commented Nov 1, 2020

I'll add some background context on the implementation of take_list(), which I hope might clarify some of the changes in this pull request.

Let's start with a list array:

Lists:
  ["a" "b"] ["c" "d"] ["e"] ["f" "g" "h" "i" "j"]

Array:
  Offsets: [0 2 4 5 10]
  Values: ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j"]

For this example, let's take indices [2 1 0]. The resulting array should be:

Lists:
  ["e"] ["c" "d"] ["a" "b"]

Array:
  Offsets: [0 1 3 5]
  Values: ["e" "c" "d" "a" "b"]

Let's look specifically at the input values array and the output values array:

Input:
  Values: ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j"]

Output:
  Values: ["e" "c" "d" "a" "b"]

We can think of the values arrays as their sequences of indices into the input values array:

Input:
  Indices: [0   1   2   3   4   5   6   7   8   9  ]
  Values:  ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j"]

Output:
  Indices: [4   2   3   0   1  ]
  Values:  ["e" "c" "d" "a" "b"]

Constructing that output values array can be accomplished by:

take(
  ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j"],
  [4 2 3 0 1],
) // => ["e" "c" "d" "a" "b"]

This is what the implementation of take_list() is doing.
First [2 1 0] is mapped to [4 2 3 0 1] by take_value_indices_from_list(). Then take_list() calls take() here.

The [4 2 3 0 1] is based on the original list array's offsets. For ListArray they are i32, and for LargeListArray they are i64. And that is ultimately why I've updated the implementations of take() to accept indices as any PrimitiveArray<ArrowNumericType>, rather than UInt32Array.


I'll also note that the generic implementation is in take_impl(), but take() remains UIn32Array. If we'd like, we can replace take() with the generic form take_impl(), that way take() itself can accept indices of any type.

@jorgecarleitao
Copy link
Member

You are obviously right, @drusso. I am sorry for the confusion.

@nevi-me nevi-me self-requested a review November 2, 2020 06:54
@nevi-me
Copy link
Contributor

nevi-me commented Nov 2, 2020

I've been away for a few days, I'll review this in the evening GMT+2, as I worked on the initial take_list()

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me -- I had some code style questions, but all in all it looks like a nice generalization to me

rust/arrow/src/compute/kernels/take.rs Show resolved Hide resolved
rust/arrow/src/compute/util.rs Show resolved Hide resolved
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

let start = offsets[ix];
let end = offsets[ix + 1];
current_offset += (end - start) as i32;
current_offset = current_offset + (end - start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably did this in response to a clippy lint, but it's fine as we have more lints to fix at some point

@nevi-me nevi-me closed this in e6366dc Nov 7, 2020
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This change adds support for `LargeList` in `take()`.

There is an additional update to the underlying implementation of `take()` such that the indices may be any `PrimitiveArray` of `ArrowNumericType`, rather than only `UInt32Array`. This change is motivated by the recursive call to `take()` in `take_list()` ([here](https://github.com/apache/arrow/blob/b109195b77d85e513aab80650bd4b193e26a5471/rust/arrow/src/compute/kernels/take.rs#L324)), since in order to support `LargeListArray`, which use `i64` offsets, the recursive call must support indices arrays that are `Int64Array`.

Closes apache#8556 from drusso/ARROW-10378

Authored-by: Daniel Russo <danrusso@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants