Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++/Python] Kernel for SetItem(IntegerArray, values) ("replace_with_indices") #25505

Open
asfimport opened this issue Jul 13, 2020 · 24 comments

Comments

@asfimport
Copy link

asfimport commented Jul 13, 2020

We should have a kernel that allows overriding the values of an array using an integer array as the indexer and a scalar or array of equal length as the values.

Reporter: Uwe Korn / @xhochy

Related issues:

Note: This issue was originally created as ARROW-9431. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
Identical issue

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@nirandaperera you linked ARROW-9430 as duplicate. Both are of course closely related, but there is a difference: using integer indices vs a boolean mask to select the values you want to set

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
Thank you @jorisvandenbossche. That was clearly my mistake :(

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
I'd like to take this one. @jorisvandenbossche  @xhochy @pitrou I'd like to get your feedback on the following.

Just to be clear, an example of this would be,

 

// code placeholder
arr1 = [a, b, c, d]
arr2 = [l, m, n, o, p]

to_replace = [0, 2, 3]

res = replace_by_index(to_replace, arr1, arr2)

# res --> [l, b, n, o]

 

What is a good name for the kernel? I'm not sure about replace_by_index because Arrays are immutable. Any suggestions?

Can we assume that 'to_replace' is a non-null Int64Array?

And should we enforce 'to_replace' being sorted (IMO this would be more efficient)?

What is the arity of this kernel? Unary or binary?

 

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I think one difference in your example is that I expect the arr2 to be of the same length as to_replace

Can we assume that 'to_replace' is a non-null Int64Array?

That might be the easiest for now. Or otherwise we can say nulls in to_replace give nulls in the output. The question is then how to interpret arr2, though (should it be of the same length as to_replace, or as the non-null values of to_replace ?)

And should we enforce 'to_replace' being sorted (IMO this would be more efficient)?

Hmm, not sure. Also for Take we generally don't require sortedness of the indices, I think?

What is the arity of this kernel? Unary or binary?

Ternary?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Indeed, we don't require sortedness of the indices for Take. Also, I'm not convinced it would be more efficient (why?).

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The example doesn't quite look right to me. I would expect:

arr1 = [a, b, c, d]
arr2 = [l, m, n, o, p]

to_replace = [0, 2, 3]

res = replace_by_index(to_replace, arr1, arr2)

# res --> [l, b, m, n]

@xhochy What do you think?

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
@pitrou my idea was something like this.

def replace_by_index(to_replace: Array, arr1: Array, arr2: Array) -> Array:
  out = arr1 # copy array
  for i in to_replace:
    out[i] = arr2[i]
  return out 

 

Its like an if-else mask, but the mask's set bits are encoded in an array ARROW-10640

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
If this is meant to emulate Numpy's advanced indexing, then I don't think this is the right semantics.

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
I see.. Can you give an example?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

>>> a = np.array(["a", "b", "c", "d"])
>>> b = np.array(["m", "n", "o"])
>>> a[[0,2,3]] = b
>>> a
array(['m', 'b', 'n', 'o'], dtype='<U1')

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@xhochy What is the intended use case?

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
Got it. Think that makes sense. So, this will ultimately be another ternary kernel, isnt it?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
What do you mean by "another ternary kernel"?

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
I'm currently working on the 'if-else' function ARROW-10640. And AFAIK it is the only ternary function that is there ATM. This will be another one.

@asfimport
Copy link
Author

asfimport commented May 27, 2021

Uwe Korn / @xhochy:

@xhochy  What is the intended use case?

The intention was to support pandas' def __setitem__(self, key, value) where key is an array and value is an array of equal length. In fletcher we currently roundtrip to numpy for that: https://github.com/xhochy/fletcher/blob/master/fletcher/base.py#L998-L1018 The biggest use case for this is though the ArrowStringArray that is being built inside of pandas currently.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
It's "another ternary kernel" in the sense that it accepts three arguments, but it's quite different from "if-else" in that it is not a scalar kernel like "if-else" is (for "if-else" all inputs are of the same length (or scalars broadcasted), end then the operation is basically element-wise, while for this "setitem" that is not the case).

@nirandaperera to use your pseudo-code, the expected behaviour looks like:

def replace_by_index(to_replace: Array, arr1: Array, arr2: Array) -> Array:
  out = arr1 # copy array
  for idx, val in zip(to_replace, arr2):
    out[idx] = val
  return out 

so where to_replace and arr2 have the same length, but are not of the same length as arr1.

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
Yes, understood. I will take this up once I have completed ARROW-10640. Still figuring out the housekeeping work for ternary kernels :)

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
@jorisvandenbossche @xhochy  my initial idea is the following cases for the function.

def replace_by_index(source, indices, values)

array(k) --> array of size k, M <= N

|source|indices|values|output|
|
|-|-|-|-|-|
|array(N)|array(M)|array(M)|array(N)|
|
|array(N)|array(M)| scalar|array(N)|
|
|array(N)|scalar|scalar|array(N)|



That covers all the cases IMO. WDYT?|

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Yes, I think that covers all cases. I am personally not fully sure that the third case of a scalar for the indices is necessarily important to cover for an initial implementation (for example also the "take" kernel only supports an actual array for the indices)

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Niranda Perera / @nirandaperera:
I'd like to start looking at this Jira again. @jorisvandenbossche  Has there been any developments for the last couple of months (affecting this requirement) or can I still go ahead? 

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I am not aware of any work that happened in the meantime. So taking a look at this issue would be very welcome!

The related ARROW-9430 has been done by now, and there we went with a name replace_with_mask for the kernel to set values with a mask. So maybe for this kernel we can then use replace_with_indices to follow the same naming scheme?

@asfimport
Copy link
Author

Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant