Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Kernel implementations for "match" function #17575

Closed
asfimport opened this issue Sep 19, 2017 · 14 comments
Closed

[C++] Kernel implementations for "match" function #17575

asfimport opened this issue Sep 19, 2017 · 14 comments

Comments

@asfimport
Copy link

Match computes a position index array from an array values into a set of categories

match(['a', 'b', 'a', null, 'b', 'a', 'b'], ['b', 'a'])

return [1, 0, 1, null, 0, 1, 0]

Reporter: Wes McKinney / @wesm
Assignee: Preeti Suman / @psuman65

PRs and other links:

Note: This issue was originally created as ARROW-1560. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Atri Sharma / @atris:
Can someone please assign this to me?

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
@atris Assigned you

@asfimport
Copy link
Author

Atri Sharma / @atris:
Thanks!

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Is the intended output an array of the smallest numeric type capable of holding the index (or some fixed size)?

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
[~emkornfield@gmail.com] This could also be an array of int64 for simplicity/as a start.

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:
What should it return when it doesn't match?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
R returns null by default

> match(c(1, 2, 3), c(2, 3, 4))
[1] NA  1  2

It's configurable though

> match(c(1, 2, 3), c(2, 3, 4), nomatch=-1)
[1] -1  1  2

@asfimport
Copy link
Author

Preeti Suman / @psuman65:
Can someone assign this to me? 

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Just made you a Contributor and assigned the issue to you.

Could you describe your implementation approach before you go too far down the rabbit hole? We want to make use of the existing hashing machinery that we are using for the Unique and DictionaryEncode functions

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Should there be two implementations? One for small lists (linear scan) and
one with hashtable ?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Yes that sounds right to me

@asfimport
Copy link
Author

Preeti Suman / @psuman65:
For match (and isin) compute kernel , in left and right array, if there are nulls in the input,

a) Do we need to match null with null or ignore null completely?

    Example:

    match(['a', 'b', null], ['a', 'c', null])

    Expected output [0, null, 2]

b) If we need to compare, what will be the suggested way to traverse nulls if we use the VisitValue and VisitNull (using ArrayDataVisitor) for the array?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
R does match NA (null-ish values) so that should probably be the default

> match(c(NA, NA, NA, NA), NA)
[1] 1 1 1 1

On the second question, I'm not sure. We aren't accounting for nulls in other hash-related functions like ValueCounts. See ARROW-4787. When you populate the hash table with the right-hand-side values, you can set a flag whether null was present or not (and at what position) and then use this when VisitNull is invoked (if using ArrayDataVisitor turns out to be the most efficient method for this, which I'm also not sure about)

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
Issue resolved by pull request 5665
#5665

@asfimport asfimport added this to the 0.17.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant