Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Add compute kernel for "dictionary_decode" #34588

Closed
jorisvandenbossche opened this issue Mar 16, 2023 · 8 comments · Fixed by #35356
Closed

[C++][Python] Add compute kernel for "dictionary_decode" #34588

jorisvandenbossche opened this issue Mar 16, 2023 · 8 comments · Fixed by #35356

Comments

@jorisvandenbossche
Copy link
Member

The pyarrow DictionaryArray class has a dictionary_decode() method, as a counterpart of the dictionary_encode(). However, this is not an actual kernel, but just implemented manually (it's also just a simple "take" of course):

arrow/python/pyarrow/array.pxi

Lines 2506 to 2510 in fe88d9a

def dictionary_decode(self):
"""
Decodes the DictionaryArray to an Array.
"""
return self.dictionary.take(self.indices)

It would be nice to make an actual kernel for this, just like we have a "dictionary_encode" kernel (and eg also "run_end_encode" and "run_end_decode" kernels). Among other things, this makes it available as a function in pyarrow.compute (and for example ensures you can call this on a chunked array more easily).

@R-JunmingChen
Copy link
Contributor

Hi, where is the code of dictionary_encode in cpp?
I search for arrow/cpp/src/arrow/compute and there is no such file like vector_dictionary_encode

@R-JunmingChen
Copy link
Contributor

take

@westonpace
Copy link
Member

westonpace commented Apr 11, 2023

Hi, where is the code of dictionary_encode in cpp?

This can be done today with the cast function (and I'm pretty sure that is how its done when we need to decode in Acero):

>>> x = pa.array(["a", "a", "b"], pa.dictionary(pa.int8(), pa.string()))
>>> x
<pyarrow.lib.DictionaryArray object at 0x7f9eaaf60900>

-- dictionary:
  [
    "a",
    "b"
  ]
-- indices:
  [
    0,
    0,
    1
  ]
>>> pc.cast(x, pa.string())
<pyarrow.lib.StringArray object at 0x7f9eaaf80b20>
[
  "a",
  "a",
  "b"
]

I agree that an explicit dictionary_decode alias would be nice. Maybe we can use a MetaFunction?

@pitrou
Copy link
Member

pitrou commented May 23, 2023

What is the point of this? Not every function needs to be wrapped in a compute function.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 23, 2023

Quoting the last part of my top post:

Among other things, this makes it available as a function in pyarrow.compute (and for example ensures you can call this on a chunked array more easily).

Being able to decode a chunked array was my motivating use case to open this issue. Of course, for just that we could also add a ChunkedArray.dictionary_decode() method, but at the moment we don't really add type-specific methods on ChunkedArray.

@pitrou
Copy link
Member

pitrou commented May 23, 2023

Adding ChunkedArray.dictionary_decode() would sound a bit more logical (since Array.dictionary_decode() already exists) than creating a wrapper compute function, IMHO.

@jorisvandenbossche
Copy link
Member Author

(since Array.dictionary_decode() already exists

That might be what you meant, but to be nitpicky: that actually doesn't exist, it's only DictionaryArray.dictionary_decode() that exists (which touches on my point that up to now we haven't really been adding type-specific methods on ChunkedArray, since we only have a single chunked class and not type-specific subclasses)

Another potential (consistency) argument: run_end_decode also exists as a compute kernel.

@pitrou
Copy link
Member

pitrou commented May 23, 2023

Hmm, I see. pc.dictionary_decode kind of makes sense, then, even though it's a bit of an anti-pattern to make everything a compute function.

westonpace added a commit that referenced this issue Jul 18, 2023
…5356)

**Rationale for this change**
This PR is for [Issue-34588](#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

**What changes are included in this PR?**
C++: Meta Function of dictionary_decode.
Python: Test

**Are these changes tested?**
One test in tests/test_compute.py

* Closes: #34588

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 14.0.0 milestone Jul 18, 2023
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this issue Jul 20, 2023
…e" (apache#35356)

**Rationale for this change**
This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

**What changes are included in this PR?**
C++: Meta Function of dictionary_decode.
Python: Test

**Are these changes tested?**
One test in tests/test_compute.py

* Closes: apache#34588

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
R-JunmingChen added a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…e" (apache#35356)

**Rationale for this change**
This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

**What changes are included in this PR?**
C++: Meta Function of dictionary_decode.
Python: Test

**Are these changes tested?**
One test in tests/test_compute.py

* Closes: apache#34588

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants