Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356

Merged
merged 44 commits into from
Jul 18, 2023

Conversation

R-JunmingChen
Copy link
Contributor

@R-JunmingChen R-JunmingChen commented Apr 27, 2023

Rationale for this change
This PR is for Issue-34588. Discussing with @westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

What changes are included in this PR?
C++: Meta Function of dictionary_decode.
Python: Test

Are these changes tested?
One test in tests/test_compute.py

@github-actions
Copy link

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 11, 2023
@R-JunmingChen
Copy link
Contributor Author

Hi, @westonpace @AlenkaF, if you have spared time, please review this PR

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @R-JunmingChen !

I had a look at the Python part. Just have some minor nits.

  • Can we add a combo of encode & decode to the test?
  • There are some corrections needed from doctest and linter checks:
--- original//arrow/python/pyarrow/compute.py
+++ fixed//arrow/python/pyarrow/compute.py
@@ -387,7 +387,7 @@
     array : decoded Array
         The dictionary_decode result as a new Array
     """
-    if(not isinstance(arr.type, pa.DictionaryType)):
+    if (not isinstance(arr.type, pa.DictionaryType)):
         raise TypeError("Must pass a dictionary array")
 
     return call_function("dictionary_decode", [arr], memory_pool)
--- original//arrow/python/pyarrow/tests/test_compute.py
+++ fixed//arrow/python/pyarrow/tests/test_compute.py
@@ -1750,7 +1750,7 @@
 
 def test_dictionary_decode():
     array = pa.array(["a", "a", "b", "c", "b"])
-    dictionary_array = pa.array(["a", "a", "b", "c", "b"], 
+    dictionary_array = pa.array(["a", "a", "b", "c", "b"],
                                 pa.dictionary(pa.int8(), pa.string()))
 
     assert array != dictionary_array
_________________ [doctest] pyarrow.compute.dictionary_decode __________________
340     arr : Array-like
341     memory_pool : MemoryPool, optional
342         memory pool to use for allocations during function execution.
343 
344     Examples
345     --------
346     >>> import pyarrow as pa
347     >>> import pyarrow.compute as pc
348     >>> x = pa.array(["a", "a", "b"], pa.dictionary(pa.int8(), pa.string()))
349     >>> x
Expected:
    <pyarrow.lib.DictionaryArray object at ...>
Got:
    <pyarrow.lib.DictionaryArray object at 0x7f6adc3da050>
    <BLANKLINE>
    -- dictionary:
      [
        "a",
        "b"
      ]
    -- indices:
      [
        0,
        0,
        1
      ]

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/compute.py:349: DocTestFailure
__________________ [doctest] pyarrow.lib.default_memory_pool ___________________
125 default_memory_pool()
126 
127     Return the process-global memory pool.
128 
129     Examples
130     --------
131     >>> default_memory_pool()
Expected:
    <pyarrow.MemoryPool backend_name=... bytes_allocated=0 max_memory=...>
Got:
    <pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=192 max_memory=1205120>

In the latter you can use ... instead of specific bytes allocated etc.

@R-JunmingChen
Copy link
Contributor Author

Can you put the function in vector_hash next to dictionary_encode? I agree it is a bit weird since hashing is not used but I think it would better to keep the functions together than to introduce a new file.

Sure, have moved. Do we need additional comment for that?

@westonpace
Copy link
Member

Sure, have moved. Do we need additional comment for that?

I think it's ok, but if you want we can add a simple comment like

// This function does not use any hashing utilities but is kept in this file to be near dictionary_encode

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have a few more questions but I think this is getting close.

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
"""
return self.dictionary.take(self.indices)
return _pc().dictionary_decode(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the old approach work? I wonder if this will introduce any subtle differences? Is there a reason we need to change this method?

Copy link
Contributor Author

@R-JunmingChen R-JunmingChen Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old approach works well. I am not sure the entire risk of subtle differences. One difference is that we output the original input if the input is not dictionary_array, which has no further influence cause only DictionaryArray has a dictionary_decode function. I think it's good to unify the dictionay_decode logic in Python. But your consideration make senses, the change has a risk affecting users who used the old dictionary_decode.

I roll back to old code currently, but I think we need a further consideration on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to unify the dictionay_decode logic in Python

This is a good point too. I'll let @jorisvandenbossche or @AlenkaF weigh in as well. We can move forward as-is for now and add this back if they want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other reason to keep the original is that it is possible to build pyarrow without compute, and then the old version would still work while using the compute version would raise (although I am not sure how important this is, as many things won't work in that case)

python/pyarrow/compute.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_compute.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 13, 2023
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 13, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this addition!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 18, 2023
@westonpace westonpace merged commit c7741fb into apache:main Jul 18, 2023
36 checks passed
@westonpace westonpace removed the awaiting merge Awaiting merge label Jul 18, 2023
@github-actions github-actions bot added the awaiting changes Awaiting changes label Jul 18, 2023
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request Jul 20, 2023
…e" (apache#35356)

**Rationale for this change**
This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

**What changes are included in this PR?**
C++: Meta Function of dictionary_decode.
Python: Test

**Are these changes tested?**
One test in tests/test_compute.py

* Closes: apache#34588

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit c7741fb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

R-JunmingChen added a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
…e" (apache#35356)

**Rationale for this change**
This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

**What changes are included in this PR?**
C++: Meta Function of dictionary_decode.
Python: Test

**Are these changes tested?**
One test in tests/test_compute.py

* Closes: apache#34588

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] Add compute kernel for "dictionary_decode"
5 participants