Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Function 'dictionary_encode' fails with DictionaryArray input (for compute kernel / ChunkedArray method) #34890

Closed
randolf-scholz opened this issue Apr 4, 2023 · 11 comments · Fixed by #38349

Comments

@randolf-scholz
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

dictionary_encode should probably just return the data as-is if it is already encoded as a dictionary.

Component(s)

Python

@randolf-scholz
Copy link
Author

randolf-scholz commented Apr 4, 2023

MWE:

import pyarrow as pa

dict_array = pa.array(["banana", "orange"]).dictionary_encode()
dict_array.dictionary_encode()  # ✔
pa.compute.dictionary_encode(dict_array)  # ✘ ArrowNotImplementedError

@randolf-scholz
Copy link
Author

randolf-scholz commented Apr 4, 2023

Using chunked arrays, array.dictionary_encode() breaks as well:

import pyarrow as pa

dict_array = pa.chunked_array([["banana", "orange"], ["orange", "apple"]]).dictionary_encode()
dict_array.dictionary_encode()  # ✘ ArrowNotImplementedError
pa.compute.dictionary_encode(dict_array)  # ✘ ArrowNotImplementedError

@danepitkin
Copy link
Member

Honestly, it might be better to keep the error and remove support for the python DictionaryArray.dictionary_encode() method. The errors being thrown all come from the same compute API in the native C++ implementation. The python method is literally just

    def dictionary_encode(self):
        return self

@jorisvandenbossche jorisvandenbossche changed the title ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=string, indices=uint8, ordered=0>) [Python] Function 'dictionary_encode' fails with DictionaryArray input (for compute kernel / ChunkedArray method) Apr 11, 2023
@jorisvandenbossche
Copy link
Member

Indeed, and the other dictionary_encode() methods (on base class Array and on ChunkedArray) call the compute kernel, and hence give this error. Doing that for DictionaryArray.dictionary_encode as well sounds fine to me.

If we want to have support for passing a DictionaryArray to the "dictionary_encode" kernel generally, this is done in C++, and would require registering an extra version of this kernel (as a no-op for dict typed input). It seems there is actually a comment about this:

auto dict_encode = std::make_shared<VectorFunction>(
"dictionary_encode", Arity::Unary(), dictionary_encode_doc,
GetDefaultDictionaryEncodeOptions());
AddHashKernels<DictEncodeAction>(dict_encode.get(), base, DictEncodeOutput);
// Calling dictionary_encode on dictionary input not supported, but if it
// ends up being needed (or convenience), a kernel could be added to make it
// a no-op

@randolf-scholz
Copy link
Author

One case when this bug appears in the wild is when trying to pivot a pandas.DataFrame that has been loaded with the pyarrow backend:

import pandas as pd

df = (
    pd.DataFrame([("A", 1), ("B", 2), ("C", 3)], columns=["var", "val"])
    .astype({"var": "string", "val": "float32"})
    .astype({"var": "category", "val": "float32"})
)

# write and reload as parquet with pyarrow backend
df.to_parquet("demo.parquet")
df = pd.read_parquet("demo.parquet", dtype_backend="pyarrow")
print(df.dtypes)  # var is now dictionary[int32,string]

df.pivot(columns=["var"], values=["val"])  # ✘ ArrowNotImplementedError

@jorisvandenbossche
Copy link
Member

@randolf-scholz that's something that pandas can fix on their side in the meantime, so I would report that example over there (if you haven't done that yet)

@jorisvandenbossche
Copy link
Member

But it is certainly true that for convenience (to avoid having to check if the input array is already dict encoded before calling dictionary_encode), it might be nice that pyarrow.compute.dictionary_encode would just be a no-op for dictionary encoded arrays.

@ddutt
Copy link

ddutt commented May 3, 2023

I see this problem even for pandas frames with groupby followed by any operation such as nunique, count etc.

@jorisvandenbossche
Copy link
Member

@ddutt please report that to pandas

@randolf-scholz
Copy link
Author

randolf-scholz commented May 4, 2023

Imo, the preferred solution here would be:

  1. keep .dictionary_encode as an identity function (LSP!)
  2. add the no-op C++ kernel
  3. Enable conversion to dictionary via cast (why is there a separate dictionary_encode function to begin with?) - but this is a separate issue.

Functions like dictionary_encode or cast should be idempotent.

@js8544
Copy link
Collaborator

js8544 commented Oct 19, 2023

I happened to come cross this issue. I'll add a no-op kernel for dictionaries.

bkietz added a commit that referenced this issue Dec 11, 2023
…ionary) (#38349)

Added a no-op kernel for convenience as discussed in the issue.
* Closes: #34890

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@bkietz bkietz added this to the 15.0.0 milestone Dec 11, 2023
mapleFU pushed a commit to mapleFU/arrow that referenced this issue Dec 13, 2023
…e(dictionary) (apache#38349)

Added a no-op kernel for convenience as discussed in the issue.
* Closes: apache#34890

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…e(dictionary) (apache#38349)

Added a no-op kernel for convenience as discussed in the issue.
* Closes: apache#34890

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…e(dictionary) (apache#38349)

Added a no-op kernel for convenience as discussed in the issue.
* Closes: apache#34890

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants