-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Function 'dictionary_encode' fails with DictionaryArray input (for compute kernel / ChunkedArray method) #34890
Comments
MWE: import pyarrow as pa
dict_array = pa.array(["banana", "orange"]).dictionary_encode()
dict_array.dictionary_encode() # ✔
pa.compute.dictionary_encode(dict_array) # ✘ ArrowNotImplementedError |
Using chunked arrays, import pyarrow as pa
dict_array = pa.chunked_array([["banana", "orange"], ["orange", "apple"]]).dictionary_encode()
dict_array.dictionary_encode() # ✘ ArrowNotImplementedError
pa.compute.dictionary_encode(dict_array) # ✘ ArrowNotImplementedError |
Honestly, it might be better to keep the error and remove support for the python
|
Indeed, and the other If we want to have support for passing a DictionaryArray to the "dictionary_encode" kernel generally, this is done in C++, and would require registering an extra version of this kernel (as a no-op for dict typed input). It seems there is actually a comment about this: arrow/cpp/src/arrow/compute/kernels/vector_hash.cc Lines 810 to 817 in e488942
|
One case when this bug appears in the wild is when trying to pivot a import pandas as pd
df = (
pd.DataFrame([("A", 1), ("B", 2), ("C", 3)], columns=["var", "val"])
.astype({"var": "string", "val": "float32"})
.astype({"var": "category", "val": "float32"})
)
# write and reload as parquet with pyarrow backend
df.to_parquet("demo.parquet")
df = pd.read_parquet("demo.parquet", dtype_backend="pyarrow")
print(df.dtypes) # var is now dictionary[int32,string]
df.pivot(columns=["var"], values=["val"]) # ✘ ArrowNotImplementedError |
@randolf-scholz that's something that pandas can fix on their side in the meantime, so I would report that example over there (if you haven't done that yet) |
But it is certainly true that for convenience (to avoid having to check if the input array is already dict encoded before calling |
I see this problem even for pandas frames with groupby followed by any operation such as nunique, count etc. |
@ddutt please report that to pandas |
Imo, the preferred solution here would be:
Functions like |
I happened to come cross this issue. I'll add a no-op kernel for dictionaries. |
…ionary) (#38349) Added a no-op kernel for convenience as discussed in the issue. * Closes: #34890 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…e(dictionary) (apache#38349) Added a no-op kernel for convenience as discussed in the issue. * Closes: apache#34890 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…e(dictionary) (apache#38349) Added a no-op kernel for convenience as discussed in the issue. * Closes: apache#34890 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…e(dictionary) (apache#38349) Added a no-op kernel for convenience as discussed in the issue. * Closes: apache#34890 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
dictionary_encode
should probably just return the data as-is if it is already encoded as a dictionary.Component(s)
Python
The text was updated successfully, but these errors were encountered: