Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] dictionary_encode() of a slice gives wrong result #23556

Closed
asfimport opened this issue Nov 26, 2019 · 3 comments
Closed

[Python] dictionary_encode() of a slice gives wrong result #23556

asfimport opened this issue Nov 26, 2019 · 3 comments

Comments

@asfimport
Copy link

Steps to reproduce:

import pyarrow as pa
arr = pa.array(["a", "b", "b", "b"])[1:]
arr.dictionary_encode()

Expected results:

-- dictionary:
  [
    "b"
  ]
-- indices:
  [
    0,
    0,
    0
  ]

Actual results:

-- dictionary:
  [
    "b",
    ""
  ]
-- indices:
  [
    0,
    0,
    1
  ]

I don't know a workaround. Converting to pylist and back is too slow. Is there a way to copy the slice to a new offset-0 StringArray that I could then dictionary-encode? Otherwise, I'm considering building buffers by hand....

Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
Reporter: Adam Hooper / @adamhooper
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-7266. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Adam Hooper / @adamhooper:
Ah, found a workaround that should be good enough for now: pa.serialize(arr).deserialize().dictionary_encode()

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@adamhooper Thanks of the report!

This seems to be specific to the string type, as I don't see a similar bug for integer type:

In [7]: a = pa.array(['a', 'b', 'c', 'b'])                                                                                                                                                                         

In [9]: a[1:].dictionary_encode()                                                                                                                                                                                  
Out[9]: 
<pyarrow.lib.DictionaryArray object at 0x7f677975e128>

-- dictionary:
  [
    "c",
    "b",
    ""
  ]
-- indices:
  [
    0,
    1,
    2
  ]

In [10]: a = pa.array([1, 2, 3, 2])                                                                                                                                                                                

In [12]: a[1:].dictionary_encode()                                                                                                                                                                                 
Out[12]: 
<pyarrow.lib.DictionaryArray object at 0x7f6776f5f208>

-- dictionary:
  [
    2,
    3
  ]
-- indices:
  [
    0,
    1,
    0
  ]

Is there a way to copy the slice to a new offset-0 StringArray that I could then dictionary-encode?

At least in the current pyarrow API, I don't think such a functionality is exposed (apart from getting buffers, slicing/copying, and recreating an array)

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Issue resolved by pull request 6061
#6061

@asfimport asfimport added this to the 0.16.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants