[Python] dictionary_encode() of a slice gives wrong result #23556

asfimport · 2019-11-26T14:34:06Z

Steps to reproduce:

import pyarrow as pa
arr = pa.array(["a", "b", "b", "b"])[1:]
arr.dictionary_encode()

Expected results:

-- dictionary:
  [
    "b"
  ]
-- indices:
  [
    0,
    0,
    0
  ]

Actual results:

-- dictionary:
  [
    "b",
    ""
  ]
-- indices:
  [
    0,
    0,
    1
  ]

I don't know a workaround. Converting to pylist and back is too slow. Is there a way to copy the slice to a new offset-0 StringArray that I could then dictionary-encode? Otherwise, I'm considering building buffers by hand....

Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
Reporter: Adam Hooper / @adamhooper
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

GitHub Pull Request #6061

_{Note: This issue was originally created as ARROW-7266. Please see the migration documentation for further details.}

asfimport · 2019-11-26T14:36:32Z

Adam Hooper / @adamhooper:
Ah, found a workaround that should be good enough for now: pa.serialize(arr).deserialize().dictionary_encode()

asfimport · 2019-11-27T14:16:41Z

Joris Van den Bossche / @jorisvandenbossche:
@adamhooper Thanks of the report!

This seems to be specific to the string type, as I don't see a similar bug for integer type:

In [7]: a = pa.array(['a', 'b', 'c', 'b'])                                                                                                                                                                         

In [9]: a[1:].dictionary_encode()                                                                                                                                                                                  
Out[9]: 
<pyarrow.lib.DictionaryArray object at 0x7f677975e128>

-- dictionary:
  [
    "c",
    "b",
    ""
  ]
-- indices:
  [
    0,
    1,
    2
  ]

In [10]: a = pa.array([1, 2, 3, 2])                                                                                                                                                                                

In [12]: a[1:].dictionary_encode()                                                                                                                                                                                 
Out[12]: 
<pyarrow.lib.DictionaryArray object at 0x7f6776f5f208>

-- dictionary:
  [
    2,
    3
  ]
-- indices:
  [
    0,
    1,
    0
  ]

Is there a way to copy the slice to a new offset-0 StringArray that I could then dictionary-encode?

At least in the current pyarrow API, I don't think such a functionality is exposed (apart from getting buffers, slicing/copying, and recreating an array)

asfimport · 2019-12-19T06:44:32Z

Kouhei Sutou / @kou:
Issue resolved by pull request 6061
#6061

asfimport closed this as completed Dec 19, 2019

asfimport assigned pitrou Jan 10, 2023

asfimport added this to the 0.16.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] dictionary_encode() of a slice gives wrong result #23556

[Python] dictionary_encode() of a slice gives wrong result #23556

asfimport commented Nov 26, 2019

asfimport commented Nov 26, 2019

asfimport commented Nov 27, 2019

asfimport commented Dec 19, 2019

[Python] dictionary_encode() of a slice gives wrong result #23556

[Python] dictionary_encode() of a slice gives wrong result #23556

Comments

asfimport commented Nov 26, 2019

PRs and other links:

asfimport commented Nov 26, 2019

asfimport commented Nov 27, 2019

asfimport commented Dec 19, 2019