Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] to_pandas() not implemented on list<dictionary<values=string, indices=int32> #23225

Closed
asfimport opened this issue Oct 16, 2019 · 5 comments

Comments

@asfimport
Copy link

Hi,

pyarrow.Table.to_pandas() fails on an Arrow List Vector where the data vector is of type "dictionary encoded string". Here is the table schema as printed by pyarrow:

pyarrow.Table
encodedList: list<$data$: dictionary<values=string, indices=int32, ordered=0> not null> not null
  child 0, $data$: dictionary<values=string, indices=int32, ordered=0> not null
metadata
--------
OrderedDict() 

and the data (also attached in a file to this ticket)

<pyarrow.lib.ChunkedArray object at 0x7f7ea6a748b8>
[
  [

    -- dictionary:
      [
        "a",
        "b",
        "c",
        "d"
      ]
    -- indices:
      [
        0,
        1,
        2
      ],

    -- dictionary:
      [
        "a",
        "b",
        "c",
        "d"
      ]
    -- indices:
      [
        0,
        3
      ]
  ]
] 

and the exception I got

---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-10-5f865bc01df1> in <module>
----> 1 df.to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata)
    700 
    701     _check_data_column_metadata_consistency(all_columns)
--> 702     blocks = _table_to_blocks(options, table, categories)
    703     columns = _deserialize_column_index(table, all_columns, column_indexes)
    704 

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories)
    972 
    973     # Convert an arrow table to Block from the internal pandas API
--> 974     result = pa.lib.table_to_blocks(options, block_table, categories)
    975 
    976     # Defined above

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: dictionary<values=string, indices=int32, ordered=0> 

Note that the data vector itself can be loaded successfully by to_pandas.

It'd be great if this would be addressed in the next version of pyarrow. For now, is there anything I can do on my end to bypass this unimplemented conversion?

Thanks,

Razvan

Reporter: Razvan Chitu
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6899. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~razvanch] thanks for the report. Could you provide a small script to reproduce the issue (some code to create a Table with such a type) ?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Another thing: I suppose it should be possible to add this conversion, but, the question is to what it would convert. For other List types, we convert it to a numpy arrays of numpy arrays. But for a dictionary type, this would mean that it would loose that information.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Dictionary should be converted to dense / non-dictionary in this case. Marking for 1.0

@asfimport
Copy link
Author

Razvan Chitu:
@jorisvandenbossche  sure, here's an example that throws:

import pyarrow as pa

offsets = pa.array([0, 3, 5])
values = pa.array(['a', 'b', 'c', 'a', 'd']).dictionary_encode()
pa.ListArray.from_arrays(offsets, values).to_pandas() 

@wesm  is 1.0 going to be the next release? Also I wouldn't mind trying to solve this myself, but I would need a bit of guidance!

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 6199
#6199

@asfimport asfimport added this to the 0.16.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants