Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented #34165

Closed
AlenkaF opened this issue Feb 13, 2023 · 0 comments · Fixed by #34559

Comments

@AlenkaF
Copy link
Member

AlenkaF commented Feb 13, 2023

Describe the bug, including details regarding any error messages, version, and platform.

When working on the extension type for tensors in PyArrow I came across a behaviour of the conversion to pandas that could be improved.

Creating an extension array (fixed shape tensor in this case) and converting it to pandas works well

>>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
>>> storage = pa.array(arr, pa.list_(pa.int32(), 4))
>>> tensor = pa.ExtensionArray.from_storage(tensor_type, storage)
>>> tensor.to_pandas()
0            [1, 2, 3, 4]
1        [10, 20, 30, 40]
2    [100, 200, 300, 400]
dtype: object

But creating a table with an extension array and then converting it to pandas fails:

>>> data = [
...     pa.array([1, 2, 3]),
...     pa.array(['foo', 'bar', None]),
...     pa.array([True, None, True]),
...     tensor
... ]
>>> my_schema = pa.schema([('f0', pa.int8()),
...                        ('f1', pa.string()),
...                        ('f2', pa.bool_()),
...                        ('tensors_int', tensor_type)])
>>> table = pa.Table.from_arrays(data, schema=my_schema)
>>> table.to_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
    return self._to_pandas(options, categories=categories,
  File "pyarrow/table.pxi", line 4004, in pyarrow.lib.Table._to_pandas
    mgr = table_to_blockmanager(
  File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 1171, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
  File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 1171, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
  File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 776, in _reconstruct_block
    pandas_dtype = extension_columns[name]
KeyError: 'tensors_int'

The issue is due to the extension array in this example not having to_pandas_dtype method implemented. In this case ext_columns does not get populated in _get_extension_dtypes method with the name of the column with an extension type:

# infer from extension type in the schema
for field in table.schema:
typ = field.type
if isinstance(typ, pa.BaseExtensionType):
try:
pandas_dtype = typ.to_pandas_dtype()
except NotImplementedError:
pass
else:
ext_columns[field.name] = pandas_dtype

It would be good if it would, in case to_pandas_dtype method is not implemented, convert the storage array

pandas_dtype = extension_columns[name]
similar to

arrow/python/pyarrow/array.pxi

Lines 2888 to 2889 in 925cbd8

# otherwise convert the storage array with the base implementation
return Array._to_pandas(self.storage, options, **kwargs)

Component(s)

Python

@AlenkaF AlenkaF self-assigned this Mar 14, 2023
@AlenkaF AlenkaF added this to the 12.0.0 milestone Mar 16, 2023
jorisvandenbossche added a commit that referenced this issue Apr 5, 2023
…orage type if to_pandas_dtype is not implemented (#34559)

### Rationale for this change
Method `to_pandas` fails with `KeyError` if a table has an extension array as a column with extension dtype not having `to_pandas_dtype` defined. In this cases we should fall back to storage type of the extension array.

### What changes are included in this PR?
Changes in `arrow_to_pandas.cc` at:
- `GetBlockType` for `ConvertTableToPandas`
- `ConvertChunkedArrayToPandas`
* Closes: #34165

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…the storage type if to_pandas_dtype is not implemented (apache#34559)

### Rationale for this change
Method `to_pandas` fails with `KeyError` if a table has an extension array as a column with extension dtype not having `to_pandas_dtype` defined. In this cases we should fall back to storage type of the extension array.

### What changes are included in this PR?
Changes in `arrow_to_pandas.cc` at:
- `GetBlockType` for `ConvertTableToPandas`
- `ConvertChunkedArrayToPandas`
* Closes: apache#34165

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…the storage type if to_pandas_dtype is not implemented (apache#34559)

### Rationale for this change
Method `to_pandas` fails with `KeyError` if a table has an extension array as a column with extension dtype not having `to_pandas_dtype` defined. In this cases we should fall back to storage type of the extension array.

### What changes are included in this PR?
Changes in `arrow_to_pandas.cc` at:
- `GetBlockType` for `ConvertTableToPandas`
- `ConvertChunkedArrayToPandas`
* Closes: apache#34165

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant