Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Table.to_pandas fails to convert index dtype with a custom type mapper #34283

Closed
j-bennet opened this issue Feb 21, 2023 · 2 comments · Fixed by #34445
Closed

[Python] Table.to_pandas fails to convert index dtype with a custom type mapper #34283

j-bennet opened this issue Feb 21, 2023 · 2 comments · Fixed by #34445

Comments

@j-bennet
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

When providing a custom type mapper to pyarrow.Table.to_pandas, column dtypes are converted, but not index dtypes.

Example:

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"s1": ["a", "c", "b"], "i1": [1, 2, 3], "s2": ["d", "e", "f"]}).set_index('s2')
tbl = pa.Table.from_pandas(df)
mapper = {pa.string(): pd.StringDtype("pyarrow")}.get
df2 = tbl.to_pandas(types_mapper=mapper)
print(f"{df2.dtypes}")
print(f"{df2.index.dtype = }")

This prints:

s1    string[pyarrow]
i1              int64
dtype: object
df2.index.dtype = dtype('O')

Column s1 was mapped to string[pyarrow] dtype, however, the index dtype remained object. I'd expect the type mapper being used to convert the index dtype as well.

Platform: macOS
Pyarrow version: 11.0.0
Build: py310h89f3c6b_2_cpu
Channel: conda-forge

Component(s)

Python

@AlenkaF
Copy link
Member

AlenkaF commented Feb 22, 2023

Yes, I think that the dtype of an index is never converted according to the types_mapper keyword when converting pa.Table or an array for that matter.

A single array gets converted to pandas series in _array_like_to_pandas with pd.Series which doesn't take into account the dtype of the index.

result = pandas_api.series(arr, dtype=dtype, name=name)

This would be a good add on. What would be needed is to use pandas api to reset the series index with .astype() method after the code line linked above in arrow/python/pyarrow/array.pxi.

As for the Table, it gets converted with the use of pandas BlockManager and I am not sure how the desired dtype could be passed to the BlockManager axes in that case:

arrow/python/pyarrow/table.pxi

Lines 4001 to 4008 in 7828165

def _to_pandas(self, options, categories=None, ignore_metadata=False,
types_mapper=None):
from pyarrow.pandas_compat import table_to_blockmanager
mgr = table_to_blockmanager(
options, self, categories,
ignore_metadata=ignore_metadata,
types_mapper=types_mapper)
return pandas_api.data_frame(mgr)

def table_to_blockmanager(options, table, categories=None,
ignore_metadata=False, types_mapper=None):
from pandas.core.internals import BlockManager
all_columns = []
column_indexes = []
pandas_metadata = table.schema.pandas_metadata
if not ignore_metadata and pandas_metadata is not None:
all_columns = pandas_metadata['columns']
column_indexes = pandas_metadata.get('column_indexes', [])
index_descriptors = pandas_metadata['index_columns']
table = _add_any_metadata(table, pandas_metadata)
table, index = _reconstruct_index(table, index_descriptors,
all_columns)
ext_columns_dtypes = _get_extension_dtypes(
table, all_columns, types_mapper)
else:
index = _pandas_api.pd.RangeIndex(table.num_rows)
ext_columns_dtypes = _get_extension_dtypes(table, [], types_mapper)
_check_data_column_metadata_consistency(all_columns)
columns = _deserialize_column_index(table, all_columns, column_indexes)
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
axes = [columns, index]
return BlockManager(blocks, axes)

@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Mar 3, 2023
jorisvandenbossche pushed a commit that referenced this issue Mar 9, 2023
…4445)

### Rationale for this change

### What changes are included in this PR?

Only respects types_mapper for indexes as well

### Are these changes tested?

Yes

### Are there any user-facing changes?

Technically this breaks the API in a way that we would now respect the types_mapper for the index.

- [x] closes #34283

cc @ jorisvandenbossche 

Authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

Issue resolved by pull request 34445
#34445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants