Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_parquet fails for nullable dtype index #9186

Closed
bnaul opened this issue Jun 14, 2022 · 6 comments
Closed

to_parquet fails for nullable dtype index #9186

bnaul opened this issue Jun 14, 2022 · 6 comments
Assignees
Labels
bug Something is broken needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet upstream

Comments

@bnaul
Copy link
Contributor

bnaul commented Jun 14, 2022

git bisect traced this back to #9131:

import dask.dataframe as dd, pandas as pd

dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-d12a786a20e0> in <module>
      1 import dask.dataframe as dd, pandas as pd
----> 2 dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in initialize_write(cls, df, fs, path, append, partition_on, ignore_divisions, division_info, schema, index_cols, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/utils.py in __call__(self, arg, *args, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/backends.py in get_pyarrow_schema_pandas(obj)

~/model/.venv/lib/python3.9/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.from_pandas()

~/model/.venv/lib/python3.9/site-packages/pyarrow/pandas_compat.py in dataframe_to_types(df, preserve_index, columns)
    527             type_ = pa.array(c, from_pandas=True).type
    528         elif _pandas_api.is_extension_array_dtype(values):
--> 529             type_ = pa.array(c.head(0), from_pandas=True).type
    530         else:
    531             values, type_ = get_datetimetz_type(values, c.dtype, None)

AttributeError: 'Index' object has no attribute 'head'

Removing dtype="string" fixes the issue.

I can't quite make out whether the mistaken assumption is in dask or pyarrow...? But regardless since it worked prior to #9131 it seems like dask should be able to work around the issue.

cc @jcrist

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jun 14, 2022
@jrbourbeau
Copy link
Member

Thanks for surfacing @bnaul. I'm able to reproduce using the latest pyarrow=8.0.0 and pandas=1.4.2 releases (with the current dask main branch).

I can't quite make out whether the mistaken assumption is in dask or pyarrow...?

I'm able to reproduce with pyarrow alone (see snippet below) which makes me think the underlying issue is related to pyarrows handling of DataFrame's whose index are using extension dtypes. cc @jorisvandenbossche for visibility

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))

In [4]: pa.Schema.from_pandas(df)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pa.Schema.from_pandas(df)

File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/types.pxi:1663, in pyarrow.lib.Schema.from_pandas()

File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/pandas_compat.py:529, in dataframe_to_types(df, preserve_index, columns)
    527     type_ = pa.array(c, from_pandas=True).type
    528 elif _pandas_api.is_extension_array_dtype(values):
--> 529     type_ = pa.array(c.head(0), from_pandas=True).type
    530 else:
    531     values, type_ = get_datetimetz_type(values, c.dtype, None)

AttributeError: 'Index' object has no attribute 'head'

Removing dtype="string" fixes the issue.

This is also true when just using pyarrow

In [5]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"]))

In [6]: pa.Schema.from_pandas(df)
Out[6]:
a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 430

@jrbourbeau
Copy link
Member

Have a possible workaround I'll push up in a bit...

@bnaul
Copy link
Contributor Author

bnaul commented Jun 14, 2022

Thanks @jrbourbeau, I am seeing that passing schema=None seems to revert to the old behavior, so I guess the change to that default argument is really what surfaced this.

@jrbourbeau
Copy link
Member

So it turns out while

df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Schema.from_pandas(df)

doesn't work, creating a pyarrow.Table and then extracting the schema with

df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Table.from_pandas(df).schema

does kind of work. It doesn't raise an exception but also doesn't preserve the string[python] extension dtype for the index. Instead it converts to object.

@ian-r-rose
Copy link
Collaborator

Reported upstream here. I think changing to schema=None (or supplying a manual schema to ddf.to_parquet) is the correct workaround until it is fixed upstream.

@ian-r-rose ian-r-rose added parquet upstream bug Something is broken and removed needs triage Needs a response from a contributor labels Jun 15, 2022
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jul 18, 2022
@jrbourbeau
Copy link
Member

Closing this issue as resolved by apache/arrow#14080. When using the latest nightly version of pyarrow the original example now passes. Thanks for reporting @bnaul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet upstream
Projects
None yet
Development

No branches or pull requests

3 participants