to_parquet fails for nullable dtype index #9186

bnaul · 2022-06-14T16:41:15Z

git bisect traced this back to #9131:

import dask.dataframe as dd, pandas as pd

dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-d12a786a20e0> in <module>
      1 import dask.dataframe as dd, pandas as pd
----> 2 dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in initialize_write(cls, df, fs, path, append, partition_on, ignore_divisions, division_info, schema, index_cols, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/utils.py in __call__(self, arg, *args, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/backends.py in get_pyarrow_schema_pandas(obj)

~/model/.venv/lib/python3.9/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.from_pandas()

~/model/.venv/lib/python3.9/site-packages/pyarrow/pandas_compat.py in dataframe_to_types(df, preserve_index, columns)
    527             type_ = pa.array(c, from_pandas=True).type
    528         elif _pandas_api.is_extension_array_dtype(values):
--> 529             type_ = pa.array(c.head(0), from_pandas=True).type
    530         else:
    531             values, type_ = get_datetimetz_type(values, c.dtype, None)

AttributeError: 'Index' object has no attribute 'head'

Removing dtype="string" fixes the issue.

I can't quite make out whether the mistaken assumption is in dask or pyarrow...? But regardless since it worked prior to #9131 it seems like dask should be able to work around the issue.

cc @jcrist

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-06-14T19:08:13Z

Thanks for surfacing @bnaul. I'm able to reproduce using the latest pyarrow=8.0.0 and pandas=1.4.2 releases (with the current dask main branch).

I can't quite make out whether the mistaken assumption is in dask or pyarrow...?

I'm able to reproduce with pyarrow alone (see snippet below) which makes me think the underlying issue is related to pyarrows handling of DataFrame's whose index are using extension dtypes. cc @jorisvandenbossche for visibility

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))

In [4]: pa.Schema.from_pandas(df)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pa.Schema.from_pandas(df)

File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/types.pxi:1663, in pyarrow.lib.Schema.from_pandas()

File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/pandas_compat.py:529, in dataframe_to_types(df, preserve_index, columns)
    527     type_ = pa.array(c, from_pandas=True).type
    528 elif _pandas_api.is_extension_array_dtype(values):
--> 529     type_ = pa.array(c.head(0), from_pandas=True).type
    530 else:
    531     values, type_ = get_datetimetz_type(values, c.dtype, None)

AttributeError: 'Index' object has no attribute 'head'

Removing dtype="string" fixes the issue.

This is also true when just using pyarrow

In [5]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"]))

In [6]: pa.Schema.from_pandas(df)
Out[6]:
a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 430

jrbourbeau · 2022-06-14T19:36:16Z

Have a possible workaround I'll push up in a bit...

bnaul · 2022-06-14T19:40:43Z

Thanks @jrbourbeau, I am seeing that passing schema=None seems to revert to the old behavior, so I guess the change to that default argument is really what surfaced this.

jrbourbeau · 2022-06-14T21:18:23Z

So it turns out while

df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Schema.from_pandas(df)

doesn't work, creating a pyarrow.Table and then extracting the schema with

df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Table.from_pandas(df).schema

does kind of work. It doesn't raise an exception but also doesn't preserve the string[python] extension dtype for the index. Instead it converts to object.

ian-r-rose · 2022-06-15T17:05:35Z

Reported upstream here. I think changing to schema=None (or supplying a manual schema to ddf.to_parquet) is the correct workaround until it is fixed upstream.

jrbourbeau · 2022-09-28T14:55:03Z

Closing this issue as resolved by apache/arrow#14080. When using the latest nightly version of pyarrow the original example now passes. Thanks for reporting @bnaul

github-actions bot added the needs triage Needs a response from a contributor label Jun 14, 2022

hayesgb assigned ian-r-rose and jrbourbeau Jun 14, 2022

hayesgb unassigned jrbourbeau Jun 14, 2022

ian-r-rose added parquet upstream bug Something is broken and removed needs triage Needs a response from a contributor labels Jun 15, 2022

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jul 18, 2022

ian-r-rose mentioned this issue Aug 19, 2022

p2p shuffle fails when partition index is pandas extension dtype dask/distributed#6927

Closed

jrbourbeau closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_parquet fails for nullable dtype index #9186

to_parquet fails for nullable dtype index #9186

bnaul commented Jun 14, 2022 •

edited

jrbourbeau commented Jun 14, 2022

jrbourbeau commented Jun 14, 2022

bnaul commented Jun 14, 2022

jrbourbeau commented Jun 14, 2022

ian-r-rose commented Jun 15, 2022

jrbourbeau commented Sep 28, 2022

to_parquet fails for nullable dtype index #9186

to_parquet fails for nullable dtype index #9186

Comments

bnaul commented Jun 14, 2022 • edited

jrbourbeau commented Jun 14, 2022

jrbourbeau commented Jun 14, 2022

bnaul commented Jun 14, 2022

jrbourbeau commented Jun 14, 2022

ian-r-rose commented Jun 15, 2022

jrbourbeau commented Sep 28, 2022

bnaul commented Jun 14, 2022 •

edited