-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_parquet fails for nullable dtype index #9186
Comments
Thanks for surfacing @bnaul. I'm able to reproduce using the latest
I'm able to reproduce with In [1]: import pandas as pd
In [2]: import pyarrow as pa
In [3]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
In [4]: pa.Schema.from_pandas(df)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pa.Schema.from_pandas(df)
File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/types.pxi:1663, in pyarrow.lib.Schema.from_pandas()
File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/pandas_compat.py:529, in dataframe_to_types(df, preserve_index, columns)
527 type_ = pa.array(c, from_pandas=True).type
528 elif _pandas_api.is_extension_array_dtype(values):
--> 529 type_ = pa.array(c.head(0), from_pandas=True).type
530 else:
531 values, type_ = get_datetimetz_type(values, c.dtype, None)
AttributeError: 'Index' object has no attribute 'head'
This is also true when just using In [5]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"]))
In [6]: pa.Schema.from_pandas(df)
Out[6]:
a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 430 |
Have a possible workaround I'll push up in a bit... |
Thanks @jrbourbeau, I am seeing that passing |
So it turns out while df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Schema.from_pandas(df) doesn't work, creating a df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Table.from_pandas(df).schema does kind of work. It doesn't raise an exception but also doesn't preserve the |
Reported upstream here. I think changing to |
Closing this issue as resolved by apache/arrow#14080. When using the latest nightly version of |
git bisect
traced this back to #9131:Removing
dtype="string"
fixes the issue.I can't quite make out whether the mistaken assumption is in dask or pyarrow...? But regardless since it worked prior to #9131 it seems like dask should be able to work around the issue.
cc @jcrist
The text was updated successfully, but these errors were encountered: