[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset #24447

asfimport · 2020-03-28T10:19:17Z

write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df

import pandas as pd 
import pyarrow as pa 
import pyarrow.parquet as pq 
parquet_dataset = 'partquet_dataset/' 
parquet_file = 'test.parquet' 

df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 

table = pa.Table.from_pandas(df) 

pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file)

write_table handles schema correctly, pandas.ExtensionDtype survive round trip:

pq.read_table(parquet_file).to_pandas().dtypes 
str_col string 
int_col Int64 
part int64

However, write_to_dataset reverts back to object/float:

pq.read_table(parquet_dataset).to_pandas().dtypes 
str_col object 
int_col float64 
part category

I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata

pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes

This also affects pandas to_parquet when partition_cols is specified:

df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes 
str_col object 
int_col float64 
part category

Environment: pandas 1.0.1
parquet 0.16
Reporter: Ged Steponavicius
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

[Python] Parquet partitioning degrades Int32 to float64 (duplicates)

PRs and other links:

GitHub Pull Request #7054

_{Note: This issue was originally created as ARROW-8251. Please see the migration documentation for further details.}

asfimport · 2020-04-01T18:42:13Z

Joris Van den Bossche / @jorisvandenbossche:
[~Ged.Steponavicius] Thanks for the report!

I think this might be due to the ignore_metadata=True here:

arrow/python/pyarrow/parquet.py

Line 1442 in 38a4134

df = table.to_pandas(ignore_metadata=True)

, when converting the table back to a dataframe to use pandas' groupby.
I am not fully sure what the original reason was to have that ignore_metadata, and there also doesn't seem to be a test failing if I remove that.

(in general, the write_to_dataset is not very efficient right now, because it converts back and forth to pandas)

asfimport · 2020-04-01T18:44:23Z

Joris Van den Bossche / @jorisvandenbossche:
ARROW-7782 might be related

asfimport · 2020-04-29T02:43:04Z

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7054
#7054

asfimport closed this as completed Apr 29, 2020

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python] Parquet partitioning degrades Int32 to float64 #25245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset #24447

[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset #24447

asfimport commented Mar 28, 2020 •

edited

Loading

asfimport commented Apr 1, 2020

asfimport commented Apr 1, 2020

asfimport commented Apr 29, 2020

[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset #24447

[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset #24447

Comments

asfimport commented Mar 28, 2020 • edited Loading

Related issues:

PRs and other links:

asfimport commented Apr 1, 2020

asfimport commented Apr 1, 2020

asfimport commented Apr 29, 2020

asfimport commented Mar 28, 2020 •

edited

Loading