Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset #24447

Closed
asfimport opened this issue Mar 28, 2020 · 3 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 28, 2020

write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df

import pandas as pd 
import pyarrow as pa 
import pyarrow.parquet as pq 
parquet_dataset = 'partquet_dataset/' 
parquet_file = 'test.parquet' 

df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 

table = pa.Table.from_pandas(df) 

pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file) 

write_table handles schema correctly, pandas.ExtensionDtype survive round trip:

pq.read_table(parquet_file).to_pandas().dtypes 
str_col string 
int_col Int64 
part int64 

However, write_to_dataset reverts back to object/float:

pq.read_table(parquet_dataset).to_pandas().dtypes 
str_col object 
int_col float64 
part category 

I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata

pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes 

This also affects pandas to_parquet when partition_cols is specified:

df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes 
str_col object 
int_col float64 
part category 

 

Environment: pandas 1.0.1
parquet 0.16
Reporter: Ged Steponavicius
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8251. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~Ged.Steponavicius] Thanks for the report!

I think this might be due to the ignore_metadata=True here:

df = table.to_pandas(ignore_metadata=True)
, when converting the table back to a dataframe to use pandas' groupby.
I am not fully sure what the original reason was to have that ignore_metadata, and there also doesn't seem to be a test failing if I remove that.

(in general, the write_to_dataset is not very efficient right now, because it converts back and forth to pandas)

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
ARROW-7782 might be related

@asfimport
Copy link
Collaborator Author

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7054
#7054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants