Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Losing index information when using write_to_dataset with partition_cols #24016

Closed
asfimport opened this issue Feb 6, 2020 · 7 comments

Comments

@asfimport
Copy link

One cannot save the index when using pyarrow.parquet.write_to_dataset() with given partition_cols arguments. Here I have created a minimal example which shows the issue:

 
from pathlib import Path
import pandas as pd
from pyarrow import Table
from pyarrow.parquet import write_to_dataset, read_table

path = Path('/home/user/trials')
file_name = 'local_database.parquet'
df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
                  index=pd.Index(['a', 'b', 'c'], 
                  name='idx'))

table = Table.from_pandas(df)
write_to_dataset(table, 
                 str(path / file_name), 
                 partition_cols=['B']
                )
df_read = read_table(str(path / file_name))
df_read.to_pandas()

 

The issue is rather important for pandas and dask users.

Environment: pyarrow==0.15.1
Reporter: Ludwik Bielczynski
Assignee: Joris Van den Bossche / @jorisvandenbossche

Note: This issue was originally created as ARROW-7782. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
This might be solved in master (will be released as 0.16 this week):

In [1]: from pathlib import Path 
   ...: import pandas as pd 
   ...: from pyarrow import Table 
   ...: from pyarrow.parquet import write_to_dataset 
   ...: path = Path('.') 
   ...: file_name = 'trial_pq.parquet' 
   ...: df = pd.DataFrame({"A": [1, 2, 3],  
   ...:  "B": ['a', 'a', 'b'] 
   ...:  },  
   ...:  index=pd.Index(['a', 'b', 'c'], name='idx')) 
   ...:  
   ...: table = Table.from_pandas(df) 
   ...: write_to_dataset(table, str(path / file_name), partition_cols=['B'], 
   ...:  partition_filename_cb=None, filesystem=None) 
   ...:                                                                                                                                                                                                            

In [2]: table                                                                                                                                                                                                      
Out[2]: 
pyarrow.Table
A: int64
B: string
idx: string
metadata
--------
{b'pandas': b'{"index_columns": ["idx"], "column_indexes": [{"name": null, "fi'
            b'eld_name": null, "pandas_type": "unicode", "numpy_type": "object'
            b'", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "A"'
            b', "field_name": "A", "pandas_type": "int64", "numpy_type": "int6'
            b'4", "metadata": null}, {"name": "B", "field_name": "B", "pandas_'
            b'type": "unicode", "numpy_type": "object", "metadata": null}, {"n'
            b'ame": "idx", "field_name": "idx", "pandas_type": "unicode", "num'
            b'py_type": "object", "metadata": null}], "creator": {"library": "'
            b'pyarrow", "version": "0.15.1.dev736+g46d0b7f47"}, "pandas_versio'
            b'n": "1.1.0.dev0+369.ga62dbda20"}'}

In [3]: pd.read_parquet(file_name)                                                                                                                                                                                 
Out[3]: 
   A idx  B
0  1   a  a
1  2   b  a
2  3   c  b

which seem to preserve the "idx" index as a column?

@asfimport
Copy link
Author

Ludwik Bielczynski:
Maybe I was not clear. The index is moved from the index to the columns. That's the buggy behaviour I am describing. Version 0.9 before ARROW-2891 preserved the index as an index between writting and reading the data.

Does it make sense?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Ah, OK. So the index information is preserved for single files (with your dataframe above):

In [4]: df                                                                                                                                                                                                         
Out[4]: 
     A  B
idx      
a    1  a
b    2  a
c    3  b

In [5]: df.to_parquet("test_index.parquet")                                                                                                                                                                        

In [6]: pd.read_parquet("test_index.parquet")                                                                                                                                                                      
Out[6]: 
     A  B
idx      
a    1  a
b    2  a
c    3  b

but for partitioned data this is more difficult.
The problem is the current implementation of write_to_dataset, which splits the pandas dataframe in parts using pandas' groupby, and then writes those parts to parquet. But in this current process, the index information is lost. It might need some more investigation if it is easy/possible to still preserve that information here.

@asfimport
Copy link
Author

Ludwik Bielczynski:
Thanks Joris for checking this out. Yes for single files data the index is preserved, however, I am sure that you are aware that the usual use-case of parquet databases is not that simple. I think preserving this index in one case and resetting the index in the another can lead to one's confusion.

Please let me know when you have more information about the feasibility of this issue's correction.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 7054
#7054

@asfimport
Copy link
Author

Tom Augspurger / @TomAugspurger:
Joris, was this fix included in 0.17.1? Or is it just for 1.0?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I don't think we backported this to 0.17.x, so only for master / 1.0.0

@asfimport asfimport added this to the 1.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants