[Python] Losing index information when using write_to_dataset with partition_cols #24016

asfimport · 2020-02-06T15:07:57Z

One cannot save the index when using pyarrow.parquet.write_to_dataset() with given partition_cols arguments. Here I have created a minimal example which shows the issue:

 
from pathlib import Path
import pandas as pd
from pyarrow import Table
from pyarrow.parquet import write_to_dataset, read_table

path = Path('/home/user/trials')
file_name = 'local_database.parquet'
df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
                  index=pd.Index(['a', 'b', 'c'], 
                  name='idx'))

table = Table.from_pandas(df)
write_to_dataset(table, 
                 str(path / file_name), 
                 partition_cols=['B']
                )
df_read = read_table(str(path / file_name))
df_read.to_pandas()

The issue is rather important for pandas and dask users.

Environment: pyarrow==0.15.1
Reporter: Ludwik Bielczynski
Assignee: Joris Van den Bossche / @jorisvandenbossche

_{Note: This issue was originally created as ARROW-7782. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-02-06T15:27:54Z

Joris Van den Bossche / @jorisvandenbossche:
This might be solved in master (will be released as 0.16 this week):

In [1]: from pathlib import Path 
   ...: import pandas as pd 
   ...: from pyarrow import Table 
   ...: from pyarrow.parquet import write_to_dataset 
   ...: path = Path('.') 
   ...: file_name = 'trial_pq.parquet' 
   ...: df = pd.DataFrame({"A": [1, 2, 3],  
   ...:  "B": ['a', 'a', 'b'] 
   ...:  },  
   ...:  index=pd.Index(['a', 'b', 'c'], name='idx')) 
   ...:  
   ...: table = Table.from_pandas(df) 
   ...: write_to_dataset(table, str(path / file_name), partition_cols=['B'], 
   ...:  partition_filename_cb=None, filesystem=None) 
   ...:                                                                                                                                                                                                            

In [2]: table                                                                                                                                                                                                      
Out[2]: 
pyarrow.Table
A: int64
B: string
idx: string
metadata
--------
{b'pandas': b'{"index_columns": ["idx"], "column_indexes": [{"name": null, "fi'
            b'eld_name": null, "pandas_type": "unicode", "numpy_type": "object'
            b'", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "A"'
            b', "field_name": "A", "pandas_type": "int64", "numpy_type": "int6'
            b'4", "metadata": null}, {"name": "B", "field_name": "B", "pandas_'
            b'type": "unicode", "numpy_type": "object", "metadata": null}, {"n'
            b'ame": "idx", "field_name": "idx", "pandas_type": "unicode", "num'
            b'py_type": "object", "metadata": null}], "creator": {"library": "'
            b'pyarrow", "version": "0.15.1.dev736+g46d0b7f47"}, "pandas_versio'
            b'n": "1.1.0.dev0+369.ga62dbda20"}'}

In [3]: pd.read_parquet(file_name)                                                                                                                                                                                 
Out[3]: 
   A idx  B
0  1   a  a
1  2   b  a
2  3   c  b

which seem to preserve the "idx" index as a column?

asfimport · 2020-02-06T15:31:32Z

Ludwik Bielczynski:
Maybe I was not clear. The index is moved from the index to the columns. That's the buggy behaviour I am describing. Version 0.9 before ARROW-2891 preserved the index as an index between writting and reading the data.

Does it make sense?

asfimport · 2020-02-06T16:25:26Z

Joris Van den Bossche / @jorisvandenbossche:
Ah, OK. So the index information is preserved for single files (with your dataframe above):

In [4]: df                                                                                                                                                                                                         
Out[4]: 
     A  B
idx      
a    1  a
b    2  a
c    3  b

In [5]: df.to_parquet("test_index.parquet")                                                                                                                                                                        

In [6]: pd.read_parquet("test_index.parquet")                                                                                                                                                                      
Out[6]: 
     A  B
idx      
a    1  a
b    2  a
c    3  b

but for partitioned data this is more difficult.
The problem is the current implementation of write_to_dataset, which splits the pandas dataframe in parts using pandas' groupby, and then writes those parts to parquet. But in this current process, the index information is lost. It might need some more investigation if it is easy/possible to still preserve that information here.

asfimport · 2020-02-06T16:44:07Z

Ludwik Bielczynski:
Thanks Joris for checking this out. Yes for single files data the index is preserved, however, I am sure that you are aware that the usual use-case of parquet databases is not that simple. I think preserving this index in one case and resetting the index in the another can lead to one's confusion.

Please let me know when you have more information about the feasibility of this issue's correction.

asfimport · 2020-04-29T06:21:14Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 7054
#7054

asfimport · 2020-06-02T14:57:04Z

Tom Augspurger / @TomAugspurger:
Joris, was this fix included in 0.17.1? Or is it just for 1.0?

asfimport · 2020-06-02T15:00:14Z

Joris Van den Bossche / @jorisvandenbossche:
I don't think we backported this to 0.17.x, so only for master / 1.0.0

asfimport closed this as completed Apr 29, 2020

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Losing index information when using write_to_dataset with partition_cols #24016

[Python] Losing index information when using write_to_dataset with partition_cols #24016

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Apr 29, 2020

asfimport commented Jun 2, 2020

asfimport commented Jun 2, 2020

[Python] Losing index information when using write_to_dataset with partition_cols #24016

[Python] Losing index information when using write_to_dataset with partition_cols #24016

Comments

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Feb 6, 2020

asfimport commented Apr 29, 2020

asfimport commented Jun 2, 2020

asfimport commented Jun 2, 2020