New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Losing index information when using write_to_dataset with partition_cols #24016
Comments
Joris Van den Bossche / @jorisvandenbossche: In [1]: from pathlib import Path
...: import pandas as pd
...: from pyarrow import Table
...: from pyarrow.parquet import write_to_dataset
...: path = Path('.')
...: file_name = 'trial_pq.parquet'
...: df = pd.DataFrame({"A": [1, 2, 3],
...: "B": ['a', 'a', 'b']
...: },
...: index=pd.Index(['a', 'b', 'c'], name='idx'))
...:
...: table = Table.from_pandas(df)
...: write_to_dataset(table, str(path / file_name), partition_cols=['B'],
...: partition_filename_cb=None, filesystem=None)
...:
In [2]: table
Out[2]:
pyarrow.Table
A: int64
B: string
idx: string
metadata
--------
{b'pandas': b'{"index_columns": ["idx"], "column_indexes": [{"name": null, "fi'
b'eld_name": null, "pandas_type": "unicode", "numpy_type": "object'
b'", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "A"'
b', "field_name": "A", "pandas_type": "int64", "numpy_type": "int6'
b'4", "metadata": null}, {"name": "B", "field_name": "B", "pandas_'
b'type": "unicode", "numpy_type": "object", "metadata": null}, {"n'
b'ame": "idx", "field_name": "idx", "pandas_type": "unicode", "num'
b'py_type": "object", "metadata": null}], "creator": {"library": "'
b'pyarrow", "version": "0.15.1.dev736+g46d0b7f47"}, "pandas_versio'
b'n": "1.1.0.dev0+369.ga62dbda20"}'}
In [3]: pd.read_parquet(file_name)
Out[3]:
A idx B
0 1 a a
1 2 b a
2 3 c b which seem to preserve the "idx" index as a column? |
Ludwik Bielczynski: Does it make sense? |
Joris Van den Bossche / @jorisvandenbossche: In [4]: df
Out[4]:
A B
idx
a 1 a
b 2 a
c 3 b
In [5]: df.to_parquet("test_index.parquet")
In [6]: pd.read_parquet("test_index.parquet")
Out[6]:
A B
idx
a 1 a
b 2 a
c 3 b but for partitioned data this is more difficult. |
Ludwik Bielczynski: Please let me know when you have more information about the feasibility of this issue's correction. |
Joris Van den Bossche / @jorisvandenbossche: |
Tom Augspurger / @TomAugspurger: |
Joris Van den Bossche / @jorisvandenbossche: |
One cannot save the index when using
pyarrow.parquet.write_to_dataset()
with given partition_cols arguments. Here I have created a minimal example which shows the issue:The issue is rather important for pandas and dask users.
Environment: pyarrow==0.15.1
Reporter: Ludwik Bielczynski
Assignee: Joris Van den Bossche / @jorisvandenbossche
Note: This issue was originally created as ARROW-7782. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: