[Python] ParquetDataset().read columns argument always returns partition column #20409

asfimport · 2018-11-23T11:44:37Z

I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns:

import dask as da
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
                       ('partition_column', pa.int32()),
                       ('arrays', pa.list_(pa.int32())),
                       ('strings', pa.string()),
                       ('new_column', pa.string())])

table = pa.Table.from_pandas(df, schema=my_schema)
pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column'])

df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()
# pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], engine='pyarrow')
df_pq

df_pq has column partition_column

Reporter: Christian Thiel / @c-thiel
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim (depends upon)

PRs and other links:

GitHub Pull Request #7050

_{Note: This issue was originally created as ARROW-3861. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2018-11-23T14:12:56Z

Wes McKinney / @wesm:
Should not be too difficult to fix. Patches welcome

asfimport · 2019-04-26T08:36:45Z

Joris Van den Bossche / @jorisvandenbossche:
@c-thiel note that the way you create and pass the schema (with "new" columns and the index column specified) now raises an error. I opened ARROW-5220 for that.
What was your intent to add "new_column" to the schema? That it would be created in the actual table?

asfimport · 2019-06-13T05:55:11Z

Christian Thiel / @c-thiel:
@jorisvandenbossche thanks for the info.

Yes, my intention of "new_column" is for it to be added. This is however not primarily related to this issue. The code example above is just my usual testcase for my own code which modifies the dataframe to match a schema beforehand.

In my opinion the schema should be the single source of truth. Thus columns of the df which are not part of the schema should be dropped or raise an error. Columns which are not in the Dataframe should be added with the invalid value corresponding to the schema dtype (or raise an error again).

I am not sure how the index should be handled. I really do not like that we cannot specify the dtype there. I believe this is due to the index being saved in the metadata of parquet, which also implies that the information in the index, presumably the most important column, is not so easily available across platforms as a usual column. For all my applications I stopped writing the index to the parquet file and use a regular parquet column instead. If you make sure the column is the first column, the perfomance implication when using s3 are minimal as no seek needs to be performed. This is also supported by the fact that write_to_dataset no longer supports index preservation.

The only other major thing which is bothering me is that Ints can't be NaN. I really like the pandas Int64 columns. However as this is not supported by parquet yet as far as I know, this is a problem for another day.

asfimport · 2020-03-17T16:13:29Z

Joris Van den Bossche / @jorisvandenbossche:
This works now correctly with the new Datasets API:

In [26]: pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()                                                                                                                                                                                           
Out[26]: 
   DPRD_ID strings partition_column
0        0     nan                0
1        1     nan                0
2        2       a                1
3        3       b                1

vs

In [28]: import pyarrow.dataset as ds                                                                                                                                                                              

In [29]: ds.dataset(PATH_PYARROW_MANUAL).to_table(columns=['DPRD_ID', 'strings']).to_pandas()                                                                                                                      
Out[29]: 
   DPRD_ID strings
0        0     nan
1        1     nan
2        2       a
3        3       b

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue should be solved (might want to add a test for it)

asfimport · 2020-04-29T02:47:08Z

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7050
#7050

asfimport closed this as completed Apr 29, 2020

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] ParquetDataset().read columns argument always returns partition column #20409

[Python] ParquetDataset().read columns argument always returns partition column #20409

asfimport commented Nov 23, 2018 •

edited

Loading

asfimport commented Nov 23, 2018

asfimport commented Apr 26, 2019

asfimport commented Jun 13, 2019

asfimport commented Mar 17, 2020

asfimport commented Apr 29, 2020

[Python] ParquetDataset().read columns argument always returns partition column #20409

[Python] ParquetDataset().read columns argument always returns partition column #20409

Comments

asfimport commented Nov 23, 2018 • edited Loading

Related issues:

PRs and other links:

asfimport commented Nov 23, 2018

asfimport commented Apr 26, 2019

asfimport commented Jun 13, 2019

asfimport commented Mar 17, 2020

asfimport commented Apr 29, 2020

asfimport commented Nov 23, 2018 •

edited

Loading