Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] ParquetDataset().read columns argument always returns partition column #20409

Closed
asfimport opened this issue Nov 23, 2018 · 5 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Nov 23, 2018

I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns:

import dask as da
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
                       ('partition_column', pa.int32()),
                       ('arrays', pa.list_(pa.int32())),
                       ('strings', pa.string()),
                       ('new_column', pa.string())])

table = pa.Table.from_pandas(df, schema=my_schema)
pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column'])

df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()
# pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], engine='pyarrow')
df_pq

df_pq has column partition_column

Reporter: Christian Thiel / @c-thiel
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3861. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Should not be too difficult to fix. Patches welcome

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@c-thiel note that the way you create and pass the schema (with "new" columns and the index column specified) now raises an error. I opened ARROW-5220 for that.
What was your intent to add "new_column" to the schema? That it would be created in the actual table?

@asfimport
Copy link
Collaborator Author

Christian Thiel / @c-thiel:
@jorisvandenbossche thanks for the info.

Yes, my intention of "new_column" is for it to be added. This is however not primarily related to this issue. The code example above is just my usual testcase for my own code which modifies the dataframe to match a schema beforehand.

In my opinion the schema should be the single source of truth. Thus columns of the df which are not part of the schema should be dropped or raise an error. Columns which are not in the Dataframe should be added with the invalid value corresponding to the schema dtype (or raise an error again).

I am not sure how the index should be handled. I really do not like that we cannot specify the dtype there. I believe this is due to the index being saved in the metadata of parquet, which also implies that the information in the index, presumably the most important column, is not so easily available across platforms as a usual column. For all my applications I stopped writing the index to the parquet file and use a regular parquet column instead. If you make sure the column is the first column, the perfomance implication when using s3 are minimal as no seek needs to be performed. This is also supported by the fact that write_to_dataset no longer supports index preservation.

The only other major thing which is bothering me is that Ints can't be NaN. I really like the pandas Int64 columns. However as this is not supported by parquet yet as far as I know, this is a problem for another day.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
This works now correctly with the new Datasets API:

In [26]: pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()                                                                                                                                                                                           
Out[26]: 
   DPRD_ID strings partition_column
0        0     nan                0
1        1     nan                0
2        2       a                1
3        3       b                1

vs

In [28]: import pyarrow.dataset as ds                                                                                                                                                                              

In [29]: ds.dataset(PATH_PYARROW_MANUAL).to_table(columns=['DPRD_ID', 'strings']).to_pandas()                                                                                                                      
Out[29]: 
   DPRD_ID strings
0        0     nan
1        1     nan
2        2       a
3        3       b

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue should be solved (might want to add a test for it)

@asfimport
Copy link
Collaborator Author

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7050
#7050

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants