Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow not adding partition columns when given a glob path #2147

Closed
pranav-kohli opened this issue Jun 20, 2018 · 4 comments
Closed

Pyarrow not adding partition columns when given a glob path #2147

pranav-kohli opened this issue Jun 20, 2018 · 4 comments

Comments

@pranav-kohli
Copy link

I am saving a dask dataframe to parquet with two partition columns using the pyarrow engine. The problem arises in scanning the partition columns. When I scan using the directory path, I get the partition columns in the output dataframe, whereas if I scan using the glob path, I dont get these columns

pyarrow : 0.9.0.post1

dask : 0.17.1

import dask.dataframe as dd

size = 20
tmpdir = '/tmp/test/outputParquet1'
engine = 'pyarrow'

d = {'signal1': np.random.normal(0, 0.3, size=size).cumsum() + 50,
       'fake_categorical1': np.random.choice(['A', 'B', 'C'], size=size),
       'fake_categorical2': np.random.choice(['D', 'E', 'F'], size=size)}

df = dd.from_pandas(pd.DataFrame(d), 2)
df.to_parquet(tmpdir, compression='snappy', write_index=True, engine=engine,
          partition_on=['fake_categorical1', 'fake_categorical2'])

#This doesnt scans the partitioned columns
df_partitioned = dd.read_parquet(tmpdir + '/*/*/*.parquet', engine=engine)

#this fails
#df_partitioned[df_partitioned.fake_categorical1 == 'A'].compute()

#This scans the partitioned columns
df_partitioned = dd.read_parquet(tmpdir, engine=engine)

df_partitioned[df_partitioned.fake_categorical1 == 'A'].compute()

Fastparquet supports the glob path, but somehow pyarrow doesn't

The problem lies in the _make_manifest function. For a single path, it calls the _visit_level fuction which successfully updates the partitions. For a glob path it doesnt call _visit_level fuction, it just creates a list of ParquetDatasetPiece objects with no partitions

@pranav-kohli
Copy link
Author

The file in question(_make_manifest())
https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py

@wesm
Copy link
Member

wesm commented Jun 21, 2018

@pranav-kohli could you create a JIRA with this feature request? A patch would be welcome

@pranav-kohli
Copy link
Author

Created the jira
https://issues.apache.org/jira/browse/ARROW-2728

@wesm
Copy link
Member

wesm commented Jun 21, 2018

Thanks!

@wesm wesm closed this as completed Jun 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants