Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Improved workflow for loading an arbitrary collection of Parquet files #19750

Closed
asfimport opened this issue Oct 3, 2018 · 3 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 3, 2018

See SO question for use case: https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-3424. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Currently, a list of files is already supported in ParquetDataset. So something like this (that would address the SO question, I think) works:
 

dataset = pq.ParquetDataset(['part0.parquet', 'part1.parquet'])
dataset.read_pandas().to_pandas()

Do we think that is enough support? (if so, this issue can be closed I think)
Or do we want to add this to pq.read_table ? (which eg also accepts a directory name, which is then passed through to ParquetDataset. We could do a similar pass through for a list of paths)

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Yes, that might work. I think we should hold off until we can migrate this logic into C++, though

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
The new dataset API supports creating a dataset from a list of files (both in the higher level ds.dataset(..) which infers the schema as in the lower-level ds.FileSystemDataset(...)).
This functionality is now also exposed in pq.read_table and pq.ParquetDataset, with the use_legacy_dataset=False keyword (ARROW-8039).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant