[Python] Improved workflow for loading an arbitrary collection of Parquet files #19750

asfimport · 2018-10-03T12:17:43Z

See SO question for use case: https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis

Reporter: Wes McKinney / @wesm

Related issues:

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim (depends upon)

_{Note: This issue was originally created as ARROW-3424. Please see the migration documentation for further details.}

asfimport · 2019-05-13T10:05:47Z

Joris Van den Bossche / @jorisvandenbossche:
Currently, a list of files is already supported in ParquetDataset. So something like this (that would address the SO question, I think) works:

dataset = pq.ParquetDataset(['part0.parquet', 'part1.parquet'])
dataset.read_pandas().to_pandas()

Do we think that is enough support? (if so, this issue can be closed I think)
Or do we want to add this to pq.read_table ? (which eg also accepts a directory name, which is then passed through to ParquetDataset. We could do a similar pass through for a list of paths)

asfimport · 2019-05-13T14:05:33Z

Wes McKinney / @wesm:
Yes, that might work. I think we should hold off until we can migrate this logic into C++, though

asfimport · 2020-04-14T07:43:58Z

Joris Van den Bossche / @jorisvandenbossche:
The new dataset API supports creating a dataset from a list of files (both in the higher level ds.dataset(..) which infers the schema as in the lower-level ds.FileSystemDataset(...)).
This functionality is now also exposed in pq.read_table and pq.ParquetDataset, with the use_legacy_dataset=False keyword (ARROW-8039).

asfimport closed this as completed Nov 12, 2020

asfimport mentioned this issue Jan 11, 2023

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Improved workflow for loading an arbitrary collection of Parquet files #19750

[Python] Improved workflow for loading an arbitrary collection of Parquet files #19750

asfimport commented Oct 3, 2018 •

edited

Loading

asfimport commented May 13, 2019

asfimport commented May 13, 2019

asfimport commented Apr 14, 2020

[Python] Improved workflow for loading an arbitrary collection of Parquet files #19750

[Python] Improved workflow for loading an arbitrary collection of Parquet files #19750

Comments

asfimport commented Oct 3, 2018 • edited Loading

Related issues:

asfimport commented May 13, 2019

asfimport commented May 13, 2019

asfimport commented Apr 14, 2020

asfimport commented Oct 3, 2018 •

edited

Loading