[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

asfimport · 2020-03-10T18:15:51Z

Partitioned parquet datasets sometimes come with _metadata / _common_metadata files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for _metadata).

Using those files during the creation of a parquet Dataset can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
Such logic could be put in a different factory class, eg ParquetManifestFactory (as suggestetd by @fsaintjacques).

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Francois Saint-Jacques / @fsaintjacques

Related issues:

[Python][Dataset] Detect and use _metadata file in a list of file paths (relates to)
[C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error (relates to)
[Python][C++] Possibly use _common_metadata for schema if _metadata isn't available (relates to)
[Python] Multi-file parquet loading without scan (relates to)
[C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata (relates to)

PRs and other links:

GitHub Pull Request #7180

_{Note: This issue was originally created as ARROW-8062. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-05-25T19:20:30Z

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7180
#7180

asfimport closed this as completed May 25, 2020

asfimport assigned fsaintjacques Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

asfimport commented Mar 10, 2020 •

edited

asfimport commented May 25, 2020

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

Comments

asfimport commented Mar 10, 2020 • edited

Related issues:

PRs and other links:

asfimport commented May 25, 2020

asfimport commented Mar 10, 2020 •

edited