Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

Closed
asfimport opened this issue Mar 10, 2020 · 1 comment
Closed

Comments

@asfimport
Copy link

asfimport commented Mar 10, 2020

Partitioned parquet datasets sometimes come with _metadata / _common_metadata files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for _metadata).

Using those files during the creation of a parquet Dataset can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
Such logic could be put in a different factory class, eg ParquetManifestFactory (as suggestetd by @fsaintjacques).

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Francois Saint-Jacques / @fsaintjacques

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8062. Please see the migration documentation for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants