You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Partitioned parquet datasets sometimes come with _metadata / _common_metadata files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for _metadata).
Using those files during the creation of a parquet Dataset can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).
Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
Such logic could be put in a different factory class, eg ParquetManifestFactory (as suggestetd by @fsaintjacques).
Partitioned parquet datasets sometimes come with
_metadata
/_common_metadata
files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for_metadata
).Using those files during the creation of a parquet
Dataset
can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
Such logic could be put in a different factory class, eg
ParquetManifestFactory
(as suggestetd by @fsaintjacques).Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Francois Saint-Jacques / @fsaintjacques
Related issues:
_common_metadata
for schema if_metadata
isn't available (relates to)PRs and other links:
Note: This issue was originally created as ARROW-8062. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: