Skip to content

Fix parallel metadata collection in pyarrow engine#9165

Merged
jrbourbeau merged 1 commit intodask:mainfrom
rjzamora:fix-partitioning-bug
Jun 6, 2022
Merged

Fix parallel metadata collection in pyarrow engine#9165
jrbourbeau merged 1 commit intodask:mainfrom
rjzamora:fix-partitioning-bug

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Jun 6, 2022

Fixes a bug in parallel metadata collection in the "pyarrow" read_parquet engine for hive-partitioned data.

pa_ds.dataset(
files_or_frags,
filesystem=fs,
**dataset_options,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical change. Without these dataset options, the new fragment may be missing hive/directory-partitioning information.

Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @rjzamora

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora for the fix and @ian-r-rose for reviewing

@jrbourbeau jrbourbeau merged commit 8c9076a into dask:main Jun 6, 2022
@rjzamora rjzamora deleted the fix-partitioning-bug branch June 6, 2022 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants