ARROW-3861: [Python] ParquetDataset.read() respect specified columns and not include partition columns #7050

jorisvandenbossche · 2020-04-28T07:37:32Z

This is adding a test for it (using both legacy and new dataset), and also small fix for the legacy path.

…and not include partition columns

github-actions · 2020-04-28T07:46:46Z

https://issues.apache.org/jira/browse/ARROW-3861

jorisvandenbossche · 2020-04-30T08:58:04Z

Hmm, I should have thought about triggering the dask integration tests for this ... -> https://issues.apache.org/jira/browse/ARROW-8644

…de partition column for dask compatibility Given that the original change (https://issues.apache.org/jira/browse/ARROW-3861 / #7050) breaks dask's reading of partitioned datasets (it doesn't add the partition column to the list of columns to read, but expects it will still be read automatically), it doesn't seem worth it to me to fix this in the "old" ParquetDataset implementation. But we can keep the "correct" behaviour in the Datasets API - based implementation going forward. Closes #7096 from jorisvandenbossche/ARROW-8644-dask-partitioned Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

ARROW-3861: [Python] ParquetDataset.read() respect specified columns …

22b27fe

…and not include partition columns

fsaintjacques approved these changes Apr 29, 2020

View reviewed changes

fsaintjacques closed this in 283e188 Apr 29, 2020

jorisvandenbossche deleted the ARROW-3861-parquet-dataset-read-columns branch April 29, 2020 06:21

jorisvandenbossche mentioned this pull request May 4, 2020

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility #7096

Closed

This was referenced Apr 30, 2020

[Python] ParquetDataset().read columns argument always returns partition column #20409

Closed

[Python] Dask integration tests failing due to change in not including partition columns #24805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3861: [Python] ParquetDataset.read() respect specified columns and not include partition columns #7050

ARROW-3861: [Python] ParquetDataset.read() respect specified columns and not include partition columns #7050

jorisvandenbossche commented Apr 28, 2020 •

edited

github-actions bot commented Apr 28, 2020

jorisvandenbossche commented Apr 30, 2020

ARROW-3861: [Python] ParquetDataset.read() respect specified columns and not include partition columns #7050

ARROW-3861: [Python] ParquetDataset.read() respect specified columns and not include partition columns #7050

Conversation

jorisvandenbossche commented Apr 28, 2020 • edited

github-actions bot commented Apr 28, 2020

jorisvandenbossche commented Apr 30, 2020

jorisvandenbossche commented Apr 28, 2020 •

edited