Skip to content

Commit

Permalink
ARROW-8644: [Python] Restore ParquetDataset behaviour to always inclu…
Browse files Browse the repository at this point in the history
…de partition column for dask compatibility

Given that the original change (https://issues.apache.org/jira/browse/ARROW-3861 / #7050) breaks dask's reading of partitioned datasets (it doesn't add the partition column to the list of columns to read, but expects it will still be read automatically), it doesn't seem worth it to me to fix this in the "old" ParquetDataset implementation.

But we can keep the "correct" behaviour in the Datasets API - based implementation going forward.

Closes #7096 from jorisvandenbossche/ARROW-8644-dask-partitioned

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
  • Loading branch information
jorisvandenbossche authored and wesm committed May 5, 2020
1 parent bc283e3 commit fb4d57a
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 4 deletions.
6 changes: 6 additions & 0 deletions docs/source/python/parquet.rst
Expand Up @@ -396,6 +396,12 @@ option was enabled on write).
the partition keys.
- General performance improvement and bug fixes.

It also has the following changes in behaviour:

- The partition keys need to be explicitly included in the ``columns``
keyword when you want to include them in the result while reading a
subset of the columns

In the future, this will be turned on by default. The new implementation
does not yet cover all existing ParquetDataset features (e.g. specifying
the ``metadata``, or the ``pieces`` property API). Feedback is very welcome.
Expand Down
6 changes: 3 additions & 3 deletions python/pyarrow/parquet.py
Expand Up @@ -722,8 +722,6 @@ def read(self, columns=None, use_threads=True, partitions=None,
# value as indicated. The distinct categories of the partition have
# been computed in the ParquetManifest
for i, (name, index) in enumerate(self.partition_keys):
if columns is not None and name not in columns:
continue
# The partition code is the same for all values in this piece
indices = np.full(len(table), index, dtype='i4')

Expand Down Expand Up @@ -1418,7 +1416,9 @@ def read(self, columns=None, use_threads=True, use_pandas_metadata=False):
Parameters
----------
columns : List[str]
Names of columns to read from the dataset.
Names of columns to read from the dataset. The partition fields
are not automatically included (in contrast to when setting
``use_legacy_dataset=True``).
use_threads : bool, default True
Perform multi-threaded column reads.
use_pandas_metadata : bool, default False
Expand Down
8 changes: 7 additions & 1 deletion python/pyarrow/tests/test_parquet.py
Expand Up @@ -1705,7 +1705,13 @@ def test_read_partitioned_columns_selection(tempdir, use_legacy_dataset):
dataset = pq.ParquetDataset(
base_path, use_legacy_dataset=use_legacy_dataset)
result = dataset.read(columns=["values"])
assert result.column_names == ["values"]
if use_legacy_dataset:
# ParquetDataset implementation always includes the partition columns
# automatically, and we can't easily "fix" this since dask relies on
# this behaviour (ARROW-8644)
assert result.column_names == ["values", "foo", "bar"]
else:
assert result.column_names == ["values"]


@pytest.mark.pandas
Expand Down

0 comments on commit fb4d57a

Please sign in to comment.