Skip to content

Commit

Permalink
ARROW-3861: [Python] ParquetDataset.read() respect specified columns …
Browse files Browse the repository at this point in the history
…and not include partition columns

This is adding a test for it (using both legacy and new dataset), and also small fix for the legacy path.

Closes #7050 from jorisvandenbossche/ARROW-3861-parquet-dataset-read-columns

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>
  • Loading branch information
jorisvandenbossche authored and fsaintjacques committed Apr 29, 2020
1 parent 1ffb5f6 commit 283e188
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 0 deletions.
2 changes: 2 additions & 0 deletions python/pyarrow/parquet.py
Expand Up @@ -722,6 +722,8 @@ def read(self, columns=None, use_threads=True, partitions=None,
# value as indicated. The distinct categories of the partition have
# been computed in the ParquetManifest
for i, (name, index) in enumerate(self.partition_keys):
if columns is not None and name not in columns:
continue
# The partition code is the same for all values in this piece
indices = np.full(len(table), index, dtype='i4')

Expand Down
15 changes: 15 additions & 0 deletions python/pyarrow/tests/test_parquet.py
Expand Up @@ -1698,6 +1698,21 @@ def test_create_parquet_dataset_multi_threaded(tempdir):
assert len(partitions.levels) == len(manifest.partitions.levels)


@pytest.mark.pandas
@parametrize_legacy_dataset
def test_read_partitioned_columns_selection(tempdir, use_legacy_dataset):
# ARROW-3861 - do not include partition columns in resulting table when
# `columns` keyword was passed without those columns
fs = LocalFileSystem.get_instance()
base_path = tempdir
_partition_test_for_filesystem(fs, base_path)

dataset = pq.ParquetDataset(
base_path, use_legacy_dataset=use_legacy_dataset)
result = dataset.read(columns=["values"])
assert result.column_names == ["values"]


@pytest.mark.pandas
@parametrize_legacy_dataset
def test_equivalency(tempdir, use_legacy_dataset):
Expand Down

0 comments on commit 283e188

Please sign in to comment.