Skip to content

[Python][Parquet] read_table much slower with multiple row group files when column pruning than single row group #37666

@deanm0000

Description

@deanm0000

Describe the bug, including details regarding any error messages, version, and platform.

Here's the setup, scroll to bottom for results and better summary.

import fsspec
import pyarrow as pa
import pyarrow.parquet as pq
abfs = fsspec.filesystem() ## insert your own settings here
tab = pa.Table.from_arrays([
    pa.array(np.linspace(0, 10_000_000, 10_000_001))
    for _ in range(20)
], names=[f"x{i}" for i in range(20)])
pymultpath=#some path on cloud
pyonepath=#some path on cloud
pq.write_table(tab, pymultpath, filesystem=abfs, compression='zstd',
               row_group_size=512**2)
pq.write_table(tab, pyonepath, filesystem=abfs, compression='zstd',
               row_group_size=10_000_001)

t1 = time.time()
rtab= pq.read_table(pyonepath, filesystem=abfs)
print(f"whole pyonepath file took {round(time.time()-t1,1)} s")

t1 = time.time()
rtab= pq.read_table(pymultpath, filesystem=abfs)
print(f"whole pymultpath file took {round(time.time()-t1,1)} s")


t1 = time.time()
rtab= pq.read_table(pyonepath, filesystem=abfs, columns=['x0'])
print(f"single column pyonepath file took {round(time.time()-t1,1)} s")

t1 = time.time()
rtab= pq.read_table(pymultpath, filesystem=abfs, columns=['x0'])
print(f"single column pymultpath file took {round(time.time()-t1,1)} s")

results...

whole pyonepath file took 45.1 s
whole pymultpath file took 85.3 s
single column pyonepath file took 4.6 s
single column pymultpath file took 46.1 s

This tests makes a table with 20 columns and 10M rows. It saves it twice, one version with a single row group and again with 39 row groups. I read both files in their entirety for a benchmark. The single row group file is 176.MiB while the multiple row group file is 369MiB. When I read the single row group file for a single column I get it in about 1/10th the time as the full file. When I read in a single column of a multiple row group file then it's more than half the time of the full file.

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions