[Python][Parquet] read_table much slower with multiple row group files when column pruning than single row group

### Describe the bug, including details regarding any error messages, version, and platform.

Here's the setup, scroll to bottom for results and better summary.


```
import fsspec
import pyarrow as pa
import pyarrow.parquet as pq
abfs = fsspec.filesystem() ## insert your own settings here
tab = pa.Table.from_arrays([
    pa.array(np.linspace(0, 10_000_000, 10_000_001))
    for _ in range(20)
], names=[f"x{i}" for i in range(20)])
pymultpath=#some path on cloud
pyonepath=#some path on cloud
pq.write_table(tab, pymultpath, filesystem=abfs, compression='zstd',
               row_group_size=512**2)
pq.write_table(tab, pyonepath, filesystem=abfs, compression='zstd',
               row_group_size=10_000_001)

t1 = time.time()
rtab= pq.read_table(pyonepath, filesystem=abfs)
print(f"whole pyonepath file took {round(time.time()-t1,1)} s")

t1 = time.time()
rtab= pq.read_table(pymultpath, filesystem=abfs)
print(f"whole pymultpath file took {round(time.time()-t1,1)} s")


t1 = time.time()
rtab= pq.read_table(pyonepath, filesystem=abfs, columns=['x0'])
print(f"single column pyonepath file took {round(time.time()-t1,1)} s")

t1 = time.time()
rtab= pq.read_table(pymultpath, filesystem=abfs, columns=['x0'])
print(f"single column pymultpath file took {round(time.time()-t1,1)} s")
```

results...

```
whole pyonepath file took 45.1 s
whole pymultpath file took 85.3 s
single column pyonepath file took 4.6 s
single column pymultpath file took 46.1 s
```


This tests makes a table with 20 columns and 10M rows. It saves it twice, one version with a single row group and again with 39 row groups. I read both files in their entirety for a benchmark. The single row group file is 176.MiB while the multiple row group file is 369MiB. When I read the single row group file for a single column I get it in about 1/10th the time as the full file. When I read in a single column of a multiple row group file then it's more than half the time of the full file.


### Component(s)

Parquet, Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python][Parquet] read_table much slower with multiple row group files when column pruning than single row group #37666

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][Parquet] read_table much slower with multiple row group files when column pruning than single row group #37666

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions