-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Here's the setup, scroll to bottom for results and better summary.
import fsspec
import pyarrow as pa
import pyarrow.parquet as pq
abfs = fsspec.filesystem() ## insert your own settings here
tab = pa.Table.from_arrays([
pa.array(np.linspace(0, 10_000_000, 10_000_001))
for _ in range(20)
], names=[f"x{i}" for i in range(20)])
pymultpath=#some path on cloud
pyonepath=#some path on cloud
pq.write_table(tab, pymultpath, filesystem=abfs, compression='zstd',
row_group_size=512**2)
pq.write_table(tab, pyonepath, filesystem=abfs, compression='zstd',
row_group_size=10_000_001)
t1 = time.time()
rtab= pq.read_table(pyonepath, filesystem=abfs)
print(f"whole pyonepath file took {round(time.time()-t1,1)} s")
t1 = time.time()
rtab= pq.read_table(pymultpath, filesystem=abfs)
print(f"whole pymultpath file took {round(time.time()-t1,1)} s")
t1 = time.time()
rtab= pq.read_table(pyonepath, filesystem=abfs, columns=['x0'])
print(f"single column pyonepath file took {round(time.time()-t1,1)} s")
t1 = time.time()
rtab= pq.read_table(pymultpath, filesystem=abfs, columns=['x0'])
print(f"single column pymultpath file took {round(time.time()-t1,1)} s")
results...
whole pyonepath file took 45.1 s
whole pymultpath file took 85.3 s
single column pyonepath file took 4.6 s
single column pymultpath file took 46.1 s
This tests makes a table with 20 columns and 10M rows. It saves it twice, one version with a single row group and again with 39 row groups. I read both files in their entirety for a benchmark. The single row group file is 176.MiB while the multiple row group file is 369MiB. When I read the single row group file for a single column I get it in about 1/10th the time as the full file. When I read in a single column of a multiple row group file then it's more than half the time of the full file.
Component(s)
Parquet, Python