-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
dataset.head loads all data in memory and doesn't release it. when it should just load the top n rows.
This issue started after July 17 2023.
Versions
Pyarrow : 12.0.0
Python: 3.10.6
Jupyter lab: 3.3.4 on
Docker: 4.12.0 (85629) on
windows 10, version: 21H2, build: 19044.3086
Sample data
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet For all the months
Sample Code
- Install memory_profiler
pip3 install memory_profiler
- load Extension and Check mem
%load_ext memory_profiler
%memit
peak memory: 163.00 MiB, increment: 0.21 MiB
- Create Dataset
import pyarrow.dataset as ds
data = ds.dataset('./testdata/nyc/year=2009', format='parquet', partitioning='hive')
- Check mem
%memit
peak memory: 157.97 MiB, increment: 0.01 MiB
- Count rows
data.count_rows()
170896055
- Check mem
%memit
peak memory: 170.34 MiB, increment: 0.02 MiB
- get first 10 rows
data.head(10).to_pandas()
- Check Mem
%memit
peak memory: 11753.76 MiB, increment: 142.51 MiB
peak memory: 9914.21 MiB, increment: 0.00 MiB
peak memory: 9914.21 MiB, increment: 0.00 MiB
peak memory: 9914.21 MiB, increment: 0.00 MiB
Component(s)
Parquet, Python