Skip to content

[Python] Memory leak with dataset.head #36754

@neerajd12

Description

@neerajd12

Describe the bug, including details regarding any error messages, version, and platform.

dataset.head loads all data in memory and doesn't release it. when it should just load the top n rows.

This issue started after July 17 2023.

Versions

Pyarrow : 12.0.0
Python: 3.10.6
Jupyter lab: 3.3.4 on
Docker: 4.12.0 (85629) on
windows 10, version: 21H2, build: 19044.3086

Sample data

https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet For all the months

Sample Code

  1. Install memory_profiler
pip3 install memory_profiler
  1. load Extension and Check mem
%load_ext memory_profiler
%memit

peak memory: 163.00 MiB, increment: 0.21 MiB

  1. Create Dataset
import pyarrow.dataset as ds
data = ds.dataset('./testdata/nyc/year=2009', format='parquet', partitioning='hive')
  1. Check mem
%memit

peak memory: 157.97 MiB, increment: 0.01 MiB

  1. Count rows
data.count_rows()

170896055

  1. Check mem
%memit

peak memory: 170.34 MiB, increment: 0.02 MiB

  1. get first 10 rows
data.head(10).to_pandas()
  1. Check Mem
%memit

peak memory: 11753.76 MiB, increment: 142.51 MiB
peak memory: 9914.21 MiB, increment: 0.00 MiB
peak memory: 9914.21 MiB, increment: 0.00 MiB
peak memory: 9914.21 MiB, increment: 0.00 MiB

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions