[Python] Memory Leak while iterating batches of pyarrow dataset

### Describe the bug, including details regarding any error messages, version, and platform.

I use pyarrow to laod and filter batches of a large hive-partitioned parquet dataset on a HPC cluster.

Due to the memory restrictions imposed by it, my jobs kept getting OOM killed. When I started investigating, pyarrow kept accumulating RAM, *no matter the memory pool type*.

When I finally switchd to system memory pool, it seems like this is a memory leak since RSS memroy keeps accumulating (blue line in graphic) even though pyarrow reports not that much allocated ram (see orange line in graphic):

This makes it very hard to use pyarrow in memory restrained environments like an HPC cluster.

My question is now:

1. Did I make any obvious mistake in using pyarrow?
2. Can this be prevented so HPC jobs run stable?

Code used to reproduce the issue (and generate the diagram):

<img width="1800" height="600" alt="Image" src="https://github.com/user-attachments/assets/eb714ee1-ed76-45ae-9e99-e07ebb28a3c6" />

Code to reproduce:

[minimal_working_example.py](https://github.com/user-attachments/files/25839547/minimal_working_example.py)

(DISCLAIMER: I wrote this code to be similar to my actual use case, the issue probebly still persists when stripping away stuff like filters and other cloumns. Still easy to reproduce the issue with the code provided.)

System: 

Pyarrow and system version used:

```
PyArrow : 23.0.1
Python  : 3.13.5
OS      : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43
BuildInfo(build_type='release', cpp_build_info=CppBuildInfo(version='23.0.1', version_info=VersionInfo(major=23, minor=0, patch=1), so_version='2300', full_so_version='2300.1.0', compiler_id='GNU', compiler_version='14.2.1', compiler_flags=' -Wno-noexcept-type -Wno-self-move -Wno-subobject-linkage  -fdiagnostics-color=always  -Wall -fno-semantic-interposition -msse4.2 ', git_id='', git_description='', package_kind='python-wheel-manylinux228', build_type='release'))
```


### Component(s)

Python, Parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Memory Leak while iterating batches of pyarrow dataset #49474

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Memory Leak while iterating batches of pyarrow dataset #49474

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions