-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
I use pyarrow to laod and filter batches of a large hive-partitioned parquet dataset on a HPC cluster.
Due to the memory restrictions imposed by it, my jobs kept getting OOM killed. When I started investigating, pyarrow kept accumulating RAM, no matter the memory pool type.
When I finally switchd to system memory pool, it seems like this is a memory leak since RSS memroy keeps accumulating (blue line in graphic) even though pyarrow reports not that much allocated ram (see orange line in graphic):
This makes it very hard to use pyarrow in memory restrained environments like an HPC cluster.
My question is now:
- Did I make any obvious mistake in using pyarrow?
- Can this be prevented so HPC jobs run stable?
Code used to reproduce the issue (and generate the diagram):
Code to reproduce:
(DISCLAIMER: I wrote this code to be similar to my actual use case, the issue probebly still persists when stripping away stuff like filters and other cloumns. Still easy to reproduce the issue with the code provided.)
System:
Pyarrow and system version used:
PyArrow : 23.0.1
Python : 3.13.5
OS : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43
BuildInfo(build_type='release', cpp_build_info=CppBuildInfo(version='23.0.1', version_info=VersionInfo(major=23, minor=0, patch=1), so_version='2300', full_so_version='2300.1.0', compiler_id='GNU', compiler_version='14.2.1', compiler_flags=' -Wno-noexcept-type -Wno-self-move -Wno-subobject-linkage -fdiagnostics-color=always -Wall -fno-semantic-interposition -msse4.2 ', git_id='', git_description='', package_kind='python-wheel-manylinux228', build_type='release'))
Component(s)
Python, Parquet