Skip to content

[Python] Memory Leak while iterating batches of pyarrow dataset #49474

@yum-yab

Description

@yum-yab

Describe the bug, including details regarding any error messages, version, and platform.

I use pyarrow to laod and filter batches of a large hive-partitioned parquet dataset on a HPC cluster.

Due to the memory restrictions imposed by it, my jobs kept getting OOM killed. When I started investigating, pyarrow kept accumulating RAM, no matter the memory pool type.

When I finally switchd to system memory pool, it seems like this is a memory leak since RSS memroy keeps accumulating (blue line in graphic) even though pyarrow reports not that much allocated ram (see orange line in graphic):

This makes it very hard to use pyarrow in memory restrained environments like an HPC cluster.

My question is now:

  1. Did I make any obvious mistake in using pyarrow?
  2. Can this be prevented so HPC jobs run stable?

Code used to reproduce the issue (and generate the diagram):

Image

Code to reproduce:

minimal_working_example.py

(DISCLAIMER: I wrote this code to be similar to my actual use case, the issue probebly still persists when stripping away stuff like filters and other cloumns. Still easy to reproduce the issue with the code provided.)

System:

Pyarrow and system version used:

PyArrow : 23.0.1
Python  : 3.13.5
OS      : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43
BuildInfo(build_type='release', cpp_build_info=CppBuildInfo(version='23.0.1', version_info=VersionInfo(major=23, minor=0, patch=1), so_version='2300', full_so_version='2300.1.0', compiler_id='GNU', compiler_version='14.2.1', compiler_flags=' -Wno-noexcept-type -Wno-self-move -Wno-subobject-linkage  -fdiagnostics-color=always  -Wall -fno-semantic-interposition -msse4.2 ', git_id='', git_description='', package_kind='python-wheel-manylinux228', build_type='release'))

Component(s)

Python, Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions