Skip to content

Cannot perform gc on pyarrow object #36019

@ysq151944

Description

@ysq151944

Bug

I cannot del and collect table, and this is a big problem as I continuously loading and processing large dataset.


Platform and Versions:

Platform: Linux-5.15.0-57-generic-x86_64-with-glibc2.35
Python: 3.11.3 (main, May 15 2023, 15:45:52) [GCC 11.2.0]
pyarrow: 12.0.0


Data:

https://drive.google.com/file/d/1kigHHHcyx2hWi4xnctG0om3G_X2azcS6/view?usp=drive_link

Code

import gc
import os

import psutil
import pyarrow.feather as feather


def show_memory_info(msg):
    info = psutil.virtual_memory()

    print(f'\n{msg} -- current(MB): {psutil.Process(os.getpid()).memory_info().rss / 1024**2:.3f}')
    print(f'{msg} -- total(MB): {info.total / 1024**2:.3f}')
    print(f'{msg} -- account(MB): {info.percent:.3f}')


def main():
    show_memory_info('before loading:')
    table = feather.read_feather('data.feather')

    show_memory_info('after loading:')

    del table
    gc.collect()

    show_memory_info('after del and collect:')


if __name__ == '__main__':
    main()
    show_memory_info('final:')

Print

before loading: -- current(MB): 67.430
before loading: -- total(MB): 64052.953
before loading: -- account(MB): 2.500

after loading: -- current(MB): 918.852
after loading: -- total(MB): 64052.953
after loading: -- account(MB): 3.800

after del and collect: -- current(MB): 861.238
after del and collect: -- total(MB): 64052.953
after del and collect: -- account(MB): 3.700

final: -- current(MB): 861.238
final: -- total(MB): 64052.953
final: -- account(MB): 3.700

Process finished with exit code 0

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions