Skip to content

[pyarrow] fixed size binary arrays with nulls consume too much memory #36158

@convoi

Description

@convoi

Describe the bug, including details regarding any error messages, version, and platform.

Fixed size binary arrays with nulls use full fixed length size of memory for nulls.
This demonstrates the issue

len = 1000000  #  1 mio
nulls = 1000 # 1k nulls
x = [b'0'*len]+[None]*nulls   # 1 value of 1M bytes, and 1000 nulls
arr = pa.array(x, pa.binary(len))
print(f"allocated: {pa.total_allocated_bytes()}")    # should print about 1MB but prints 1001*1MB.

Same thing happens if I load parquet data with fixed size binary columns with many nulls. Actually it makes those files unreadable, as memory consumption is off the charts.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions