Skip to content

BaseFile.splitOffsets() allocates a new List on every call #15622

@RussellSpitzer

Description

@RussellSpitzer

Feature Request / Improvement

The reason why we care about this is when we have parquet manifests we cannot re-use the immutable list returned by the "get" method from base file. That means we leak an object for every manifest. Not a huge deal but we should probably do something there.

--

BaseFile stores split offsets internally as a long[], but splitOffsets() wraps it in a new List<Long> via ArrayUtil.toUnmodifiableLongList on every invocation. When file metadata is being read and rewritten (e.g., during manifest rewriting or format conversion), this means each entry needlessly allocates a list that is immediately consumed and discarded.

Other fields like partitionData are stored and returned as-is. Split offsets could similarly cache or reuse the List<Long> representation, or callers within the core module could use the existing package-private splitOffsetArray() to pass the raw long[] through without conversion.

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions