Skip to content

Perf: avoid materializing all block dicts and factory closures at registration time #129

@alxmrs

Description

@alxmrs

Problem

read_xarray_table() forces all partition state into memory at registration time:

blocks = list(block_slices(ds, chunks))          # 732,072 Python dicts
metadata = partition_metadata(ds, blocks)          # 732,072 metadata dicts
factories = [make_partition_factory(b) for b in blocks]  # 732,072 closures

For ARCO-ERA5 (732,072 partitions):

  • Block dicts: ~400 bytes × 732,072 ≈ 300 MB
  • Metadata dicts: ~800 bytes × 732,072 ≈ 586 MB
  • Python closures: ~200 bytes × 732,072 ≈ 150 MB
  • Rust Arc<PyArrowStreamPartition> + PartitionMetadata: ~900 bytes × 732,072 ≈ 660 MB
  • Total: ~1.7 GB RAM consumed at registration before any query runs

This also causes 732,072 Python::attach() GIL acquisitions in convert_python_metadata — one per partition — serializing all metadata conversion through the GIL.

Proposed fix

Either:

  1. Accept a Python callable that generates (factory, metadata) pairs on demand, or
  2. Expose a streaming registration API that adds partitions incrementally

This would make registration O(1) in memory rather than O(N_partitions).

Parent: #126

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions