Problem
read_xarray_table() forces all partition state into memory at registration time:
blocks = list(block_slices(ds, chunks)) # 732,072 Python dicts
metadata = partition_metadata(ds, blocks) # 732,072 metadata dicts
factories = [make_partition_factory(b) for b in blocks] # 732,072 closures
For ARCO-ERA5 (732,072 partitions):
- Block dicts: ~400 bytes × 732,072 ≈ 300 MB
- Metadata dicts: ~800 bytes × 732,072 ≈ 586 MB
- Python closures: ~200 bytes × 732,072 ≈ 150 MB
- Rust
Arc<PyArrowStreamPartition> + PartitionMetadata: ~900 bytes × 732,072 ≈ 660 MB
- Total: ~1.7 GB RAM consumed at registration before any query runs
This also causes 732,072 Python::attach() GIL acquisitions in convert_python_metadata — one per partition — serializing all metadata conversion through the GIL.
Proposed fix
Either:
- Accept a Python callable that generates
(factory, metadata) pairs on demand, or
- Expose a streaming registration API that adds partitions incrementally
This would make registration O(1) in memory rather than O(N_partitions).
Parent: #126
Problem
read_xarray_table()forces all partition state into memory at registration time:For ARCO-ERA5 (732,072 partitions):
Arc<PyArrowStreamPartition>+PartitionMetadata: ~900 bytes × 732,072 ≈ 660 MBThis also causes 732,072
Python::attach()GIL acquisitions inconvert_python_metadata— one per partition — serializing all metadata conversion through the GIL.Proposed fix
Either:
(factory, metadata)pairs on demand, orThis would make registration O(1) in memory rather than O(N_partitions).
Parent: #126