Problem
The current data path in each partition factory is:
xarray arrays → ds.to_dataframe() → pandas MultiIndex → reset_index() → flat DataFrame → pa.RecordBatch.from_pandas()
This is a 3-copy chain. For a 38.4M-row, 5-column partition:
- xarray in-memory: ~307 MB
- pandas MultiIndex DataFrame: ~614 MB (copy 1)
- flat DataFrame after
reset_index(): ~614 MB (copy 2)
- Arrow RecordBatch: ~307 MB (copy 3)
- Peak memory: ~1.5 GB to produce ~307 MB of output
Proposed fix
Build the RecordBatch directly from numpy arrays, bypassing pandas entirely:
- For each dimension column:
pa.array(ds.coords[dim].values[slc]) broadcast to full row count
- For each data variable:
pa.array(ds[var].values.ravel())
pa.RecordBatch.from_arrays([...], schema=schema)
This requires computing broadcasted coordinate arrays (xarray stores coordinates once per axis, not per row) but avoids all pandas overhead.
Impact
For ERA5-scale datasets this change is necessary to avoid OOM during query execution. The pandas round-trip is the primary memory bottleneck per partition.
Parent: #126
Problem
The current data path in each partition factory is:
This is a 3-copy chain. For a 38.4M-row, 5-column partition:
reset_index(): ~614 MB (copy 2)Proposed fix
Build the
RecordBatchdirectly from numpy arrays, bypassing pandas entirely:pa.array(ds.coords[dim].values[slc])broadcast to full row countpa.array(ds[var].values.ravel())pa.RecordBatch.from_arrays([...], schema=schema)This requires computing broadcasted coordinate arrays (xarray stores coordinates once per axis, not per row) but avoids all pandas overhead.
Impact
For ERA5-scale datasets this change is necessary to avoid OOM during query execution. The pandas round-trip is the primary memory bottleneck per partition.
Parent: #126