Skip to content

Perf: replace pandas round-trip with direct numpy → Arrow RecordBatch construction #127

@alxmrs

Description

@alxmrs

Problem

The current data path in each partition factory is:

xarray arrays → ds.to_dataframe() → pandas MultiIndex → reset_index() → flat DataFrame → pa.RecordBatch.from_pandas()

This is a 3-copy chain. For a 38.4M-row, 5-column partition:

  • xarray in-memory: ~307 MB
  • pandas MultiIndex DataFrame: ~614 MB (copy 1)
  • flat DataFrame after reset_index(): ~614 MB (copy 2)
  • Arrow RecordBatch: ~307 MB (copy 3)
  • Peak memory: ~1.5 GB to produce ~307 MB of output

Proposed fix

Build the RecordBatch directly from numpy arrays, bypassing pandas entirely:

  1. For each dimension column: pa.array(ds.coords[dim].values[slc]) broadcast to full row count
  2. For each data variable: pa.array(ds[var].values.ravel())
  3. pa.RecordBatch.from_arrays([...], schema=schema)

This requires computing broadcasted coordinate arrays (xarray stores coordinates once per axis, not per row) but avoids all pandas overhead.

Impact

For ERA5-scale datasets this change is necessary to avoid OOM during query execution. The pandas round-trip is the primary memory bottleneck per partition.

Parent: #126

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions