Skip to content

/load should accept inline Arrow IPC / Parquet bytes (alongside path) #768

@paddymul

Description

@paddymul

Problem

POST /load in buckaroo.server.handlers.LoadHandler requires path: str — a filesystem path the server can read. For host integrations that produce DataFrames in memory (Rust, Node, Electron renderers, etc.) without going through disk, this forces a temp-file dance:

  1. Serialize bytes to a temp .parquet somewhere mutually visible
  2. POST /load with that path
  3. Clean up the temp file later

Concrete pain: the buckaroo-tauri integration shape today is

host (Rust)
  ├─ produce arrow/parquet bytes
  ├─ write temp file
  ├─ invoke buckaroo_load_path({ path: tempfile })
  └─ (eventually) delete temp file

Filesystem visibility between renderer/Rust/Python is a per-platform headache (Tauri sandboxes, Electron contextIsolation, Docker mounts, etc.). And for hosts that have already serialized to memory there's no reason to round-trip through disk.

Where this came from

xorq-desktop integration spike — initially planned to send Arrow IPC bytes straight to buckaroo. We worked around it by pointing /load at xorq's snapshot-cache parquet directly (which happens to live on disk because of xorq's ParquetSnapshotCache). That's a convenient coincidence — most hosts won't have it.

Proposal

Extend LoadHandler.post to accept one of:

{ "path": "/abs/path.parquet", ... }       // existing
{ "arrow_ipc_b64": "<base64>", ... }       // new — Arrow IPC stream bytes
{ "parquet_b64": "<base64>", ... }         // new — raw Parquet bytes

Validation: exactly one of path / arrow_ipc_b64 / parquet_b64 required. Error if multiple or none.

Server-side dispatch in _load_file_with_error_handling:

if path:
    df = load_file(path)
elif arrow_ipc_b64:
    df = load_arrow_bytes(base64.b64decode(arrow_ipc_b64))
elif parquet_b64:
    df = load_parquet_bytes(base64.b64decode(parquet_b64))
metadata = get_metadata(df, label or "<inline>")

The "filename" / "path" fields in the response can be a synthesized label (passed in or derived) when bytes are used. Everything downstream (session, ws push, browser window) keeps working unchanged.

Why both arrow_ipc and parquet

Arrow IPC is faster to encode for hosts that already hold an Arrow-backed table (no parquet write overhead). Parquet is what most hosts already have on disk and want to keep compatibility with. Supporting both is a tiny dispatch cost and avoids forcing one transport.

Companion plugin change

buckaroo-tauri::buckaroo_load_path would gain arrow_ipc_b64 / parquet_b64 variants in LoadPathArgs. The plugin already does base64 on the binary frame relay (supervisor.rs::258-323), so the encoding/decoding tooling is in scope.

Out of scope

  • Streaming inline data. Single-shot encode/decode is fine for the target use cases (< 100k rows ish).
  • WebSocket push of inline data. Same — /load is the right place.

Happy to draft the buckaroo PR + the buckaroo-tauri companion if direction is acceptable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions