Skip to content

Use obstore as the object storage backend? #47

@turban

Description

@turban

Overview

obstore is a high-performance Python library for cloud object storage backed by the Rust object_store crate. It provides a single, unified async-native interface across S3, GCS, Azure, HTTP, and local storage — making it a strong fit as our storage backend for both ingested datasets and external Zarr stores (#46).

Why obstore over fsspec

Zarr reads from S3 are many small, concurrent range requests — one per chunk. obstore is designed for exactly this pattern:

  • Stateless, atomic object API (mirrors how cloud storage actually works) vs. fsspec's filesystem cursor abstraction
  • ~9x higher throughput than fsspec in async workloads
  • First-class async support — no thread pool workarounds
  • Automatic credential refresh and multipart uploads built in
  • Single client for all cloud providers — relevant for external stores (Support external Zarr stores as data sources (no local ingest) #46) that may live on S3, GCS, or Azure

Relevance to this project

  • Ingested datasets: faster chunk reads from our own S3 storage during API queries
  • External Zarr stores (Support external Zarr stores as data sources (no local ingest) #46): one client handles any provider without per-provider credential management
  • Icechunk alignment: Icechunk uses the same underlying Rust object_store crate internally — adopting obstore means a shared storage layer if we go that route

Caveat to verify

zarr-python v3 supports pluggable Store backends and obstore provides an fsspec compatibility shim. We should verify that our full stack (xarray, zarr-python, any ingestion tooling) can wire up to obstore directly rather than falling back through fsspec — otherwise the performance gain is partially lost.

Suggested next steps

  • Benchmark obstore vs. fsspec/s3fs for representative Zarr chunk reads on our S3 data
  • Confirm zarr-python v3 + xarray can use obstore natively (not via fsspec shim)
  • Use obstore in the ingestion and serving pipeline
  • Use obstore for external store connections in Support external Zarr stores as data sources (no local ingest) #46 (multi-cloud, single interface)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions