Skip to content

Add optional S3-compatible object storage as primary persistence backend #3228

@jayakasadev

Description

@jayakasadev

Description

iggy currently persists everything to local disk: per-partition append-only `.log` segments, sparse `.index` files, an append-only state log, per-consumer offset files, system info, and tokens. This issue proposes an opt-in mode where an S3-compatible object store is the only persistence medium — useful for ephemeral / scale-to-zero compute, durable-by-default archive, and deployments where local NVMe provisioning is the bottleneck.

Local-disk mode is unchanged and remains the default; the S3 backend ships behind a default-off cargo feature so fs-only deployments aren't affected.

Component

Iggy server

Proposed solution

A new `ObjectStorage` trait abstracts persistence, with a 10-phase incremental rollout. Each phase migrates one persistence subsystem onto the trait and is independently mergeable.

The S3 client is compio-native: built on `rusty-s3` (sans-IO SigV4 + request shaping) + `cyper` (compio HTTP client, rustls TLS). Critically, this avoids reintroducing tokio into the data path that #2020 just removed.

Phase plan

Phase Scope
0 Pre-flight feasibility spike — validate rusty-s3 + cyper + compio + rustls against real AWS S3. Done — 5/5 scenarios passed.
1 `ObjectStorage` trait + `CompioFsStorage` + `InMemoryStorage` (test) + `S3Storage` (feature-gated) + `BufferedMultipartWriter`. Seam only — no production callers yet. ← this issue's first deliverable
2 State log + `FileSystemInfoStorage` + tokens onto the trait (journal + snapshot model on object backend).
3 Segment writes via multipart upload.
4 Segment reads via ranged GET + LRU byte cache (active-segment reads keep using the in-flight buffer).
5 Bootstrap, directory ops, per-partition versioned manifests for fast boot.
6 Consumer offsets repacked into one binary object per partition.
7 Retention + segment deletion.
8 Per-partition lease object (S3 conditional PUT) for split-brain safety.
9 Hardening — retries, Prometheus metrics, IAM template, perf benchmarks.
10 Documentation, sample config, release notes.

The S3 backend ships behind a default-off `object-storage` cargo feature, so fs-only deployments don't pull `rusty-s3` / `cyper` / `url` into their dependency graph.

Phase 0 spike outcome

A throwaway feasibility spike (~330 LoC) ran against real AWS S3 in an ephemeral bucket (us-east-1; 1-day lifecycle backstop; bucket torn down on exit). All five scenarios passed:

Scenario Latency
PUT 1 KiB 108 ms
Range-GET 256 B 33 ms
Multipart 12 MiB upload (3 parts) 1555 ms
Full GET + byte-compare 12 MiB 919 ms
Conditional PUT race (`If-None-Match: *`) 62 ms; loser fenced cleanly with HTTP 412

Three correctness findings, baked into Phase 1:

  1. rustls 0.23 needs an explicit `CryptoProvider::install_default()` when cyper is configured `default-features = false`.
  2. AWS ETags arrive wrapped in quotes; `rusty-s3`'s `complete_multipart_upload` re-wraps them when serializing the XML body, so the ETag must be `.trim_matches('"')`-stripped before being passed back. Otherwise: `400 InvalidPart`.
  3. Multipart minimum part size is 5 MiB except the final part; iggy's typical sub-MiB flushes need a buffering layer (`BufferedMultipartWriter`) to coalesce them into legal parts.

Phase 1 PRs

Alternatives considered

  1. Build on `opendal` instead of `rusty-s3` + `cyper`. Tempting because opendal supports more backends out-of-the-box (GCS native, Azure native), but opendal is tokio-native. Pulling it in would require either reintroducing tokio into iggy's data-path runtime (undoing feat(io_uring): replace tokio s3 crate #2020) or running opendal behind a per-call thread bridge (channel-hop overhead per S3 call, plus an extra runtime). Rejected.
  2. Use `rust-s3` (already a transitive dep via the iceberg sink). Also tokio-based via reqwest. Same problem. Left in place for the existing iceberg connector; not used for the new path.
  3. Roll a thin compio-native HTTP client over `compio::net::TcpStream` + `rustls` directly. Minimal external surface but ~300 LoC of HTTP/SigV4 plumbing to maintain in-tree. Reserved as a fallback if `cyper` later turns out to have sharp edges; the Phase 0 spike confirmed it doesn't, today.

Open questions for maintainers

  • Issue cadence. Happy to file separate issues per phase (one-issue-one-PR) if you prefer that to a single umbrella. This issue is intended to anchor design discussion for the milestone; each phase still ships in its own PR(s).
  • Default `multipart_part_size`. Currently 8 MiB (configurable). AWS minimum is 5 MiB except final. Smaller → finer durability + more S3 PUTs; larger → fewer PUTs + larger memory buffers. 8 MiB is a starting guess; alternatives 5 / 16 / 32.
  • `ack_after_upload` default. True (producer ack waits for part-upload success — durable before producer learns) is the safe default. False is faster but loses messages on crash before next flush; intended for testing only. Reasonable?
  • GCS-S3-compat support. Phase 8 (fencing) uses S3 `If-None-Match: *` conditional PUT. AWS / MinIO / R2 / Tigris support this; GCS-S3-compat does not. OK to document GCS as not-supported in Phase 8 and revisit later, or should we adopt a different fencing mechanism (paid coordination service)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions