Skip to content

v1.1.0 — adoption tooling + small-object compression

Choose a tag to compare

@masumi-ryugo masumi-ryugo released this 11 Jun 03:42
· 18 commits to main since this release

v1.1 — adoption tooling + small-object compression. Six additive
features (s4 estimate / s4 migrate / zstd dictionaries +
s4 train-dict / s4fs fsspec adapter / s4 recompact / GPU batched
small-PUT compression) hardened by a 3-round dual-reviewer audit
(Claude ×3 + Codex; findings 20 → 7 → 5, P1/P2 zero at round 3). The
v1.0 freeze contract holds: every change below is additive and
default-off; flag-less PUT/GET behavior is bit-for-bit unchanged.

Fixed (audit round 2 — adversarial verification of the round-1 fix wave)

  • P2 CreateMultipartUpload now strips client-supplied s4-*
    metadata like put_object does — a forged x-amz-meta-s4-encrypted
    could otherwise survive onto a completed multipart object and 5xx a
    flag-less GET (multipart re-open of the round-1 PUT fix).
  • P2 migrate / recompact no longer hard-fail every object when
    GetObjectTagging is denied or unimplemented: such objects skip as
    tags-unreadable (data is never rewritten tag-less), NoSuchTagSet
    counts as "no tags", and a new --no-tags flag opts out of tag
    inheritance entirely. Transient tagging errors still fail hard.
  • P2 Version-pinned CopyObject (?versionId=) probes the pinned
    source version — not the latest — for both the REPLACE metadata merge
    and cross-bucket dictionary propagation.
  • P3 Dictionary size cap (1 MiB) is now one consistent contract:
    train-dict --max-dict-bytes and --zstd-dict boot preload reject
    what a flag-less gateway's lazy fetch would refuse.
  • P3 Boot-preloaded dictionaries are bucket-scoped, fetched per
    (bucket, id) with s4-dict-sha256 verification, and the server
    refuses to boot when one dict-id resolves to different bytes across
    buckets (16-hex prefix collision).
  • P3 s4 estimate excludes already-S4 objects (gateway metadata or
    S4F2/S4P1/S4E* magic) from sampling so re-estimating a
    gateway-operated bucket doesn't measure framed/encrypted bytes as if
    they were compressible plaintext (already_s4 count + note).
  • P3 (s4fs) the sidecar staleness check reuses a cached live-info
    snapshot instead of issuing a second backend HEAD per info().
    Trade-off disclosed: external overwrites during one filesystem
    instance's lifetime are detected on the next invalidate_cache() /
    new instance, not per-read (same contract as the metadata cache).

Fixed (audit round 3 — convergence check)

  • P3 s4 estimate's already-S4 body detection is structurally
    validated (known codec id + payload fits the object for S4F2,
    plausible padding length for S4P1) so customer data that merely
    starts with the 4-byte magic isn't silently dropped from sampling.
  • P3 README/CHANGELOG drift from the round-1/2 fixes corrected:
    dictionary 1 MiB cap is documented as one three-surface contract,
    migrate/recompact sample outputs show the full current skip taxonomy,
    --no-tags / tags-unreadable / already-s4 estimate exclusions
    documented.

Fixed (audit round 1 — 4 reviewers over v1.0.0..HEAD, 2026-06-11)

  • P1 s4 migrate could rewrite .s4dict/<id> dictionary objects as
    S4F2-framed data, breaking every cpu-zstd-dict object in the bucket
    (lazy fetch fails fingerprint verification). All three bulk tools
    (estimate / migrate / recompact) now exclude S4-internal keys:
    *.s4index, .s4dict/, and *.__s4ver__/* versioning shadows.
  • P1 A client-supplied x-amz-meta-s4-dict-id on a plain PUT made
    the subsequent GET fail 5xx even with --zstd-dict unset (default-off
    behavior regression). The GET dict branch is now gated on the
    gateway-managed manifest codec (cpu-zstd-dict), and put_object
    strips client-supplied s4-* metadata keys up front.
  • P1 (s4fs) SSE-encrypted objects could return AES-GCM ciphertext
    bytes silently (passthrough + SSE). s4fs now refuses with
    NotImplementedError via three layers: s4-encrypted metadata,
    sidecar SSE binding, and S4E1S4E6 magic sniff.
  • P1 (s4fs) <key>.__s4ver__/<version> shadow objects were not
    hidden from ls/find/glob (prefix check instead of infix), so
    directory dataset scans could silently include stale versions.
  • P2 migrate / recompact rewrites dropped the source object's
    storage class (silent promotion to STANDARD) and object tags; both
    are now inherited. ACLs / Object Lock retention remain uninherited
    (stated in report notes).
  • P2 migrate treated a roundtrip-verify failure as a skip
    (exit 0); it is now a hard failure (exit 1), matching recompact.
    The skipped_verify_failed JSON field remains (always 0) for shape
    compatibility.
  • P2 Cross-bucket CopyObject of a dict-compressed object now
    propagates .s4dict/<id> to the destination bucket (idempotent,
    content-addressed); previously the copy succeeded but every GET on
    the destination failed 5xx.
  • P2 .s4dict/ joined the reserved-key guard: gateway PUT / DELETE
    are rejected with InvalidObjectName (reads still allowed) so a
    bucket-wide dictionary can't be destroyed through the data path.
  • P2 (s4fs) info() no longer trusts a stale sidecar for object
    size (staleness-checked first), and binding-less legacy v1 sidecars
    are no longer used for size or partial range reads.
  • P2 (s4fs) dependency floor corrected to s4-codec>=1.1.0,<2
    the binding APIs s4fs imports don't exist in the 1.0.0 wheel.
  • P3 estimate no longer aborts the whole run when a sampled
    object 404s mid-run (skip + note); module/report now disclose the
    single-stream measurement bias vs the server's 4 MiB chunking.
  • P3 migrate / recompact enforce --max-body-bytes from the
    GET Content-Length before buffering; migrate now also cleans up a
    stale multi-frame sidecar when its rewrite comes out single-frame.
  • P3 recompact no longer auto-promotes backend-written framed
    objects that lack gateway metadata (unstamped-framed skip; opt back
    in with --assume-unstamped-framed).
  • P3 Dict hardening: DictCache is bucket-scoped, train-dict
    stamps s4-dict-sha256 (full-digest verification when present), and
    lazy fetch caps dictionaries at 1 MiB. (s4fs) open() on a framed
    object with inexact size raises instead of silently truncating
    (allow_inexact_open=True restores the old clamp).
  • P3 nvcomp_batched validates device-reported chunk sizes on the
    host before the unsafe copy (typed per-item error instead of a
    potential OOB read on driver misbehavior).

Added

  • --gpu-batch-small-puts (opt-in, requires the nvcomp-gpu build +
    a CUDA-capable GPU at boot — the server refuses to start otherwise) —
    batch concurrent small PUTs into a single nvCOMP batched-zstd
    kernel launch so the GPU pays its fixed launch + PCIe cost once per
    batch instead of once per object. Eligibility: sampling dispatcher
    picked cpu-zstd, no --zstd-dict prefix match, declared
    Content-Length in [--gpu-batch-floor-bytes (default 4 KiB), --gpu-min-bytes (default 1 MiB)). Companion knobs:
    --gpu-batch-max-items (flush at N pending bodies, default 32) and
    --gpu-batch-window-ms (flush after T ms, default 4 — also the
    worst-case latency the batch path adds to a PUT). Wire format is
    unchanged
    : batched objects are byte-layout-identical standard
    nvcomp-zstd bodies (same FCG1 framing + CodecKind::NvcompZstd
    manifest as the per-object GPU path; no new codec id, no new
    metadata) and the GET path has zero batch awareness — proven by
    GPU-gated tests that decompress batch output through the unmodified
    per-object path, plus a MinIO e2e (tests/gpu_batch_e2e.rs).
    Fail-open semantics: queue full (backpressure), GPU error, or a
    batched result that is not smaller than the input all fall back to
    the pre-existing cpu-zstd framed path — observable via the new
    s4_gpu_batch_total{result="batched"|"fallback"} counter. Measured
    on 1000 × 8 KiB log-like objects (RTX 4070 Ti SUPER, nvCOMP
    5.2.0.10): batched GPU = 29.7 ms vs 702 ms per-object GPU (~24×) vs
    15.7–19.5 ms single-thread cpu-zstd-3; GPU output ~10% smaller
    (12.31× vs 11.14× ratio). Honest verdict in README §"GPU small-PUT
    batching": this offloads CPU and improves ratio — it does not beat a
    free CPU core on raw wall time at 8 KiB. New public surface:
    s4_codec::nvcomp_batched::NvcompZstdBatchEncoder (feature-gated),
    s4_server::gpu_batch (aggregator + GpuBatchHandle),
    S4Service::with_gpu_batch, and the gpu_small_batch bench. Flag
    off (default) = bit-for-bit unchanged PUT behaviour.
  • s4 recompact <bucket>[/prefix] --endpoint-url <BACKEND> [--execute]
    rewrite cpu-zstd framed objects at a higher zstd level during a quiet
    window (LSM-compaction for S3). The gateway's PUT path favours latency
    (--zstd-level, default 3); recompact decodes each S4-framed cpu-zstd
    object in-process (same FrameIter walk as the GET path — doubles as
    an integrity check on the stored frames), re-frames the original bytes
    with the same streaming_compress_to_frames + pick_chunk_size pair
    the PUT path uses at --target-zstd-level (default 19), and overwrites
    only when the new frames shrink the stored bytes by
    --min-gain-percent (default 3%). Rewritten objects are stamped with
    new s4-zstd-level metadata (recompact-only stamp — the gateway
    neither reads nor writes it), making re-runs idempotent
    (already-compacted skip) with no checkpoint file.
    --older-than <DUR> (30d / 12h / 45m / 90s) restricts the run
    to cold objects by backend LastModified. Dry-run by default;
    mandatory decompress-roundtrip byte comparison before every write (no
    off switch) and a pre-PUT HEAD ETag re-check (narrows, does not close,
    the concurrent-writer race). Skip taxonomy: not-s4 (run s4 migrate
    first) / already-compacted / unsupported-codec (passthrough,
    cpu-gzip, nvcomp-*, cpu-zstd-dict — this tool is cpu-zstd →
    cpu-zstd only) / unstamped-framed (audit round 1: backend-written
    frames without gateway metadata; opt in with
    --assume-unstamped-framed) / insufficient-gain / too-large
    (--max-body-bytes, default 5 GiB) / etag-raced / too-recent /
    tags-unreadable (audit round 2; --no-tags opts out of tag
    inheritance). Multi-frame rewrites
    refresh the <key>.s4index sidecar; single-frame rewrites delete a
    now-stale one. --concurrency (default 4), --max-objects,
    --format table|json; exit 1 iff any object failed. SSE-enabled
    deployments are rejected (same guard as migrate). New library module
    s4_server::recompact (run_recompact, RecompactParams,
    RecompactReport, RecompactError #[non_exhaustive],
    parse_duration_suffix). Additive only — no existing flag, metadata
    key, or default changed (s4-server internals: a handful of private
    migrate helpers became pub(crate) for reuse, behaviour unchanged).
  • s4 estimate <bucket>[/prefix] --endpoint-url <BACKEND> — read-only
    pre-deployment savings simulator. Lists the bucket (.s4index excluded,
    capped at --max-list-keys), stratifies objects by extension, samples
    --samples-per-stratum objects per stratum (size-weighted, deterministic
    under --seed), compresses the sampled bytes with the same
    SamplingDispatcher pick the gateway would make at PUT time (honoring
    --codec / --dispatcher / --zstd-level / --gpu-min-bytes /
    --prefer-columnar-gpu), and extrapolates projected storage bytes and
    $/month (--price-per-gb-month, default 0.023). --format table|json.
    Never executes GPU codecs: nvcomp-* picks are measured via a cpu-zstd
    proxy with an explicit report note. New library module
    s4_server::estimate (run_estimate, EstimateParams,
    EstimateReport, EstimateError #[non_exhaustive]). Additive only —
    no existing flag or default changed.
  • s4 migrate <bucket>[/prefix] --endpoint-url <BACKEND> [--execute]
    bulk retro-compression of pre-existing objects into the gateway's S4F2
    framed format (same SamplingDispatcher decision, same
    streaming_compress_to_frames framing + chunk-size policy, same
    s4-codec/s4-framed metadata and <key>.s4index sidecar contract as
    the PUT path — gateway GETs decompress migrated objects transparently).
    Dry-run by default; --execute to write. Already-S4 objects (frame
    magic or s4-codec metadata) are skipped, so re-runs resume
    automatically without a checkpoint file. Every write requires an
    in-process decompress-roundtrip byte comparison (no off switch) and a
    pre-PUT HEAD ETag re-check (narrows, does not close, the concurrent-
    writer race — documented). Skip taxonomy: already-s4 /
    not-compressible (passthrough pick or no size gain; object untouched)
    / too-large (--max-body-bytes, default 5 GiB) / etag-raced /
    tags-unreadable (audit round 2; --no-tags opts out of tag
    inheritance). A roundtrip-verify failure is a hard failure (exit 1)
    since the round-1 audit — the skipped_verify_failed JSON field
    remains for shape compatibility but is always 0.
    --concurrency (default 4), --max-objects,
    --format table|json; exit 1 iff any object failed. GPU / cpu-gzip
    dispatcher picks really fall back to cpu-zstd at --zstd-level
    (reported as picked != wrote_with). SSE-configured invocations are
    rejected; versioning-Enabled buckets get a double-billing WARNING
    note. New library module s4_server::migrate (run_migrate,
    MigrateParams, MigrateReport, MigrateError / SkipReason
    #[non_exhaustive]). Additive only — no existing flag, default, or
    PUT/GET behavior changed.
  • Shared zstd dictionaries for small objects (s4 train-dict +
    --zstd-dict) — new codec cpu-zstd-dict (codec id 8; additive:
    the S4F2 frame layout is unchanged, only a new id is allocated).
    s4 train-dict <bucket>/<prefix> --endpoint-url <BACKEND> [--max-samples 1000] [--max-dict-bytes 112640] [--min-samples 8] [--sample-max-bytes 65536] samples small raw objects under the prefix
    (already-S4 bodies skipped), trains a stock zstd dictionary
    (zstd::dict::from_samples / ZDICT), stores it at the content-addressed
    in-bucket object .s4dict/<dict-id> (<dict-id> = first 16 hex of the
    dictionary's SHA-256; immutable, idempotent re-train), and prints the
    gateway flag. The gateway flag --zstd-dict '<bucket>/<key-prefix>=<dict-id>' (repeatable; dictionaries fetched +
    fingerprint-verified at boot, missing dict = boot error) makes
    single-PUT cpu-zstd bodies ≤ --zstd-dict-max-bytes (default 1 MiB)
    whose key longest-prefix-matches compress against the dictionary —
    only when it actually beats dict-less cpu-zstd (both compressed and
    compared per small PUT; ties / losses fall back to a plain cpu-zstd
    frame with no dict reference). The dictionary id travels in the new
    s4-dict-id object-metadata key, never in the frame. GETs resolve the
    dictionary preloaded → LRU → lazy backend fetch of .s4dict/<id>
    (fingerprint-verified, ~16-entry cache), so a gateway booted without
    the flag still reads dict-compressed objects; fetch failures are 5xx +
    the new s4_dict_fetch_total{result} counter. .s4dict/ keys are
    hidden from gateway listings (same treatment as .s4index /
    .__s4ver__/). Measured on the minio E2E (100 × ~300-byte
    same-schema JSON events): 8 903 bytes stored vs 21 923 dict-less =
    2.46×. No lock-in: the payload is a stock zstd frame and .s4dict/<id>
    is raw zstd dictionary bytes — zstd -D <dictfile> -d decodes without
    any S4 software (pinned by the E2E against the real CLI). New modules
    s4_codec::cpu_zstd_dict (CpuZstdDict, train_from_samples,
    blocking helpers) and s4_server::dict (DictStore, DictCache,
    run_train_dict, …). Compatibility note: pre-v1.1 readers fail a
    GET of a cpu-zstd-dict object with the existing unknown codec id
    error (graceful typed failure, no silent corruption) — roll mixed
    fleets forward before enabling the flag. Multipart parts and
    s4-codec-wasm native decode are out of scope (follow-ups);
    s4-codec-py decodes dict objects via the CpuZstdDict binding
    added in this release, and cross-bucket CopyObject propagates the
    dictionary (see Fixed). Without --zstd-dict, PUT/GET behavior is
    bit-for-bit unchanged.
  • s4fs — fsspec filesystem for reading S4 objects without the gateway
    (new pure-Python package python/s4fs/, protocol
    s4://). pandas / pyarrow / DuckDB / Polars read gateway-written
    objects straight off the backend: S4F2 frames are decoded transparently
    (passthrough / cpu-zstd / cpu-gzip / cpu-zstd-dict, with
    .s4dict/<id> fetch + SHA-256-fingerprint verify), unframed
    metadata-manifest objects (cpu-gzip, legacy raw zstd) decode via the
    s4-codec / s4-original-size / s4-crc32c stamps, and non-S4
    objects pass through byte-for-byte. ls / info hide .s4index /
    .s4dict/ / .__s4ver__/ internals and report original
    (decompressed) sizes (sidecar → s4-original-size metadata →
    compressed size with s4_size_exact: False). Range reads / seeks use
    the .s4index sidecar (with source-ETag staleness check) to fetch only
    the overlapping frames; verified by the MinIO e2e to transfer fewer
    backend bytes than a full read. Read-only by design — every write API
    raises NotImplementedError("s4fs is read-only; write through the S4 gateway"); GPU frames (nvcomp-* / dietgpu-ans) raise
    NotImplementedError instead of decoding wrong. The underlying
    filesystem defaults to s3fs ([s3] extra) and is injectable
    (S4FileSystem(fs=...)). Unit fixtures are real gateway-written bytes
    captured off MinIO (tests/fixtures/generate_fixtures.py); e2e
    (pytest -m e2e) covers pandas / pyarrow / DuckDB round-trips against
    MinIO + the real gateway.
  • s4-codec Python binding: wire-format read helpers (additive,
    crates/s4-codec-py) — read_frame(bytes) / frame_iter(bytes)
    (S4F2 frame parse, S4P1 padding skipped; header dicts carry
    codec / original_size / compressed_size / crc32c),
    decode_index(bytes) (.s4index sidecar v1/v2/v3 → dict with
    entries / total_original_size / source_etag / sse),
    crc32c(bytes), the CpuZstdDict(dict_bytes, level=3) codec class
    (same compress / decompress shape as CpuZstd), module constants
    FRAME_MAGIC / PADDING_MAGIC / FRAME_HEADER_BYTES /
    SIDECAR_SUFFIX, and exception classes S4FrameError / S4IndexError
    (⊂ S4Error). Existing API unchanged.