Release v1.1.0 — adoption tooling + small-object compression · abyo-software/s4

v1.1 — adoption tooling + small-object compression. Six additive
features (s4 estimate / s4 migrate / zstd dictionaries +
s4 train-dict / s4fs fsspec adapter / s4 recompact / GPU batched
small-PUT compression) hardened by a 3-round dual-reviewer audit
(Claude ×3 + Codex; findings 20 → 7 → 5, P1/P2 zero at round 3). The
v1.0 freeze contract holds: every change below is additive and
default-off; flag-less PUT/GET behavior is bit-for-bit unchanged.

Fixed (audit round 2 — adversarial verification of the round-1 fix wave)

P2 CreateMultipartUpload now strips client-supplied s4-*
metadata like put_object does — a forged x-amz-meta-s4-encrypted
could otherwise survive onto a completed multipart object and 5xx a
flag-less GET (multipart re-open of the round-1 PUT fix).
P2 migrate / recompact no longer hard-fail every object when
GetObjectTagging is denied or unimplemented: such objects skip as
tags-unreadable (data is never rewritten tag-less), NoSuchTagSet
counts as "no tags", and a new --no-tags flag opts out of tag
inheritance entirely. Transient tagging errors still fail hard.
P2 Version-pinned CopyObject (?versionId=) probes the pinned
source version — not the latest — for both the REPLACE metadata merge
and cross-bucket dictionary propagation.
P3 Dictionary size cap (1 MiB) is now one consistent contract:
train-dict --max-dict-bytes and --zstd-dict boot preload reject
what a flag-less gateway's lazy fetch would refuse.
P3 Boot-preloaded dictionaries are bucket-scoped, fetched per
(bucket, id) with s4-dict-sha256 verification, and the server
refuses to boot when one dict-id resolves to different bytes across
buckets (16-hex prefix collision).
P3 s4 estimate excludes already-S4 objects (gateway metadata or
S4F2/S4P1/S4E* magic) from sampling so re-estimating a
gateway-operated bucket doesn't measure framed/encrypted bytes as if
they were compressible plaintext (already_s4 count + note).
P3 (s4fs) the sidecar staleness check reuses a cached live-info
snapshot instead of issuing a second backend HEAD per info().
Trade-off disclosed: external overwrites during one filesystem
instance's lifetime are detected on the next invalidate_cache() /
new instance, not per-read (same contract as the metadata cache).

Fixed (audit round 3 — convergence check)

P3 s4 estimate's already-S4 body detection is structurally
validated (known codec id + payload fits the object for S4F2,
plausible padding length for S4P1) so customer data that merely
starts with the 4-byte magic isn't silently dropped from sampling.
P3 README/CHANGELOG drift from the round-1/2 fixes corrected:
dictionary 1 MiB cap is documented as one three-surface contract,
migrate/recompact sample outputs show the full current skip taxonomy,
--no-tags / tags-unreadable / already-s4 estimate exclusions
documented.

Fixed (audit round 1 — 4 reviewers over v1.0.0..HEAD, 2026-06-11)

P1 s4 migrate could rewrite .s4dict/<id> dictionary objects as
S4F2-framed data, breaking every cpu-zstd-dict object in the bucket
(lazy fetch fails fingerprint verification). All three bulk tools
(estimate / migrate / recompact) now exclude S4-internal keys:
*.s4index, .s4dict/, and *.__s4ver__/* versioning shadows.
P1 A client-supplied x-amz-meta-s4-dict-id on a plain PUT made
the subsequent GET fail 5xx even with --zstd-dict unset (default-off
behavior regression). The GET dict branch is now gated on the
gateway-managed manifest codec (cpu-zstd-dict), and put_object
strips client-supplied s4-* metadata keys up front.
P1 (s4fs) SSE-encrypted objects could return AES-GCM ciphertext
bytes silently (passthrough + SSE). s4fs now refuses with
NotImplementedError via three layers: s4-encrypted metadata,
sidecar SSE binding, and S4E1–S4E6 magic sniff.
P1 (s4fs) <key>.__s4ver__/<version> shadow objects were not
hidden from ls/find/glob (prefix check instead of infix), so
directory dataset scans could silently include stale versions.
P2 migrate / recompact rewrites dropped the source object's
storage class (silent promotion to STANDARD) and object tags; both
are now inherited. ACLs / Object Lock retention remain uninherited
(stated in report notes).
P2 migrate treated a roundtrip-verify failure as a skip
(exit 0); it is now a hard failure (exit 1), matching recompact.
The skipped_verify_failed JSON field remains (always 0) for shape
compatibility.
P2 Cross-bucket CopyObject of a dict-compressed object now
propagates .s4dict/<id> to the destination bucket (idempotent,
content-addressed); previously the copy succeeded but every GET on
the destination failed 5xx.
P2 .s4dict/ joined the reserved-key guard: gateway PUT / DELETE
are rejected with InvalidObjectName (reads still allowed) so a
bucket-wide dictionary can't be destroyed through the data path.
P2 (s4fs) info() no longer trusts a stale sidecar for object
size (staleness-checked first), and binding-less legacy v1 sidecars
are no longer used for size or partial range reads.
P2 (s4fs) dependency floor corrected to s4-codec>=1.1.0,<2 —
the binding APIs s4fs imports don't exist in the 1.0.0 wheel.
P3 estimate no longer aborts the whole run when a sampled
object 404s mid-run (skip + note); module/report now disclose the
single-stream measurement bias vs the server's 4 MiB chunking.
P3 migrate / recompact enforce --max-body-bytes from the
GET Content-Length before buffering; migrate now also cleans up a
stale multi-frame sidecar when its rewrite comes out single-frame.
P3 recompact no longer auto-promotes backend-written framed
objects that lack gateway metadata (unstamped-framed skip; opt back
in with --assume-unstamped-framed).
P3 Dict hardening: DictCache is bucket-scoped, train-dict
stamps s4-dict-sha256 (full-digest verification when present), and
lazy fetch caps dictionaries at 1 MiB. (s4fs) open() on a framed
object with inexact size raises instead of silently truncating
(allow_inexact_open=True restores the old clamp).
P3 nvcomp_batched validates device-reported chunk sizes on the
host before the unsafe copy (typed per-item error instead of a
potential OOB read on driver misbehavior).

Added

--gpu-batch-small-puts (opt-in, requires the nvcomp-gpu build +
a CUDA-capable GPU at boot — the server refuses to start otherwise) —
batch concurrent small PUTs into a single nvCOMP batched-zstd
kernel launch so the GPU pays its fixed launch + PCIe cost once per
batch instead of once per object. Eligibility: sampling dispatcher
picked cpu-zstd, no --zstd-dict prefix match, declared
Content-Length in [--gpu-batch-floor-bytes (default 4 KiB), --gpu-min-bytes (default 1 MiB)). Companion knobs:
--gpu-batch-max-items (flush at N pending bodies, default 32) and
--gpu-batch-window-ms (flush after T ms, default 4 — also the
worst-case latency the batch path adds to a PUT). Wire format is
unchanged: batched objects are byte-layout-identical standard
nvcomp-zstd bodies (same FCG1 framing + CodecKind::NvcompZstd
manifest as the per-object GPU path; no new codec id, no new
metadata) and the GET path has zero batch awareness — proven by
GPU-gated tests that decompress batch output through the unmodified
per-object path, plus a MinIO e2e (tests/gpu_batch_e2e.rs).
Fail-open semantics: queue full (backpressure), GPU error, or a
batched result that is not smaller than the input all fall back to
the pre-existing cpu-zstd framed path — observable via the new
s4_gpu_batch_total{result="batched"|"fallback"} counter. Measured
on 1000 × 8 KiB log-like objects (RTX 4070 Ti SUPER, nvCOMP
5.2.0.10): batched GPU = 29.7 ms vs 702 ms per-object GPU (~24×) vs
15.7–19.5 ms single-thread cpu-zstd-3; GPU output ~10% smaller
(12.31× vs 11.14× ratio). Honest verdict in README §"GPU small-PUT
batching": this offloads CPU and improves ratio — it does not beat a
free CPU core on raw wall time at 8 KiB. New public surface:
s4_codec::nvcomp_batched::NvcompZstdBatchEncoder (feature-gated),
s4_server::gpu_batch (aggregator + GpuBatchHandle),
S4Service::with_gpu_batch, and the gpu_small_batch bench. Flag
off (default) = bit-for-bit unchanged PUT behaviour.
s4 recompact <bucket>[/prefix] --endpoint-url <BACKEND> [--execute] —
rewrite cpu-zstd framed objects at a higher zstd level during a quiet
window (LSM-compaction for S3). The gateway's PUT path favours latency
(--zstd-level, default 3); recompact decodes each S4-framed cpu-zstd
object in-process (same FrameIter walk as the GET path — doubles as
an integrity check on the stored frames), re-frames the original bytes
with the same streaming_compress_to_frames + pick_chunk_size pair
the PUT path uses at --target-zstd-level (default 19), and overwrites
only when the new frames shrink the stored bytes by
--min-gain-percent (default 3%). Rewritten objects are stamped with
new s4-zstd-level metadata (recompact-only stamp — the gateway
neither reads nor writes it), making re-runs idempotent
(already-compacted skip) with no checkpoint file.
--older-than <DUR> (30d / 12h / 45m / 90s) restricts the run
to cold objects by backend LastModified. Dry-run by default;
mandatory decompress-roundtrip byte comparison before every write (no
off switch) and a pre-PUT HEAD ETag re-check (narrows, does not close,
the concurrent-writer race). Skip taxonomy: not-s4 (run s4 migrate
first) / already-compacted / unsupported-codec (passthrough,
cpu-gzip, nvcomp-*, cpu-zstd-dict — this tool is cpu-zstd →
cpu-zstd only) / unstamped-framed (audit round 1: backend-written
frames without gateway metadata; opt in with
--assume-unstamped-framed) / insufficient-gain / too-large
(--max-body-bytes, default 5 GiB) / etag-raced / too-recent /
tags-unreadable (audit round 2; --no-tags opts out of tag
inheritance). Multi-frame rewrites
refresh the <key>.s4index sidecar; single-frame rewrites delete a
now-stale one. --concurrency (default 4), --max-objects,
--format table|json; exit 1 iff any object failed. SSE-enabled
deployments are rejected (same guard as migrate). New library module
s4_server::recompact (run_recompact, RecompactParams,
RecompactReport, RecompactError #[non_exhaustive],
parse_duration_suffix). Additive only — no existing flag, metadata
key, or default changed (s4-server internals: a handful of private
migrate helpers became pub(crate) for reuse, behaviour unchanged).
s4 estimate <bucket>[/prefix] --endpoint-url <BACKEND> — read-only
pre-deployment savings simulator. Lists the bucket (.s4index excluded,
capped at --max-list-keys), stratifies objects by extension, samples
--samples-per-stratum objects per stratum (size-weighted, deterministic
under --seed), compresses the sampled bytes with the same
SamplingDispatcher pick the gateway would make at PUT time (honoring
--codec / --dispatcher / --zstd-level / --gpu-min-bytes /
--prefer-columnar-gpu), and extrapolates projected storage bytes and
$/month (--price-per-gb-month, default 0.023). --format table|json.
Never executes GPU codecs: nvcomp-* picks are measured via a cpu-zstd
proxy with an explicit report note. New library module
s4_server::estimate (run_estimate, EstimateParams,
EstimateReport, EstimateError #[non_exhaustive]). Additive only —
no existing flag or default changed.
s4 migrate <bucket>[/prefix] --endpoint-url <BACKEND> [--execute] —
bulk retro-compression of pre-existing objects into the gateway's S4F2
framed format (same SamplingDispatcher decision, same
streaming_compress_to_frames framing + chunk-size policy, same
s4-codec/s4-framed metadata and <key>.s4index sidecar contract as
the PUT path — gateway GETs decompress migrated objects transparently).
Dry-run by default; --execute to write. Already-S4 objects (frame
magic or s4-codec metadata) are skipped, so re-runs resume
automatically without a checkpoint file. Every write requires an
in-process decompress-roundtrip byte comparison (no off switch) and a
pre-PUT HEAD ETag re-check (narrows, does not close, the concurrent-
writer race — documented). Skip taxonomy: already-s4 /
not-compressible (passthrough pick or no size gain; object untouched)
/ too-large (--max-body-bytes, default 5 GiB) / etag-raced /
tags-unreadable (audit round 2; --no-tags opts out of tag
inheritance). A roundtrip-verify failure is a hard failure (exit 1)
since the round-1 audit — the skipped_verify_failed JSON field
remains for shape compatibility but is always 0.
--concurrency (default 4), --max-objects,
--format table|json; exit 1 iff any object failed. GPU / cpu-gzip
dispatcher picks really fall back to cpu-zstd at --zstd-level
(reported as picked != wrote_with). SSE-configured invocations are
rejected; versioning-Enabled buckets get a double-billing WARNING
note. New library module s4_server::migrate (run_migrate,
MigrateParams, MigrateReport, MigrateError / SkipReason
#[non_exhaustive]). Additive only — no existing flag, default, or
PUT/GET behavior changed.
Shared zstd dictionaries for small objects (s4 train-dict +
--zstd-dict) — new codec cpu-zstd-dict (codec id 8; additive:
the S4F2 frame layout is unchanged, only a new id is allocated).
s4 train-dict <bucket>/<prefix> --endpoint-url <BACKEND> [--max-samples 1000] [--max-dict-bytes 112640] [--min-samples 8] [--sample-max-bytes 65536] samples small raw objects under the prefix
(already-S4 bodies skipped), trains a stock zstd dictionary
(zstd::dict::from_samples / ZDICT), stores it at the content-addressed
in-bucket object .s4dict/<dict-id> (<dict-id> = first 16 hex of the
dictionary's SHA-256; immutable, idempotent re-train), and prints the
gateway flag. The gateway flag --zstd-dict '<bucket>/<key-prefix>=<dict-id>' (repeatable; dictionaries fetched +
fingerprint-verified at boot, missing dict = boot error) makes
single-PUT cpu-zstd bodies ≤ --zstd-dict-max-bytes (default 1 MiB)
whose key longest-prefix-matches compress against the dictionary —
only when it actually beats dict-less cpu-zstd (both compressed and
compared per small PUT; ties / losses fall back to a plain cpu-zstd
frame with no dict reference). The dictionary id travels in the new
s4-dict-id object-metadata key, never in the frame. GETs resolve the
dictionary preloaded → LRU → lazy backend fetch of .s4dict/<id>
(fingerprint-verified, ~16-entry cache), so a gateway booted without
the flag still reads dict-compressed objects; fetch failures are 5xx +
the new s4_dict_fetch_total{result} counter. .s4dict/ keys are
hidden from gateway listings (same treatment as .s4index /
.__s4ver__/). Measured on the minio E2E (100 × ~300-byte
same-schema JSON events): 8 903 bytes stored vs 21 923 dict-less =
2.46×. No lock-in: the payload is a stock zstd frame and .s4dict/<id>
is raw zstd dictionary bytes — zstd -D <dictfile> -d decodes without
any S4 software (pinned by the E2E against the real CLI). New modules
s4_codec::cpu_zstd_dict (CpuZstdDict, train_from_samples,
blocking helpers) and s4_server::dict (DictStore, DictCache,
run_train_dict, …). Compatibility note: pre-v1.1 readers fail a
GET of a cpu-zstd-dict object with the existing unknown codec id
error (graceful typed failure, no silent corruption) — roll mixed
fleets forward before enabling the flag. Multipart parts and
s4-codec-wasm native decode are out of scope (follow-ups);
s4-codec-py decodes dict objects via the CpuZstdDict binding
added in this release, and cross-bucket CopyObject propagates the
dictionary (see Fixed). Without --zstd-dict, PUT/GET behavior is
bit-for-bit unchanged.
s4fs — fsspec filesystem for reading S4 objects without the gateway
(new pure-Python package python/s4fs/, protocol
s4://). pandas / pyarrow / DuckDB / Polars read gateway-written
objects straight off the backend: S4F2 frames are decoded transparently
(passthrough / cpu-zstd / cpu-gzip / cpu-zstd-dict, with
.s4dict/<id> fetch + SHA-256-fingerprint verify), unframed
metadata-manifest objects (cpu-gzip, legacy raw zstd) decode via the
s4-codec / s4-original-size / s4-crc32c stamps, and non-S4
objects pass through byte-for-byte. ls / info hide .s4index /
.s4dict/ / .__s4ver__/ internals and report original
(decompressed) sizes (sidecar → s4-original-size metadata →
compressed size with s4_size_exact: False). Range reads / seeks use
the .s4index sidecar (with source-ETag staleness check) to fetch only
the overlapping frames; verified by the MinIO e2e to transfer fewer
backend bytes than a full read. Read-only by design — every write API
raises NotImplementedError("s4fs is read-only; write through the S4 gateway"); GPU frames (nvcomp-* / dietgpu-ans) raise
NotImplementedError instead of decoding wrong. The underlying
filesystem defaults to s3fs ([s3] extra) and is injectable
(S4FileSystem(fs=...)). Unit fixtures are real gateway-written bytes
captured off MinIO (tests/fixtures/generate_fixtures.py); e2e
(pytest -m e2e) covers pandas / pyarrow / DuckDB round-trips against
MinIO + the real gateway.
s4-codec Python binding: wire-format read helpers (additive,
crates/s4-codec-py) — read_frame(bytes) / frame_iter(bytes)
(S4F2 frame parse, S4P1 padding skipped; header dicts carry
codec / original_size / compressed_size / crc32c),
decode_index(bytes) (.s4index sidecar v1/v2/v3 → dict with
entries / total_original_size / source_etag / sse),
crc32c(bytes), the CpuZstdDict(dict_bytes, level=3) codec class
(same compress / decompress shape as CpuZstd), module constants
FRAME_MAGIC / PADDING_MAGIC / FRAME_HEADER_BYTES /
SIDECAR_SUFFIX, and exception classes S4FrameError / S4IndexError
(⊂ S4Error). Existing API unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0 — adoption tooling + small-object compression

Choose a tag to compare

Sorry, something went wrong.