v1.1.0 — adoption tooling + small-object compression
v1.1 — adoption tooling + small-object compression. Six additive
features (s4 estimate / s4 migrate / zstd dictionaries +
s4 train-dict / s4fs fsspec adapter / s4 recompact / GPU batched
small-PUT compression) hardened by a 3-round dual-reviewer audit
(Claude ×3 + Codex; findings 20 → 7 → 5, P1/P2 zero at round 3). The
v1.0 freeze contract holds: every change below is additive and
default-off; flag-less PUT/GET behavior is bit-for-bit unchanged.
Fixed (audit round 2 — adversarial verification of the round-1 fix wave)
- P2
CreateMultipartUploadnow strips client-supplieds4-*
metadata likeput_objectdoes — a forgedx-amz-meta-s4-encrypted
could otherwise survive onto a completed multipart object and 5xx a
flag-less GET (multipart re-open of the round-1 PUT fix). - P2
migrate/recompactno longer hard-fail every object when
GetObjectTaggingis denied or unimplemented: such objects skip as
tags-unreadable(data is never rewritten tag-less),NoSuchTagSet
counts as "no tags", and a new--no-tagsflag opts out of tag
inheritance entirely. Transient tagging errors still fail hard. - P2 Version-pinned CopyObject (
?versionId=) probes the pinned
source version — not the latest — for both the REPLACE metadata merge
and cross-bucket dictionary propagation. - P3 Dictionary size cap (1 MiB) is now one consistent contract:
train-dict --max-dict-bytesand--zstd-dictboot preload reject
what a flag-less gateway's lazy fetch would refuse. - P3 Boot-preloaded dictionaries are bucket-scoped, fetched per
(bucket, id)withs4-dict-sha256verification, and the server
refuses to boot when one dict-id resolves to different bytes across
buckets (16-hex prefix collision). - P3
s4 estimateexcludes already-S4 objects (gateway metadata or
S4F2/S4P1/S4E*magic) from sampling so re-estimating a
gateway-operated bucket doesn't measure framed/encrypted bytes as if
they were compressible plaintext (already_s4count + note). - P3 (s4fs) the sidecar staleness check reuses a cached live-info
snapshot instead of issuing a second backend HEAD perinfo().
Trade-off disclosed: external overwrites during one filesystem
instance's lifetime are detected on the nextinvalidate_cache()/
new instance, not per-read (same contract as the metadata cache).
Fixed (audit round 3 — convergence check)
- P3
s4 estimate's already-S4 body detection is structurally
validated (known codec id + payload fits the object forS4F2,
plausible padding length forS4P1) so customer data that merely
starts with the 4-byte magic isn't silently dropped from sampling. - P3 README/CHANGELOG drift from the round-1/2 fixes corrected:
dictionary 1 MiB cap is documented as one three-surface contract,
migrate/recompact sample outputs show the full current skip taxonomy,
--no-tags/tags-unreadable/already-s4estimate exclusions
documented.
Fixed (audit round 1 — 4 reviewers over v1.0.0..HEAD, 2026-06-11)
- P1
s4 migratecould rewrite.s4dict/<id>dictionary objects as
S4F2-framed data, breaking everycpu-zstd-dictobject in the bucket
(lazy fetch fails fingerprint verification). All three bulk tools
(estimate/migrate/recompact) now exclude S4-internal keys:
*.s4index,.s4dict/, and*.__s4ver__/*versioning shadows. - P1 A client-supplied
x-amz-meta-s4-dict-idon a plain PUT made
the subsequent GET fail 5xx even with--zstd-dictunset (default-off
behavior regression). The GET dict branch is now gated on the
gateway-managed manifest codec (cpu-zstd-dict), andput_object
strips client-supplieds4-*metadata keys up front. - P1 (s4fs) SSE-encrypted objects could return AES-GCM ciphertext
bytes silently (passthrough+ SSE). s4fs now refuses with
NotImplementedErrorvia three layers:s4-encryptedmetadata,
sidecar SSE binding, andS4E1–S4E6magic sniff. - P1 (s4fs)
<key>.__s4ver__/<version>shadow objects were not
hidden fromls/find/glob (prefix check instead of infix), so
directory dataset scans could silently include stale versions. - P2
migrate/recompactrewrites dropped the source object's
storage class (silent promotion to STANDARD) and object tags; both
are now inherited. ACLs / Object Lock retention remain uninherited
(stated in report notes). - P2
migratetreated a roundtrip-verify failure as a skip
(exit 0); it is now a hard failure (exit 1), matchingrecompact.
Theskipped_verify_failedJSON field remains (always 0) for shape
compatibility. - P2 Cross-bucket CopyObject of a dict-compressed object now
propagates.s4dict/<id>to the destination bucket (idempotent,
content-addressed); previously the copy succeeded but every GET on
the destination failed 5xx. - P2
.s4dict/joined the reserved-key guard: gateway PUT / DELETE
are rejected withInvalidObjectName(reads still allowed) so a
bucket-wide dictionary can't be destroyed through the data path. - P2 (s4fs)
info()no longer trusts a stale sidecar for object
size (staleness-checked first), and binding-less legacy v1 sidecars
are no longer used for size or partial range reads. - P2 (s4fs) dependency floor corrected to
s4-codec>=1.1.0,<2—
the binding APIs s4fs imports don't exist in the 1.0.0 wheel. - P3
estimateno longer aborts the whole run when a sampled
object 404s mid-run (skip + note); module/report now disclose the
single-stream measurement bias vs the server's 4 MiB chunking. - P3
migrate/recompactenforce--max-body-bytesfrom the
GETContent-Lengthbefore buffering;migratenow also cleans up a
stale multi-frame sidecar when its rewrite comes out single-frame. - P3
recompactno longer auto-promotes backend-written framed
objects that lack gateway metadata (unstamped-framedskip; opt back
in with--assume-unstamped-framed). - P3 Dict hardening:
DictCacheis bucket-scoped,train-dict
stampss4-dict-sha256(full-digest verification when present), and
lazy fetch caps dictionaries at 1 MiB. (s4fs)open()on a framed
object with inexact size raises instead of silently truncating
(allow_inexact_open=Truerestores the old clamp). - P3
nvcomp_batchedvalidates device-reported chunk sizes on the
host before the unsafe copy (typed per-item error instead of a
potential OOB read on driver misbehavior).
Added
--gpu-batch-small-puts(opt-in, requires thenvcomp-gpubuild +
a CUDA-capable GPU at boot — the server refuses to start otherwise) —
batch concurrent small PUTs into a single nvCOMP batched-zstd
kernel launch so the GPU pays its fixed launch + PCIe cost once per
batch instead of once per object. Eligibility: sampling dispatcher
pickedcpu-zstd, no--zstd-dictprefix match, declared
Content-Lengthin[--gpu-batch-floor-bytes (default 4 KiB), --gpu-min-bytes (default 1 MiB)). Companion knobs:
--gpu-batch-max-items(flush at N pending bodies, default 32) and
--gpu-batch-window-ms(flush after T ms, default 4 — also the
worst-case latency the batch path adds to a PUT). Wire format is
unchanged: batched objects are byte-layout-identical standard
nvcomp-zstdbodies (same FCG1 framing +CodecKind::NvcompZstd
manifest as the per-object GPU path; no new codec id, no new
metadata) and the GET path has zero batch awareness — proven by
GPU-gated tests that decompress batch output through the unmodified
per-object path, plus a MinIO e2e (tests/gpu_batch_e2e.rs).
Fail-open semantics: queue full (backpressure), GPU error, or a
batched result that is not smaller than the input all fall back to
the pre-existing cpu-zstd framed path — observable via the new
s4_gpu_batch_total{result="batched"|"fallback"}counter. Measured
on 1000 × 8 KiB log-like objects (RTX 4070 Ti SUPER, nvCOMP
5.2.0.10): batched GPU = 29.7 ms vs 702 ms per-object GPU (~24×) vs
15.7–19.5 ms single-thread cpu-zstd-3; GPU output ~10% smaller
(12.31× vs 11.14× ratio). Honest verdict in README §"GPU small-PUT
batching": this offloads CPU and improves ratio — it does not beat a
free CPU core on raw wall time at 8 KiB. New public surface:
s4_codec::nvcomp_batched::NvcompZstdBatchEncoder(feature-gated),
s4_server::gpu_batch(aggregator +GpuBatchHandle),
S4Service::with_gpu_batch, and thegpu_small_batchbench. Flag
off (default) = bit-for-bit unchanged PUT behaviour.s4 recompact <bucket>[/prefix] --endpoint-url <BACKEND> [--execute]—
rewrite cpu-zstd framed objects at a higher zstd level during a quiet
window (LSM-compaction for S3). The gateway's PUT path favours latency
(--zstd-level, default 3); recompact decodes each S4-framed cpu-zstd
object in-process (sameFrameIterwalk as the GET path — doubles as
an integrity check on the stored frames), re-frames the original bytes
with the samestreaming_compress_to_frames+pick_chunk_sizepair
the PUT path uses at--target-zstd-level(default 19), and overwrites
only when the new frames shrink the stored bytes by
--min-gain-percent(default 3%). Rewritten objects are stamped with
news4-zstd-levelmetadata (recompact-only stamp — the gateway
neither reads nor writes it), making re-runs idempotent
(already-compactedskip) with no checkpoint file.
--older-than <DUR>(30d/12h/45m/90s) restricts the run
to cold objects by backendLastModified. Dry-run by default;
mandatory decompress-roundtrip byte comparison before every write (no
off switch) and a pre-PUT HEAD ETag re-check (narrows, does not close,
the concurrent-writer race). Skip taxonomy:not-s4(runs4 migrate
first) /already-compacted/unsupported-codec(passthrough,
cpu-gzip,nvcomp-*,cpu-zstd-dict— this tool is cpu-zstd →
cpu-zstd only) /unstamped-framed(audit round 1: backend-written
frames without gateway metadata; opt in with
--assume-unstamped-framed) /insufficient-gain/too-large
(--max-body-bytes, default 5 GiB) /etag-raced/too-recent/
tags-unreadable(audit round 2;--no-tagsopts out of tag
inheritance). Multi-frame rewrites
refresh the<key>.s4indexsidecar; single-frame rewrites delete a
now-stale one.--concurrency(default 4),--max-objects,
--format table|json; exit 1 iff any object failed. SSE-enabled
deployments are rejected (same guard as migrate). New library module
s4_server::recompact(run_recompact,RecompactParams,
RecompactReport,RecompactError#[non_exhaustive],
parse_duration_suffix). Additive only — no existing flag, metadata
key, or default changed (s4-serverinternals: a handful of private
migratehelpers becamepub(crate)for reuse, behaviour unchanged).s4 estimate <bucket>[/prefix] --endpoint-url <BACKEND>— read-only
pre-deployment savings simulator. Lists the bucket (.s4indexexcluded,
capped at--max-list-keys), stratifies objects by extension, samples
--samples-per-stratumobjects per stratum (size-weighted, deterministic
under--seed), compresses the sampled bytes with the same
SamplingDispatcherpick the gateway would make at PUT time (honoring
--codec/--dispatcher/--zstd-level/--gpu-min-bytes/
--prefer-columnar-gpu), and extrapolates projected storage bytes and
$/month (--price-per-gb-month, default 0.023).--format table|json.
Never executes GPU codecs:nvcomp-*picks are measured via a cpu-zstd
proxy with an explicit report note. New library module
s4_server::estimate(run_estimate,EstimateParams,
EstimateReport,EstimateError#[non_exhaustive]). Additive only —
no existing flag or default changed.s4 migrate <bucket>[/prefix] --endpoint-url <BACKEND> [--execute]—
bulk retro-compression of pre-existing objects into the gateway's S4F2
framed format (sameSamplingDispatcherdecision, same
streaming_compress_to_framesframing + chunk-size policy, same
s4-codec/s4-framedmetadata and<key>.s4indexsidecar contract as
the PUT path — gateway GETs decompress migrated objects transparently).
Dry-run by default;--executeto write. Already-S4 objects (frame
magic ors4-codecmetadata) are skipped, so re-runs resume
automatically without a checkpoint file. Every write requires an
in-process decompress-roundtrip byte comparison (no off switch) and a
pre-PUT HEAD ETag re-check (narrows, does not close, the concurrent-
writer race — documented). Skip taxonomy:already-s4/
not-compressible(passthrough pick or no size gain; object untouched)
/too-large(--max-body-bytes, default 5 GiB) /etag-raced/
tags-unreadable(audit round 2;--no-tagsopts out of tag
inheritance). A roundtrip-verify failure is a hard failure (exit 1)
since the round-1 audit — theskipped_verify_failedJSON field
remains for shape compatibility but is always 0.
--concurrency(default 4),--max-objects,
--format table|json; exit 1 iff any object failed. GPU /cpu-gzip
dispatcher picks really fall back tocpu-zstdat--zstd-level
(reported aspicked != wrote_with). SSE-configured invocations are
rejected; versioning-Enabled buckets get a double-billingWARNING
note. New library modules4_server::migrate(run_migrate,
MigrateParams,MigrateReport,MigrateError/SkipReason
#[non_exhaustive]). Additive only — no existing flag, default, or
PUT/GET behavior changed.- Shared zstd dictionaries for small objects (
s4 train-dict+
--zstd-dict) — new codeccpu-zstd-dict(codec id 8; additive:
the S4F2 frame layout is unchanged, only a new id is allocated).
s4 train-dict <bucket>/<prefix> --endpoint-url <BACKEND> [--max-samples 1000] [--max-dict-bytes 112640] [--min-samples 8] [--sample-max-bytes 65536]samples small raw objects under the prefix
(already-S4 bodies skipped), trains a stock zstd dictionary
(zstd::dict::from_samples/ ZDICT), stores it at the content-addressed
in-bucket object.s4dict/<dict-id>(<dict-id>= first 16 hex of the
dictionary's SHA-256; immutable, idempotent re-train), and prints the
gateway flag. The gateway flag--zstd-dict '<bucket>/<key-prefix>=<dict-id>'(repeatable; dictionaries fetched +
fingerprint-verified at boot, missing dict = boot error) makes
single-PUT cpu-zstd bodies ≤--zstd-dict-max-bytes(default 1 MiB)
whose key longest-prefix-matches compress against the dictionary —
only when it actually beats dict-less cpu-zstd (both compressed and
compared per small PUT; ties / losses fall back to a plaincpu-zstd
frame with no dict reference). The dictionary id travels in the new
s4-dict-idobject-metadata key, never in the frame. GETs resolve the
dictionary preloaded → LRU → lazy backend fetch of.s4dict/<id>
(fingerprint-verified, ~16-entry cache), so a gateway booted without
the flag still reads dict-compressed objects; fetch failures are 5xx +
the news4_dict_fetch_total{result}counter..s4dict/keys are
hidden from gateway listings (same treatment as.s4index/
.__s4ver__/). Measured on the minio E2E (100 × ~300-byte
same-schema JSON events): 8 903 bytes stored vs 21 923 dict-less =
2.46×. No lock-in: the payload is a stock zstd frame and.s4dict/<id>
is raw zstd dictionary bytes —zstd -D <dictfile> -ddecodes without
any S4 software (pinned by the E2E against the real CLI). New modules
s4_codec::cpu_zstd_dict(CpuZstdDict,train_from_samples,
blocking helpers) ands4_server::dict(DictStore,DictCache,
run_train_dict, …). Compatibility note: pre-v1.1 readers fail a
GET of acpu-zstd-dictobject with the existing unknown codec id
error (graceful typed failure, no silent corruption) — roll mixed
fleets forward before enabling the flag. Multipart parts and
s4-codec-wasmnative decode are out of scope (follow-ups);
s4-codec-pydecodes dict objects via theCpuZstdDictbinding
added in this release, and cross-bucket CopyObject propagates the
dictionary (see Fixed). Without--zstd-dict, PUT/GET behavior is
bit-for-bit unchanged. s4fs— fsspec filesystem for reading S4 objects without the gateway
(new pure-Python packagepython/s4fs/, protocol
s4://). pandas / pyarrow / DuckDB / Polars read gateway-written
objects straight off the backend: S4F2 frames are decoded transparently
(passthrough/cpu-zstd/cpu-gzip/cpu-zstd-dict, with
.s4dict/<id>fetch + SHA-256-fingerprint verify), unframed
metadata-manifest objects (cpu-gzip, legacy raw zstd) decode via the
s4-codec/s4-original-size/s4-crc32cstamps, and non-S4
objects pass through byte-for-byte.ls/infohide.s4index/
.s4dict//.__s4ver__/internals and report original
(decompressed) sizes (sidecar →s4-original-sizemetadata →
compressed size withs4_size_exact: False). Range reads / seeks use
the.s4indexsidecar (with source-ETag staleness check) to fetch only
the overlapping frames; verified by the MinIO e2e to transfer fewer
backend bytes than a full read. Read-only by design — every write API
raisesNotImplementedError("s4fs is read-only; write through the S4 gateway"); GPU frames (nvcomp-*/dietgpu-ans) raise
NotImplementedErrorinstead of decoding wrong. The underlying
filesystem defaults to s3fs ([s3]extra) and is injectable
(S4FileSystem(fs=...)). Unit fixtures are real gateway-written bytes
captured off MinIO (tests/fixtures/generate_fixtures.py); e2e
(pytest -m e2e) covers pandas / pyarrow / DuckDB round-trips against
MinIO + the real gateway.s4-codecPython binding: wire-format read helpers (additive,
crates/s4-codec-py) —read_frame(bytes)/frame_iter(bytes)
(S4F2 frame parse, S4P1 padding skipped; header dicts carry
codec/original_size/compressed_size/crc32c),
decode_index(bytes)(.s4indexsidecar v1/v2/v3 → dict with
entries/total_original_size/source_etag/sse),
crc32c(bytes), theCpuZstdDict(dict_bytes, level=3)codec class
(samecompress/decompressshape asCpuZstd), module constants
FRAME_MAGIC/PADDING_MAGIC/FRAME_HEADER_BYTES/
SIDECAR_SUFFIX, and exception classesS4FrameError/S4IndexError
(⊂S4Error). Existing API unchanged.