Release tridec v0.2.1 · bledden/tridec

0.2.1 — 2026-06-14

Validated on all three platforms: Metal (M4 Max), NVIDIA H200 (CUDA), and AMD
MI300X (ROCm/gfx942). The CUDA/ROCm kernels are byte-identical to 0.2.0 (the
megakernel lift is Metal-only and auto-dispatch is routing), so the 0.2.0
performance receipts stand; re-confirmed no regression on H200/MI300X.

Statistical-tier gate uses Wilson-CI overlap (#1). The cross-decoder /
cross-platform / vs-oracle gates (where exact failure counts can't match by
construction) now assert a single sample-size-aware
validation.wilson_consistent(f1, n1, f2, n2) (two rates consistent iff their
95% Wilson score intervals overlap), replacing the ad-hoc abs(diff) <= max(5, 5%) count bars — too loose at high N/LER, too strict at low counts. New helper

unit test; applied across the numpy-vs-ldpc, no-regression statistical tier,
and relay-vs-relay_bp-oracle gates (metal/CUDA/ROCm). The fp32-near-tie-flip
bars (same-algo, ≈0 flips) are intentionally left as tight absolute tolerances.

Relay-BP auto-dispatch (#5). tridec.from_dem(..., algorithm="relay") and
RelayBpDecoder now use the single-launch megakernel by default on GPU —
it wins decisively on relay (9–32× on CUDA/ROCm, ~197× on Metal vs the v0.1
two-kernel host loop, with per-shot early exit and LER matching the relay_bp
oracle). Pass megakernel=False for the v0.1 two-kernel path. The dispatch is
GPU-gated by construction — RelayBpDecoder only accepts the triton
(CUDA/ROCm) and metal backends, so the megakernel is never built on CPU. The
megakernel is a drop-in RelayBpTriton subclass (overrides _relay_posteriors
only), so decode_batch and the no-observable path are unchanged. BP keeps
the two-kernel default (BpMegaTriton stays opt-in via
tridec.backends.megakernel) — the plain-BP megakernel is a single-shot
latency tool that loses at batch throughput. Validated on Metal + CPU (full
suite 99 passed / 6 skipped); CUDA/ROCm dispatch is a pre-publish confirmation
(the megakernel's kernel correctness there is already receipted).

Metal megakernel fully lifted off the BLOCK=32 pin — both kernels at
BLOCK=256. The triton-metal codegen gaps that forced BLOCK=32 in 0.2.0
(silently-dropped tl.debug_barrier; cross-lane reduction-in-loop) are fixed
upstream. Metal now runs BP at (256) (20 → 12 ms, 1.67×) and relay at
(256, num_warps=8) (441 → 152 ms, 2.89×), lifting the Metal relay
headline from 65× → ~197× vs the v0.1 two-kernel path (30.0 s → 0.152 s /
2000 shots), relay bit-identical to BLOCK=128. The relay num_warps=8 sets
num_threads = 256 = BLOCK so each thread handles one element (n=1); at n>1
triton-metal's base path under-covers a BLOCK-wide store and now loudly
refuses (MetalNonRecoverableError, never silent-wrong). All 4 Metal gates
pass at the new defaults; relay validated deterministic + bit-identical to
BLOCK=128 + LER-vs-oracle over repeated runs. Requires triton-metal with the
in-loop-reduction + n=1-store fixes (older → relay@256 loudly refuses).
Receipt: bench/receipts/megakernel_metal_lift.{md,json}.

Process note: an intermediate upstream build made relay@256 silently-wrong +
racy (base-path n>1 under-coverage); caught by tridec's repeated-run
determinism gate, root-caused, and resolved with num_warps=block/32 + a loud
upstream refusal before any lift shipped (the kernel never produced wrong output
through the default path).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tridec v0.2.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

0.2.1 — 2026-06-14

Uh oh!