Skip to content

tridec v0.2.1

Latest

Choose a tag to compare

@bledden bledden released this 14 Jun 10:51

0.2.1 — 2026-06-14

Validated on all three platforms: Metal (M4 Max), NVIDIA H200 (CUDA), and AMD
MI300X (ROCm/gfx942). The CUDA/ROCm kernels are byte-identical to 0.2.0 (the
megakernel lift is Metal-only and auto-dispatch is routing), so the 0.2.0
performance receipts stand; re-confirmed no regression on H200/MI300X.

Statistical-tier gate uses Wilson-CI overlap (#1). The cross-decoder /
cross-platform / vs-oracle gates (where exact failure counts can't match by
construction) now assert a single sample-size-aware
validation.wilson_consistent(f1, n1, f2, n2) (two rates consistent iff their
95% Wilson score intervals overlap), replacing the ad-hoc abs(diff) <= max(5, 5%) count bars — too loose at high N/LER, too strict at low counts. New helper

  • unit test; applied across the numpy-vs-ldpc, no-regression statistical tier,
    and relay-vs-relay_bp-oracle gates (metal/CUDA/ROCm). The fp32-near-tie-flip
    bars (same-algo, ≈0 flips) are intentionally left as tight absolute tolerances.

Relay-BP auto-dispatch (#5). tridec.from_dem(..., algorithm="relay") and
RelayBpDecoder now use the single-launch megakernel by default on GPU
it wins decisively on relay (9–32× on CUDA/ROCm, ~197× on Metal vs the v0.1
two-kernel host loop, with per-shot early exit and LER matching the relay_bp
oracle). Pass megakernel=False for the v0.1 two-kernel path. The dispatch is
GPU-gated by constructionRelayBpDecoder only accepts the triton
(CUDA/ROCm) and metal backends, so the megakernel is never built on CPU. The
megakernel is a drop-in RelayBpTriton subclass (overrides _relay_posteriors
only), so decode_batch and the no-observable path are unchanged. BP keeps
the two-kernel default
(BpMegaTriton stays opt-in via
tridec.backends.megakernel) — the plain-BP megakernel is a single-shot
latency tool that loses at batch throughput. Validated on Metal + CPU (full
suite 99 passed / 6 skipped); CUDA/ROCm dispatch is a pre-publish confirmation
(the megakernel's kernel correctness there is already receipted).

Metal megakernel fully lifted off the BLOCK=32 pin — both kernels at
BLOCK=256.
The triton-metal codegen gaps that forced BLOCK=32 in 0.2.0
(silently-dropped tl.debug_barrier; cross-lane reduction-in-loop) are fixed
upstream. Metal now runs BP at (256) (20 → 12 ms, 1.67×) and relay at
(256, num_warps=8)
(441 → 152 ms, 2.89×), lifting the Metal relay
headline from 65× → ~197× vs the v0.1 two-kernel path (30.0 s → 0.152 s /
2000 shots), relay bit-identical to BLOCK=128. The relay num_warps=8 sets
num_threads = 256 = BLOCK so each thread handles one element (n=1); at n>1
triton-metal's base path under-covers a BLOCK-wide store and now loudly
refuses
(MetalNonRecoverableError, never silent-wrong). All 4 Metal gates
pass at the new defaults; relay validated deterministic + bit-identical to
BLOCK=128 + LER-vs-oracle over repeated runs. Requires triton-metal with the
in-loop-reduction + n=1-store fixes (older → relay@256 loudly refuses).
Receipt: bench/receipts/megakernel_metal_lift.{md,json}.

Process note: an intermediate upstream build made relay@256 silently-wrong +
racy (base-path n>1 under-coverage); caught by tridec's repeated-run
determinism gate, root-caused, and resolved with num_warps=block/32 + a loud
upstream refusal before any lift shipped (the kernel never produced wrong output
through the default path).