v0.3.0
What's New
Scaling Infrastructure (M32-M34)
- Mixture of Experts —
@moedecorator with top-k gating, capacity-based routing, and load-balancing aux loss - Speculative Decoding —
@speculativewith tree attention, rejection sampling, and@medusamulti-head prediction - Ring Attention —
@context_parallel(ring_size=N)for cross-GPU sequence parallelism with causal masking
Quantization (M35)
- FP8 Compute —
@fp8_computedecorator with E4M3/E5M2 scale management and automatic Tensor Core dispatch - AWQ 4-bit —
quant { dtype: awq4 }with in-register dequantize-in-GEMM (zero memory round-trip) - GPTQ 4-bit/8-bit —
quant { dtype: gptq4 }with Hessian-based optimal quantization
Compiler Intelligence (M36-M37)
- Memory Planning — compile-time tensor liveness analysis, interference graph, first-fit-decreasing slab assignment with 256-byte GPU alignment
- Roofline Cost Model — per-operation FLOP/byte/arithmetic-intensity analysis against a built-in GPU database (A100, H100, RTX-4090, RTX-3090, L40S); table, JSON, and Chrome tracing output formats
Language Features (M38-M40)
- Linear Types — ownership checker with use-after-move detection, branch consumption symmetry, loop consumption prevention, and
@sharedescape hatch - Autodiff Safety — BackwardAccess classification for all 36 TapeOp variants (ShapeOnly/DataRequired/AuxDataRequired)
- vmap —
@vmap(batch_dim=0)automatic batching with batch-variant/invariant tracking, dimension shifting, and matmul rewrite classification - Source-to-Source AD — Wengert list extraction, 18 reverse-mode adjoint rules (reviewer-verified correct), dead gradient elimination, saved tensor analysis
Stats
- 408 unit tests passing across all crates
- 4,738 new lines of code across 42 files
- 18 new source modules + 7 implementation plans
- Clippy clean, release build verified
Breaking Changes
None. All new features are additive. Existing code compiles unchanged.
Known Limitations
--linear-types,--vram-budget,--perfCLI flags are parsed but not yet wired throughcompile_entry()(same status as--fusion-report)- Source AD and vmap are infrastructure-only (analysis libraries complete, codegen integration in progress)
- E2E tests that invoke
nsl runrequire a C compiler (gcc/clang/MSVC) in PATH for linking
Full Changelog: v0.2.0...v0.3.0