Skip to content

v0.3.0

Choose a tag to compare

@bwiemz bwiemz released this 18 Mar 02:56
· 1912 commits to main since this release

What's New

Scaling Infrastructure (M32-M34)

  • Mixture of Experts@moe decorator with top-k gating, capacity-based routing, and load-balancing aux loss
  • Speculative Decoding@speculative with tree attention, rejection sampling, and @medusa multi-head prediction
  • Ring Attention@context_parallel(ring_size=N) for cross-GPU sequence parallelism with causal masking

Quantization (M35)

  • FP8 Compute@fp8_compute decorator with E4M3/E5M2 scale management and automatic Tensor Core dispatch
  • AWQ 4-bitquant { dtype: awq4 } with in-register dequantize-in-GEMM (zero memory round-trip)
  • GPTQ 4-bit/8-bitquant { dtype: gptq4 } with Hessian-based optimal quantization

Compiler Intelligence (M36-M37)

  • Memory Planning — compile-time tensor liveness analysis, interference graph, first-fit-decreasing slab assignment with 256-byte GPU alignment
  • Roofline Cost Model — per-operation FLOP/byte/arithmetic-intensity analysis against a built-in GPU database (A100, H100, RTX-4090, RTX-3090, L40S); table, JSON, and Chrome tracing output formats

Language Features (M38-M40)

  • Linear Types — ownership checker with use-after-move detection, branch consumption symmetry, loop consumption prevention, and @shared escape hatch
  • Autodiff Safety — BackwardAccess classification for all 36 TapeOp variants (ShapeOnly/DataRequired/AuxDataRequired)
  • vmap@vmap(batch_dim=0) automatic batching with batch-variant/invariant tracking, dimension shifting, and matmul rewrite classification
  • Source-to-Source AD — Wengert list extraction, 18 reverse-mode adjoint rules (reviewer-verified correct), dead gradient elimination, saved tensor analysis

Stats

  • 408 unit tests passing across all crates
  • 4,738 new lines of code across 42 files
  • 18 new source modules + 7 implementation plans
  • Clippy clean, release build verified

Breaking Changes

None. All new features are additive. Existing code compiles unchanged.

Known Limitations

  • --linear-types, --vram-budget, --perf CLI flags are parsed but not yet wired through compile_entry() (same status as --fusion-report)
  • Source AD and vmap are infrastructure-only (analysis libraries complete, codegen integration in progress)
  • E2E tests that invoke nsl run require a C compiler (gcc/clang/MSVC) in PATH for linking

Full Changelog: v0.2.0...v0.3.0