Releases: bwiemz/NSL
Releases · bwiemz/NSL
v0.9.0
Full Changelog: v0.8.0...v0.9.0
Full Changelog: v0.8.0...v0.9.0
v0.8.0: Full Roadmap Complete — M9 through M51
Milestone: Full Roadmap Delivered
NeuralScript v0.8.0 marks the completion of the entire M9–M51 roadmap. Every milestone now has its infrastructure layer implemented with analysis modules, runtime FFI, semantic validation, and unit tests.
New in v0.8.0 (Phase 9: Type System Extensions)
M49: Shape Algebra
- Symbolic dimension solver with equality, divisibility, and range proofs
- DimExpr extended with Mod variant + Eq/Hash derives
- shape_assert decorator recognition
M50: Sparse Tensors
- NslSparseTensor repr(C) struct with COO/CSR/CSC/BSR format support
- Format-aware kernel dispatch (SparseOp × Format × Device)
- Sparsity-preserving type inference rules
- sparse(pattern="2:4") decorator validation
M51: Effect System
- EffectSet bitset tracking IO, Random, Mutation, Communication
- 3-phase EffectChecker: local inference → call graph propagation → assertion validation
- pure enforcement (no effects), checkpoint (requires pure), deterministic (no Random)
- ~40 known-pure builtins, conservative default for unknowns
Code Quality (from external review)
- All CLI flags now wired through to compiler
- 5 hotspot files refactored (tensor.rs, expr.rs, compiler.rs, checker.rs, autodiff.rs)
- 14 panic points replaced with graceful error handling
- Clippy strict clean (--all-targets)
- CHANGELOG covers all versions
- README honestly separates shipped vs infrastructure features
Stats
- 726 unit tests passing
- 43 milestones (M9–M51) with infrastructure complete
- Clippy strict clean
Full Changelog: v0.7.0...v0.8.0
v0.7.0: Phase 8 — Developer Experience, Debugging & Multimodal
What's New
Tensor Debugger (M45)
- Binary trace recording (124-byte fixed-size entries) with per-op stats (min/max/mean/std)
- NaN/Inf sentinel detection with automatic halt
- Compile-time NaN risk analysis (log/sqrt/div patterns)
- Trace diffing for non-determinism diagnosis
- Chrome tracing export
- @no_trace and @trace_breakpoint decorators
Reproducibility Mode (M46)
- --deterministic flag with compile-time non-determinism detection
- 4 non-determinism categories: GPU atomics (auto-fixed), algorithm selection (auto-fixed), implicit RNG (error), external (warning)
- Deterministic kernel variant selection (sort-based reduction, fixed cuBLAS)
- RNG seed tracking (ExplicitSeed/Derived/Implicit)
- Graph hash computation for checkpoint fingerprinting
Multimodal Primitives (M48)
- PatchEmbed config with compile-time validation (image_size % patch_size)
- MelSpectrogram with compile-time mel filterbank (hz-to-mel triangular filters)
- CrossAttention config with Q/K dim matching and head divisibility
- Modality classification heuristic (Vision/Audio/Text by rank+dtype)
- @multimodal decorator validation
- 7 preprocessing FFI stubs (patch_embed, mel, cross_attention, resize, normalize, stft, resample)
Stats
- 678 unit tests passing
- Clippy clean
Full Changelog: v0.5.0...v0.7.0
v0.5.0: Phase 6 — Deployment, Portability & Testing Infrastructure
Multi-Backend KIR Foundation (M47a)
- Kernel IR — 40+ instruction SSA-form intermediate representation
- PTX Backend — KIR to PTX lowering with typed register allocation
- GpuTarget — CUDA/ROCm/Metal/WebGPU with per-backend feature capability tables
- GpuBackend trait — alloc/free/copy/launch/sync interface for all backends
- target(backend) — conditional compilation per GPU target
- --target — CLI flag for backend selection
vmap AST Transform (M39b)
- VmapTransformer — FnDef-to-FnDef AST rewriting producing _batched variants
- Matmul/reduction/transpose rewriting with batch status propagation
- nsl_vmap_check_batch runtime FFI
Testing Infrastructure
- Snapshot testing (insta) — 7 PTX/KIR/fusion snapshots catching silent codegen regressions
- Differential oracle testing — same script with/without --disable-fusion, assert numerical equivalence
Full Changelog: v0.4.0...v0.5.0
v0.4.0
v0.4.0: Phase 5 — Inference Optimization & Compile-Time Moat Features
Milestones M41, M42, M44 complete (M36, M37 shipped in v0.3.0).
New in v0.4.0:
- M41: Disaggregated inference (prefill/decode worker separation, KV transfer, router scheduling)
- M42: KV-cache compression (INT8/INT4/FP8 quantization, sliding window, H2O eviction)
- M44: Constrained decoding (compiled FSM, JSON Schema/BNF grammars, token-level DFA, logit masking)
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's New
Scaling Infrastructure (M32-M34)
- Mixture of Experts —
@moedecorator with top-k gating, capacity-based routing, and load-balancing aux loss - Speculative Decoding —
@speculativewith tree attention, rejection sampling, and@medusamulti-head prediction - Ring Attention —
@context_parallel(ring_size=N)for cross-GPU sequence parallelism with causal masking
Quantization (M35)
- FP8 Compute —
@fp8_computedecorator with E4M3/E5M2 scale management and automatic Tensor Core dispatch - AWQ 4-bit —
quant { dtype: awq4 }with in-register dequantize-in-GEMM (zero memory round-trip) - GPTQ 4-bit/8-bit —
quant { dtype: gptq4 }with Hessian-based optimal quantization
Compiler Intelligence (M36-M37)
- Memory Planning — compile-time tensor liveness analysis, interference graph, first-fit-decreasing slab assignment with 256-byte GPU alignment
- Roofline Cost Model — per-operation FLOP/byte/arithmetic-intensity analysis against a built-in GPU database (A100, H100, RTX-4090, RTX-3090, L40S); table, JSON, and Chrome tracing output formats
Language Features (M38-M40)
- Linear Types — ownership checker with use-after-move detection, branch consumption symmetry, loop consumption prevention, and
@sharedescape hatch - Autodiff Safety — BackwardAccess classification for all 36 TapeOp variants (ShapeOnly/DataRequired/AuxDataRequired)
- vmap —
@vmap(batch_dim=0)automatic batching with batch-variant/invariant tracking, dimension shifting, and matmul rewrite classification - Source-to-Source AD — Wengert list extraction, 18 reverse-mode adjoint rules (reviewer-verified correct), dead gradient elimination, saved tensor analysis
Stats
- 408 unit tests passing across all crates
- 4,738 new lines of code across 42 files
- 18 new source modules + 7 implementation plans
- Clippy clean, release build verified
Breaking Changes
None. All new features are additive. Existing code compiles unchanged.
Known Limitations
--linear-types,--vram-budget,--perfCLI flags are parsed but not yet wired throughcompile_entry()(same status as--fusion-report)- Source AD and vmap are infrastructure-only (analysis libraries complete, codegen integration in progress)
- E2E tests that invoke
nsl runrequire a C compiler (gcc/clang/MSVC) in PATH for linking
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Full Changelog: v0.1.0...v0.2.0