Skip to content

Releases: bwiemz/NSL

v0.9.0

19 Mar 17:52

Choose a tag to compare

Full Changelog: v0.8.0...v0.9.0

Full Changelog: v0.8.0...v0.9.0

v0.8.0: Full Roadmap Complete — M9 through M51

19 Mar 00:08

Choose a tag to compare

Milestone: Full Roadmap Delivered

NeuralScript v0.8.0 marks the completion of the entire M9–M51 roadmap. Every milestone now has its infrastructure layer implemented with analysis modules, runtime FFI, semantic validation, and unit tests.

New in v0.8.0 (Phase 9: Type System Extensions)

M49: Shape Algebra

  • Symbolic dimension solver with equality, divisibility, and range proofs
  • DimExpr extended with Mod variant + Eq/Hash derives
  • shape_assert decorator recognition

M50: Sparse Tensors

  • NslSparseTensor repr(C) struct with COO/CSR/CSC/BSR format support
  • Format-aware kernel dispatch (SparseOp × Format × Device)
  • Sparsity-preserving type inference rules
  • sparse(pattern="2:4") decorator validation

M51: Effect System

  • EffectSet bitset tracking IO, Random, Mutation, Communication
  • 3-phase EffectChecker: local inference → call graph propagation → assertion validation
  • pure enforcement (no effects), checkpoint (requires pure), deterministic (no Random)
  • ~40 known-pure builtins, conservative default for unknowns

Code Quality (from external review)

  • All CLI flags now wired through to compiler
  • 5 hotspot files refactored (tensor.rs, expr.rs, compiler.rs, checker.rs, autodiff.rs)
  • 14 panic points replaced with graceful error handling
  • Clippy strict clean (--all-targets)
  • CHANGELOG covers all versions
  • README honestly separates shipped vs infrastructure features

Stats

  • 726 unit tests passing
  • 43 milestones (M9–M51) with infrastructure complete
  • Clippy strict clean

Full Changelog: v0.7.0...v0.8.0

v0.7.0: Phase 8 — Developer Experience, Debugging & Multimodal

18 Mar 22:11

Choose a tag to compare

What's New

Tensor Debugger (M45)

  • Binary trace recording (124-byte fixed-size entries) with per-op stats (min/max/mean/std)
  • NaN/Inf sentinel detection with automatic halt
  • Compile-time NaN risk analysis (log/sqrt/div patterns)
  • Trace diffing for non-determinism diagnosis
  • Chrome tracing export
  • @no_trace and @trace_breakpoint decorators

Reproducibility Mode (M46)

  • --deterministic flag with compile-time non-determinism detection
  • 4 non-determinism categories: GPU atomics (auto-fixed), algorithm selection (auto-fixed), implicit RNG (error), external (warning)
  • Deterministic kernel variant selection (sort-based reduction, fixed cuBLAS)
  • RNG seed tracking (ExplicitSeed/Derived/Implicit)
  • Graph hash computation for checkpoint fingerprinting

Multimodal Primitives (M48)

  • PatchEmbed config with compile-time validation (image_size % patch_size)
  • MelSpectrogram with compile-time mel filterbank (hz-to-mel triangular filters)
  • CrossAttention config with Q/K dim matching and head divisibility
  • Modality classification heuristic (Vision/Audio/Text by rank+dtype)
  • @multimodal decorator validation
  • 7 preprocessing FFI stubs (patch_embed, mel, cross_attention, resize, normalize, stft, resample)

Stats

  • 678 unit tests passing
  • Clippy clean

Full Changelog: v0.5.0...v0.7.0

v0.5.0: Phase 6 — Deployment, Portability & Testing Infrastructure

18 Mar 20:14

Choose a tag to compare

Multi-Backend KIR Foundation (M47a)

  • Kernel IR — 40+ instruction SSA-form intermediate representation
  • PTX Backend — KIR to PTX lowering with typed register allocation
  • GpuTarget — CUDA/ROCm/Metal/WebGPU with per-backend feature capability tables
  • GpuBackend trait — alloc/free/copy/launch/sync interface for all backends
  • target(backend) — conditional compilation per GPU target
  • --target — CLI flag for backend selection

vmap AST Transform (M39b)

  • VmapTransformer — FnDef-to-FnDef AST rewriting producing _batched variants
  • Matmul/reduction/transpose rewriting with batch status propagation
  • nsl_vmap_check_batch runtime FFI

Testing Infrastructure

  • Snapshot testing (insta) — 7 PTX/KIR/fusion snapshots catching silent codegen regressions
  • Differential oracle testing — same script with/without --disable-fusion, assert numerical equivalence

Full Changelog: v0.4.0...v0.5.0

v0.4.0

18 Mar 18:16

Choose a tag to compare

v0.4.0: Phase 5 — Inference Optimization & Compile-Time Moat Features

Milestones M41, M42, M44 complete (M36, M37 shipped in v0.3.0).

New in v0.4.0:

  • M41: Disaggregated inference (prefill/decode worker separation, KV transfer, router scheduling)
  • M42: KV-cache compression (INT8/INT4/FP8 quantization, sliding window, H2O eviction)
  • M44: Constrained decoding (compiled FSM, JSON Schema/BNF grammars, token-level DFA, logit masking)

Full Changelog: v0.3.0...v0.4.0

v0.3.0

18 Mar 02:56

Choose a tag to compare

What's New

Scaling Infrastructure (M32-M34)

  • Mixture of Experts@moe decorator with top-k gating, capacity-based routing, and load-balancing aux loss
  • Speculative Decoding@speculative with tree attention, rejection sampling, and @medusa multi-head prediction
  • Ring Attention@context_parallel(ring_size=N) for cross-GPU sequence parallelism with causal masking

Quantization (M35)

  • FP8 Compute@fp8_compute decorator with E4M3/E5M2 scale management and automatic Tensor Core dispatch
  • AWQ 4-bitquant { dtype: awq4 } with in-register dequantize-in-GEMM (zero memory round-trip)
  • GPTQ 4-bit/8-bitquant { dtype: gptq4 } with Hessian-based optimal quantization

Compiler Intelligence (M36-M37)

  • Memory Planning — compile-time tensor liveness analysis, interference graph, first-fit-decreasing slab assignment with 256-byte GPU alignment
  • Roofline Cost Model — per-operation FLOP/byte/arithmetic-intensity analysis against a built-in GPU database (A100, H100, RTX-4090, RTX-3090, L40S); table, JSON, and Chrome tracing output formats

Language Features (M38-M40)

  • Linear Types — ownership checker with use-after-move detection, branch consumption symmetry, loop consumption prevention, and @shared escape hatch
  • Autodiff Safety — BackwardAccess classification for all 36 TapeOp variants (ShapeOnly/DataRequired/AuxDataRequired)
  • vmap@vmap(batch_dim=0) automatic batching with batch-variant/invariant tracking, dimension shifting, and matmul rewrite classification
  • Source-to-Source AD — Wengert list extraction, 18 reverse-mode adjoint rules (reviewer-verified correct), dead gradient elimination, saved tensor analysis

Stats

  • 408 unit tests passing across all crates
  • 4,738 new lines of code across 42 files
  • 18 new source modules + 7 implementation plans
  • Clippy clean, release build verified

Breaking Changes

None. All new features are additive. Existing code compiles unchanged.

Known Limitations

  • --linear-types, --vram-budget, --perf CLI flags are parsed but not yet wired through compile_entry() (same status as --fusion-report)
  • Source AD and vmap are infrastructure-only (analysis libraries complete, codegen integration in progress)
  • E2E tests that invoke nsl run require a C compiler (gcc/clang/MSVC) in PATH for linking

Full Changelog: v0.2.0...v0.3.0

v0.2.0

15 Mar 18:26

Choose a tag to compare

Full Changelog: v0.1.0...v0.2.0

v0.1.0

13 Mar 03:12

Choose a tag to compare

What's Changed

  • feat(m19): Data Pipeline + Inference Sampling by @bwiemz in #1

New Contributors

  • @bwiemz made their first contribution in #1

Full Changelog: https://github.com/bwiemz/NSL/commits/v0.1.0