Skip to content

ghuntley/mly

 
 

Repository files navigation

mly — Verified ML Framework

v0.2.0-dev | Apache-2.0 | Repo

Current import path: give mly a torch.export graph plus safetensors weights and get back an improving Rust/Metal import-and-compile bridge, with machine-readable ConvertReport output and optional proof/report hooks where they are actually wired today, including current composition-bounds method/soundness/proof-strength when that verifier path runs.

mly is building toward a Rust-native ML stack where model code, GPU kernels, and proof tooling live together. Today the concrete production story is Kani-backed kernel verification, an exported-artifact compiler bridge that keeps getting better, partial gamma-crown integration, and Metal as the real backend in production use.

The direct one-function PyTorch/ONNX compiler vision is not here yet. Today the import path is torch.export JSON + safetensors, and the mly CLI can shell out to mly_export.py to produce those artifacts for you.

PyTorch export + weights
        |
        v
  mly::convert() bridge
        |
        ├── Parse torch.export graph
        ├── Map aten ops → imported graph / mly TraceOps
        ├── Load safetensors weights
        ├── Trace → compile → constant-fold / fuse / optimize
        ├── Compile to GPU (Metal MSL today; HIP / SPIR-V planned)
        ├── Structured `ConvertReport` + optional verification hooks where wired
        |
        v
  CompiledModel + `ConvertReport`
        |
        ├── execute_dyn(&inputs) → outputs  (GPU-optimized inference)
        └── report / parity artifacts       (when configured and available)

This is the exported-artifact bridge that exists today, not yet a universal one-call PyTorch/ONNX compiler for arbitrary models.

Quick Start

# Build / trace / execute on Metal
mly = { git = "https://github.com/andrewdyates/mly", features = ["metal"] }

# Compiled exported-artifact conversion via `mly::convert()`
mly = { git = "https://github.com/andrewdyates/mly", features = ["import-metal"] }

# Same compiled import bridge plus available verification hooks
mly = { git = "https://github.com/andrewdyates/mly", features = ["import-full"] }

# Backend-agnostic `convert_from_trace()`;
# combine with `metal` for one-function exported-artifact Metal helpers
mly = { git = "https://github.com/andrewdyates/mly", features = ["metal", "convert-model"] }

Option A: Build a model from scratch

use mly::{DType, Device, DynTensor, Module, Result, VarBuilder};
use mly::{linear, embedding, layer_norm};

fn main() -> Result<()> {
    let device = Device::Cpu;
    let vb = VarBuilder::zeros(DType::F32, &device);

    let emb = embedding(1000, 128, &vb.pp("embed"))?;
    let ln = layer_norm(128, Default::default(), &vb.pp("ln"))?;
    let proj = linear(128, 32, &vb.pp("proj"))?;

    let ids = DynTensor::from_vec_u32(vec![1, 42, 7], &[3], &device)?;
    let x = emb.forward(&ids)?;   // [3, 128]
    let x = ln.forward(&x)?;      // [3, 128]
    let out = proj.forward(&x)?;  // [3, 32]

    assert_eq!(out.dims(), &[3, 32]);
    Ok(())
}

Option B: Compile exported PyTorch artifacts to Metal

use std::path::Path;

use mly::{convert, metal::PipelineCache, OptLevel};

let cache = PipelineCache::new()?;
let result = convert(
    Path::new("model_graph.json"),    // from torch.export
    Path::new("weights.safetensors"), // from PyTorch
    &cache,
)
.optimize(OptLevel::Full)
.build()?;

// Optional import-time verification/report hooks exist on `import-full`,
// but they are still partial and model-dependent today.

// result.result.model: CompiledModel — optimized GPU inference
// result.result.proof: optional report/parity artifacts when configured
let output = result.result.model.execute_dyn(&cache, &[&input])?;

mly::convert() is gated by import-metal and returns the builder-style Metal bridge shown above. If you want the backend-agnostic imported-model surface instead, enable convert-model and call mly::convert_from_trace(). If you want the newer one-function exported-artifact Metal helpers, enable metal + convert-model and call mly::compile_metal_from_exported_artifacts() or mly::compile_metal_from_exported_artifacts_with_report().

mly::compile_metal_from_exported_artifacts_with_report() returns the compiled Metal model plus the same structured ConvertReport used by the CLI. Both mly convert and mly compile can persist that JSON schema with --report-output and print it with --json. The populated verification/report fields are feature- and pipeline-dependent today; when the current composition-bounds verifier path runs, the JSON can record composition_method, composition_soundness_mode, and composition_proof_strength for that run. This is audit-friendly reporting for the exported-artifact path, not a finished proof-powered compiler certificate.

For automation, mly compile --output model.mlyc --report-output model.compile.json --json persists the same structured ConvertReport JSON to disk and stdout. If --report-output is omitted, mly compile still writes a default sibling report next to the .mlyc output and prints that report path on stderr.

Broad direct PyTorch/ONNX intake is still roadmap work.

Option C: Trace and compile any mly model

use mly_core::dyn_tensor::trace::trace_graph;
use mly_metal::compiled_model::CompiledModel;

// Trace a forward pass (like torch.compile)
let (_output, graph) = trace_graph(|| model.forward(&input))?;

// Compile: fuse ops, optimize, upload weights to GPU
let compiled = CompiledModel::builder(&graph, &cache).build()?;

// Subsequent calls are fast — pre-compiled GPU dispatch
let result = compiled.execute_dyn(&cache, &[&input])?;

Features

Feature What it enables
metal Apple Metal GPU backend (macOS)
dsl #[mly::kernel] and #[mly::model] proc-macros
training Autodiff (reverse-mode AD) + optimizers (AdamW, SGD, LoRA)
verify metal-backed verification surfaces plus Kani / gamma-crown / z4 integrations
import Exported-artifact parsing and graph-building re-exports (torch.export JSON, op mapping, imported graphs)
import-metal Compiled exported-artifact conversion via mly::convert() / mly::import::convert()
import-full import-metal plus the available import-time verification hooks
convert-model Backend-agnostic mly::convert_from_trace()ConvertedModel; combine with metal for mly::compile_metal_from_exported_artifacts() / mly::compile_metal_from_exported_artifacts_with_report()
whisper Whisper speech-to-text model
qwen3 Qwen3 LLM
glm5 GLM-4/5 (ChatGLM) LLM
models Kokoro TTS, HTDemucs, ECAPA-TDNN, WeSpeaker, STFT/iSTFT
reftest Reference tensor comparison tooling
tts-verify Audio-domain quality verification for TTS/SVS pipelines

What You Get

nn layers and building blocks — Linear, Conv1d/2d, ConvTranspose1d/2d, LSTM, BiLSTM, Embedding, LayerNorm, GroupNorm, InstanceNorm, BatchNorm, RMSNorm, AdaIN, AdaLN, MultiHeadAttention, RoPE, SwiGLU, MoE, and more.

Model implementations — Kokoro TTS (flagship integration target), Silero VAD, HTDemucs, Whisper, Qwen3, GLM-5, ECAPA-TDNN, and additional import/model-conversion paths under active development.

DynTensor — dynamic-rank tensor with 60+ ops, GPU dispatch, BF16/F16/F32 support. API mirrors candle for easy migration.

Compilation pipelinetrace_graph() captures computation graphs, compile_trace() lowers to GPU dispatch plans, and configurable fusion / peephole / native-op passes reduce dispatch overhead before execution.

Fused NativeOps — LSTM sequence, Flash Attention, SIMD GEMM, fused norm+activation+conv, batched projections, and related dispatch-collapsing kernels.

Verified kernels#[mly::kernel] generates GPU code + Kani proof harness + differential test from one Rust function. Live proof totals are tracked in operational_state.json.

Metal GPU backend — precompiled Metal kernels, lazy command buffer batching, activation arena for buffer reuse, simdgroup matrix multiply, and F16 autocast.

Certification pipelinecertify_model() orchestrates trace, verify, and proof bundling. Certificate coverage is still model-dependent today; Kokoro/Moonshot-style checks are partial, not a universal end-to-end guarantee.

Training — reverse-mode autodiff with gradient tape, AdamW/SGD optimizers, LoRA adapters, learning rate schedules, gradient clipping/scaling.

Current Status

Fast-moving repo totals live in operational_state.json rather than this README. Current tracked counts there are:

Metric Value
Rust source LOC 1,450,257
Tests 46,803
Kani proof harnesses 11,475
Bounds registry entries 22
nn Module impls 55

Kokoro TTS (flagship model)

Kokoro is mly's primary integration target and the model that drove most of the GPU/compiler work. The single-voice path is useful today, but the chorus path is still below the production bar:

What Kokoro drove Framework benefit
LSTM GPU kernel Any model with LSTMs runs on GPU
Flash Attention Any transformer model benefits
Lazy command buffer batching All models: fewer GPU sync points
Activation arena All models: GPU buffer reuse
Trace-compile pipeline trace_graph() → CompiledModel works for any model
Fusion / peephole passes General: AddLayerNorm, Linear+activation, batched Q/K/V, NormLinear fusion. Kokoro-specific: ResBlock, style projection, conv/norm fusion work
iSTFT GPU Any vocoder or audio model
F16 autocast All models: mixed-precision inference
VarBuilder weight loading All models: safetensors import

Kokoro scorecards are split. The live machine-readable single-voice production scorecard is operational_state.json (kokoro_rtf_tracking / kokoro_performance), which records production RTF, dispatch, flush, and related gate metadata for the single-voice path only. The current chorus-side report path is reports/2026-03-29-gate-measurements.md, which captures chorus-adjacent gate results such as TTFA snapshots and chorus test outcomes. That report is useful for execution-state honesty, but it is not a proof that chorus quality or production throughput is solved. Multi-voice chorus quality and throughput are still not production-good.

The production Kokoro gates also expose stable machine-readable scrape surfaces for local automation: single-voice RTF uses KOKORO_RTF_SCORECARD_JSON= / KOKORO_RTF_SCORECARD_OUT, production chorus uses KOKORO_PRODUCTION_CHORUS_SCORECARD_JSON= / KOKORO_PRODUCTION_CHORUS_SCORECARD_OUT, and streaming chorus uses KOKORO_STREAMING_CHORUS_SCORECARD_JSON= / KOKORO_STREAMING_CHORUS_SCORECARD_OUT. Those are scorecard/report surfaces, not proof that chorus is production-good. The repo also now has explicit recommended-autocast guard coverage on batch/shared-encode/streaming chorus surfaces. Those tests are regression guardrails for finite-audio, parity, and chunk-contract stability under the recommended config, not a certificate of chorus quality or throughput.

Near-Term Execution

  • Exported-artifact compiler bridge: keep expanding supported torch.export coverage on real models, persist real ConvertReport JSON from mly convert / mly compile, and land reusable equivalence checks for exported graphs. Raw PyTorch/ONNX intake is still future work.
  • Verification: raise gamma-crown from selective IBP/CROWN on current segments to broader composed single-voice coverage. The verifier path is useful today, but it is still partial rather than a finished end-to-end proof-powered compiler.
  • Chorus: keep the scorecard split honest until there is a real chorus production benchmark. operational_state.json is the live single-voice production scorecard; reports/2026-03-29-gate-measurements.md is still a chorus TTFA/test snapshot, not proof that chorus quality or throughput is solved.

How Verification Works

mly combines GPU kernel formal verification (Kani/CBMC model checking) with partial neural-network bound propagation via gamma-crown. Proofs operate on Rust source; Metal is the production backend today while broader model-property coverage is still in progress.

System Verifies Trusts (unverified)
alpha-beta-CROWN ONNX graph bounds PyTorch→ONNX export, CUDA runtime, GPU arithmetic
Marabou ONNX graph via SMT Same
ERAN ONNX graph via abstract interpretation Same
mly Rust source, GPU kernels, selected model bounds Hardware; broader gamma-crown/model coverage still in progress

Five verification levels:

Level Method What it catches
0 Type system Rank mismatches (compile error)
1 Static analysis Dimension mismatches (runtime)
2 IBP (gamma-crown) Unbounded outputs, NaN/Inf
3 CROWN (gamma-crown) Where wired, tighter-than-IBP correlations
4 SMT (z4) Complete verification of small subgraphs

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│  MODEL DEFINITION                                                    │
│                                                                      │
│  Option A: Rust with proc-macros                                     │
│    #[mly::kernel] fn snake(x: f32, alpha: f32) -> f32 { ... }       │
│    #[mly::model] fn encoder(audio: Tensor<3, F32>) -> ...           │
│                                                                      │
│  Option B: Exported-artifact import                                  │
│    mly::convert(graph.json, weights.safetensors, &cache).build()    │
│                                                                      │
│  Option C: Trace compilation                                         │
│    trace_graph(|| model.forward(&x)) → CompiledModel                │
└───────────────────────────────┬──────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────────┐
│  OPTIMIZATION (fusion + peephole/native-op passes)                   │
│                                                                      │
│  Elementwise chain fusion, ResBlock fusion, Linear+activation,       │
│  AddLayerNorm, batched Q/K/V projection, NormLinear, flip+LSTM,     │
│  Conv+LayerNorm, style projection batching                           │
└───────────────────────────────┬──────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────────┐
│  VERIFICATION (on Rust source, backend-agnostic)                     │
│                                                                      │
│  Kani:        Each kernel proved correct (no overflow, no NaN)       │
│  gamma-crown: Partial model bounds via IBP/CROWN where wired         │
│  z4:          Complete SMT verification of small components          │
└───────────────────────────────┬──────────────────────────────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                 ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  Metal (Apple)   │  │  HIP (AMD)       │  │  Vulkan          │
│  MSL → .metallib │  │  HIP codegen     │  │  SPIR-V          │
│  Production      │  │  Codegen only    │  │  Planned         │
└──────────────────┘  └──────────────────┘  └──────────────────┘

Crates

Core

Crate Description
mly Top-level re-exports with feature gates
mly-core DynTensor, Tensor, nn modules, VarBuilder
mly-dsl #[kernel] proc-macro, KernelIR, tensor IR, trace compiler, fusion / peephole infrastructure
mly-macros Proc-macro entry point for #[mly::kernel] and #[mly::model]

Backends

Crate Description
mly-metal Metal GPU backend — compiled model execution, arena, lazy batching
mly-cuda CUDA/HIP backend — HIP codegen from DispatchStep IR
mly-cpu CPU backend (NEON/AVX2 SIMD) — planned
mly-vulkan Vulkan backend (SPIR-V) — planned

Verification

Crate Description
mly-verify Kani harness generation, gamma-crown bridge (partial), z4 SMT encoding, fusion proofs
mly-reftest Reference tensor comparison, differential testing, spectral comparison

Models

Crate Description
mly-models HTDemucs, Silero VAD, Kokoro TTS, ECAPA-TDNN, WeSpeaker, STFT/iSTFT
mly-whisper Whisper speech-to-text (encoder-decoder transformer, mel spectrogram)
mly-qwen3 Qwen3 LLM (decoder-only transformer with RoPE, GQA, SwiGLU)
mly-glm5 GLM-4/5 decoder-only LLM

Training & Import

Crate Description
mly-autodiff Reverse-mode automatic differentiation
mly-optim AdamW, SGD, LoRA, learning rate schedules
mly-import Exported PyTorch artifact bridge — torch.export graphs / segmented bundles → ComputationGraph / Metal-compiled segments
mly-tts-verify Audio-domain quality verification for TTS/SVS pipelines

Building

cargo check --workspace              # Type-check all crates
cargo test --workspace               # Run all tests
cargo test -p mly-core               # Test core tensor types and nn layers
cargo test -p mly-metal              # Test Metal GPU backend
cargo test -p mly-verify             # Test verification infrastructure

Roadmap

Milestone Status Description
Core types + Metal backend Complete DynTensor, Tensor, nn layers/modules, Metal dispatch
Kernel DSL + verification Kani complete Counts live in operational_state.json; gamma-crown IBP/CROWN coverage remains partial
Training Complete Reverse-mode AD, AdamW/SGD/LoRA, Kani-verified gradients
Proof certificates Partial certify_model() orchestrator and checker exist; end-to-end property coverage still varies by model
Trace-compile pipeline Complete trace_graph() → fusion / optimization → CompiledModel → GPU
PyTorch import Bridge improving convert() handles supported torch.export artifacts; not yet a universal one-function PyTorch/ONNX compiler
Kokoro TTS pipeline In progress Strong integration target; multi-voice chorus quality/perf still below the production bar
Multi-backend Metal prod Metal is the production backend; CUDA/HIP codegen exists but runtime work remains
General model compilation Next Broaden convert() and compiler coverage beyond Kokoro and selected paths
Verified quantization Planned Prove bf16/int8 preserves bounds for all inputs

References

  • Kani — Rust model checker used for kernel verification
  • alpha-beta-CROWN — VNN-COMP winner 2021-2025, NN bound propagation
  • Marabou — SMT-based NN verification
  • ERAN — Abstract interpretation for NNs
  • VNN-COMP — Annual NN verification competition

License

Copyright 2026 Dropbox, Inc.. Licensed under Apache-2.0.

Author: Andrew Yates

About

AI-native ML framework for Apple Silicon - Rust-first with verification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 99.6%
  • Other 0.4%