v0.2.0-dev | Apache-2.0 | Repo
Current import path: give mly a torch.export graph plus safetensors weights and get back an improving Rust/Metal import-and-compile bridge, with machine-readable ConvertReport output and optional proof/report hooks where they are actually wired today, including current composition-bounds method/soundness/proof-strength when that verifier path runs.
mly is building toward a Rust-native ML stack where model code, GPU kernels, and proof tooling live together. Today the concrete production story is Kani-backed kernel verification, an exported-artifact compiler bridge that keeps getting better, partial gamma-crown integration, and Metal as the real backend in production use.
The direct one-function PyTorch/ONNX compiler vision is not here yet. Today the import path is torch.export JSON + safetensors, and the mly CLI can shell out to mly_export.py to produce those artifacts for you.
PyTorch export + weights
|
v
mly::convert() bridge
|
├── Parse torch.export graph
├── Map aten ops → imported graph / mly TraceOps
├── Load safetensors weights
├── Trace → compile → constant-fold / fuse / optimize
├── Compile to GPU (Metal MSL today; HIP / SPIR-V planned)
├── Structured `ConvertReport` + optional verification hooks where wired
|
v
CompiledModel + `ConvertReport`
|
├── execute_dyn(&inputs) → outputs (GPU-optimized inference)
└── report / parity artifacts (when configured and available)
This is the exported-artifact bridge that exists today, not yet a universal one-call PyTorch/ONNX compiler for arbitrary models.
# Build / trace / execute on Metal
mly = { git = "https://github.com/andrewdyates/mly", features = ["metal"] }
# Compiled exported-artifact conversion via `mly::convert()`
mly = { git = "https://github.com/andrewdyates/mly", features = ["import-metal"] }
# Same compiled import bridge plus available verification hooks
mly = { git = "https://github.com/andrewdyates/mly", features = ["import-full"] }
# Backend-agnostic `convert_from_trace()`;
# combine with `metal` for one-function exported-artifact Metal helpers
mly = { git = "https://github.com/andrewdyates/mly", features = ["metal", "convert-model"] }use mly::{DType, Device, DynTensor, Module, Result, VarBuilder};
use mly::{linear, embedding, layer_norm};
fn main() -> Result<()> {
let device = Device::Cpu;
let vb = VarBuilder::zeros(DType::F32, &device);
let emb = embedding(1000, 128, &vb.pp("embed"))?;
let ln = layer_norm(128, Default::default(), &vb.pp("ln"))?;
let proj = linear(128, 32, &vb.pp("proj"))?;
let ids = DynTensor::from_vec_u32(vec![1, 42, 7], &[3], &device)?;
let x = emb.forward(&ids)?; // [3, 128]
let x = ln.forward(&x)?; // [3, 128]
let out = proj.forward(&x)?; // [3, 32]
assert_eq!(out.dims(), &[3, 32]);
Ok(())
}use std::path::Path;
use mly::{convert, metal::PipelineCache, OptLevel};
let cache = PipelineCache::new()?;
let result = convert(
Path::new("model_graph.json"), // from torch.export
Path::new("weights.safetensors"), // from PyTorch
&cache,
)
.optimize(OptLevel::Full)
.build()?;
// Optional import-time verification/report hooks exist on `import-full`,
// but they are still partial and model-dependent today.
// result.result.model: CompiledModel — optimized GPU inference
// result.result.proof: optional report/parity artifacts when configured
let output = result.result.model.execute_dyn(&cache, &[&input])?;mly::convert() is gated by import-metal and returns the builder-style Metal bridge shown above. If you want the backend-agnostic imported-model surface instead, enable convert-model and call mly::convert_from_trace(). If you want the newer one-function exported-artifact Metal helpers, enable metal + convert-model and call mly::compile_metal_from_exported_artifacts() or mly::compile_metal_from_exported_artifacts_with_report().
mly::compile_metal_from_exported_artifacts_with_report() returns the compiled Metal model plus the same structured ConvertReport used by the CLI. Both mly convert and mly compile can persist that JSON schema with --report-output and print it with --json. The populated verification/report fields are feature- and pipeline-dependent today; when the current composition-bounds verifier path runs, the JSON can record composition_method, composition_soundness_mode, and composition_proof_strength for that run. This is audit-friendly reporting for the exported-artifact path, not a finished proof-powered compiler certificate.
For automation, mly compile --output model.mlyc --report-output model.compile.json --json persists the same structured ConvertReport JSON to disk and stdout. If --report-output is omitted, mly compile still writes a default sibling report next to the .mlyc output and prints that report path on stderr.
Broad direct PyTorch/ONNX intake is still roadmap work.
use mly_core::dyn_tensor::trace::trace_graph;
use mly_metal::compiled_model::CompiledModel;
// Trace a forward pass (like torch.compile)
let (_output, graph) = trace_graph(|| model.forward(&input))?;
// Compile: fuse ops, optimize, upload weights to GPU
let compiled = CompiledModel::builder(&graph, &cache).build()?;
// Subsequent calls are fast — pre-compiled GPU dispatch
let result = compiled.execute_dyn(&cache, &[&input])?;| Feature | What it enables |
|---|---|
metal |
Apple Metal GPU backend (macOS) |
dsl |
#[mly::kernel] and #[mly::model] proc-macros |
training |
Autodiff (reverse-mode AD) + optimizers (AdamW, SGD, LoRA) |
verify |
metal-backed verification surfaces plus Kani / gamma-crown / z4 integrations |
import |
Exported-artifact parsing and graph-building re-exports (torch.export JSON, op mapping, imported graphs) |
import-metal |
Compiled exported-artifact conversion via mly::convert() / mly::import::convert() |
import-full |
import-metal plus the available import-time verification hooks |
convert-model |
Backend-agnostic mly::convert_from_trace() → ConvertedModel; combine with metal for mly::compile_metal_from_exported_artifacts() / mly::compile_metal_from_exported_artifacts_with_report() |
whisper |
Whisper speech-to-text model |
qwen3 |
Qwen3 LLM |
glm5 |
GLM-4/5 (ChatGLM) LLM |
models |
Kokoro TTS, HTDemucs, ECAPA-TDNN, WeSpeaker, STFT/iSTFT |
reftest |
Reference tensor comparison tooling |
tts-verify |
Audio-domain quality verification for TTS/SVS pipelines |
nn layers and building blocks — Linear, Conv1d/2d, ConvTranspose1d/2d, LSTM, BiLSTM, Embedding, LayerNorm, GroupNorm, InstanceNorm, BatchNorm, RMSNorm, AdaIN, AdaLN, MultiHeadAttention, RoPE, SwiGLU, MoE, and more.
Model implementations — Kokoro TTS (flagship integration target), Silero VAD, HTDemucs, Whisper, Qwen3, GLM-5, ECAPA-TDNN, and additional import/model-conversion paths under active development.
DynTensor — dynamic-rank tensor with 60+ ops, GPU dispatch, BF16/F16/F32 support. API mirrors candle for easy migration.
Compilation pipeline — trace_graph() captures computation graphs, compile_trace() lowers to GPU dispatch plans, and configurable fusion / peephole / native-op passes reduce dispatch overhead before execution.
Fused NativeOps — LSTM sequence, Flash Attention, SIMD GEMM, fused norm+activation+conv, batched projections, and related dispatch-collapsing kernels.
Verified kernels — #[mly::kernel] generates GPU code + Kani proof harness + differential test from one Rust function. Live proof totals are tracked in operational_state.json.
Metal GPU backend — precompiled Metal kernels, lazy command buffer batching, activation arena for buffer reuse, simdgroup matrix multiply, and F16 autocast.
Certification pipeline — certify_model() orchestrates trace, verify, and proof bundling. Certificate coverage is still model-dependent today; Kokoro/Moonshot-style checks are partial, not a universal end-to-end guarantee.
Training — reverse-mode autodiff with gradient tape, AdamW/SGD optimizers, LoRA adapters, learning rate schedules, gradient clipping/scaling.
Fast-moving repo totals live in operational_state.json rather than this README. Current tracked counts there are:
| Metric | Value |
|---|---|
| Rust source LOC | 1,450,257 |
| Tests | 46,803 |
| Kani proof harnesses | 11,475 |
| Bounds registry entries | 22 |
nn Module impls |
55 |
Kokoro is mly's primary integration target and the model that drove most of the GPU/compiler work. The single-voice path is useful today, but the chorus path is still below the production bar:
| What Kokoro drove | Framework benefit |
|---|---|
| LSTM GPU kernel | Any model with LSTMs runs on GPU |
| Flash Attention | Any transformer model benefits |
| Lazy command buffer batching | All models: fewer GPU sync points |
| Activation arena | All models: GPU buffer reuse |
| Trace-compile pipeline | trace_graph() → CompiledModel works for any model |
| Fusion / peephole passes | General: AddLayerNorm, Linear+activation, batched Q/K/V, NormLinear fusion. Kokoro-specific: ResBlock, style projection, conv/norm fusion work |
| iSTFT GPU | Any vocoder or audio model |
| F16 autocast | All models: mixed-precision inference |
| VarBuilder weight loading | All models: safetensors import |
Kokoro scorecards are split. The live machine-readable single-voice production scorecard is operational_state.json (kokoro_rtf_tracking / kokoro_performance), which records production RTF, dispatch, flush, and related gate metadata for the single-voice path only. The current chorus-side report path is reports/2026-03-29-gate-measurements.md, which captures chorus-adjacent gate results such as TTFA snapshots and chorus test outcomes. That report is useful for execution-state honesty, but it is not a proof that chorus quality or production throughput is solved. Multi-voice chorus quality and throughput are still not production-good.
The production Kokoro gates also expose stable machine-readable scrape surfaces for local automation: single-voice RTF uses KOKORO_RTF_SCORECARD_JSON= / KOKORO_RTF_SCORECARD_OUT, production chorus uses KOKORO_PRODUCTION_CHORUS_SCORECARD_JSON= / KOKORO_PRODUCTION_CHORUS_SCORECARD_OUT, and streaming chorus uses KOKORO_STREAMING_CHORUS_SCORECARD_JSON= / KOKORO_STREAMING_CHORUS_SCORECARD_OUT. Those are scorecard/report surfaces, not proof that chorus is production-good.
The repo also now has explicit recommended-autocast guard coverage on batch/shared-encode/streaming chorus surfaces. Those tests are regression guardrails for finite-audio, parity, and chunk-contract stability under the recommended config, not a certificate of chorus quality or throughput.
- Exported-artifact compiler bridge: keep expanding supported
torch.exportcoverage on real models, persist realConvertReportJSON frommly convert/mly compile, and land reusable equivalence checks for exported graphs. Raw PyTorch/ONNX intake is still future work. - Verification: raise gamma-crown from selective IBP/CROWN on current segments to broader composed single-voice coverage. The verifier path is useful today, but it is still partial rather than a finished end-to-end proof-powered compiler.
- Chorus: keep the scorecard split honest until there is a real chorus production benchmark.
operational_state.jsonis the live single-voice production scorecard;reports/2026-03-29-gate-measurements.mdis still a chorus TTFA/test snapshot, not proof that chorus quality or throughput is solved.
mly combines GPU kernel formal verification (Kani/CBMC model checking) with partial neural-network bound propagation via gamma-crown. Proofs operate on Rust source; Metal is the production backend today while broader model-property coverage is still in progress.
| System | Verifies | Trusts (unverified) |
|---|---|---|
| alpha-beta-CROWN | ONNX graph bounds | PyTorch→ONNX export, CUDA runtime, GPU arithmetic |
| Marabou | ONNX graph via SMT | Same |
| ERAN | ONNX graph via abstract interpretation | Same |
| mly | Rust source, GPU kernels, selected model bounds | Hardware; broader gamma-crown/model coverage still in progress |
Five verification levels:
| Level | Method | What it catches |
|---|---|---|
| 0 | Type system | Rank mismatches (compile error) |
| 1 | Static analysis | Dimension mismatches (runtime) |
| 2 | IBP (gamma-crown) | Unbounded outputs, NaN/Inf |
| 3 | CROWN (gamma-crown) | Where wired, tighter-than-IBP correlations |
| 4 | SMT (z4) | Complete verification of small subgraphs |
┌──────────────────────────────────────────────────────────────────────┐
│ MODEL DEFINITION │
│ │
│ Option A: Rust with proc-macros │
│ #[mly::kernel] fn snake(x: f32, alpha: f32) -> f32 { ... } │
│ #[mly::model] fn encoder(audio: Tensor<3, F32>) -> ... │
│ │
│ Option B: Exported-artifact import │
│ mly::convert(graph.json, weights.safetensors, &cache).build() │
│ │
│ Option C: Trace compilation │
│ trace_graph(|| model.forward(&x)) → CompiledModel │
└───────────────────────────────┬──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ OPTIMIZATION (fusion + peephole/native-op passes) │
│ │
│ Elementwise chain fusion, ResBlock fusion, Linear+activation, │
│ AddLayerNorm, batched Q/K/V projection, NormLinear, flip+LSTM, │
│ Conv+LayerNorm, style projection batching │
└───────────────────────────────┬──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ VERIFICATION (on Rust source, backend-agnostic) │
│ │
│ Kani: Each kernel proved correct (no overflow, no NaN) │
│ gamma-crown: Partial model bounds via IBP/CROWN where wired │
│ z4: Complete SMT verification of small components │
└───────────────────────────────┬──────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Metal (Apple) │ │ HIP (AMD) │ │ Vulkan │
│ MSL → .metallib │ │ HIP codegen │ │ SPIR-V │
│ Production │ │ Codegen only │ │ Planned │
└──────────────────┘ └──────────────────┘ └──────────────────┘
| Crate | Description |
|---|---|
| mly | Top-level re-exports with feature gates |
| mly-core | DynTensor, Tensor, nn modules, VarBuilder |
| mly-dsl | #[kernel] proc-macro, KernelIR, tensor IR, trace compiler, fusion / peephole infrastructure |
| mly-macros | Proc-macro entry point for #[mly::kernel] and #[mly::model] |
| Crate | Description |
|---|---|
| mly-metal | Metal GPU backend — compiled model execution, arena, lazy batching |
| mly-cuda | CUDA/HIP backend — HIP codegen from DispatchStep IR |
| mly-cpu | CPU backend (NEON/AVX2 SIMD) — planned |
| mly-vulkan | Vulkan backend (SPIR-V) — planned |
| Crate | Description |
|---|---|
| mly-verify | Kani harness generation, gamma-crown bridge (partial), z4 SMT encoding, fusion proofs |
| mly-reftest | Reference tensor comparison, differential testing, spectral comparison |
| Crate | Description |
|---|---|
| mly-models | HTDemucs, Silero VAD, Kokoro TTS, ECAPA-TDNN, WeSpeaker, STFT/iSTFT |
| mly-whisper | Whisper speech-to-text (encoder-decoder transformer, mel spectrogram) |
| mly-qwen3 | Qwen3 LLM (decoder-only transformer with RoPE, GQA, SwiGLU) |
| mly-glm5 | GLM-4/5 decoder-only LLM |
| Crate | Description |
|---|---|
| mly-autodiff | Reverse-mode automatic differentiation |
| mly-optim | AdamW, SGD, LoRA, learning rate schedules |
| mly-import | Exported PyTorch artifact bridge — torch.export graphs / segmented bundles → ComputationGraph / Metal-compiled segments |
| mly-tts-verify | Audio-domain quality verification for TTS/SVS pipelines |
cargo check --workspace # Type-check all crates
cargo test --workspace # Run all tests
cargo test -p mly-core # Test core tensor types and nn layers
cargo test -p mly-metal # Test Metal GPU backend
cargo test -p mly-verify # Test verification infrastructure| Milestone | Status | Description |
|---|---|---|
| Core types + Metal backend | Complete | DynTensor, Tensor, nn layers/modules, Metal dispatch |
| Kernel DSL + verification | Kani complete | Counts live in operational_state.json; gamma-crown IBP/CROWN coverage remains partial |
| Training | Complete | Reverse-mode AD, AdamW/SGD/LoRA, Kani-verified gradients |
| Proof certificates | Partial | certify_model() orchestrator and checker exist; end-to-end property coverage still varies by model |
| Trace-compile pipeline | Complete | trace_graph() → fusion / optimization → CompiledModel → GPU |
| PyTorch import | Bridge improving | convert() handles supported torch.export artifacts; not yet a universal one-function PyTorch/ONNX compiler |
| Kokoro TTS pipeline | In progress | Strong integration target; multi-voice chorus quality/perf still below the production bar |
| Multi-backend | Metal prod | Metal is the production backend; CUDA/HIP codegen exists but runtime work remains |
| General model compilation | Next | Broaden convert() and compiler coverage beyond Kokoro and selected paths |
| Verified quantization | Planned | Prove bf16/int8 preserves bounds for all inputs |
- Kani — Rust model checker used for kernel verification
- alpha-beta-CROWN — VNN-COMP winner 2021-2025, NN bound propagation
- Marabou — SMT-based NN verification
- ERAN — Abstract interpretation for NNs
- VNN-COMP — Annual NN verification competition
Copyright 2026 Dropbox, Inc.. Licensed under Apache-2.0.
Author: Andrew Yates