Releases · Epistates/pmetal

08 May 11:38

github-actions

v0.5.0

0891716

PMetal v0.5.0 Latest

Latest

[0.5.0] - 2026-05-07

Added

Distributed inference & training

pmetal-distributed crate (Phases 1-4 + 7): Thunderbolt-fabric-aware multi-Mac cluster runtime with feature-gated tensor, expert, context, ZeRO, and pipeline parallelism modules
- pmetal cluster CLI: per-node launch, ring/mesh topology discovery, fabric handshake
- Pipeline harness with overlap of computation and Thunderbolt transfers
- Canonical expert-rank mapping + per-architecture MoE/MLA tensor-parallel plans
- Ring all-reduce / all-gather with corrected chunk indexing

TurboQuant KV cache (production-ready)

TurboQuant KV cache quantization: Provably near-optimal KV cache compression based on random rotation + Lloyd-Max scalar quantization + QJL residual for unbiased inner products (arXiv:2504.19874). Achieves 4-6x KV cache compression with near-zero quality loss. Available via --kv-turboquant or presets --kv-turboquant-preset q3_5 (near-lossless) / q2_5 (6.4x compression)
- Separate key/value runtimes with independent bit widths and outlier-aware mixed-precision
- Direct attention path for single-token decode avoids full cache dequantization
- Data-oblivious (no calibration data required) — quantizes KV entries online as generated
- Precomputed codebooks via Lloyd-Max algorithm for Beta distribution (deterministic from seed)
- Metal kernel backend with CPU fallback
- Phase 0: split monolithic mod.rs (6101 → 222 LOC) into config/core/state/bits/math submodules
- Phase 3: GPU-resident hot/cold pipeline + Mixed K/V storage; mixed_score as layout oracle
- Phase A/B: QJL ablation harness (feature-gated) + per-row key_slot_scale codebook adaptation
- Phase C/C′: Variant F drop-QJL opt-in path; d128/d256 no_qjl_2pass fast paths (4..=8 bits)
- Phase D: TurboQuantPackMode config + Fullbyte dense-values kernel
- Phase E: TurboQuantOutlierMode — encode-side top-K outlier storage, zero pre-quant + decode override, outlier-bias on d128/d256 fullbyte score kernel; CPU mirror in scalar encode/decode
- Phase F: Hamming skip-list dispatch — skiplist_threshold config, GPU sign_hash buffer, Metal Hamming-distances kernel + FFI, GQA support
- Mixed-precision attention parity baseline; defensive residual-norm clamp + NaN-safe encode
Asymmetric K/V head dimensions: KV cache, TurboQuant, and fused attention now support models where key and value projections have different widths (e.g. DeepSeek MLA with qk_head_dim != v_head_dim)
pmetal serve --kv-turboquant: TurboQuant KV cache in the serving engine with --kv-turboquant-preset q3_5 for near-lossless 4.6x KV compression in production

Quantization & model formats

Optimized FP8 checkpoint loading: Hugging Face FP8 weight_scale_inv sidecars are dequantized or repacked into MLX mxfp8 weights for Qwen3-family native paths; mode-aware quantized matmul plumbing handles floating-point quantized weights without dense fallback
Expanded GGUF quantization/export: pmetal quantize now writes standard GGUF metadata from Hugging Face configs, tokenizer/pre-tokenizer metadata, HF-to-GGUF tensor names, stacked MoE expert tensors, and method-specific file types
Broader GGUF format coverage: quantization/dequantization support now includes K-quants, legacy Q4/Q5/Q8 variants, Q1_0, TQ1_0/TQ2_0, MXFP4, NVFP4, BF16, F16, and F32 round trips
MLX safetensors quantization path: quality-based bit allocation with --target-bpw, GPU-resident weight loading, and tokenizer/config sidecar copy for MLX-format quantized exports

Inference server (OpenAI- + Anthropic-compatible)

Continuous batching with paged-KV-style admission + shared prefix cache in pmetal-serve: per-request slot scheduling, KV-cache prefix sharing, concurrent decode for many simultaneous chats
- Token-block admission budget (--cb-block-size, --cb-max-blocks) prevents over-admitting active contexts and skips head-of-line requests when a smaller queued request fits the remaining block budget
- Continuous batching now reuses the shared prompt prefix cache, prefills only uncached suffix tokens, and saves extended prefixes after final prefill
- Continuous batching derives the same cache mode as the single-request serving path, honoring --kv-quant and --kv-turboquant
- Hybrid/recurrent models are rejected from continuous batching instead of silently running without recurrent state
Anthropic-compatible /v1/messages endpoint: streaming message_start → content_block_start → content_block_delta* → content_block_stop → message_delta → message_stop events; non-streaming JSON path
/v1/embeddings endpoint: 17 architectures supported via forward_hidden (Llama/Llama4/Qwen2/Qwen3/Qwen3MoE/Qwen3Next/Mistral/Gemma/Gemma4/Phi/Phi4/DeepSeek/Cohere/Granite/GptOss/NemotronH/BERT) — pooling via pmetal_models::pooling
Token logprobs: SamplingParams.logprobs_top_n plumbed end-to-end through non-streaming and SSE streaming on both /v1/chat/completions and /v1/completions. New pmetal_models::generation::token_logprobs primitive; ANE/CPU paths emit logprob: None
Best-effort tool calling on /v1/chat/completions: try_parse_tool_calls accepts {name, arguments} or {tool_calls: [...]}. ChatCompletionRequest.tools gates the attempt; chat templating threads tool defs into the rendered prompt
IncrementalDecoder<Aux> SSE buffer: shared UTF-8 boundary buffer + per-token aux pipelining (used for logprobs alignment) across chat/completions/anthropic streams

Job orchestration substrate (TUI / GUI / MCP / CLI parity)

JobSpec substrate: 16 canonical spec types in pmetal-core (Train, Distill, GRPO, Bench, Eval, Pretrain, Tokenize, Serve, Generate, RLKD, EmbedTrain, DFlash, Memory, Ollama, …) with #[derive(JobSpec)] proc-macro
JobEvent canonical streaming protocol: progress / metric / log / artefact / complete / failed events emitted by all 4 surfaces (CLI, TUI, GUI, MCP)
CLI: 8 specced Commands variants flattened — 613 LOC removed from main.rs; cli/<sub>.rs Args structs and JobSpec argv round-trip tests; --log-events flag stub
TUI: 14 tabs with full CLI parity, ?-key help overlay, Ctrl+1..9 tab jump, active-job footer badge, descriptor-driven forms with shared FormTabState primitive; channel-based metrics streaming (ChannelMetricsCallback) for direct-path train/distill/grpo/bench/eval/pretrain
GUI (Tauri): complete 9-DTO frontend-lockstep migration to *Spec types; Serve, Bench, Eval, Jobs, Pretrain pages; embed-train + rlkd + ollama routes; channel-based metrics streaming
MCP: 51-tool server with migrated train/pretrain/tokenize/memory/dflash/generate coverage, allowlisted CLI passthrough tools for newly added CLI flags, and a JobEvent JSONL consumer for managed background jobs

SOTA distillation (`pmetal-distill`)

Universal Logit Distillation (ULD) — Wasserstein-1 over sorted logit distributions for cross-tokenizer KD (Boizard et al. 2024); optional top_k truncation; permutation-invariant by design
Generalized Knowledge Distillation (GKD) — λ-weighted off-policy + on-policy KL blend (Agarwal et al. 2024); OnPolicySampler trait with GreedySampler reference impl; compute_full(t_off, s_off, t_on, s_on, T)
MiniLLM — reverse-KL with optional teacher-mix target = mix·T + (1-mix)·S (Gu et al. 2024)
Skewed JSD (DistiLLM-2) — α·KL(T||M_α) + (1-α)·KL(S||M_α) with M_α = α·T + (1-α)·S, log-sum-exp computation; α=0.5 reduces to standard symmetric JSD (Ko et al. 2024)
Attention-transfer loss + weighted Metal path for hidden-state distillation
Offline teacher-logit caching: pmetal distill --offline-cache <path> precomputes teacher logits to disk; new Int8PerToken compressed-block variant replaces NaN-sentinel scheme with explicit per_token_meta field (legacy Int8 variant retained for read-back)
DistillLossOutput.metrics: HashMap<&'static str, f32>: lazily-evaluated teacher_entropy, student_entropy, kl_per_token, top1_agreement exposed to trainer JSONL/TUI streaming
TAID difficulty-aware observability: alpha_var surfaced for per-step monitoring
Configurable ignore_index: PyTorch-standard -100 default on TrainingConfig; safe label clamping before gather
Hidden-state shape assertions before matmul (clear error vs. silent broadcast bug)

SOTA model merging (`pmetal-merge`)

Fisher merging (Matena & Raffel 2022): diagonal-Fisher-weighted average θ = Σ F_i⊙θ_i / (Σ F_i + ε); lazy-loaded Fisher safetensors; fallback_to_mean for tensors without Fisher entries
RegMean (Jin et al. 2023): closed-form linear-layer merge W = (Σ G_i)⁻¹ · (Σ G_i W_i) via hand-rolled Gauss-Jordan pseudo_inverse_2d with Tikhonov ridge; falls back to mean for non-2D weights
MoE expert permutation alignment: per-(model, layer) Hungarian solver (Jonker-Volgenant style, O(N³)) over L2-normalized cosine similarity of expert fingerprints; tensor-name remapping experts.{i}. → experts.{π(i)}. before merge; gated by align_moe_experts
Honor config.dtype in save path: MergeBuilder.dtype builder, TensorWriter::with_dtype plumbing, per-dtype byte packing for F16/BF16/F32; previously hardcoded to F16
Cross-model dtype consistency check: verify_source_dtypes errors on mismatch unless allow_mixed_dtype is set
Tied-embedding detection: lm_head.weight and embed_tokens.weight aliasing detected and merged once under canonical name
Tokenizer + config sidecar copy: tokenizer.json, tokenizer_config.json, special_tokens_map.json, config.json, generation_config.json copied on full-model merge; config.json.torch_dtype patched to match output dtype
Post-merge sanity sweep (`Sa...

Assets 10

24 Mar 04:05

github-actions

v0.4.0

a91acf7

PMetal v0.4.0

[0.4.0] - 2026-03-23

Added

pmetal-mcp crate: Full MCP (Model Context Protocol) server exposing 45 tools for Claude Desktop and other MCP clients. Covers all pmetal functionality — training, inference, distillation, GRPO, RLKD, quantization, model merging, dataset operations, evaluation, benchmarking, model search, and Ollama export
- Device & models: device_info, search_models, download_model, list_local_models, model_fit, model_info
- Inference: generate (blocking), chat (via running serve instance), start_serve, benchmark, bench_train, bench_gen, bench_corpus
- Training: train, distill, grpo, rlkd, embed_train — all as background jobs with full parameter coverage matching the CLI
- Runtime training control: job_set_lr, job_reduce_lr, job_reset_lr, job_save_checkpoint, job_graceful_stop — LLM-driven adaptive training via the control file protocol
- Job management: list_jobs, job_status, job_logs, stop_job
- Dataset ops: dataset_analyze, dataset_preview, dataset_validate, dataset_download, dataset_convert, dataset_filter, dataset_split, dataset_merge, dataset_sample, dataset_template, dataset_prepare
- Quantization & conversion: quantize, fuse_lora, merge_models, pack_experts, ollama_create, ollama_modelfile
- Evaluation: eval_perplexity
- All tools include rich #[description] annotations for parameter documentation in the MCP schema
- Standalone binary (pmetal-mcp) for Claude Desktop + pmetal mcp subcommand (behind mcp feature flag)
- Uses turbomcp v3.0.7 from crates.io
Runtime training control protocol: Extended the control file protocol (.lr_control.json) with SaveCheckpoint and GracefulStop commands. The adaptive LR controller now polls the control file before checking its enabled flag, so external agents (MCP, TUI) can always send commands regardless of whether automatic detection is active
--no-adaptive-lr flag: Disables automatic spike/plateau/divergence detection while keeping the control file protocol active. Enables fully LLM-driven learning rate control — the agent observes loss via job_status and manually adjusts LR via job_set_lr/job_reduce_lr
UltraFusion execution planner (pmetal-distributed): Per-die stage planner for M-series Ultra Macs with in-memory channel transport backend for same-process links, avoiding TCP overhead on UltraFusion interconnect
MPP FlashAttention for head_dim 64/96: Metal 4 MPP flash attention kernel now supports head_dim 64, 96, and 128 with stride-2/stride-3 SIMD lane packing and causal/non-causal variants
Tuna persistent disk cache: The auto-tuner now persists benchmark results to disk, avoiding re-tuning on restart. Expanded search covers FlashAttention, FusedCrossEntropy, FusedNormLora, and FusedSwiGLU via function constants
MoE GPU top-k selection: Expert top-k selection moved from CPU sort to GPU argpartition_axis, eliminating a sync point in the MoE forward path
bench-workload CLI command: Benchmark a real cached workload for inference and short LoRA training with named presets (--preset dense-qwen3, --preset hybrid-qwen3next)
KV cache quantization auto-select: --kv-quant is now optional — omitting it auto-selects the fastest quantization mode that fits the device memory budget
UltraFusion info display: pmetal info shows UltraFusion topology, die count, and local executor plan on Ultra Macs
Qwen3 LoRA RoPE reset: Qwen3 LoRA and QLoRA gain dense attention and RoPE reset support
ANE real-time evaluation: Experimental _ANEClient real-time dispatch with automatic fallback to standard evaluation on failure. Propagated via --ane-real-time CLI flag
bench-corpus CLI command: Structured kernel benchmarking with device-tier-aware test cases, JSON reporting, and --quick/--output flags
GPU memory bandwidth probing: Real GPU copy benchmark replaces static spec-table lookup, with disk-cached results and spec-table fallback
Persistent runtime kernel backend selection: Benchmark-and-persist infrastructure races MLX vs MPP backends on Apple10/M5, validates numerical agreement, and caches the winner to disk for 4-bit quantized linear, fused attention, and LoRA matmul
MPP kernel tile variants: Metal 4 GEMM supports parameterized tile variants (32x32, 64x32, 32x64, 64x64) with Tuna auto-tuner selection per device and problem shape
Serve ANE/CPU-hybrid engine caching: Serve engine auto-selects optimal backend (ANE, CPU-hybrid, GPU) at startup with permanent downgrade on failure. Compiled engines cached across requests
Rollback enabled by default for LoRA: Best-loss checkpoint rollback now defaults to on with extended warmup grace period. Persistent snapshot to disk via atomic write. for_lora() factory for recommended defaults
Extended StepMetrics: gpu_fwd_bwd_ms, optimizer_ms, io_staging_ms, overhead_ms fields for fine-grained training profiling
Zero-copy MoE expert dispatch: ExpertBufferPool with read_experts_aligned + encode_expert_aligned for pread-to-Metal expert weight dispatch. Auto-enable KV-Q8 when memory-constrained
ANE dual-die support: On UltraFusion chips, compile variant-B kernel set with distinct MIL hashes and alternate per step for dual-die thermal distribution. Auto-recompile on throughput degradation (>15% or >25K dispatches)
Batched parameter eval: Model dispatcher evaluates parameters in batches of 128 tensors per sync instead of all-at-once, reducing peak memory during model loading
Architecture enhancements: DeepSeek V3/V3.2, GPT-OSS, Jamba, Llama 4, Qwen3, and Qwen3-MoE model improvements and weight sanitization refinements
Third-party attribution: Complete THIRD_PARTY_NOTICES with entries for mlx-lm, llama.cpp/GGML, Candle, and Burn

Changed

ANE is now opt-in: The --no-ane flag has been replaced with --ane across CLI, TUI, orchestrator, and MCP. ANE training is experimental and limited to small models, so it defaults to off. The orchestrator's DispatchConfig now sets ane: false by default
Gradient checkpointing support corrected: Qwen3 and Qwen3Next no longer claim gradient checkpointing support (was incorrectly advertised)
Training loop refactored: Gradient checkpointing helper extracted, step logging tracks step numbers correctly, training loop tests expanded

Removed

Merge methods: Removed merge methods with incompatible licenses. Cleaned up related references across documentation and configuration

Fixed

MetalSampler use-after-free: Retained source logits array until GPU completion in serve engine
Fused merge Tuna cache: Now uses persistent disk cache instead of ephemeral per-session tuning

Downloads

Asset	Description
`pmetal-*-aarch64-apple-darwin.tar.gz`	CLI binary + mlx.metallib (Apple Silicon)
`PMetal--aarch64-apple-darwin-.dmg`	Desktop GUI app (Apple Silicon)
`mlx.metallib`	MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.13...v0.4.0

Assets 10

16 Mar 18:52

github-actions

v0.3.7

071dd7e

PMetal v0.3.7

[0.3.7] - 2026-03-16

Added

pmetal merge CLI command: Model merging exposed as a first-class CLI command supporting all merge methods (Linear, SLERP, TIES, DARE, DELLA, NearSwap, Model Stock) with --method, --base, --t, --weight-a, --weight-b, --density, and --dtype flags
pmetal eval CLI command: Dataset evaluation command — measures loss/perplexity over a validation set with optional LoRA adapter, --num-samples cap, and --json output
pmetal info CLI command: Prints device and runtime information; --json flag emits structured JSON for scripting
pmetal search --json output: Structured JSON output mode for search results including fit estimates, download counts, parameter estimates, and tags — enables scripting and GUI integration
QuantizeMethod enum: Replaces the string --method argument for pmetal quantize with a typed enum (dynamic, q8_0, q4_k_m, etc.) — invalid methods now fail at argument parsing rather than deep inside the quantizer
GRPO CLI arguments: --epochs, --lora-r, --lora-alpha, --max-completion-length, and --seed exposed as CLI arguments, replacing previous hardcoded defaults
loraplus_lr_ratio and neftune_noise_alpha: New fields on training loop configurations — enables LoRA+ differential learning rates and NEFTune noise injection directly from config
trainable_params() helper: New utility in pmetal-lora for counting total vs. trainable parameter counts, useful for logging and memory estimation
lora_alpha: f32: Distillation CLI and run_distillation_cli now accept lora_alpha as f32 instead of usize for finer-grained scaling control
seed parameter in distillation and GRPO CLI: Reproducible runs via explicit --seed flag in all training entry points
Gemma3 sliding window auto-detection: DynamicModel loader now reads model_type == "gemma3" and sets is_gemma3 = true on the config, enabling the correct every-6th-layer global attention pattern without manual config overrides
KV cache support for more architectures: DynamicModel::forward_with_cache now routes DeepSeek, Cohere, StarCoder2, and Llama4 to their native caching paths; RecurrentGemma and Jamba now get clear error messages that they require forward() directly; hybrid models (NemotronH, Qwen3Next) get a descriptive error directing to forward_with_hybrid_cache
Speculative decoding greedy path: SpeculativeDecoder::verify_greedy() — exact-correct verification for temperature=0 decoding using argmax equality; avoids the numerically unstable rejection-sampling limit as temperature→0
Hub cache management (pmetal-hub): New cache.rs module with cache inspection, eviction, and size-reporting helpers
Shared model utilities (pmetal-models/utils.rs): Common helpers extracted from per-architecture modules to reduce duplication

Fixed

Scale factor broadcasting in distillation: squeeze applied to the scale factor dimension so it broadcasts correctly across batch and sequence axes — previously caused shape mismatches on non-unit batch sizes
TAID mean_alpha forcing GPU sync: TaidLossOutput::mean_alpha changed from f32 to a lazy Array — the .eval() call is deferred until callers explicitly call .item::<f32>(), removing a forced GPU-CPU sync before the backward pass
SLERP numerical stability: Added epsilon clamping in the SLERP merge path to prevent NaN when interpolation parameter is at the boundary values (0.0 or 1.0)
Llama LoRA trainable_params / gradient application: Replaced 100+ lines of repeated field accesses with an insert_adapter! macro and loop over projection names, fixing DoRA magnitude parameter that was silently dropped from gradient maps
GaLore improvements: Corrected projection matrix update schedule and subspace dimensionality handling
Distillation hidden-state loss: Refactored alignment computation to correctly handle variable-rank teacher/student hidden state tensors
Jensen-Shannon / KL divergence loss: Numerical stability improvements — log-sum-exp stabilization applied consistently across all reduction paths
Offline distillation: Fixed logit cache loading to handle both single-file and sharded cache layouts

Changed

lm_groups.rs / LoRA+ optimizer groups: build_lora_param_groups significantly reworked — LoRA+ differential LR ratio (loraplus_lr_ratio) applied to lora_b parameters, NEFTune noise injection integrated into group construction
GRPO trainer: epochs, lora_r, lora_alpha, max_completion_length, and seed plumbed through from CLI args; previously these were hardcoded to 1, 16, 32, 512, and a fixed seed
Training loop: loraplus_lr_ratio and neftune_noise_alpha read from config and forwarded to optimizer group construction
pmetal-core config / scheduler / traits: Config structs gained loraplus_lr_ratio and neftune_noise_alpha fields; scheduler types and learning rate trait bounds refined; TrainingCallback trait extended with blanket impls for boxed callbacks
Data pipeline: Tokenizer, packing, vocab_compact, dataset, and chat template modules updated — minor correctness and efficiency fixes accumulated across the release cycle
GGUF reader / writer / quantize: Reader handles additional tensor metadata fields; writer improves alignment padding; quantize module uses QuantizeMethod enum instead of string matching
Hub search: search_models returns richer result structs used by both the human-readable table and the new --json output path; upload path fixes for large model shards
Metal kernels: GDN, LoRA, grouped GEMM, and fused SwiGLU Metal shaders updated — improved numerical correctness and register pressure
GUI app icons and Tauri config: Updated icons (32×32, 128×128, 128×128@2x, icns, ico) and tauri.conf.json for the 0.3.7 release build; Python vocoder easy API additions and mel spectrogram fix

Downloads

Asset	Description
`pmetal-*-aarch64-apple-darwin.tar.gz`	CLI binary + mlx.metallib (Apple Silicon)
`PMetal--aarch64-apple-darwin-.dmg`	Desktop GUI app (Apple Silicon)
`mlx.metallib`	MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.6...v0.3.7

Assets 9

15 Mar 23:56

github-actions

v0.3.6

118f20f

PMetal v0.3.6

[0.3.6] - 2026-03-15

Added

Desktop GUI (Tauri + Svelte): Full desktop application for model management, training, distillation, GRPO, inference, merging, and quantization. 10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, Settings. Real-time training metrics with live loss charts via broadcast events. Model download with HuggingFace Hub integration, dataset browser, and inference chat interface with streaming token display
GUI in-process execution: Training, distillation, GRPO, inference, model merging, LoRA fuse, and quantization run as direct library calls instead of shelling out to the pmetal binary. Eliminates binary discovery issues, reduces process overhead, and enables richer progress reporting. Device info and model metadata also read from library APIs
easy::dpo() / easy::simpo() / easy::orpo() / easy::kto() builders: PreferenceTuneBuilder in easy.rs for preference optimization methods. Full pipeline: model download → tokenizer → dataset loading → LoRA setup → training loop → weight saving. Supports method-specific config (DPO beta/loss type, SimPO gamma/CPO, ORPO beta, KTO desirable/undesirable weights)
easy::infer().generate_streaming(): Streaming inference API with per-delta callback. Supports both base models and LoRA adapters. Returns false from callback to cancel early. ANE fallback emits full result as single delta
Preference trainer train() methods: DPO, KTO, ORPO, and SimPO trainers now have self-contained train() methods with optimizer integration, batching, epoch loops, callback lifecycle, and metrics collection. Previously only exposed per-step primitives
TrainingCallback::should_stop(): Clean cancellation mechanism — callbacks return true to request training loop to finish the current step and exit with Cancelled error. Checked after every step in all 5 TrainingLoop::run* methods, all 4 preference trainer train() loops, and GrpoTrainer::run()
PMetalError::Cancelled: New error variant for clean training cancellation. Corresponding Cancelled variants added to SftError, DpoError, KtoError, OrpoError, SimpoError, and GrpoError
Preference batch padding utilities: pad_u32_sequences, pad_i64_sequences, pad_f32_sequences in preference_batch.rs for batching variable-length preference pairs
NemotronH runtime FP8 quantization: quantize_fp8() converts float weights to FP8 (E4M3) at runtime for all four block types (Mamba, attention, MLP, MoE). Shared helpers materialize_linear_weight and linear_forward_with_optional_fp8 consolidate FP8 dequantization across the model. MoE weights are restacked after quantization for batched dispatch
FluxPipeline::from_pretrained: Load Flux diffusion pipelines from HuggingFace-style model directories. Discovers components via model_index.json, parses both native and diffusers-style config keys for CLIP, T5, FluxDiT, and VAE
Python training callbacks: Trainer.add_callback() now wires callbacks into the training loop. Built-in ProgressCallback, LoggingCallback, and MetricsJsonCallback map to native Rust implementations; arbitrary Python objects bridge through PythonCallbackBridge

Fixed

Training cancellation via panic_any replaced: GUI and TUI previously used std::panic::panic_any(CancelledRun) + catch_unwind to abort training — fragile, UB-prone through FFI, and could be swallowed by intermediate catch_unwind. Replaced with TrainingCallback::should_stop() returning a clean Err(Cancelled) from the training loop
GUI QLoRA silently failed on non-Llama models: run_qlora_training_in_process hardcoded LlamaConfig deserialization, causing confusing errors or silent misconfiguration for Gemma/Qwen/Phi models. Now detects model_type from config.json and returns a clear error for unsupported architectures
GUI resume_from silently ignored: Training config accepted resume_from but discarded it (let _ = eval). Now returns an error directing users to the CLI
GUI GRPO with no reward function produced noise: DummyReward returning constant 0.1 for all completions made GRPO training meaningless when reasoning rewards were disabled. Now requires explicit reward configuration
Preference trainers doubled compute per step: DPO, KTO, ORPO, and SimPO train() methods ran a second full forward pass after the gradient step solely for logging metrics. Replaced with RefCell side-channels that capture metric arrays from within the autograd closure — same metrics, zero extra compute
Base model thinking mode: Auto-detect base vs instruct models and disable <think> tag prefill for base models. Base models don't understand thinking tags, causing infinite generation without a closing tag
Fused model 5x slower than LoRA: Skip ANE-hybrid path for models under 2B parameters where GPU KV-cache decode is significantly faster (115 vs 20 tok/s). ANE-hybrid benefits larger models where prefill dominates
DataLoader panics on bad images: Replace panic!() in VLM batch construction with proper DataLoaderError enum and try_next_batch() method. Image preprocessing failures and missing-image errors now propagate as Result instead of crashing
Division by zero with log_every=0: Clamp log_every and save_every to minimum 1 across TrainingLoop, LoggingCallback, CheckpointCallback, and CLI
LoRA scaling with rank 0: LoraConfig::scaling() returns 0.0 when rank is 0 instead of dividing by zero
BF16 LoRA weights: sanitize_loaded_weights() converts BF16 tensors to FP16 since MLX doesn't natively support BF16 on Apple Silicon
Qwen3Next silent weight mismatch: Weight loading now returns errors for unmatched or missing parameters instead of logging a warning and continuing with a partially loaded model
Dataset download only fetched README: download_dataset() now enumerates repo files and downloads actual data files (parquet, json, jsonl, csv, arrow, etc.) with split-aware filtering
Model download silent failures: download_model() tracks per-file failures and reports them instead of silently skipping failed downloads
Flux loading via DynamicModel: DynamicModel::load() for Flux now returns an error directing to FluxPipeline instead of incorrectly loading a diffusion model as a causal LM

Changed

GUI architecture: library calls replace subprocess spawning: Training, distillation, GRPO, inference, merge, fuse, and quantize commands now call pmetal library functions directly instead of spawning pmetal CLI as a child process. System info reads from MetalContext::global() instead of parsing pmetal memory stdout. Removes which and futures-util dependencies
TUI direct training execution: command_runner.rs dispatches train, distill, and grpo commands as in-process library calls via run_direct_command(), falling back to subprocess for other commands. Training parameters parsed from CommandSpec args with parse_arg/required_arg/optional_arg helpers
ORPO loss computation refactored: compute_orpo_loss_static now contains the full computation directly instead of creating a throwaway OrpoTrainer instance. The instance method compute_orpo_loss delegates to it
SimPO gradient-safe loss path: New compute_loss_with_cpo_for_grad static method keeps the computation graph lazy (no .eval()/.item() calls) for correct autograd. The existing compute_loss_with_cpo remains for non-grad contexts
FinetuneBuilder expanded: New builder methods — lora_dropout(), use_rslora(), use_dora(), gradient_checkpointing_layers(), callback(), metrics_path(). LoRA config now forwards dropout, RSLoRA, and DoRA settings
GRPO CLI gains new parameters: epochs, lora_r, lora_alpha, max_completion_length exposed as CLI arguments and TUI form fields. GRPO now saves adapter_config.json alongside LoRA weights
CLI emit_console_output flag: Training, distillation, and GRPO CLI functions accept emit_console_output: bool and extra_callbacks: Vec<Box<dyn TrainingCallback>> to suppress terminal output when called from GUI/TUI
DataLoader error handling: New DataLoaderError enum with Mlx, ImagePreprocess, and MissingImages variants. All 7 training loop entry points migrated from next_batch() to try_next_batch()
AdapterManager validation: load() now validates path existence, checks for adapter artifacts in directories, and rejects unsupported file types
Metal shader build isolation: Shader compiler cache redirected to build output directory, preventing pollution of user's home directory
unsafe_code lint scoping: Moved blanket #![allow(unsafe_code)] from crate-level lib.rs into individual modules that contain unsafe blocks across pmetal-metal, pmetal-mlx, pmetal-models, pmetal-trainer, pmetal-distill, and pmetal-distributed

Downloads

Asset	Description
`pmetal-*-aarch64-apple-darwin.tar.gz`	CLI binary + mlx.metallib (Apple Silicon)
`PMetal--aarch64-apple-darwin-.dmg`	Desktop GUI app (Apple Silicon)
`mlx.metallib`	MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.5...v0.3.6

Assets 9

15 Mar 03:16

github-actions

v0.3.5

807f444

v0.3.5

Full Changelog: v0.3.4...v0.3.5

Assets 7

14 Mar 20:08

github-actions

v0.3.4

d8488cb

v0.3.4

[0.3.4] - 2026-03-14

Added

Mixture-of-Depths (MoD) for Llama 4: Proper implementation per Raposo et al. (2024) — lightweight router with argpartition_axis top-k, gather-before-compute on sub-batch, scatter-after, BCE auxiliary loss. Configurable capacity factor and per-layer selection
Llama 4 RoPE: Real RoPE implementation via pmetal_mlx::kernels::rope::apply_rope (Metal-accelerated), replacing the placeholder stub. Correctly wired into iRoPE layer dispatch — RoPE layers get rotary embeddings, NoPE layers skip them
Llama 4 temperature scaling: Per Meta's formula log(floor((pos+1)/floor_scale) + 1) * attn_scale + 1.0, applied to Q states in NoPE layers before QK matmul for long-context attention stabilization
Llama 4 GQA: KV-head broadcast expansion for grouped-query attention — enables Scout (40 Q / 8 KV) and Maverick configs
MoE top-k > 1: Llama4Router uses argpartition_axis for O(n) expert selection with L1-normalized weights and per-slot dispatch loop, replacing hardcoded argmax
ANE fused kernels: gen_dynamic_sdpa_fwd (single-kernel attention: RMSNorm + QKV + SDPA + Wo) and gen_dynamic_ffn_w13 (single-kernel FFN: RMSNorm + W1 + W3 + SiLU), replacing 6+ separate ANE evaluations per layer
ANE fused backward: gen_dynamic_ffn_bwd_w2t and gen_dynamic_ffn_bwd_w13t for fused FFN backward pass
Metal dequantization kernels: Q4_0 and IQ4_XS Metal compute shaders, verified correct per GGML spec. Bridge methods in MlxMetalBridge for GPU-accelerated dequantization
Cancellation safety infrastructure: CompletionToken::Drop guard in AsyncScheduler waits for in-flight GPU commands; retain_resource() / as_retained() for Metal buffer lifetime extension
IoSurface helpers: write_f32_strided_at, write_f32_at_col_offset, zero_channel_range_f32 for fused backward kernel IO
CloudBridge: Complete training state export (weights, optimizer state, RNG, dataloader position, metadata) with working Python bootstrap scripts for FSDP/DeepSpeed cluster resumption and Rust-side loader functions
Formal verification: cargo-kani proofs for ring all-reduce chunk arithmetic (95 checks) and k-ary tree topology consistency (607 checks), with justfile recipes
Reasoning templates: MathReasoningTemplate (GRPO + accuracy/format rewards) and CodeReasoningTemplate (structural code fence + test case matching)
Reasoning dataset auto-detection: pmetal dataset prepare automatically detects problem/thinking/solution columns and formats them as <think> tagged ChatML conversations
--columns flag: General column remapping for dataset prepare (e.g., --columns "instruction=question,output=answer")
adapter_config.json: Saved alongside LoRA weights during training (r, alpha, target_modules, use_rslora). Loaded automatically at inference and fuse time — eliminates config guesswork
Supply chain: cargo-vet initialized with Mozilla, Google, and Bytecode Alliance audit imports; 17 workspace crates covered; 5 transitive dependency exemptions with exact lockfile versions
Tracing spans: 6 info_span! markers in Python trainer for phase-level observability (model_resolve, load_tokenizer, load_dataset, load_model, training_loop, save_weights)

Fixed

LoRA inference garbage output: Merged LoRA weights into base model at inference time (W += scale*B@A), matching mlx-lm's pattern. The separate-forward path had dtype mismatch issues (BF16 base × F32 LoRA)
Auto-chat mode regression: Removed heuristic that forced chat template on base models just because their tokenizer has <|im_end|>. Chat mode now requires explicit --chat or an instruction-tuned model
Missing EOS in training data: Training sequences now end with the model's actual EOS token (e.g., <|endoftext|> for Qwen). Previously only had turn delimiter (<|im_end|>) — model never learned to stop generating
Fuse command wrong alpha/rank: pmetal fuse now reads adapter_config.json for correct alpha and rank instead of defaulting to scale=1.0. Also filters MLP LoRA weights (rank=0) when auto-detecting rank from shapes
ANE x2norm backward bug: FFN weight gradients (dW1, dW3) were computed against the wrong pre-norm tensor (xnorm from attention block instead of x2norm from FFN block). Restored x2norm field and CPU RMSNorm recomputation for gradient correctness
ANE sdpa_bwd surface dtype: Backward SDPA output surfaces were allocated as fp32 but ANE kernels produce fp16 — stride mismatch corrupted dV/dQ/dK gradients. Fixed to IoSurface::for_tensor() (fp16)
MoD argpartition sign: Router negated weights before argpartition_axis, selecting bottom-k (least important) tokens instead of top-k. Removed negation
MLX bridge copy_as_f32 regression: Renamed methods dropped auto dtype conversion — callers passing wrong dtype would panic. Restored copy_as_f32 / copy_as_f16 with auto-conversion
MLX bridge view_f32 eval: Removed .eval() call before accessing data pointer — unevaluated arrays returned null. Restored defensive eval
Python API surface: Restored ProgressCallback, LoggingCallback(log_every=10), __version__, and PythonCallbackBridge that were deleted during PyO3 migration
TUI training completion: Reads final metrics from JSONL file on disk (immune to polling lag). Shows actual loss and step count instead of 0.0000 / sample count
TUI Steps/min overflow: Guards against divide-by-zero when total_ms=0 — shows — instead of 60000
Dataset prepare panic: Empty results no longer crash with index-out-of-bounds. Shows diagnostic message with format hints

Changed

LoRA inference uses merge: merge_lora() is called before generation, producing a single merged weight matrix per layer. This is equivalent to the fuse command but happens in-memory without saving
PyO3 0.23 → 0.28: allow_threads → detach, with_gil → attach, from_py_object on all pyclass types, Bound<'py, PyDict> return types
tokio 1.49 → 1.50
unsafe_code lint: Escalated from warn to deny workspace-wide

Full Changelog: v0.3.3...v0.3.4

Assets 7

13 Mar 03:11

github-actions

v0.3.3

f6c29f2

v0.3.3

[0.3.3] - 2026-03-12

Added

Self-contained binary: mlx.metallib is now gzip-compressed and embedded into the pmetal binary at build time via build.rs + include_bytes!. On first run it extracts to ~/.cache/pmetal/lib/ if not already present. cargo install pmetal-cli now produces a fully self-contained binary with no external metallib dependency (~31MB added to binary, 70% smaller than the raw 102MB metallib)
Adaptive LR rollback: When divergence is detected and rollback_enabled = true, the adaptive LR controller emits LrEvent::RollbackTriggered — the training loop restores LoRA weights from the best in-memory EMA snapshot, resets optimizer momentum, and continues with a halved LR multiplier
Early-stop on repeated divergence: After max_rollbacks exhausted rollbacks, the controller emits LrEvent::EarlyStop — the training loop saves a final checkpoint and exits cleanly instead of spiraling deeper into loss divergence
In-memory LoRA snapshot: TrainingLoop holds the best LoRA weight snapshot in RAM via snapshot_best_weights() / restore_best_weights(). LoRA params are typically 1–20 MB, making this negligible overhead vs checkpoint I/O
AdaptiveAction enum: apply_adaptive_lr() now returns AdaptiveAction::Continue | Rollback | EarlyStop so training loops can react to controller decisions without re-parsing event strings

Fixed

apply_adaptive_lr return type: Previously returned (), discarding rollback/early-stop events — callers had no way to react. Now returns AdaptiveAction
Divergence rollback vs plain reduction ambiguity: Divergence path now checks rollback_enabled and has_best_snapshot before deciding between rollback and plain LR reduction — prevents silent rollback when no snapshot exists
EMA state reset on rollback: Spike EMA and variance are reset alongside LR multiplier on rollback so z-score anomaly detection re-stabilizes correctly after weight restoration
total_steps in metrics: run_standard() and run_jit_compiled() computed total_steps: max_steps.unwrap_or(0) — now estimates from dataset.len() / batch_size * epochs when max_steps is None, giving accurate progress in the TUI
stats_summary missing rollback count: AdaptiveLrController::stats_summary() now includes rollbacks=N in its output string

Improved

Rollback tests: Four new unit tests — test_rollback_triggered_on_divergence, test_early_stop_after_max_rollbacks, test_rollback_disabled_falls_through_to_divergence, test_should_snapshot_best_tracks_ema_improvement

Full Changelog: v0.3.2...v0.3.3

Assets 7

12 Mar 02:10

github-actions

v0.3.2

9edad5f

v0.3.2

[0.3.2] - 2026-03-11

Added

Adaptive learning rate controller: EMA-based z-score spike detection, patience-based plateau detection, and linear regression divergence detection — automatically adjusts LR multiplier during training to recover from loss spikes, reduce LR on plateaus, and halt on divergence
Manual LR override via TUI: Press L in Training, Distillation, or GRPO tabs to set a custom learning rate mid-run; uses atomic control file protocol ({output_dir}/.lr_control.json) for safe subprocess communication
WSD (Warmup-Stable-Decay) scheduler: New LrSchedulerType::Wsd with configurable stable_ratio — holds peak LR for a plateau phase before linear decay, popular for large-scale pretraining
GRPO adaptive LR + callbacks: GrpoTrainer now supports adaptive LR, TrainingCallback lifecycle events, and StepMetrics emission for live TUI monitoring
HuggingFace Hub search (pmetal search): CLI command and TUI integration (press S in Models tab) to search HF Hub for text-generation models with download counts, parameter estimates, and memory fit assessment
Memory fit estimation: New pmetal-hub module estimates inference/training memory requirements, tok/s throughput, and color-coded fit levels (green/yellow/red) based on device specs and model architecture
Model detail panel: Models tab shows memory breakdown — weights, KV cache, overhead, training estimate, and recommended batch size
Distillation metrics callbacks: DistillationTrainer now emits step-by-step metrics via TrainingCallback, enabling live TUI dashboard during distillation runs
Command logging in Jobs tab: Spawned commands are logged with the full CLI invocation for easier debugging

Fixed

NaN/Inf loss guard: Adaptive LR skips EMA updates on non-finite losses to prevent EMA poisoning — returns scheduled LR unchanged
EMA variance bias correction: Early-training z-scores now use bias-corrected variance (raw_var / (1 - alpha^n)), matching Adam's moment correction — prevents false spike detection in first ~20 steps
Zero-variance z-score fallback: When loss variance is near zero (std_dev < 1e-8), uses absolute deviation threshold instead of division-by-zero; returns z=10 for >50% deviation, z=0 otherwise
Atomic control file protocol: LR control file is renamed to .lr_control.claimed before reading and deleted after — prevents race conditions between TUI writer and training subprocess reader
Distillation metrics LR: Distillation step metrics now report post-adaptive LR instead of pre-adjustment scheduled LR
Adaptive LR in all training paths: apply_adaptive_lr() now called in run_metal_fused(), run_compiled(), run_jit_compiled(), and run_packed() paths (was only in run_standard())
TUI LR override validation: LR range check now accepts 1.0 (was exclusive upper bound); shows error modal on invalid input instead of silent log warning
Distillation/GRPO job routing: Status updates were always routed to the Training tab regardless of job type. Added active_job_type tracking to route metrics, completion, and failure to the correct tab (Distill, GRPO, or Training)
Distillation CLI args: TUI sent --lora-alpha and --log-metrics flags that the CLI didn't accept, causing immediate exit code 2. Added both args to the Distill command and --log-metrics to Grpo
Parquet dataset support in distill/GRPO: Distillation and GRPO commands only supported JSONL datasets. Now auto-detect .parquet files and route to the parquet loader, matching the training command's behavior
Tab click targeting: Mouse clicks on Monitor, Inference, and Jobs tabs selected the wrong tab due to hardcoded fixed-width hit-testing. Now computes actual tab widths from rendered text
Error diagnostics: Failed jobs now show the last 5 stderr lines in the tab status panel instead of just "Process exited with code N", with a hint to check the Jobs tab for full output
UTF-8 safe string truncation: truncate_str used byte indexing which panics on multi-byte characters; switched to chars() iterator
Leaked channel in HF search: search_hf() created a sender/receiver pair even without a CommandRunner, silently dropping results
Integer overflow in fit estimation: estimate_params_from_config used plain multiplication; switched to saturating_mul/saturating_add
Context length truncation: u64→u32 cast could wrap for extreme values; capped at 1M before cast

Improved

TUI tab ordering: System (formerly Device) is now the default first tab; Dashboard renamed to Monitor
Empty state messaging: Monitor tab shows actionable guidance ("Start a run from Training, Distill, or GRPO tab") instead of "Waiting for training data..."
Idle state hint: Tabs show "Press S to start" instead of "Press S to start training" (generic across all job types)

Security

Bounded API responses: bounded_json() caps HF API response bodies at 4MB to prevent heap exhaustion
Model ID validation: is_valid_model_id() rejects path traversal, URL injection, and malformed values in HF API paths

Full Changelog: v0.3.1...v0.3.2

Assets 7

11 Mar 15:43

github-actions

v0.3.1

fc4a676

v0.3.1

[0.3.1] - 2026-03-11

Added

M5 / Apple10 device detection: GPU family Apple10 with architecture generation 17, NAX (Neural Accelerators in GPU) availability flag, and NAX-aware tile size tuning (M5 Max/Ultra get 128×64×32)
UltraFusion topology detection: sysctl hw.packages detects multi-die Ultra chips; is_ultra_fusion and die_count fields on DeviceProperties
GPU and ANE core count estimation: Per-chip core counts derived from device name and tier, with UltraFusion die multiplication
Memory bandwidth estimation: Tier + GPU family lookup table for estimated bandwidth (GB/s)
ANE performance stats API: evaluate_with_stats() on AneModel uses _ANEPerformanceStats with hwExecutionTime for nanosecond-precision hardware timing
TUI device tab enhancements: GPU core counts (with per-die breakdown for Ultra), ANE core counts, memory bandwidth, architecture generation, NAX and UltraFusion feature flags
crates/pmetal/README.md: Crate-level README with feature flags table, quick start examples, hardware support summary, and re-export reference

Fixed

AppleGPUFamily::Unknown ordering bug: Unknown was declared last in the enum, causing derived Ord to rank it above Apple10 — unknown GPUs incorrectly got has_dynamic_caching, has_nax, etc. set to true. Fixed by moving Unknown to first position
Future chip name collision: name.contains("M1") matched "M10"; replaced with has_chip_id() that checks the character after the match isn't a digit
Dead sysctl subprocess in query_memory_bandwidth: Spawned sysctl whose result was discarded; removed and renamed to estimate_memory_bandwidth() using tier-based lookup

Improved

README updates: Root README now documents hardware support matrix (M1–M5), 9 TUI tabs (was 7), 16 crates (was 15), all fused Metal kernels (GDN, SwiGLU, RMSNorm+LoRA), ANE perf stats and M1–M5 compatibility
Hardware support docs: Complete M1–M5 chip matrix with arch gen, core counts, bandwidth, ANE TFLOPS measurements; NAX kernel integration roadmap; UltraFusion distributed roadmap

Full Changelog: v0.3.0...v0.3.1

Assets 7

11 Mar 03:27

github-actions

v0.3.0

1e65de7

v0.3.0

[0.3.0] - 2026-03-10

Added

TUI Control Center (pmetal tui): Full terminal interface with 9 tabs — Dashboard, Device, Models, Datasets, Training, Distillation, GRPO, Inference, Jobs. Async event loop with crossterm/ratatui, modal system (confirm, text input, model picker, dataset picker, error, progress), and reusable form field widgets
Live job integration: Training, distillation, and GRPO tabs spawn pmetal subprocesses and stream metrics in real time via CommandRunner + JSONL polling
LoRA fuse command (pmetal fuse): Merge LoRA adapter weights into base model, with optional fuse-then-quantize pipeline
Chat template support for Llama 4, DeepSeek, and Cohere: Full template formatting, Jinja detection, model name heuristics, stop tokens, and inference formatting for all three model families
Llama 4 template: <|header_start|>/<|header_end|>/<|eot|> tokens (distinct from Llama 3's <|start_header_id|>/<|end_header_id|>/<|eot_id|>)
DeepSeek template: Full-width unicode tokens (<｜begin▁of▁sentence｜>, <｜User｜>, <｜Assistant｜>) with thinking mode support (<think>/</think> prefill)
Cohere Command R template: <|START_OF_TURN_TOKEN|>, <|USER_TOKEN|>, <|CHATBOT_TOKEN|>, <|END_OF_TURN_TOKEN|> tokens
Comprehensive stop token collection: collect_all_stop_tokens() now probes 11 well-known special tokens across all model families (added <|eot|>, <|end|>, <|return|>, <|END_OF_TURN_TOKEN|>, <｜end▁of▁sentence｜>)
LoRA inference auto-chat detection: Probes vocabulary for <|im_end|>/<|eot_id|> to auto-enable chat mode on base models fine-tuned with LoRA
Streaming generation support: GenerationConfig streaming extensions in pmetal-models
Epoch/total_steps in StepMetrics: Training progress now flows through entire pipeline (training loop → JSONL callback → TUI) showing step X/Y and epoch M/N
Hardware support documentation: Apple Silicon hardware matrix and tuning reference (docs/hardware-support.md)

Fixed

TUI inference word wrap: Model output now wraps correctly within the terminal width instead of clipping off-screen; normalize_code_fences() preprocessor ensures ``` markers always appear on their own line even when the model emits text without newlines
TUI inference code block rendering: Fenced code blocks (```python, etc.) now render properly with distinct styling even when the token stream lacks explicit newline characters
TUI UTF-8 safe text handling: Word wrap and code block truncation now use char-count width instead of byte length, preventing panics on multi-byte characters
GRPO accuracy reward — last-occurrence extraction: AccuracyReward now uses rfind() for <answer> tags and \boxed{}, correctly grabbing the final answer when the model retries within chain-of-thought
GRPO accuracy reward — broken fallback: Old code compared the entire completion (including reasoning) against the answer when no <answer> tags were found; now falls back to last non-empty line
GRPO accuracy reward — whitespace normalization: Answer comparison now collapses internal whitespace runs to single space, preventing false negatives from formatting differences
LoRA inference stop tokens: run_inference_with_lora now uses full chat template + comprehensive stop token collection instead of just tokenizer EOS — fixes infinite generation on chat-finetuned models
LoRA inference missing parameters: All sampling parameters (top_k, top_p, min_p, penalties, seed) now passed through to LoRA inference path
Llama 4 misdetection: Model name heuristic now correctly routes llama-4/llama4 to Llama 4 template (was incorrectly using Llama 3 tokens)

Added

GRPO \boxed{} answer extraction: AccuracyReward now extracts answers from LaTeX \boxed{...} expressions with brace-depth tracking, standard for math GRPO (DeepSeek-R1 style)

Improved

TUI replaces legacy dashboard: pmetal tui provides full control center; legacy pmetal dashboard retained for simple metrics monitoring
Chat template Jinja detection: Ordered detection ensures DeepSeek (full-width unicode), Cohere, Llama 4 are matched before generic patterns
EOS token stripping: strip_eos_tokens() now handles all model-family EOS tokens

Full Changelog: v0.2.1...v0.3.0

Assets 7

Uh oh!

Releases: Epistates/pmetal

PMetal v0.5.0

[0.5.0] - 2026-05-07

Added

Distributed inference & training

TurboQuant KV cache (production-ready)

Quantization & model formats

Inference server (OpenAI- + Anthropic-compatible)

Job orchestration substrate (TUI / GUI / MCP / CLI parity)

SOTA distillation (pmetal-distill)

SOTA model merging (pmetal-merge)

Uh oh!

PMetal v0.4.0

[0.4.0] - 2026-03-23

Added

Changed

Removed

Fixed

Downloads

CLI Quick Start

GUI

Uh oh!

PMetal v0.3.7

[0.3.7] - 2026-03-16

Added

Fixed

Changed

Downloads

CLI Quick Start

GUI

Uh oh!

PMetal v0.3.6

[0.3.6] - 2026-03-15

Added

Fixed

Changed

Downloads

CLI Quick Start

GUI

Uh oh!

v0.3.5

Uh oh!

v0.3.4

[0.3.4] - 2026-03-14

Added

Fixed

Changed

Uh oh!

v0.3.3

[0.3.3] - 2026-03-12

Added

Fixed

Improved

Uh oh!

v0.3.2

[0.3.2] - 2026-03-11

Added

Fixed

Improved

Security

Uh oh!

v0.3.1

[0.3.1] - 2026-03-11

Added

Fixed

Improved

Uh oh!

v0.3.0

[0.3.0] - 2026-03-10

Added

Fixed

Added

Improved

Uh oh!

SOTA distillation (`pmetal-distill`)

SOTA model merging (`pmetal-merge`)