Skip to content

Releases: Epistates/pmetal

PMetal v0.5.0

08 May 11:38

Choose a tag to compare

[0.5.0] - 2026-05-07

Added

Distributed inference & training

  • pmetal-distributed crate (Phases 1-4 + 7): Thunderbolt-fabric-aware multi-Mac cluster runtime with feature-gated tensor, expert, context, ZeRO, and pipeline parallelism modules
    • pmetal cluster CLI: per-node launch, ring/mesh topology discovery, fabric handshake
    • Pipeline harness with overlap of computation and Thunderbolt transfers
    • Canonical expert-rank mapping + per-architecture MoE/MLA tensor-parallel plans
    • Ring all-reduce / all-gather with corrected chunk indexing

TurboQuant KV cache (production-ready)

  • TurboQuant KV cache quantization: Provably near-optimal KV cache compression based on random rotation + Lloyd-Max scalar quantization + QJL residual for unbiased inner products (arXiv:2504.19874). Achieves 4-6x KV cache compression with near-zero quality loss. Available via --kv-turboquant or presets --kv-turboquant-preset q3_5 (near-lossless) / q2_5 (6.4x compression)

    • Separate key/value runtimes with independent bit widths and outlier-aware mixed-precision
    • Direct attention path for single-token decode avoids full cache dequantization
    • Data-oblivious (no calibration data required) — quantizes KV entries online as generated
    • Precomputed codebooks via Lloyd-Max algorithm for Beta distribution (deterministic from seed)
    • Metal kernel backend with CPU fallback
    • Phase 0: split monolithic mod.rs (6101 → 222 LOC) into config/core/state/bits/math submodules
    • Phase 3: GPU-resident hot/cold pipeline + Mixed K/V storage; mixed_score as layout oracle
    • Phase A/B: QJL ablation harness (feature-gated) + per-row key_slot_scale codebook adaptation
    • Phase C/C′: Variant F drop-QJL opt-in path; d128/d256 no_qjl_2pass fast paths (4..=8 bits)
    • Phase D: TurboQuantPackMode config + Fullbyte dense-values kernel
    • Phase E: TurboQuantOutlierMode — encode-side top-K outlier storage, zero pre-quant + decode override, outlier-bias on d128/d256 fullbyte score kernel; CPU mirror in scalar encode/decode
    • Phase F: Hamming skip-list dispatch — skiplist_threshold config, GPU sign_hash buffer, Metal Hamming-distances kernel + FFI, GQA support
    • Mixed-precision attention parity baseline; defensive residual-norm clamp + NaN-safe encode
  • Asymmetric K/V head dimensions: KV cache, TurboQuant, and fused attention now support models where key and value projections have different widths (e.g. DeepSeek MLA with qk_head_dim != v_head_dim)

  • pmetal serve --kv-turboquant: TurboQuant KV cache in the serving engine with --kv-turboquant-preset q3_5 for near-lossless 4.6x KV compression in production

Quantization & model formats

  • Optimized FP8 checkpoint loading: Hugging Face FP8 weight_scale_inv sidecars are dequantized or repacked into MLX mxfp8 weights for Qwen3-family native paths; mode-aware quantized matmul plumbing handles floating-point quantized weights without dense fallback
  • Expanded GGUF quantization/export: pmetal quantize now writes standard GGUF metadata from Hugging Face configs, tokenizer/pre-tokenizer metadata, HF-to-GGUF tensor names, stacked MoE expert tensors, and method-specific file types
  • Broader GGUF format coverage: quantization/dequantization support now includes K-quants, legacy Q4/Q5/Q8 variants, Q1_0, TQ1_0/TQ2_0, MXFP4, NVFP4, BF16, F16, and F32 round trips
  • MLX safetensors quantization path: quality-based bit allocation with --target-bpw, GPU-resident weight loading, and tokenizer/config sidecar copy for MLX-format quantized exports

Inference server (OpenAI- + Anthropic-compatible)

  • Continuous batching with paged-KV-style admission + shared prefix cache in pmetal-serve: per-request slot scheduling, KV-cache prefix sharing, concurrent decode for many simultaneous chats
    • Token-block admission budget (--cb-block-size, --cb-max-blocks) prevents over-admitting active contexts and skips head-of-line requests when a smaller queued request fits the remaining block budget
    • Continuous batching now reuses the shared prompt prefix cache, prefills only uncached suffix tokens, and saves extended prefixes after final prefill
    • Continuous batching derives the same cache mode as the single-request serving path, honoring --kv-quant and --kv-turboquant
    • Hybrid/recurrent models are rejected from continuous batching instead of silently running without recurrent state
  • Anthropic-compatible /v1/messages endpoint: streaming message_startcontent_block_startcontent_block_delta*content_block_stopmessage_deltamessage_stop events; non-streaming JSON path
  • /v1/embeddings endpoint: 17 architectures supported via forward_hidden (Llama/Llama4/Qwen2/Qwen3/Qwen3MoE/Qwen3Next/Mistral/Gemma/Gemma4/Phi/Phi4/DeepSeek/Cohere/Granite/GptOss/NemotronH/BERT) — pooling via pmetal_models::pooling
  • Token logprobs: SamplingParams.logprobs_top_n plumbed end-to-end through non-streaming and SSE streaming on both /v1/chat/completions and /v1/completions. New pmetal_models::generation::token_logprobs primitive; ANE/CPU paths emit logprob: None
  • Best-effort tool calling on /v1/chat/completions: try_parse_tool_calls accepts {name, arguments} or {tool_calls: [...]}. ChatCompletionRequest.tools gates the attempt; chat templating threads tool defs into the rendered prompt
  • IncrementalDecoder<Aux> SSE buffer: shared UTF-8 boundary buffer + per-token aux pipelining (used for logprobs alignment) across chat/completions/anthropic streams

Job orchestration substrate (TUI / GUI / MCP / CLI parity)

  • JobSpec substrate: 16 canonical spec types in pmetal-core (Train, Distill, GRPO, Bench, Eval, Pretrain, Tokenize, Serve, Generate, RLKD, EmbedTrain, DFlash, Memory, Ollama, …) with #[derive(JobSpec)] proc-macro
  • JobEvent canonical streaming protocol: progress / metric / log / artefact / complete / failed events emitted by all 4 surfaces (CLI, TUI, GUI, MCP)
  • CLI: 8 specced Commands variants flattened — 613 LOC removed from main.rs; cli/<sub>.rs Args structs and JobSpec argv round-trip tests; --log-events flag stub
  • TUI: 14 tabs with full CLI parity, ?-key help overlay, Ctrl+1..9 tab jump, active-job footer badge, descriptor-driven forms with shared FormTabState primitive; channel-based metrics streaming (ChannelMetricsCallback) for direct-path train/distill/grpo/bench/eval/pretrain
  • GUI (Tauri): complete 9-DTO frontend-lockstep migration to *Spec types; Serve, Bench, Eval, Jobs, Pretrain pages; embed-train + rlkd + ollama routes; channel-based metrics streaming
  • MCP: 51-tool server with migrated train/pretrain/tokenize/memory/dflash/generate coverage, allowlisted CLI passthrough tools for newly added CLI flags, and a JobEvent JSONL consumer for managed background jobs

SOTA distillation (pmetal-distill)

  • Universal Logit Distillation (ULD) — Wasserstein-1 over sorted logit distributions for cross-tokenizer KD (Boizard et al. 2024); optional top_k truncation; permutation-invariant by design
  • Generalized Knowledge Distillation (GKD) — λ-weighted off-policy + on-policy KL blend (Agarwal et al. 2024); OnPolicySampler trait with GreedySampler reference impl; compute_full(t_off, s_off, t_on, s_on, T)
  • MiniLLM — reverse-KL with optional teacher-mix target = mix·T + (1-mix)·S (Gu et al. 2024)
  • Skewed JSD (DistiLLM-2)α·KL(T||M_α) + (1-α)·KL(S||M_α) with M_α = α·T + (1-α)·S, log-sum-exp computation; α=0.5 reduces to standard symmetric JSD (Ko et al. 2024)
  • Attention-transfer loss + weighted Metal path for hidden-state distillation
  • Offline teacher-logit caching: pmetal distill --offline-cache <path> precomputes teacher logits to disk; new Int8PerToken compressed-block variant replaces NaN-sentinel scheme with explicit per_token_meta field (legacy Int8 variant retained for read-back)
  • DistillLossOutput.metrics: HashMap<&'static str, f32>: lazily-evaluated teacher_entropy, student_entropy, kl_per_token, top1_agreement exposed to trainer JSONL/TUI streaming
  • TAID difficulty-aware observability: alpha_var surfaced for per-step monitoring
  • Configurable ignore_index: PyTorch-standard -100 default on TrainingConfig; safe label clamping before gather
  • Hidden-state shape assertions before matmul (clear error vs. silent broadcast bug)

SOTA model merging (pmetal-merge)

  • Fisher merging (Matena & Raffel 2022): diagonal-Fisher-weighted average θ = Σ F_i⊙θ_i / (Σ F_i + ε); lazy-loaded Fisher safetensors; fallback_to_mean for tensors without Fisher entries
  • RegMean (Jin et al. 2023): closed-form linear-layer merge W = (Σ G_i)⁻¹ · (Σ G_i W_i) via hand-rolled Gauss-Jordan pseudo_inverse_2d with Tikhonov ridge; falls back to mean for non-2D weights
  • MoE expert permutation alignment: per-(model, layer) Hungarian solver (Jonker-Volgenant style, O(N³)) over L2-normalized cosine similarity of expert fingerprints; tensor-name remapping experts.{i}.experts.{π(i)}. before merge; gated by align_moe_experts
  • Honor config.dtype in save path: MergeBuilder.dtype builder, TensorWriter::with_dtype plumbing, per-dtype byte packing for F16/BF16/F32; previously hardcoded to F16
  • Cross-model dtype consistency check: verify_source_dtypes errors on mismatch unless allow_mixed_dtype is set
  • Tied-embedding detection: lm_head.weight and embed_tokens.weight aliasing detected and merged once under canonical name
  • Tokenizer + config sidecar copy: tokenizer.json, tokenizer_config.json, special_tokens_map.json, config.json, generation_config.json copied on full-model merge; config.json.torch_dtype patched to match output dtype
  • Post-merge sanity sweep (`Sa...
Read more

PMetal v0.4.0

24 Mar 04:05

Choose a tag to compare

[0.4.0] - 2026-03-23

Added

  • pmetal-mcp crate: Full MCP (Model Context Protocol) server exposing 45 tools for Claude Desktop and other MCP clients. Covers all pmetal functionality — training, inference, distillation, GRPO, RLKD, quantization, model merging, dataset operations, evaluation, benchmarking, model search, and Ollama export

    • Device & models: device_info, search_models, download_model, list_local_models, model_fit, model_info
    • Inference: generate (blocking), chat (via running serve instance), start_serve, benchmark, bench_train, bench_gen, bench_corpus
    • Training: train, distill, grpo, rlkd, embed_train — all as background jobs with full parameter coverage matching the CLI
    • Runtime training control: job_set_lr, job_reduce_lr, job_reset_lr, job_save_checkpoint, job_graceful_stop — LLM-driven adaptive training via the control file protocol
    • Job management: list_jobs, job_status, job_logs, stop_job
    • Dataset ops: dataset_analyze, dataset_preview, dataset_validate, dataset_download, dataset_convert, dataset_filter, dataset_split, dataset_merge, dataset_sample, dataset_template, dataset_prepare
    • Quantization & conversion: quantize, fuse_lora, merge_models, pack_experts, ollama_create, ollama_modelfile
    • Evaluation: eval_perplexity
    • All tools include rich #[description] annotations for parameter documentation in the MCP schema
    • Standalone binary (pmetal-mcp) for Claude Desktop + pmetal mcp subcommand (behind mcp feature flag)
    • Uses turbomcp v3.0.7 from crates.io
  • Runtime training control protocol: Extended the control file protocol (.lr_control.json) with SaveCheckpoint and GracefulStop commands. The adaptive LR controller now polls the control file before checking its enabled flag, so external agents (MCP, TUI) can always send commands regardless of whether automatic detection is active

  • --no-adaptive-lr flag: Disables automatic spike/plateau/divergence detection while keeping the control file protocol active. Enables fully LLM-driven learning rate control — the agent observes loss via job_status and manually adjusts LR via job_set_lr/job_reduce_lr

  • UltraFusion execution planner (pmetal-distributed): Per-die stage planner for M-series Ultra Macs with in-memory channel transport backend for same-process links, avoiding TCP overhead on UltraFusion interconnect

  • MPP FlashAttention for head_dim 64/96: Metal 4 MPP flash attention kernel now supports head_dim 64, 96, and 128 with stride-2/stride-3 SIMD lane packing and causal/non-causal variants

  • Tuna persistent disk cache: The auto-tuner now persists benchmark results to disk, avoiding re-tuning on restart. Expanded search covers FlashAttention, FusedCrossEntropy, FusedNormLora, and FusedSwiGLU via function constants

  • MoE GPU top-k selection: Expert top-k selection moved from CPU sort to GPU argpartition_axis, eliminating a sync point in the MoE forward path

  • bench-workload CLI command: Benchmark a real cached workload for inference and short LoRA training with named presets (--preset dense-qwen3, --preset hybrid-qwen3next)

  • KV cache quantization auto-select: --kv-quant is now optional — omitting it auto-selects the fastest quantization mode that fits the device memory budget

  • UltraFusion info display: pmetal info shows UltraFusion topology, die count, and local executor plan on Ultra Macs

  • Qwen3 LoRA RoPE reset: Qwen3 LoRA and QLoRA gain dense attention and RoPE reset support

  • ANE real-time evaluation: Experimental _ANEClient real-time dispatch with automatic fallback to standard evaluation on failure. Propagated via --ane-real-time CLI flag

  • bench-corpus CLI command: Structured kernel benchmarking with device-tier-aware test cases, JSON reporting, and --quick/--output flags

  • GPU memory bandwidth probing: Real GPU copy benchmark replaces static spec-table lookup, with disk-cached results and spec-table fallback

  • Persistent runtime kernel backend selection: Benchmark-and-persist infrastructure races MLX vs MPP backends on Apple10/M5, validates numerical agreement, and caches the winner to disk for 4-bit quantized linear, fused attention, and LoRA matmul

  • MPP kernel tile variants: Metal 4 GEMM supports parameterized tile variants (32x32, 64x32, 32x64, 64x64) with Tuna auto-tuner selection per device and problem shape

  • Serve ANE/CPU-hybrid engine caching: Serve engine auto-selects optimal backend (ANE, CPU-hybrid, GPU) at startup with permanent downgrade on failure. Compiled engines cached across requests

  • Rollback enabled by default for LoRA: Best-loss checkpoint rollback now defaults to on with extended warmup grace period. Persistent snapshot to disk via atomic write. for_lora() factory for recommended defaults

  • Extended StepMetrics: gpu_fwd_bwd_ms, optimizer_ms, io_staging_ms, overhead_ms fields for fine-grained training profiling

  • Zero-copy MoE expert dispatch: ExpertBufferPool with read_experts_aligned + encode_expert_aligned for pread-to-Metal expert weight dispatch. Auto-enable KV-Q8 when memory-constrained

  • ANE dual-die support: On UltraFusion chips, compile variant-B kernel set with distinct MIL hashes and alternate per step for dual-die thermal distribution. Auto-recompile on throughput degradation (>15% or >25K dispatches)

  • Batched parameter eval: Model dispatcher evaluates parameters in batches of 128 tensors per sync instead of all-at-once, reducing peak memory during model loading

  • Architecture enhancements: DeepSeek V3/V3.2, GPT-OSS, Jamba, Llama 4, Qwen3, and Qwen3-MoE model improvements and weight sanitization refinements

  • Third-party attribution: Complete THIRD_PARTY_NOTICES with entries for mlx-lm, llama.cpp/GGML, Candle, and Burn

Changed

  • ANE is now opt-in: The --no-ane flag has been replaced with --ane across CLI, TUI, orchestrator, and MCP. ANE training is experimental and limited to small models, so it defaults to off. The orchestrator's DispatchConfig now sets ane: false by default
  • Gradient checkpointing support corrected: Qwen3 and Qwen3Next no longer claim gradient checkpointing support (was incorrectly advertised)
  • Training loop refactored: Gradient checkpointing helper extracted, step logging tracks step numbers correctly, training loop tests expanded

Removed

  • Merge methods: Removed merge methods with incompatible licenses. Cleaned up related references across documentation and configuration

Fixed

  • MetalSampler use-after-free: Retained source logits array until GPU completion in serve engine
  • Fused merge Tuna cache: Now uses persistent disk cache instead of ephemeral per-session tuning

Downloads

Asset Description
pmetal-*-aarch64-apple-darwin.tar.gz CLI binary + mlx.metallib (Apple Silicon)
PMetal-*-aarch64-apple-darwin-*.dmg Desktop GUI app (Apple Silicon)
mlx.metallib MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.13...v0.4.0

PMetal v0.3.7

16 Mar 18:52

Choose a tag to compare

[0.3.7] - 2026-03-16

Added

  • pmetal merge CLI command: Model merging exposed as a first-class CLI command supporting all merge methods (Linear, SLERP, TIES, DARE, DELLA, NearSwap, Model Stock) with --method, --base, --t, --weight-a, --weight-b, --density, and --dtype flags
  • pmetal eval CLI command: Dataset evaluation command — measures loss/perplexity over a validation set with optional LoRA adapter, --num-samples cap, and --json output
  • pmetal info CLI command: Prints device and runtime information; --json flag emits structured JSON for scripting
  • pmetal search --json output: Structured JSON output mode for search results including fit estimates, download counts, parameter estimates, and tags — enables scripting and GUI integration
  • QuantizeMethod enum: Replaces the string --method argument for pmetal quantize with a typed enum (dynamic, q8_0, q4_k_m, etc.) — invalid methods now fail at argument parsing rather than deep inside the quantizer
  • GRPO CLI arguments: --epochs, --lora-r, --lora-alpha, --max-completion-length, and --seed exposed as CLI arguments, replacing previous hardcoded defaults
  • loraplus_lr_ratio and neftune_noise_alpha: New fields on training loop configurations — enables LoRA+ differential learning rates and NEFTune noise injection directly from config
  • trainable_params() helper: New utility in pmetal-lora for counting total vs. trainable parameter counts, useful for logging and memory estimation
  • lora_alpha: f32: Distillation CLI and run_distillation_cli now accept lora_alpha as f32 instead of usize for finer-grained scaling control
  • seed parameter in distillation and GRPO CLI: Reproducible runs via explicit --seed flag in all training entry points
  • Gemma3 sliding window auto-detection: DynamicModel loader now reads model_type == "gemma3" and sets is_gemma3 = true on the config, enabling the correct every-6th-layer global attention pattern without manual config overrides
  • KV cache support for more architectures: DynamicModel::forward_with_cache now routes DeepSeek, Cohere, StarCoder2, and Llama4 to their native caching paths; RecurrentGemma and Jamba now get clear error messages that they require forward() directly; hybrid models (NemotronH, Qwen3Next) get a descriptive error directing to forward_with_hybrid_cache
  • Speculative decoding greedy path: SpeculativeDecoder::verify_greedy() — exact-correct verification for temperature=0 decoding using argmax equality; avoids the numerically unstable rejection-sampling limit as temperature→0
  • Hub cache management (pmetal-hub): New cache.rs module with cache inspection, eviction, and size-reporting helpers
  • Shared model utilities (pmetal-models/utils.rs): Common helpers extracted from per-architecture modules to reduce duplication

Fixed

  • Scale factor broadcasting in distillation: squeeze applied to the scale factor dimension so it broadcasts correctly across batch and sequence axes — previously caused shape mismatches on non-unit batch sizes
  • TAID mean_alpha forcing GPU sync: TaidLossOutput::mean_alpha changed from f32 to a lazy Array — the .eval() call is deferred until callers explicitly call .item::<f32>(), removing a forced GPU-CPU sync before the backward pass
  • SLERP numerical stability: Added epsilon clamping in the SLERP merge path to prevent NaN when interpolation parameter is at the boundary values (0.0 or 1.0)
  • Llama LoRA trainable_params / gradient application: Replaced 100+ lines of repeated field accesses with an insert_adapter! macro and loop over projection names, fixing DoRA magnitude parameter that was silently dropped from gradient maps
  • GaLore improvements: Corrected projection matrix update schedule and subspace dimensionality handling
  • Distillation hidden-state loss: Refactored alignment computation to correctly handle variable-rank teacher/student hidden state tensors
  • Jensen-Shannon / KL divergence loss: Numerical stability improvements — log-sum-exp stabilization applied consistently across all reduction paths
  • Offline distillation: Fixed logit cache loading to handle both single-file and sharded cache layouts

Changed

  • lm_groups.rs / LoRA+ optimizer groups: build_lora_param_groups significantly reworked — LoRA+ differential LR ratio (loraplus_lr_ratio) applied to lora_b parameters, NEFTune noise injection integrated into group construction
  • GRPO trainer: epochs, lora_r, lora_alpha, max_completion_length, and seed plumbed through from CLI args; previously these were hardcoded to 1, 16, 32, 512, and a fixed seed
  • Training loop: loraplus_lr_ratio and neftune_noise_alpha read from config and forwarded to optimizer group construction
  • pmetal-core config / scheduler / traits: Config structs gained loraplus_lr_ratio and neftune_noise_alpha fields; scheduler types and learning rate trait bounds refined; TrainingCallback trait extended with blanket impls for boxed callbacks
  • Data pipeline: Tokenizer, packing, vocab_compact, dataset, and chat template modules updated — minor correctness and efficiency fixes accumulated across the release cycle
  • GGUF reader / writer / quantize: Reader handles additional tensor metadata fields; writer improves alignment padding; quantize module uses QuantizeMethod enum instead of string matching
  • Hub search: search_models returns richer result structs used by both the human-readable table and the new --json output path; upload path fixes for large model shards
  • Metal kernels: GDN, LoRA, grouped GEMM, and fused SwiGLU Metal shaders updated — improved numerical correctness and register pressure
  • GUI app icons and Tauri config: Updated icons (32×32, 128×128, 128×128@2x, icns, ico) and tauri.conf.json for the 0.3.7 release build; Python vocoder easy API additions and mel spectrogram fix

Downloads

Asset Description
pmetal-*-aarch64-apple-darwin.tar.gz CLI binary + mlx.metallib (Apple Silicon)
PMetal-*-aarch64-apple-darwin-*.dmg Desktop GUI app (Apple Silicon)
mlx.metallib MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.6...v0.3.7

PMetal v0.3.6

15 Mar 23:56

Choose a tag to compare

[0.3.6] - 2026-03-15

Added

  • Desktop GUI (Tauri + Svelte): Full desktop application for model management, training, distillation, GRPO, inference, merging, and quantization. 10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, Settings. Real-time training metrics with live loss charts via broadcast events. Model download with HuggingFace Hub integration, dataset browser, and inference chat interface with streaming token display
  • GUI in-process execution: Training, distillation, GRPO, inference, model merging, LoRA fuse, and quantization run as direct library calls instead of shelling out to the pmetal binary. Eliminates binary discovery issues, reduces process overhead, and enables richer progress reporting. Device info and model metadata also read from library APIs
  • easy::dpo() / easy::simpo() / easy::orpo() / easy::kto() builders: PreferenceTuneBuilder in easy.rs for preference optimization methods. Full pipeline: model download → tokenizer → dataset loading → LoRA setup → training loop → weight saving. Supports method-specific config (DPO beta/loss type, SimPO gamma/CPO, ORPO beta, KTO desirable/undesirable weights)
  • easy::infer().generate_streaming(): Streaming inference API with per-delta callback. Supports both base models and LoRA adapters. Returns false from callback to cancel early. ANE fallback emits full result as single delta
  • Preference trainer train() methods: DPO, KTO, ORPO, and SimPO trainers now have self-contained train() methods with optimizer integration, batching, epoch loops, callback lifecycle, and metrics collection. Previously only exposed per-step primitives
  • TrainingCallback::should_stop(): Clean cancellation mechanism — callbacks return true to request training loop to finish the current step and exit with Cancelled error. Checked after every step in all 5 TrainingLoop::run* methods, all 4 preference trainer train() loops, and GrpoTrainer::run()
  • PMetalError::Cancelled: New error variant for clean training cancellation. Corresponding Cancelled variants added to SftError, DpoError, KtoError, OrpoError, SimpoError, and GrpoError
  • Preference batch padding utilities: pad_u32_sequences, pad_i64_sequences, pad_f32_sequences in preference_batch.rs for batching variable-length preference pairs
  • NemotronH runtime FP8 quantization: quantize_fp8() converts float weights to FP8 (E4M3) at runtime for all four block types (Mamba, attention, MLP, MoE). Shared helpers materialize_linear_weight and linear_forward_with_optional_fp8 consolidate FP8 dequantization across the model. MoE weights are restacked after quantization for batched dispatch
  • FluxPipeline::from_pretrained: Load Flux diffusion pipelines from HuggingFace-style model directories. Discovers components via model_index.json, parses both native and diffusers-style config keys for CLIP, T5, FluxDiT, and VAE
  • Python training callbacks: Trainer.add_callback() now wires callbacks into the training loop. Built-in ProgressCallback, LoggingCallback, and MetricsJsonCallback map to native Rust implementations; arbitrary Python objects bridge through PythonCallbackBridge

Fixed

  • Training cancellation via panic_any replaced: GUI and TUI previously used std::panic::panic_any(CancelledRun) + catch_unwind to abort training — fragile, UB-prone through FFI, and could be swallowed by intermediate catch_unwind. Replaced with TrainingCallback::should_stop() returning a clean Err(Cancelled) from the training loop
  • GUI QLoRA silently failed on non-Llama models: run_qlora_training_in_process hardcoded LlamaConfig deserialization, causing confusing errors or silent misconfiguration for Gemma/Qwen/Phi models. Now detects model_type from config.json and returns a clear error for unsupported architectures
  • GUI resume_from silently ignored: Training config accepted resume_from but discarded it (let _ = eval). Now returns an error directing users to the CLI
  • GUI GRPO with no reward function produced noise: DummyReward returning constant 0.1 for all completions made GRPO training meaningless when reasoning rewards were disabled. Now requires explicit reward configuration
  • Preference trainers doubled compute per step: DPO, KTO, ORPO, and SimPO train() methods ran a second full forward pass after the gradient step solely for logging metrics. Replaced with RefCell side-channels that capture metric arrays from within the autograd closure — same metrics, zero extra compute
  • Base model thinking mode: Auto-detect base vs instruct models and disable <think> tag prefill for base models. Base models don't understand thinking tags, causing infinite generation without a closing tag
  • Fused model 5x slower than LoRA: Skip ANE-hybrid path for models under 2B parameters where GPU KV-cache decode is significantly faster (115 vs 20 tok/s). ANE-hybrid benefits larger models where prefill dominates
  • DataLoader panics on bad images: Replace panic!() in VLM batch construction with proper DataLoaderError enum and try_next_batch() method. Image preprocessing failures and missing-image errors now propagate as Result instead of crashing
  • Division by zero with log_every=0: Clamp log_every and save_every to minimum 1 across TrainingLoop, LoggingCallback, CheckpointCallback, and CLI
  • LoRA scaling with rank 0: LoraConfig::scaling() returns 0.0 when rank is 0 instead of dividing by zero
  • BF16 LoRA weights: sanitize_loaded_weights() converts BF16 tensors to FP16 since MLX doesn't natively support BF16 on Apple Silicon
  • Qwen3Next silent weight mismatch: Weight loading now returns errors for unmatched or missing parameters instead of logging a warning and continuing with a partially loaded model
  • Dataset download only fetched README: download_dataset() now enumerates repo files and downloads actual data files (parquet, json, jsonl, csv, arrow, etc.) with split-aware filtering
  • Model download silent failures: download_model() tracks per-file failures and reports them instead of silently skipping failed downloads
  • Flux loading via DynamicModel: DynamicModel::load() for Flux now returns an error directing to FluxPipeline instead of incorrectly loading a diffusion model as a causal LM

Changed

  • GUI architecture: library calls replace subprocess spawning: Training, distillation, GRPO, inference, merge, fuse, and quantize commands now call pmetal library functions directly instead of spawning pmetal CLI as a child process. System info reads from MetalContext::global() instead of parsing pmetal memory stdout. Removes which and futures-util dependencies
  • TUI direct training execution: command_runner.rs dispatches train, distill, and grpo commands as in-process library calls via run_direct_command(), falling back to subprocess for other commands. Training parameters parsed from CommandSpec args with parse_arg/required_arg/optional_arg helpers
  • ORPO loss computation refactored: compute_orpo_loss_static now contains the full computation directly instead of creating a throwaway OrpoTrainer instance. The instance method compute_orpo_loss delegates to it
  • SimPO gradient-safe loss path: New compute_loss_with_cpo_for_grad static method keeps the computation graph lazy (no .eval()/.item() calls) for correct autograd. The existing compute_loss_with_cpo remains for non-grad contexts
  • FinetuneBuilder expanded: New builder methods — lora_dropout(), use_rslora(), use_dora(), gradient_checkpointing_layers(), callback(), metrics_path(). LoRA config now forwards dropout, RSLoRA, and DoRA settings
  • GRPO CLI gains new parameters: epochs, lora_r, lora_alpha, max_completion_length exposed as CLI arguments and TUI form fields. GRPO now saves adapter_config.json alongside LoRA weights
  • CLI emit_console_output flag: Training, distillation, and GRPO CLI functions accept emit_console_output: bool and extra_callbacks: Vec<Box<dyn TrainingCallback>> to suppress terminal output when called from GUI/TUI
  • DataLoader error handling: New DataLoaderError enum with Mlx, ImagePreprocess, and MissingImages variants. All 7 training loop entry points migrated from next_batch() to try_next_batch()
  • AdapterManager validation: load() now validates path existence, checks for adapter artifacts in directories, and rejects unsupported file types
  • Metal shader build isolation: Shader compiler cache redirected to build output directory, preventing pollution of user's home directory
  • unsafe_code lint scoping: Moved blanket #![allow(unsafe_code)] from crate-level lib.rs into individual modules that contain unsafe blocks across pmetal-metal, pmetal-mlx, pmetal-models, pmetal-trainer, pmetal-distill, and pmetal-distributed

Downloads

Asset Description
pmetal-*-aarch64-apple-darwin.tar.gz CLI binary + mlx.metallib (Apple Silicon)
PMetal-*-aarch64-apple-darwin-*.dmg Desktop GUI app (Apple Silicon)
mlx.metallib MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.5...v0.3.6

v0.3.5

15 Mar 03:16

Choose a tag to compare

Full Changelog: v0.3.4...v0.3.5

v0.3.4

14 Mar 20:08

Choose a tag to compare

[0.3.4] - 2026-03-14

Added

  • Mixture-of-Depths (MoD) for Llama 4: Proper implementation per Raposo et al. (2024) — lightweight router with argpartition_axis top-k, gather-before-compute on sub-batch, scatter-after, BCE auxiliary loss. Configurable capacity factor and per-layer selection
  • Llama 4 RoPE: Real RoPE implementation via pmetal_mlx::kernels::rope::apply_rope (Metal-accelerated), replacing the placeholder stub. Correctly wired into iRoPE layer dispatch — RoPE layers get rotary embeddings, NoPE layers skip them
  • Llama 4 temperature scaling: Per Meta's formula log(floor((pos+1)/floor_scale) + 1) * attn_scale + 1.0, applied to Q states in NoPE layers before QK matmul for long-context attention stabilization
  • Llama 4 GQA: KV-head broadcast expansion for grouped-query attention — enables Scout (40 Q / 8 KV) and Maverick configs
  • MoE top-k > 1: Llama4Router uses argpartition_axis for O(n) expert selection with L1-normalized weights and per-slot dispatch loop, replacing hardcoded argmax
  • ANE fused kernels: gen_dynamic_sdpa_fwd (single-kernel attention: RMSNorm + QKV + SDPA + Wo) and gen_dynamic_ffn_w13 (single-kernel FFN: RMSNorm + W1 + W3 + SiLU), replacing 6+ separate ANE evaluations per layer
  • ANE fused backward: gen_dynamic_ffn_bwd_w2t and gen_dynamic_ffn_bwd_w13t for fused FFN backward pass
  • Metal dequantization kernels: Q4_0 and IQ4_XS Metal compute shaders, verified correct per GGML spec. Bridge methods in MlxMetalBridge for GPU-accelerated dequantization
  • Cancellation safety infrastructure: CompletionToken::Drop guard in AsyncScheduler waits for in-flight GPU commands; retain_resource() / as_retained() for Metal buffer lifetime extension
  • IoSurface helpers: write_f32_strided_at, write_f32_at_col_offset, zero_channel_range_f32 for fused backward kernel IO
  • CloudBridge: Complete training state export (weights, optimizer state, RNG, dataloader position, metadata) with working Python bootstrap scripts for FSDP/DeepSpeed cluster resumption and Rust-side loader functions
  • Formal verification: cargo-kani proofs for ring all-reduce chunk arithmetic (95 checks) and k-ary tree topology consistency (607 checks), with justfile recipes
  • Reasoning templates: MathReasoningTemplate (GRPO + accuracy/format rewards) and CodeReasoningTemplate (structural code fence + test case matching)
  • Reasoning dataset auto-detection: pmetal dataset prepare automatically detects problem/thinking/solution columns and formats them as <think> tagged ChatML conversations
  • --columns flag: General column remapping for dataset prepare (e.g., --columns "instruction=question,output=answer")
  • adapter_config.json: Saved alongside LoRA weights during training (r, alpha, target_modules, use_rslora). Loaded automatically at inference and fuse time — eliminates config guesswork
  • Supply chain: cargo-vet initialized with Mozilla, Google, and Bytecode Alliance audit imports; 17 workspace crates covered; 5 transitive dependency exemptions with exact lockfile versions
  • Tracing spans: 6 info_span! markers in Python trainer for phase-level observability (model_resolve, load_tokenizer, load_dataset, load_model, training_loop, save_weights)

Fixed

  • LoRA inference garbage output: Merged LoRA weights into base model at inference time (W += scale*B@A), matching mlx-lm's pattern. The separate-forward path had dtype mismatch issues (BF16 base × F32 LoRA)
  • Auto-chat mode regression: Removed heuristic that forced chat template on base models just because their tokenizer has <|im_end|>. Chat mode now requires explicit --chat or an instruction-tuned model
  • Missing EOS in training data: Training sequences now end with the model's actual EOS token (e.g., <|endoftext|> for Qwen). Previously only had turn delimiter (<|im_end|>) — model never learned to stop generating
  • Fuse command wrong alpha/rank: pmetal fuse now reads adapter_config.json for correct alpha and rank instead of defaulting to scale=1.0. Also filters MLP LoRA weights (rank=0) when auto-detecting rank from shapes
  • ANE x2norm backward bug: FFN weight gradients (dW1, dW3) were computed against the wrong pre-norm tensor (xnorm from attention block instead of x2norm from FFN block). Restored x2norm field and CPU RMSNorm recomputation for gradient correctness
  • ANE sdpa_bwd surface dtype: Backward SDPA output surfaces were allocated as fp32 but ANE kernels produce fp16 — stride mismatch corrupted dV/dQ/dK gradients. Fixed to IoSurface::for_tensor() (fp16)
  • MoD argpartition sign: Router negated weights before argpartition_axis, selecting bottom-k (least important) tokens instead of top-k. Removed negation
  • MLX bridge copy_as_f32 regression: Renamed methods dropped auto dtype conversion — callers passing wrong dtype would panic. Restored copy_as_f32 / copy_as_f16 with auto-conversion
  • MLX bridge view_f32 eval: Removed .eval() call before accessing data pointer — unevaluated arrays returned null. Restored defensive eval
  • Python API surface: Restored ProgressCallback, LoggingCallback(log_every=10), __version__, and PythonCallbackBridge that were deleted during PyO3 migration
  • TUI training completion: Reads final metrics from JSONL file on disk (immune to polling lag). Shows actual loss and step count instead of 0.0000 / sample count
  • TUI Steps/min overflow: Guards against divide-by-zero when total_ms=0 — shows instead of 60000
  • Dataset prepare panic: Empty results no longer crash with index-out-of-bounds. Shows diagnostic message with format hints

Changed

  • LoRA inference uses merge: merge_lora() is called before generation, producing a single merged weight matrix per layer. This is equivalent to the fuse command but happens in-memory without saving
  • PyO3 0.23 → 0.28: allow_threadsdetach, with_gilattach, from_py_object on all pyclass types, Bound<'py, PyDict> return types
  • tokio 1.49 → 1.50
  • unsafe_code lint: Escalated from warn to deny workspace-wide

Full Changelog: v0.3.3...v0.3.4

v0.3.3

13 Mar 03:11

Choose a tag to compare

[0.3.3] - 2026-03-12

Added

  • Self-contained binary: mlx.metallib is now gzip-compressed and embedded into the pmetal binary at build time via build.rs + include_bytes!. On first run it extracts to ~/.cache/pmetal/lib/ if not already present. cargo install pmetal-cli now produces a fully self-contained binary with no external metallib dependency (~31MB added to binary, 70% smaller than the raw 102MB metallib)
  • Adaptive LR rollback: When divergence is detected and rollback_enabled = true, the adaptive LR controller emits LrEvent::RollbackTriggered — the training loop restores LoRA weights from the best in-memory EMA snapshot, resets optimizer momentum, and continues with a halved LR multiplier
  • Early-stop on repeated divergence: After max_rollbacks exhausted rollbacks, the controller emits LrEvent::EarlyStop — the training loop saves a final checkpoint and exits cleanly instead of spiraling deeper into loss divergence
  • In-memory LoRA snapshot: TrainingLoop holds the best LoRA weight snapshot in RAM via snapshot_best_weights() / restore_best_weights(). LoRA params are typically 1–20 MB, making this negligible overhead vs checkpoint I/O
  • AdaptiveAction enum: apply_adaptive_lr() now returns AdaptiveAction::Continue | Rollback | EarlyStop so training loops can react to controller decisions without re-parsing event strings

Fixed

  • apply_adaptive_lr return type: Previously returned (), discarding rollback/early-stop events — callers had no way to react. Now returns AdaptiveAction
  • Divergence rollback vs plain reduction ambiguity: Divergence path now checks rollback_enabled and has_best_snapshot before deciding between rollback and plain LR reduction — prevents silent rollback when no snapshot exists
  • EMA state reset on rollback: Spike EMA and variance are reset alongside LR multiplier on rollback so z-score anomaly detection re-stabilizes correctly after weight restoration
  • total_steps in metrics: run_standard() and run_jit_compiled() computed total_steps: max_steps.unwrap_or(0) — now estimates from dataset.len() / batch_size * epochs when max_steps is None, giving accurate progress in the TUI
  • stats_summary missing rollback count: AdaptiveLrController::stats_summary() now includes rollbacks=N in its output string

Improved

  • Rollback tests: Four new unit tests — test_rollback_triggered_on_divergence, test_early_stop_after_max_rollbacks, test_rollback_disabled_falls_through_to_divergence, test_should_snapshot_best_tracks_ema_improvement

Full Changelog: v0.3.2...v0.3.3

v0.3.2

12 Mar 02:10

Choose a tag to compare

[0.3.2] - 2026-03-11

Added

  • Adaptive learning rate controller: EMA-based z-score spike detection, patience-based plateau detection, and linear regression divergence detection — automatically adjusts LR multiplier during training to recover from loss spikes, reduce LR on plateaus, and halt on divergence
  • Manual LR override via TUI: Press L in Training, Distillation, or GRPO tabs to set a custom learning rate mid-run; uses atomic control file protocol ({output_dir}/.lr_control.json) for safe subprocess communication
  • WSD (Warmup-Stable-Decay) scheduler: New LrSchedulerType::Wsd with configurable stable_ratio — holds peak LR for a plateau phase before linear decay, popular for large-scale pretraining
  • GRPO adaptive LR + callbacks: GrpoTrainer now supports adaptive LR, TrainingCallback lifecycle events, and StepMetrics emission for live TUI monitoring
  • HuggingFace Hub search (pmetal search): CLI command and TUI integration (press S in Models tab) to search HF Hub for text-generation models with download counts, parameter estimates, and memory fit assessment
  • Memory fit estimation: New pmetal-hub module estimates inference/training memory requirements, tok/s throughput, and color-coded fit levels (green/yellow/red) based on device specs and model architecture
  • Model detail panel: Models tab shows memory breakdown — weights, KV cache, overhead, training estimate, and recommended batch size
  • Distillation metrics callbacks: DistillationTrainer now emits step-by-step metrics via TrainingCallback, enabling live TUI dashboard during distillation runs
  • Command logging in Jobs tab: Spawned commands are logged with the full CLI invocation for easier debugging

Fixed

  • NaN/Inf loss guard: Adaptive LR skips EMA updates on non-finite losses to prevent EMA poisoning — returns scheduled LR unchanged
  • EMA variance bias correction: Early-training z-scores now use bias-corrected variance (raw_var / (1 - alpha^n)), matching Adam's moment correction — prevents false spike detection in first ~20 steps
  • Zero-variance z-score fallback: When loss variance is near zero (std_dev < 1e-8), uses absolute deviation threshold instead of division-by-zero; returns z=10 for >50% deviation, z=0 otherwise
  • Atomic control file protocol: LR control file is renamed to .lr_control.claimed before reading and deleted after — prevents race conditions between TUI writer and training subprocess reader
  • Distillation metrics LR: Distillation step metrics now report post-adaptive LR instead of pre-adjustment scheduled LR
  • Adaptive LR in all training paths: apply_adaptive_lr() now called in run_metal_fused(), run_compiled(), run_jit_compiled(), and run_packed() paths (was only in run_standard())
  • TUI LR override validation: LR range check now accepts 1.0 (was exclusive upper bound); shows error modal on invalid input instead of silent log warning
  • Distillation/GRPO job routing: Status updates were always routed to the Training tab regardless of job type. Added active_job_type tracking to route metrics, completion, and failure to the correct tab (Distill, GRPO, or Training)
  • Distillation CLI args: TUI sent --lora-alpha and --log-metrics flags that the CLI didn't accept, causing immediate exit code 2. Added both args to the Distill command and --log-metrics to Grpo
  • Parquet dataset support in distill/GRPO: Distillation and GRPO commands only supported JSONL datasets. Now auto-detect .parquet files and route to the parquet loader, matching the training command's behavior
  • Tab click targeting: Mouse clicks on Monitor, Inference, and Jobs tabs selected the wrong tab due to hardcoded fixed-width hit-testing. Now computes actual tab widths from rendered text
  • Error diagnostics: Failed jobs now show the last 5 stderr lines in the tab status panel instead of just "Process exited with code N", with a hint to check the Jobs tab for full output
  • UTF-8 safe string truncation: truncate_str used byte indexing which panics on multi-byte characters; switched to chars() iterator
  • Leaked channel in HF search: search_hf() created a sender/receiver pair even without a CommandRunner, silently dropping results
  • Integer overflow in fit estimation: estimate_params_from_config used plain multiplication; switched to saturating_mul/saturating_add
  • Context length truncation: u64→u32 cast could wrap for extreme values; capped at 1M before cast

Improved

  • TUI tab ordering: System (formerly Device) is now the default first tab; Dashboard renamed to Monitor
  • Empty state messaging: Monitor tab shows actionable guidance ("Start a run from Training, Distill, or GRPO tab") instead of "Waiting for training data..."
  • Idle state hint: Tabs show "Press S to start" instead of "Press S to start training" (generic across all job types)

Security

  • Bounded API responses: bounded_json() caps HF API response bodies at 4MB to prevent heap exhaustion
  • Model ID validation: is_valid_model_id() rejects path traversal, URL injection, and malformed values in HF API paths

Full Changelog: v0.3.1...v0.3.2

v0.3.1

11 Mar 15:43

Choose a tag to compare

[0.3.1] - 2026-03-11

Added

  • M5 / Apple10 device detection: GPU family Apple10 with architecture generation 17, NAX (Neural Accelerators in GPU) availability flag, and NAX-aware tile size tuning (M5 Max/Ultra get 128×64×32)
  • UltraFusion topology detection: sysctl hw.packages detects multi-die Ultra chips; is_ultra_fusion and die_count fields on DeviceProperties
  • GPU and ANE core count estimation: Per-chip core counts derived from device name and tier, with UltraFusion die multiplication
  • Memory bandwidth estimation: Tier + GPU family lookup table for estimated bandwidth (GB/s)
  • ANE performance stats API: evaluate_with_stats() on AneModel uses _ANEPerformanceStats with hwExecutionTime for nanosecond-precision hardware timing
  • TUI device tab enhancements: GPU core counts (with per-die breakdown for Ultra), ANE core counts, memory bandwidth, architecture generation, NAX and UltraFusion feature flags
  • crates/pmetal/README.md: Crate-level README with feature flags table, quick start examples, hardware support summary, and re-export reference

Fixed

  • AppleGPUFamily::Unknown ordering bug: Unknown was declared last in the enum, causing derived Ord to rank it above Apple10 — unknown GPUs incorrectly got has_dynamic_caching, has_nax, etc. set to true. Fixed by moving Unknown to first position
  • Future chip name collision: name.contains("M1") matched "M10"; replaced with has_chip_id() that checks the character after the match isn't a digit
  • Dead sysctl subprocess in query_memory_bandwidth: Spawned sysctl whose result was discarded; removed and renamed to estimate_memory_bandwidth() using tier-based lookup

Improved

  • README updates: Root README now documents hardware support matrix (M1–M5), 9 TUI tabs (was 7), 16 crates (was 15), all fused Metal kernels (GDN, SwiGLU, RMSNorm+LoRA), ANE perf stats and M1–M5 compatibility
  • Hardware support docs: Complete M1–M5 chip matrix with arch gen, core counts, bandwidth, ANE TFLOPS measurements; NAX kernel integration roadmap; UltraFusion distributed roadmap

Full Changelog: v0.3.0...v0.3.1

v0.3.0

11 Mar 03:27

Choose a tag to compare

[0.3.0] - 2026-03-10

Added

  • TUI Control Center (pmetal tui): Full terminal interface with 9 tabs — Dashboard, Device, Models, Datasets, Training, Distillation, GRPO, Inference, Jobs. Async event loop with crossterm/ratatui, modal system (confirm, text input, model picker, dataset picker, error, progress), and reusable form field widgets
  • Live job integration: Training, distillation, and GRPO tabs spawn pmetal subprocesses and stream metrics in real time via CommandRunner + JSONL polling
  • LoRA fuse command (pmetal fuse): Merge LoRA adapter weights into base model, with optional fuse-then-quantize pipeline
  • Chat template support for Llama 4, DeepSeek, and Cohere: Full template formatting, Jinja detection, model name heuristics, stop tokens, and inference formatting for all three model families
  • Llama 4 template: <|header_start|>/<|header_end|>/<|eot|> tokens (distinct from Llama 3's <|start_header_id|>/<|end_header_id|>/<|eot_id|>)
  • DeepSeek template: Full-width unicode tokens (<|begin▁of▁sentence|>, <|User|>, <|Assistant|>) with thinking mode support (<think>/</think> prefill)
  • Cohere Command R template: <|START_OF_TURN_TOKEN|>, <|USER_TOKEN|>, <|CHATBOT_TOKEN|>, <|END_OF_TURN_TOKEN|> tokens
  • Comprehensive stop token collection: collect_all_stop_tokens() now probes 11 well-known special tokens across all model families (added <|eot|>, <|end|>, <|return|>, <|END_OF_TURN_TOKEN|>, <|end▁of▁sentence|>)
  • LoRA inference auto-chat detection: Probes vocabulary for <|im_end|>/<|eot_id|> to auto-enable chat mode on base models fine-tuned with LoRA
  • Streaming generation support: GenerationConfig streaming extensions in pmetal-models
  • Epoch/total_steps in StepMetrics: Training progress now flows through entire pipeline (training loop → JSONL callback → TUI) showing step X/Y and epoch M/N
  • Hardware support documentation: Apple Silicon hardware matrix and tuning reference (docs/hardware-support.md)

Fixed

  • TUI inference word wrap: Model output now wraps correctly within the terminal width instead of clipping off-screen; normalize_code_fences() preprocessor ensures ``` markers always appear on their own line even when the model emits text without newlines
  • TUI inference code block rendering: Fenced code blocks (```python, etc.) now render properly with distinct styling even when the token stream lacks explicit newline characters
  • TUI UTF-8 safe text handling: Word wrap and code block truncation now use char-count width instead of byte length, preventing panics on multi-byte characters
  • GRPO accuracy reward — last-occurrence extraction: AccuracyReward now uses rfind() for <answer> tags and \boxed{}, correctly grabbing the final answer when the model retries within chain-of-thought
  • GRPO accuracy reward — broken fallback: Old code compared the entire completion (including reasoning) against the answer when no <answer> tags were found; now falls back to last non-empty line
  • GRPO accuracy reward — whitespace normalization: Answer comparison now collapses internal whitespace runs to single space, preventing false negatives from formatting differences
  • LoRA inference stop tokens: run_inference_with_lora now uses full chat template + comprehensive stop token collection instead of just tokenizer EOS — fixes infinite generation on chat-finetuned models
  • LoRA inference missing parameters: All sampling parameters (top_k, top_p, min_p, penalties, seed) now passed through to LoRA inference path
  • Llama 4 misdetection: Model name heuristic now correctly routes llama-4/llama4 to Llama 4 template (was incorrectly using Llama 3 tokens)

Added

  • GRPO \boxed{} answer extraction: AccuracyReward now extracts answers from LaTeX \boxed{...} expressions with brace-depth tracking, standard for math GRPO (DeepSeek-R1 style)

Improved

  • TUI replaces legacy dashboard: pmetal tui provides full control center; legacy pmetal dashboard retained for simple metrics monitoring
  • Chat template Jinja detection: Ordered detection ensures DeepSeek (full-width unicode), Cohere, Llama 4 are matched before generic patterns
  • EOS token stripping: strip_eos_tokens() now handles all model-family EOS tokens

Full Changelog: v0.2.1...v0.3.0