Releases: Epistates/pmetal
PMetal v0.5.0
[0.5.0] - 2026-05-07
Added
Distributed inference & training
pmetal-distributedcrate (Phases 1-4 + 7): Thunderbolt-fabric-aware multi-Mac cluster runtime with feature-gated tensor, expert, context, ZeRO, and pipeline parallelism modulespmetal clusterCLI: per-node launch, ring/mesh topology discovery, fabric handshake- Pipeline harness with overlap of computation and Thunderbolt transfers
- Canonical expert-rank mapping + per-architecture MoE/MLA tensor-parallel plans
- Ring all-reduce / all-gather with corrected chunk indexing
TurboQuant KV cache (production-ready)
-
TurboQuant KV cache quantization: Provably near-optimal KV cache compression based on random rotation + Lloyd-Max scalar quantization + QJL residual for unbiased inner products (arXiv:2504.19874). Achieves 4-6x KV cache compression with near-zero quality loss. Available via
--kv-turboquantor presets--kv-turboquant-preset q3_5(near-lossless) /q2_5(6.4x compression)- Separate key/value runtimes with independent bit widths and outlier-aware mixed-precision
- Direct attention path for single-token decode avoids full cache dequantization
- Data-oblivious (no calibration data required) — quantizes KV entries online as generated
- Precomputed codebooks via Lloyd-Max algorithm for Beta distribution (deterministic from seed)
- Metal kernel backend with CPU fallback
- Phase 0: split monolithic
mod.rs(6101 → 222 LOC) into config/core/state/bits/math submodules - Phase 3: GPU-resident hot/cold pipeline + Mixed K/V storage;
mixed_scoreas layout oracle - Phase A/B: QJL ablation harness (feature-gated) + per-row
key_slot_scalecodebook adaptation - Phase C/C′: Variant F drop-QJL opt-in path; d128/d256
no_qjl_2passfast paths (4..=8 bits) - Phase D:
TurboQuantPackModeconfig + Fullbyte dense-values kernel - Phase E:
TurboQuantOutlierMode— encode-side top-K outlier storage, zero pre-quant + decode override, outlier-bias on d128/d256 fullbyte score kernel; CPU mirror in scalar encode/decode - Phase F: Hamming skip-list dispatch —
skiplist_thresholdconfig, GPUsign_hashbuffer, Metal Hamming-distances kernel + FFI, GQA support - Mixed-precision attention parity baseline; defensive residual-norm clamp + NaN-safe encode
-
Asymmetric K/V head dimensions: KV cache, TurboQuant, and fused attention now support models where key and value projections have different widths (e.g. DeepSeek MLA with
qk_head_dim != v_head_dim) -
pmetal serve --kv-turboquant: TurboQuant KV cache in the serving engine with--kv-turboquant-preset q3_5for near-lossless 4.6x KV compression in production
Quantization & model formats
- Optimized FP8 checkpoint loading: Hugging Face FP8
weight_scale_invsidecars are dequantized or repacked into MLXmxfp8weights for Qwen3-family native paths; mode-aware quantized matmul plumbing handles floating-point quantized weights without dense fallback - Expanded GGUF quantization/export:
pmetal quantizenow writes standard GGUF metadata from Hugging Face configs, tokenizer/pre-tokenizer metadata, HF-to-GGUF tensor names, stacked MoE expert tensors, and method-specific file types - Broader GGUF format coverage: quantization/dequantization support now includes K-quants, legacy Q4/Q5/Q8 variants, Q1_0, TQ1_0/TQ2_0, MXFP4, NVFP4, BF16, F16, and F32 round trips
- MLX safetensors quantization path: quality-based bit allocation with
--target-bpw, GPU-resident weight loading, and tokenizer/config sidecar copy for MLX-format quantized exports
Inference server (OpenAI- + Anthropic-compatible)
- Continuous batching with paged-KV-style admission + shared prefix cache in
pmetal-serve: per-request slot scheduling, KV-cache prefix sharing, concurrent decode for many simultaneous chats- Token-block admission budget (
--cb-block-size,--cb-max-blocks) prevents over-admitting active contexts and skips head-of-line requests when a smaller queued request fits the remaining block budget - Continuous batching now reuses the shared prompt prefix cache, prefills only uncached suffix tokens, and saves extended prefixes after final prefill
- Continuous batching derives the same cache mode as the single-request serving path, honoring
--kv-quantand--kv-turboquant - Hybrid/recurrent models are rejected from continuous batching instead of silently running without recurrent state
- Token-block admission budget (
- Anthropic-compatible
/v1/messagesendpoint: streamingmessage_start→content_block_start→content_block_delta*→content_block_stop→message_delta→message_stopevents; non-streaming JSON path /v1/embeddingsendpoint: 17 architectures supported viaforward_hidden(Llama/Llama4/Qwen2/Qwen3/Qwen3MoE/Qwen3Next/Mistral/Gemma/Gemma4/Phi/Phi4/DeepSeek/Cohere/Granite/GptOss/NemotronH/BERT) — pooling viapmetal_models::pooling- Token logprobs:
SamplingParams.logprobs_top_nplumbed end-to-end through non-streaming and SSE streaming on both/v1/chat/completionsand/v1/completions. Newpmetal_models::generation::token_logprobsprimitive; ANE/CPU paths emitlogprob: None - Best-effort tool calling on
/v1/chat/completions:try_parse_tool_callsaccepts{name, arguments}or{tool_calls: [...]}.ChatCompletionRequest.toolsgates the attempt; chat templating threads tool defs into the rendered prompt IncrementalDecoder<Aux>SSE buffer: shared UTF-8 boundary buffer + per-token aux pipelining (used for logprobs alignment) across chat/completions/anthropic streams
Job orchestration substrate (TUI / GUI / MCP / CLI parity)
JobSpecsubstrate: 16 canonical spec types inpmetal-core(Train, Distill, GRPO, Bench, Eval, Pretrain, Tokenize, Serve, Generate, RLKD, EmbedTrain, DFlash, Memory, Ollama, …) with#[derive(JobSpec)]proc-macroJobEventcanonical streaming protocol: progress / metric / log / artefact / complete / failed events emitted by all 4 surfaces (CLI, TUI, GUI, MCP)- CLI: 8 specced
Commandsvariants flattened — 613 LOC removed frommain.rs;cli/<sub>.rsArgs structs and JobSpec argv round-trip tests;--log-eventsflag stub - TUI: 14 tabs with full CLI parity,
?-key help overlay,Ctrl+1..9tab jump, active-job footer badge, descriptor-driven forms with sharedFormTabStateprimitive; channel-based metrics streaming (ChannelMetricsCallback) for direct-path train/distill/grpo/bench/eval/pretrain - GUI (Tauri): complete 9-DTO frontend-lockstep migration to
*Spectypes; Serve, Bench, Eval, Jobs, Pretrain pages; embed-train + rlkd + ollama routes; channel-based metrics streaming - MCP: 51-tool server with migrated train/pretrain/tokenize/memory/dflash/generate coverage, allowlisted CLI passthrough tools for newly added CLI flags, and a JobEvent JSONL consumer for managed background jobs
SOTA distillation (pmetal-distill)
- Universal Logit Distillation (ULD) — Wasserstein-1 over sorted logit distributions for cross-tokenizer KD (Boizard et al. 2024); optional
top_ktruncation; permutation-invariant by design - Generalized Knowledge Distillation (GKD) — λ-weighted off-policy + on-policy KL blend (Agarwal et al. 2024);
OnPolicySamplertrait withGreedySamplerreference impl;compute_full(t_off, s_off, t_on, s_on, T) - MiniLLM — reverse-KL with optional teacher-mix
target = mix·T + (1-mix)·S(Gu et al. 2024) - Skewed JSD (DistiLLM-2) —
α·KL(T||M_α) + (1-α)·KL(S||M_α)withM_α = α·T + (1-α)·S, log-sum-exp computation; α=0.5 reduces to standard symmetric JSD (Ko et al. 2024) - Attention-transfer loss + weighted Metal path for hidden-state distillation
- Offline teacher-logit caching:
pmetal distill --offline-cache <path>precomputes teacher logits to disk; newInt8PerTokencompressed-block variant replaces NaN-sentinel scheme with explicitper_token_metafield (legacyInt8variant retained for read-back) DistillLossOutput.metrics: HashMap<&'static str, f32>: lazily-evaluatedteacher_entropy,student_entropy,kl_per_token,top1_agreementexposed to trainer JSONL/TUI streaming- TAID difficulty-aware observability:
alpha_varsurfaced for per-step monitoring - Configurable
ignore_index: PyTorch-standard-100default onTrainingConfig; safe label clamping before gather - Hidden-state shape assertions before matmul (clear error vs. silent broadcast bug)
SOTA model merging (pmetal-merge)
- Fisher merging (Matena & Raffel 2022): diagonal-Fisher-weighted average
θ = Σ F_i⊙θ_i / (Σ F_i + ε); lazy-loaded Fisher safetensors;fallback_to_meanfor tensors without Fisher entries - RegMean (Jin et al. 2023): closed-form linear-layer merge
W = (Σ G_i)⁻¹ · (Σ G_i W_i)via hand-rolled Gauss-Jordanpseudo_inverse_2dwith Tikhonov ridge; falls back to mean for non-2D weights - MoE expert permutation alignment: per-(model, layer) Hungarian solver (Jonker-Volgenant style, O(N³)) over L2-normalized cosine similarity of expert fingerprints; tensor-name remapping
experts.{i}.→experts.{π(i)}.before merge; gated byalign_moe_experts - Honor
config.dtypein save path:MergeBuilder.dtypebuilder,TensorWriter::with_dtypeplumbing, per-dtype byte packing for F16/BF16/F32; previously hardcoded to F16 - Cross-model dtype consistency check:
verify_source_dtypeserrors on mismatch unlessallow_mixed_dtypeis set - Tied-embedding detection:
lm_head.weightandembed_tokens.weightaliasing detected and merged once under canonical name - Tokenizer + config sidecar copy:
tokenizer.json,tokenizer_config.json,special_tokens_map.json,config.json,generation_config.jsoncopied on full-model merge;config.json.torch_dtypepatched to match output dtype - Post-merge sanity sweep (`Sa...
PMetal v0.4.0
[0.4.0] - 2026-03-23
Added
-
pmetal-mcpcrate: Full MCP (Model Context Protocol) server exposing 45 tools for Claude Desktop and other MCP clients. Covers all pmetal functionality — training, inference, distillation, GRPO, RLKD, quantization, model merging, dataset operations, evaluation, benchmarking, model search, and Ollama export- Device & models:
device_info,search_models,download_model,list_local_models,model_fit,model_info - Inference:
generate(blocking),chat(via running serve instance),start_serve,benchmark,bench_train,bench_gen,bench_corpus - Training:
train,distill,grpo,rlkd,embed_train— all as background jobs with full parameter coverage matching the CLI - Runtime training control:
job_set_lr,job_reduce_lr,job_reset_lr,job_save_checkpoint,job_graceful_stop— LLM-driven adaptive training via the control file protocol - Job management:
list_jobs,job_status,job_logs,stop_job - Dataset ops:
dataset_analyze,dataset_preview,dataset_validate,dataset_download,dataset_convert,dataset_filter,dataset_split,dataset_merge,dataset_sample,dataset_template,dataset_prepare - Quantization & conversion:
quantize,fuse_lora,merge_models,pack_experts,ollama_create,ollama_modelfile - Evaluation:
eval_perplexity - All tools include rich
#[description]annotations for parameter documentation in the MCP schema - Standalone binary (
pmetal-mcp) for Claude Desktop +pmetal mcpsubcommand (behindmcpfeature flag) - Uses
turbomcpv3.0.7 from crates.io
- Device & models:
-
Runtime training control protocol: Extended the control file protocol (
.lr_control.json) withSaveCheckpointandGracefulStopcommands. The adaptive LR controller now polls the control file before checking itsenabledflag, so external agents (MCP, TUI) can always send commands regardless of whether automatic detection is active -
--no-adaptive-lrflag: Disables automatic spike/plateau/divergence detection while keeping the control file protocol active. Enables fully LLM-driven learning rate control — the agent observes loss viajob_statusand manually adjusts LR viajob_set_lr/job_reduce_lr -
UltraFusion execution planner (
pmetal-distributed): Per-die stage planner for M-series Ultra Macs with in-memory channel transport backend for same-process links, avoiding TCP overhead on UltraFusion interconnect -
MPP FlashAttention for head_dim 64/96: Metal 4 MPP flash attention kernel now supports head_dim 64, 96, and 128 with stride-2/stride-3 SIMD lane packing and causal/non-causal variants
-
Tuna persistent disk cache: The auto-tuner now persists benchmark results to disk, avoiding re-tuning on restart. Expanded search covers FlashAttention, FusedCrossEntropy, FusedNormLora, and FusedSwiGLU via function constants
-
MoE GPU top-k selection: Expert top-k selection moved from CPU sort to GPU
argpartition_axis, eliminating a sync point in the MoE forward path -
bench-workloadCLI command: Benchmark a real cached workload for inference and short LoRA training with named presets (--preset dense-qwen3,--preset hybrid-qwen3next) -
KV cache quantization auto-select:
--kv-quantis now optional — omitting it auto-selects the fastest quantization mode that fits the device memory budget -
UltraFusion info display:
pmetal infoshows UltraFusion topology, die count, and local executor plan on Ultra Macs -
Qwen3 LoRA RoPE reset: Qwen3 LoRA and QLoRA gain dense attention and RoPE reset support
-
ANE real-time evaluation: Experimental
_ANEClientreal-time dispatch with automatic fallback to standard evaluation on failure. Propagated via--ane-real-timeCLI flag -
bench-corpusCLI command: Structured kernel benchmarking with device-tier-aware test cases, JSON reporting, and--quick/--outputflags -
GPU memory bandwidth probing: Real GPU copy benchmark replaces static spec-table lookup, with disk-cached results and spec-table fallback
-
Persistent runtime kernel backend selection: Benchmark-and-persist infrastructure races MLX vs MPP backends on Apple10/M5, validates numerical agreement, and caches the winner to disk for 4-bit quantized linear, fused attention, and LoRA matmul
-
MPP kernel tile variants: Metal 4 GEMM supports parameterized tile variants (32x32, 64x32, 32x64, 64x64) with Tuna auto-tuner selection per device and problem shape
-
Serve ANE/CPU-hybrid engine caching: Serve engine auto-selects optimal backend (ANE, CPU-hybrid, GPU) at startup with permanent downgrade on failure. Compiled engines cached across requests
-
Rollback enabled by default for LoRA: Best-loss checkpoint rollback now defaults to on with extended warmup grace period. Persistent snapshot to disk via atomic write.
for_lora()factory for recommended defaults -
Extended StepMetrics:
gpu_fwd_bwd_ms,optimizer_ms,io_staging_ms,overhead_msfields for fine-grained training profiling -
Zero-copy MoE expert dispatch:
ExpertBufferPoolwithread_experts_aligned+encode_expert_alignedfor pread-to-Metal expert weight dispatch. Auto-enable KV-Q8 when memory-constrained -
ANE dual-die support: On UltraFusion chips, compile variant-B kernel set with distinct MIL hashes and alternate per step for dual-die thermal distribution. Auto-recompile on throughput degradation (>15% or >25K dispatches)
-
Batched parameter eval: Model dispatcher evaluates parameters in batches of 128 tensors per sync instead of all-at-once, reducing peak memory during model loading
-
Architecture enhancements: DeepSeek V3/V3.2, GPT-OSS, Jamba, Llama 4, Qwen3, and Qwen3-MoE model improvements and weight sanitization refinements
-
Third-party attribution: Complete THIRD_PARTY_NOTICES with entries for mlx-lm, llama.cpp/GGML, Candle, and Burn
Changed
- ANE is now opt-in: The
--no-aneflag has been replaced with--aneacross CLI, TUI, orchestrator, and MCP. ANE training is experimental and limited to small models, so it defaults to off. The orchestrator'sDispatchConfignow setsane: falseby default - Gradient checkpointing support corrected: Qwen3 and Qwen3Next no longer claim gradient checkpointing support (was incorrectly advertised)
- Training loop refactored: Gradient checkpointing helper extracted, step logging tracks step numbers correctly, training loop tests expanded
Removed
- Merge methods: Removed merge methods with incompatible licenses. Cleaned up related references across documentation and configuration
Fixed
- MetalSampler use-after-free: Retained source logits array until GPU completion in serve engine
- Fused merge Tuna cache: Now uses persistent disk cache instead of ephemeral per-session tuning
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.3.13...v0.4.0
PMetal v0.3.7
[0.3.7] - 2026-03-16
Added
pmetal mergeCLI command: Model merging exposed as a first-class CLI command supporting all merge methods (Linear, SLERP, TIES, DARE, DELLA, NearSwap, Model Stock) with--method,--base,--t,--weight-a,--weight-b,--density, and--dtypeflagspmetal evalCLI command: Dataset evaluation command — measures loss/perplexity over a validation set with optional LoRA adapter,--num-samplescap, and--jsonoutputpmetal infoCLI command: Prints device and runtime information;--jsonflag emits structured JSON for scriptingpmetal search --jsonoutput: Structured JSON output mode for search results including fit estimates, download counts, parameter estimates, and tags — enables scripting and GUI integrationQuantizeMethodenum: Replaces the string--methodargument forpmetal quantizewith a typed enum (dynamic,q8_0,q4_k_m, etc.) — invalid methods now fail at argument parsing rather than deep inside the quantizer- GRPO CLI arguments:
--epochs,--lora-r,--lora-alpha,--max-completion-length, and--seedexposed as CLI arguments, replacing previous hardcoded defaults loraplus_lr_ratioandneftune_noise_alpha: New fields on training loop configurations — enables LoRA+ differential learning rates and NEFTune noise injection directly from configtrainable_params()helper: New utility inpmetal-lorafor counting total vs. trainable parameter counts, useful for logging and memory estimationlora_alpha: f32: Distillation CLI andrun_distillation_clinow acceptlora_alphaasf32instead ofusizefor finer-grained scaling controlseedparameter in distillation and GRPO CLI: Reproducible runs via explicit--seedflag in all training entry points- Gemma3 sliding window auto-detection:
DynamicModelloader now readsmodel_type == "gemma3"and setsis_gemma3 = trueon the config, enabling the correct every-6th-layer global attention pattern without manual config overrides - KV cache support for more architectures:
DynamicModel::forward_with_cachenow routes DeepSeek, Cohere, StarCoder2, and Llama4 to their native caching paths; RecurrentGemma and Jamba now get clear error messages that they requireforward()directly; hybrid models (NemotronH, Qwen3Next) get a descriptive error directing toforward_with_hybrid_cache - Speculative decoding greedy path:
SpeculativeDecoder::verify_greedy()— exact-correct verification for temperature=0 decoding using argmax equality; avoids the numerically unstable rejection-sampling limit as temperature→0 - Hub cache management (
pmetal-hub): Newcache.rsmodule with cache inspection, eviction, and size-reporting helpers - Shared model utilities (
pmetal-models/utils.rs): Common helpers extracted from per-architecture modules to reduce duplication
Fixed
- Scale factor broadcasting in distillation:
squeezeapplied to the scale factor dimension so it broadcasts correctly across batch and sequence axes — previously caused shape mismatches on non-unit batch sizes - TAID
mean_alphaforcing GPU sync:TaidLossOutput::mean_alphachanged fromf32to a lazyArray— the.eval()call is deferred until callers explicitly call.item::<f32>(), removing a forced GPU-CPU sync before the backward pass - SLERP numerical stability: Added epsilon clamping in the SLERP merge path to prevent NaN when interpolation parameter is at the boundary values (0.0 or 1.0)
- Llama LoRA
trainable_params/ gradient application: Replaced 100+ lines of repeated field accesses with aninsert_adapter!macro and loop over projection names, fixing DoRAmagnitudeparameter that was silently dropped from gradient maps - GaLore improvements: Corrected projection matrix update schedule and subspace dimensionality handling
- Distillation hidden-state loss: Refactored alignment computation to correctly handle variable-rank teacher/student hidden state tensors
- Jensen-Shannon / KL divergence loss: Numerical stability improvements — log-sum-exp stabilization applied consistently across all reduction paths
- Offline distillation: Fixed logit cache loading to handle both single-file and sharded cache layouts
Changed
lm_groups.rs/ LoRA+ optimizer groups:build_lora_param_groupssignificantly reworked — LoRA+ differential LR ratio (loraplus_lr_ratio) applied tolora_bparameters, NEFTune noise injection integrated into group construction- GRPO trainer:
epochs,lora_r,lora_alpha,max_completion_length, andseedplumbed through from CLI args; previously these were hardcoded to1,16,32,512, and a fixed seed - Training loop:
loraplus_lr_ratioandneftune_noise_alpharead from config and forwarded to optimizer group construction pmetal-coreconfig / scheduler / traits: Config structs gainedloraplus_lr_ratioandneftune_noise_alphafields; scheduler types and learning rate trait bounds refined;TrainingCallbacktrait extended with blanket impls for boxed callbacks- Data pipeline: Tokenizer, packing,
vocab_compact, dataset, and chat template modules updated — minor correctness and efficiency fixes accumulated across the release cycle - GGUF reader / writer / quantize: Reader handles additional tensor metadata fields; writer improves alignment padding; quantize module uses
QuantizeMethodenum instead of string matching - Hub search:
search_modelsreturns richer result structs used by both the human-readable table and the new--jsonoutput path; upload path fixes for large model shards - Metal kernels: GDN, LoRA, grouped GEMM, and fused SwiGLU Metal shaders updated — improved numerical correctness and register pressure
- GUI app icons and Tauri config: Updated icons (32×32, 128×128, 128×128@2x, icns, ico) and
tauri.conf.jsonfor the 0.3.7 release build; Python vocodereasyAPI additions and mel spectrogram fix
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.3.6...v0.3.7
PMetal v0.3.6
[0.3.6] - 2026-03-15
Added
- Desktop GUI (Tauri + Svelte): Full desktop application for model management, training, distillation, GRPO, inference, merging, and quantization. 10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, Settings. Real-time training metrics with live loss charts via broadcast events. Model download with HuggingFace Hub integration, dataset browser, and inference chat interface with streaming token display
- GUI in-process execution: Training, distillation, GRPO, inference, model merging, LoRA fuse, and quantization run as direct library calls instead of shelling out to the
pmetalbinary. Eliminates binary discovery issues, reduces process overhead, and enables richer progress reporting. Device info and model metadata also read from library APIs easy::dpo()/easy::simpo()/easy::orpo()/easy::kto()builders:PreferenceTuneBuilderineasy.rsfor preference optimization methods. Full pipeline: model download → tokenizer → dataset loading → LoRA setup → training loop → weight saving. Supports method-specific config (DPO beta/loss type, SimPO gamma/CPO, ORPO beta, KTO desirable/undesirable weights)easy::infer().generate_streaming(): Streaming inference API with per-delta callback. Supports both base models and LoRA adapters. Returnsfalsefrom callback to cancel early. ANE fallback emits full result as single delta- Preference trainer
train()methods: DPO, KTO, ORPO, and SimPO trainers now have self-containedtrain()methods with optimizer integration, batching, epoch loops, callback lifecycle, and metrics collection. Previously only exposed per-step primitives TrainingCallback::should_stop(): Clean cancellation mechanism — callbacks returntrueto request training loop to finish the current step and exit withCancellederror. Checked after every step in all 5TrainingLoop::run*methods, all 4 preference trainertrain()loops, andGrpoTrainer::run()PMetalError::Cancelled: New error variant for clean training cancellation. CorrespondingCancelledvariants added toSftError,DpoError,KtoError,OrpoError,SimpoError, andGrpoError- Preference batch padding utilities:
pad_u32_sequences,pad_i64_sequences,pad_f32_sequencesinpreference_batch.rsfor batching variable-length preference pairs - NemotronH runtime FP8 quantization:
quantize_fp8()converts float weights to FP8 (E4M3) at runtime for all four block types (Mamba, attention, MLP, MoE). Shared helpersmaterialize_linear_weightandlinear_forward_with_optional_fp8consolidate FP8 dequantization across the model. MoE weights are restacked after quantization for batched dispatch - FluxPipeline::from_pretrained: Load Flux diffusion pipelines from HuggingFace-style model directories. Discovers components via
model_index.json, parses both native and diffusers-style config keys for CLIP, T5, FluxDiT, and VAE - Python training callbacks:
Trainer.add_callback()now wires callbacks into the training loop. Built-inProgressCallback,LoggingCallback, andMetricsJsonCallbackmap to native Rust implementations; arbitrary Python objects bridge throughPythonCallbackBridge
Fixed
- Training cancellation via
panic_anyreplaced: GUI and TUI previously usedstd::panic::panic_any(CancelledRun)+catch_unwindto abort training — fragile, UB-prone through FFI, and could be swallowed by intermediate catch_unwind. Replaced withTrainingCallback::should_stop()returning a cleanErr(Cancelled)from the training loop - GUI QLoRA silently failed on non-Llama models:
run_qlora_training_in_processhardcodedLlamaConfigdeserialization, causing confusing errors or silent misconfiguration for Gemma/Qwen/Phi models. Now detectsmodel_typefrom config.json and returns a clear error for unsupported architectures - GUI
resume_fromsilently ignored: Training config acceptedresume_frombut discarded it (let _ = eval). Now returns an error directing users to the CLI - GUI GRPO with no reward function produced noise:
DummyRewardreturning constant 0.1 for all completions made GRPO training meaningless when reasoning rewards were disabled. Now requires explicit reward configuration - Preference trainers doubled compute per step: DPO, KTO, ORPO, and SimPO
train()methods ran a second full forward pass after the gradient step solely for logging metrics. Replaced withRefCellside-channels that capture metric arrays from within the autograd closure — same metrics, zero extra compute - Base model thinking mode: Auto-detect base vs instruct models and disable
<think>tag prefill for base models. Base models don't understand thinking tags, causing infinite generation without a closing tag - Fused model 5x slower than LoRA: Skip ANE-hybrid path for models under 2B parameters where GPU KV-cache decode is significantly faster (115 vs 20 tok/s). ANE-hybrid benefits larger models where prefill dominates
- DataLoader panics on bad images: Replace
panic!()in VLM batch construction with properDataLoaderErrorenum andtry_next_batch()method. Image preprocessing failures and missing-image errors now propagate asResultinstead of crashing - Division by zero with log_every=0: Clamp
log_everyandsave_everyto minimum 1 acrossTrainingLoop,LoggingCallback,CheckpointCallback, and CLI - LoRA scaling with rank 0:
LoraConfig::scaling()returns 0.0 when rank is 0 instead of dividing by zero - BF16 LoRA weights:
sanitize_loaded_weights()converts BF16 tensors to FP16 since MLX doesn't natively support BF16 on Apple Silicon - Qwen3Next silent weight mismatch: Weight loading now returns errors for unmatched or missing parameters instead of logging a warning and continuing with a partially loaded model
- Dataset download only fetched README:
download_dataset()now enumerates repo files and downloads actual data files (parquet, json, jsonl, csv, arrow, etc.) with split-aware filtering - Model download silent failures:
download_model()tracks per-file failures and reports them instead of silently skipping failed downloads - Flux loading via DynamicModel:
DynamicModel::load()for Flux now returns an error directing toFluxPipelineinstead of incorrectly loading a diffusion model as a causal LM
Changed
- GUI architecture: library calls replace subprocess spawning: Training, distillation, GRPO, inference, merge, fuse, and quantize commands now call
pmetallibrary functions directly instead of spawningpmetalCLI as a child process. System info reads fromMetalContext::global()instead of parsingpmetal memorystdout. Removeswhichandfutures-utildependencies - TUI direct training execution:
command_runner.rsdispatchestrain,distill, andgrpocommands as in-process library calls viarun_direct_command(), falling back to subprocess for other commands. Training parameters parsed fromCommandSpecargs withparse_arg/required_arg/optional_arghelpers - ORPO loss computation refactored:
compute_orpo_loss_staticnow contains the full computation directly instead of creating a throwawayOrpoTrainerinstance. The instance methodcompute_orpo_lossdelegates to it - SimPO gradient-safe loss path: New
compute_loss_with_cpo_for_gradstatic method keeps the computation graph lazy (no.eval()/.item()calls) for correct autograd. The existingcompute_loss_with_cporemains for non-grad contexts FinetuneBuilderexpanded: New builder methods —lora_dropout(),use_rslora(),use_dora(),gradient_checkpointing_layers(),callback(),metrics_path(). LoRA config now forwards dropout, RSLoRA, and DoRA settings- GRPO CLI gains new parameters:
epochs,lora_r,lora_alpha,max_completion_lengthexposed as CLI arguments and TUI form fields. GRPO now savesadapter_config.jsonalongside LoRA weights - CLI
emit_console_outputflag: Training, distillation, and GRPO CLI functions acceptemit_console_output: boolandextra_callbacks: Vec<Box<dyn TrainingCallback>>to suppress terminal output when called from GUI/TUI - DataLoader error handling: New
DataLoaderErrorenum withMlx,ImagePreprocess, andMissingImagesvariants. All 7 training loop entry points migrated fromnext_batch()totry_next_batch() - AdapterManager validation:
load()now validates path existence, checks for adapter artifacts in directories, and rejects unsupported file types - Metal shader build isolation: Shader compiler cache redirected to build output directory, preventing pollution of user's home directory
- unsafe_code lint scoping: Moved blanket
#![allow(unsafe_code)]from crate-levellib.rsinto individual modules that contain unsafe blocks across pmetal-metal, pmetal-mlx, pmetal-models, pmetal-trainer, pmetal-distill, and pmetal-distributed
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.3.5...v0.3.6
v0.3.5
Full Changelog: v0.3.4...v0.3.5
v0.3.4
[0.3.4] - 2026-03-14
Added
- Mixture-of-Depths (MoD) for Llama 4: Proper implementation per Raposo et al. (2024) — lightweight router with
argpartition_axistop-k, gather-before-compute on sub-batch, scatter-after, BCE auxiliary loss. Configurable capacity factor and per-layer selection - Llama 4 RoPE: Real RoPE implementation via
pmetal_mlx::kernels::rope::apply_rope(Metal-accelerated), replacing the placeholder stub. Correctly wired into iRoPE layer dispatch — RoPE layers get rotary embeddings, NoPE layers skip them - Llama 4 temperature scaling: Per Meta's formula
log(floor((pos+1)/floor_scale) + 1) * attn_scale + 1.0, applied to Q states in NoPE layers before QK matmul for long-context attention stabilization - Llama 4 GQA: KV-head broadcast expansion for grouped-query attention — enables Scout (40 Q / 8 KV) and Maverick configs
- MoE top-k > 1:
Llama4Routerusesargpartition_axisfor O(n) expert selection with L1-normalized weights and per-slot dispatch loop, replacing hardcoded argmax - ANE fused kernels:
gen_dynamic_sdpa_fwd(single-kernel attention: RMSNorm + QKV + SDPA + Wo) andgen_dynamic_ffn_w13(single-kernel FFN: RMSNorm + W1 + W3 + SiLU), replacing 6+ separate ANE evaluations per layer - ANE fused backward:
gen_dynamic_ffn_bwd_w2tandgen_dynamic_ffn_bwd_w13tfor fused FFN backward pass - Metal dequantization kernels: Q4_0 and IQ4_XS Metal compute shaders, verified correct per GGML spec. Bridge methods in
MlxMetalBridgefor GPU-accelerated dequantization - Cancellation safety infrastructure:
CompletionToken::Dropguard inAsyncSchedulerwaits for in-flight GPU commands;retain_resource()/as_retained()for Metal buffer lifetime extension - IoSurface helpers:
write_f32_strided_at,write_f32_at_col_offset,zero_channel_range_f32for fused backward kernel IO - CloudBridge: Complete training state export (weights, optimizer state, RNG, dataloader position, metadata) with working Python bootstrap scripts for FSDP/DeepSpeed cluster resumption and Rust-side loader functions
- Formal verification:
cargo-kaniproofs for ring all-reduce chunk arithmetic (95 checks) and k-ary tree topology consistency (607 checks), with justfile recipes - Reasoning templates:
MathReasoningTemplate(GRPO + accuracy/format rewards) andCodeReasoningTemplate(structural code fence + test case matching) - Reasoning dataset auto-detection:
pmetal dataset prepareautomatically detectsproblem/thinking/solutioncolumns and formats them as<think>tagged ChatML conversations --columnsflag: General column remapping fordataset prepare(e.g.,--columns "instruction=question,output=answer")adapter_config.json: Saved alongside LoRA weights during training (r, alpha, target_modules, use_rslora). Loaded automatically at inference and fuse time — eliminates config guesswork- Supply chain:
cargo-vetinitialized with Mozilla, Google, and Bytecode Alliance audit imports; 17 workspace crates covered; 5 transitive dependency exemptions with exact lockfile versions - Tracing spans: 6
info_span!markers in Python trainer for phase-level observability (model_resolve, load_tokenizer, load_dataset, load_model, training_loop, save_weights)
Fixed
- LoRA inference garbage output: Merged LoRA weights into base model at inference time (
W += scale*B@A), matching mlx-lm's pattern. The separate-forward path had dtype mismatch issues (BF16 base × F32 LoRA) - Auto-chat mode regression: Removed heuristic that forced chat template on base models just because their tokenizer has
<|im_end|>. Chat mode now requires explicit--chator an instruction-tuned model - Missing EOS in training data: Training sequences now end with the model's actual EOS token (e.g.,
<|endoftext|>for Qwen). Previously only had turn delimiter (<|im_end|>) — model never learned to stop generating - Fuse command wrong alpha/rank:
pmetal fusenow readsadapter_config.jsonfor correct alpha and rank instead of defaulting toscale=1.0. Also filters MLP LoRA weights (rank=0) when auto-detecting rank from shapes - ANE
x2normbackward bug: FFN weight gradients (dW1,dW3) were computed against the wrong pre-norm tensor (xnormfrom attention block instead ofx2normfrom FFN block). Restoredx2normfield and CPU RMSNorm recomputation for gradient correctness - ANE
sdpa_bwdsurface dtype: Backward SDPA output surfaces were allocated as fp32 but ANE kernels produce fp16 — stride mismatch corrupted dV/dQ/dK gradients. Fixed toIoSurface::for_tensor()(fp16) - MoD argpartition sign: Router negated weights before
argpartition_axis, selecting bottom-k (least important) tokens instead of top-k. Removed negation - MLX bridge
copy_as_f32regression: Renamed methods dropped auto dtype conversion — callers passing wrong dtype would panic. Restoredcopy_as_f32/copy_as_f16with auto-conversion - MLX bridge
view_f32eval: Removed.eval()call before accessing data pointer — unevaluated arrays returned null. Restored defensive eval - Python API surface: Restored
ProgressCallback,LoggingCallback(log_every=10),__version__, andPythonCallbackBridgethat were deleted during PyO3 migration - TUI training completion: Reads final metrics from JSONL file on disk (immune to polling lag). Shows actual loss and step count instead of
0.0000/ sample count - TUI Steps/min overflow: Guards against divide-by-zero when
total_ms=0— shows—instead of60000 - Dataset prepare panic: Empty results no longer crash with index-out-of-bounds. Shows diagnostic message with format hints
Changed
- LoRA inference uses merge:
merge_lora()is called before generation, producing a single merged weight matrix per layer. This is equivalent to the fuse command but happens in-memory without saving - PyO3 0.23 → 0.28:
allow_threads→detach,with_gil→attach,from_py_objecton all pyclass types,Bound<'py, PyDict>return types - tokio 1.49 → 1.50
unsafe_codelint: Escalated fromwarntodenyworkspace-wide
Full Changelog: v0.3.3...v0.3.4
v0.3.3
[0.3.3] - 2026-03-12
Added
- Self-contained binary:
mlx.metallibis now gzip-compressed and embedded into thepmetalbinary at build time viabuild.rs+include_bytes!. On first run it extracts to~/.cache/pmetal/lib/if not already present.cargo install pmetal-clinow produces a fully self-contained binary with no external metallib dependency (~31MB added to binary, 70% smaller than the raw 102MB metallib) - Adaptive LR rollback: When divergence is detected and
rollback_enabled = true, the adaptive LR controller emitsLrEvent::RollbackTriggered— the training loop restores LoRA weights from the best in-memory EMA snapshot, resets optimizer momentum, and continues with a halved LR multiplier - Early-stop on repeated divergence: After
max_rollbacksexhausted rollbacks, the controller emitsLrEvent::EarlyStop— the training loop saves a final checkpoint and exits cleanly instead of spiraling deeper into loss divergence - In-memory LoRA snapshot:
TrainingLoopholds the best LoRA weight snapshot in RAM viasnapshot_best_weights()/restore_best_weights(). LoRA params are typically 1–20 MB, making this negligible overhead vs checkpoint I/O AdaptiveActionenum:apply_adaptive_lr()now returnsAdaptiveAction::Continue | Rollback | EarlyStopso training loops can react to controller decisions without re-parsing event strings
Fixed
apply_adaptive_lrreturn type: Previously returned(), discarding rollback/early-stop events — callers had no way to react. Now returnsAdaptiveAction- Divergence rollback vs plain reduction ambiguity: Divergence path now checks
rollback_enabledandhas_best_snapshotbefore deciding between rollback and plain LR reduction — prevents silent rollback when no snapshot exists - EMA state reset on rollback: Spike EMA and variance are reset alongside LR multiplier on rollback so z-score anomaly detection re-stabilizes correctly after weight restoration
total_stepsin metrics:run_standard()andrun_jit_compiled()computedtotal_steps: max_steps.unwrap_or(0)— now estimates fromdataset.len() / batch_size * epochswhenmax_stepsisNone, giving accurate progress in the TUIstats_summarymissing rollback count:AdaptiveLrController::stats_summary()now includesrollbacks=Nin its output string
Improved
- Rollback tests: Four new unit tests —
test_rollback_triggered_on_divergence,test_early_stop_after_max_rollbacks,test_rollback_disabled_falls_through_to_divergence,test_should_snapshot_best_tracks_ema_improvement
Full Changelog: v0.3.2...v0.3.3
v0.3.2
[0.3.2] - 2026-03-11
Added
- Adaptive learning rate controller: EMA-based z-score spike detection, patience-based plateau detection, and linear regression divergence detection — automatically adjusts LR multiplier during training to recover from loss spikes, reduce LR on plateaus, and halt on divergence
- Manual LR override via TUI: Press
Lin Training, Distillation, or GRPO tabs to set a custom learning rate mid-run; uses atomic control file protocol ({output_dir}/.lr_control.json) for safe subprocess communication - WSD (Warmup-Stable-Decay) scheduler: New
LrSchedulerType::Wsdwith configurablestable_ratio— holds peak LR for a plateau phase before linear decay, popular for large-scale pretraining - GRPO adaptive LR + callbacks:
GrpoTrainernow supports adaptive LR,TrainingCallbacklifecycle events, andStepMetricsemission for live TUI monitoring - HuggingFace Hub search (
pmetal search): CLI command and TUI integration (pressSin Models tab) to search HF Hub for text-generation models with download counts, parameter estimates, and memory fit assessment - Memory fit estimation: New
pmetal-hubmodule estimates inference/training memory requirements, tok/s throughput, and color-coded fit levels (green/yellow/red) based on device specs and model architecture - Model detail panel: Models tab shows memory breakdown — weights, KV cache, overhead, training estimate, and recommended batch size
- Distillation metrics callbacks:
DistillationTrainernow emits step-by-step metrics viaTrainingCallback, enabling live TUI dashboard during distillation runs - Command logging in Jobs tab: Spawned commands are logged with the full CLI invocation for easier debugging
Fixed
- NaN/Inf loss guard: Adaptive LR skips EMA updates on non-finite losses to prevent EMA poisoning — returns scheduled LR unchanged
- EMA variance bias correction: Early-training z-scores now use bias-corrected variance (
raw_var / (1 - alpha^n)), matching Adam's moment correction — prevents false spike detection in first ~20 steps - Zero-variance z-score fallback: When loss variance is near zero (std_dev < 1e-8), uses absolute deviation threshold instead of division-by-zero; returns z=10 for >50% deviation, z=0 otherwise
- Atomic control file protocol: LR control file is renamed to
.lr_control.claimedbefore reading and deleted after — prevents race conditions between TUI writer and training subprocess reader - Distillation metrics LR: Distillation step metrics now report post-adaptive LR instead of pre-adjustment scheduled LR
- Adaptive LR in all training paths:
apply_adaptive_lr()now called inrun_metal_fused(),run_compiled(),run_jit_compiled(), andrun_packed()paths (was only inrun_standard()) - TUI LR override validation: LR range check now accepts 1.0 (was exclusive upper bound); shows error modal on invalid input instead of silent log warning
- Distillation/GRPO job routing: Status updates were always routed to the Training tab regardless of job type. Added
active_job_typetracking to route metrics, completion, and failure to the correct tab (Distill, GRPO, or Training) - Distillation CLI args: TUI sent
--lora-alphaand--log-metricsflags that the CLI didn't accept, causing immediate exit code 2. Added both args to theDistillcommand and--log-metricstoGrpo - Parquet dataset support in distill/GRPO: Distillation and GRPO commands only supported JSONL datasets. Now auto-detect
.parquetfiles and route to the parquet loader, matching the training command's behavior - Tab click targeting: Mouse clicks on Monitor, Inference, and Jobs tabs selected the wrong tab due to hardcoded fixed-width hit-testing. Now computes actual tab widths from rendered text
- Error diagnostics: Failed jobs now show the last 5 stderr lines in the tab status panel instead of just "Process exited with code N", with a hint to check the Jobs tab for full output
- UTF-8 safe string truncation:
truncate_strused byte indexing which panics on multi-byte characters; switched tochars()iterator - Leaked channel in HF search:
search_hf()created a sender/receiver pair even without a CommandRunner, silently dropping results - Integer overflow in fit estimation:
estimate_params_from_configused plain multiplication; switched tosaturating_mul/saturating_add - Context length truncation: u64→u32 cast could wrap for extreme values; capped at 1M before cast
Improved
- TUI tab ordering: System (formerly Device) is now the default first tab; Dashboard renamed to Monitor
- Empty state messaging: Monitor tab shows actionable guidance ("Start a run from Training, Distill, or GRPO tab") instead of "Waiting for training data..."
- Idle state hint: Tabs show "Press S to start" instead of "Press S to start training" (generic across all job types)
Security
- Bounded API responses:
bounded_json()caps HF API response bodies at 4MB to prevent heap exhaustion - Model ID validation:
is_valid_model_id()rejects path traversal, URL injection, and malformed values in HF API paths
Full Changelog: v0.3.1...v0.3.2
v0.3.1
[0.3.1] - 2026-03-11
Added
- M5 / Apple10 device detection: GPU family
Apple10with architecture generation 17, NAX (Neural Accelerators in GPU) availability flag, and NAX-aware tile size tuning (M5 Max/Ultra get 128×64×32) - UltraFusion topology detection:
sysctl hw.packagesdetects multi-die Ultra chips;is_ultra_fusionanddie_countfields onDeviceProperties - GPU and ANE core count estimation: Per-chip core counts derived from device name and tier, with UltraFusion die multiplication
- Memory bandwidth estimation: Tier + GPU family lookup table for estimated bandwidth (GB/s)
- ANE performance stats API:
evaluate_with_stats()onAneModeluses_ANEPerformanceStatswithhwExecutionTimefor nanosecond-precision hardware timing - TUI device tab enhancements: GPU core counts (with per-die breakdown for Ultra), ANE core counts, memory bandwidth, architecture generation, NAX and UltraFusion feature flags
crates/pmetal/README.md: Crate-level README with feature flags table, quick start examples, hardware support summary, and re-export reference
Fixed
AppleGPUFamily::Unknownordering bug:Unknownwas declared last in the enum, causing derivedOrdto rank it aboveApple10— unknown GPUs incorrectly gothas_dynamic_caching,has_nax, etc. set totrue. Fixed by movingUnknownto first position- Future chip name collision:
name.contains("M1")matched "M10"; replaced withhas_chip_id()that checks the character after the match isn't a digit - Dead
sysctlsubprocess inquery_memory_bandwidth: Spawnedsysctlwhose result was discarded; removed and renamed toestimate_memory_bandwidth()using tier-based lookup
Improved
- README updates: Root README now documents hardware support matrix (M1–M5), 9 TUI tabs (was 7), 16 crates (was 15), all fused Metal kernels (GDN, SwiGLU, RMSNorm+LoRA), ANE perf stats and M1–M5 compatibility
- Hardware support docs: Complete M1–M5 chip matrix with arch gen, core counts, bandwidth, ANE TFLOPS measurements; NAX kernel integration roadmap; UltraFusion distributed roadmap
Full Changelog: v0.3.0...v0.3.1
v0.3.0
[0.3.0] - 2026-03-10
Added
- TUI Control Center (
pmetal tui): Full terminal interface with 9 tabs — Dashboard, Device, Models, Datasets, Training, Distillation, GRPO, Inference, Jobs. Async event loop with crossterm/ratatui, modal system (confirm, text input, model picker, dataset picker, error, progress), and reusable form field widgets - Live job integration: Training, distillation, and GRPO tabs spawn pmetal subprocesses and stream metrics in real time via
CommandRunner+ JSONL polling - LoRA fuse command (
pmetal fuse): Merge LoRA adapter weights into base model, with optional fuse-then-quantize pipeline - Chat template support for Llama 4, DeepSeek, and Cohere: Full template formatting, Jinja detection, model name heuristics, stop tokens, and inference formatting for all three model families
- Llama 4 template:
<|header_start|>/<|header_end|>/<|eot|>tokens (distinct from Llama 3's<|start_header_id|>/<|end_header_id|>/<|eot_id|>) - DeepSeek template: Full-width unicode tokens (
<|begin▁of▁sentence|>,<|User|>,<|Assistant|>) with thinking mode support (<think>/</think>prefill) - Cohere Command R template:
<|START_OF_TURN_TOKEN|>,<|USER_TOKEN|>,<|CHATBOT_TOKEN|>,<|END_OF_TURN_TOKEN|>tokens - Comprehensive stop token collection:
collect_all_stop_tokens()now probes 11 well-known special tokens across all model families (added<|eot|>,<|end|>,<|return|>,<|END_OF_TURN_TOKEN|>,<|end▁of▁sentence|>) - LoRA inference auto-chat detection: Probes vocabulary for
<|im_end|>/<|eot_id|>to auto-enable chat mode on base models fine-tuned with LoRA - Streaming generation support:
GenerationConfigstreaming extensions inpmetal-models - Epoch/total_steps in StepMetrics: Training progress now flows through entire pipeline (training loop → JSONL callback → TUI) showing step X/Y and epoch M/N
- Hardware support documentation: Apple Silicon hardware matrix and tuning reference (
docs/hardware-support.md)
Fixed
- TUI inference word wrap: Model output now wraps correctly within the terminal width instead of clipping off-screen;
normalize_code_fences()preprocessor ensures ``` markers always appear on their own line even when the model emits text without newlines - TUI inference code block rendering: Fenced code blocks (```python, etc.) now render properly with distinct styling even when the token stream lacks explicit newline characters
- TUI UTF-8 safe text handling: Word wrap and code block truncation now use char-count width instead of byte length, preventing panics on multi-byte characters
- GRPO accuracy reward — last-occurrence extraction:
AccuracyRewardnow usesrfind()for<answer>tags and\boxed{}, correctly grabbing the final answer when the model retries within chain-of-thought - GRPO accuracy reward — broken fallback: Old code compared the entire completion (including reasoning) against the answer when no
<answer>tags were found; now falls back to last non-empty line - GRPO accuracy reward — whitespace normalization: Answer comparison now collapses internal whitespace runs to single space, preventing false negatives from formatting differences
- LoRA inference stop tokens:
run_inference_with_loranow uses full chat template + comprehensive stop token collection instead of just tokenizer EOS — fixes infinite generation on chat-finetuned models - LoRA inference missing parameters: All sampling parameters (top_k, top_p, min_p, penalties, seed) now passed through to LoRA inference path
- Llama 4 misdetection: Model name heuristic now correctly routes
llama-4/llama4to Llama 4 template (was incorrectly using Llama 3 tokens)
Added
- GRPO
\boxed{}answer extraction:AccuracyRewardnow extracts answers from LaTeX\boxed{...}expressions with brace-depth tracking, standard for math GRPO (DeepSeek-R1 style)
Improved
- TUI replaces legacy dashboard:
pmetal tuiprovides full control center; legacypmetal dashboardretained for simple metrics monitoring - Chat template Jinja detection: Ordered detection ensures DeepSeek (full-width unicode), Cohere, Llama 4 are matched before generic patterns
- EOS token stripping:
strip_eos_tokens()now handles all model-family EOS tokens
Full Changelog: v0.2.1...v0.3.0