Skip to content

v0.1.7

Choose a tag to compare

@beamivalice beamivalice released this 15 Jun 17:27
· 48 commits to master since this release

Changelog

All notable changes to PonyExl3 are documented here.

[0.1.7] — 2026-06-13

  • Gemma4-26B-A4B EXL3 support (model_type gemma4 / gemma4_text)
  • Gemma4 MoE: EXL3Gemma4MoEBlock (compiled router + stacked experts; shared MLP separate)
  • Gemma4 routed experts use GeGLU (gelu_approx) in MoE kernels, matching exllamav3
  • Gemma4 sibling fusion: attn qkv + full-layer qk (40 MB threshold; MLP gate+up unfused)
  • Fusion parity test vs unfused logits (tests/test_gemma4_model.py)
  • Fix Gemma4 generation stop: merge top-level + text_config eos_token_id (honors <turn|>)

[0.1.6] — 2026-06-13

  • CLI validation: model dir, Metal, context limits, empty prompts, spec-flag warnings
  • ponyexl3-generate-bench: text-repeat prefill padding, cache clear between rows
  • Generation guards for prefill_chunk, num_draft, and max_position_embeddings
  • CLI edge-case tests (tests/test_cli.py, tests/test_generate_validation.py)

[0.1.5] — 2026-06-13

  • ponyexl3-generate-bench: prefill sweep (1k–32k) with 128-token decode per row
  • Shared generate CLI setup (--mtp, --dflash, --eagle3, --lookup, engines, etc.)
  • Default prompt file: README.md (--prompt-file to override)

[0.1.3] — 2026-06-13

  • MTP speculative decoding: temperature-aware verify (Leviathan–Chen rejection sampling)
  • README benchmark tables (M5 Max, M1 Max, RTX 4090 comparison)

[0.1.2] — 2026-06-13

  • Fix load transient memory / MLX buffer cache growth on 32 GB Macs
  • Wired-memory cap via PONYEXL3_MEM_LIMIT_GB (92% of device recommended working set)
  • M1 Max benchmark numbers in README

[0.1.1] — 2026-06-13

  • First 32 GB memory fix (load peak ~27.5 GB for 27B 4.15bpw)

[0.1.0] — 2026-06-13

  • Initial public release: EXL3 inference on Apple Silicon via MLX
  • CPU ref/ golden codec + MLX Metal runtime
  • Model loader for Qwen3.5 / Qwen3.6 dense and MoE
  • Speculative decoding: MTP, DFlash, EAGLE-3, n-gram lookup (verify-gated)
  • CLIs: ponyexl3-generate, ponyexl3-compare-layer, ponyexl3-compare-engines
  • Cross-platform reference export/compare scripts (ponyexl3/reference/)