v0.1.7
Changelog
All notable changes to PonyExl3 are documented here.
[0.1.7] — 2026-06-13
- Gemma4-26B-A4B EXL3 support (
model_typegemma4/gemma4_text) - Gemma4 MoE:
EXL3Gemma4MoEBlock(compiled router + stacked experts; shared MLP separate) - Gemma4 routed experts use GeGLU (
gelu_approx) in MoE kernels, matching exllamav3 - Gemma4 sibling fusion: attn qkv + full-layer qk (40 MB threshold; MLP gate+up unfused)
- Fusion parity test vs unfused logits (
tests/test_gemma4_model.py) - Fix Gemma4 generation stop: merge top-level +
text_configeos_token_id(honors<turn|>)
[0.1.6] — 2026-06-13
- CLI validation: model dir, Metal, context limits, empty prompts, spec-flag warnings
ponyexl3-generate-bench: text-repeat prefill padding, cache clear between rows- Generation guards for
prefill_chunk,num_draft, andmax_position_embeddings - CLI edge-case tests (
tests/test_cli.py,tests/test_generate_validation.py)
[0.1.5] — 2026-06-13
ponyexl3-generate-bench: prefill sweep (1k–32k) with 128-token decode per row- Shared generate CLI setup (
--mtp,--dflash,--eagle3,--lookup, engines, etc.) - Default prompt file:
README.md(--prompt-fileto override)
[0.1.3] — 2026-06-13
- MTP speculative decoding: temperature-aware verify (Leviathan–Chen rejection sampling)
- README benchmark tables (M5 Max, M1 Max, RTX 4090 comparison)
[0.1.2] — 2026-06-13
- Fix load transient memory / MLX buffer cache growth on 32 GB Macs
- Wired-memory cap via
PONYEXL3_MEM_LIMIT_GB(92% of device recommended working set) - M1 Max benchmark numbers in README
[0.1.1] — 2026-06-13
- First 32 GB memory fix (load peak ~27.5 GB for 27B 4.15bpw)
[0.1.0] — 2026-06-13
- Initial public release: EXL3 inference on Apple Silicon via MLX
- CPU
ref/golden codec + MLX Metal runtime - Model loader for Qwen3.5 / Qwen3.6 dense and MoE
- Speculative decoding: MTP, DFlash, EAGLE-3, n-gram lookup (verify-gated)
- CLIs:
ponyexl3-generate,ponyexl3-compare-layer,ponyexl3-compare-engines - Cross-platform reference export/compare scripts (
ponyexl3/reference/)