Releases: beamivalice/PonyExl3
Releases · beamivalice/PonyExl3
v0.3.0
Summary
- One-command converter —
ponyexl3-convert --in-dir SOURCE --out-dir OUT --bits 4.15runs plan → calibration → measured bit allocation → LDLQ → resumable HF shards - Self-converted Qwen3.6-27B @ 4.15bpw — KLD parity vs bf16; better ΔPPL (+0.015 vs +0.169) and p99 (0.548 vs 0.592) than UnstableLlama 4.15bpw
ponyexl3-convert-advanced— low-level/oracle path;ponyexl3-convert-e2eis a deprecated alias- GPU-residency: MLX LDLQ, sibling batching, parallel measurement, layer reuse
Install
pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.3.0"Convert
ponyexl3-convert --in-dir /path/to/Qwen3.6-27B \
--out-dir /path/to/Qwen3.6-27B-PonyExl3-4.15bpw --bits 4.15Test plan
-
pytest— 293 passed, 13 skipped -
uv build—ponyexl3-0.3.0sdist + wheel
v0.2.1
Summary
- Source-only quantization planning —
--init-quant-configbuildsquantization_config.jsonfrom BF16 weights (no turboderp oracle) - Plan-only conversion — use the plan dir as
--oracle-dirwith--scale-mode computed - Bit budget —
--bits,--head-bits,--use-bit-allocation,--layer-bits REGEX:K .work/gitignored
Install
pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.2.1"Test plan
-
pytest tests/test_convert*.py— 53 passed -
uv build—ponyexl3-0.2.1sdist + wheel
v0.2.0 - Exl3 Converter on Apple Silicon
Inference
- MiniCPM5-1B EXL3 support (
model_typellama) - ~152 tok/s greedy decode on M5 Max; ~0.9 GB resident
Converter (ponyexl3-convert)
- HF → EXL3 conversion on Metal: trellis search, Hessian/LDLQ, regularization, calibration, allocation
- Full-model MiniCPM5-1B in ~7 min (direct path)
- KLD vs bf16 matches turboderp/MiniCPM5-1B-exl3 4.00bpw (KLD 0.0422 vs 0.0428)
Install
pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.2.0"Full changelog: CHANGELOG.md
v0.1.7
Changelog
All notable changes to PonyExl3 are documented here.
[0.1.7] — 2026-06-13
- Gemma4-26B-A4B EXL3 support (
model_typegemma4/gemma4_text) - Gemma4 MoE:
EXL3Gemma4MoEBlock(compiled router + stacked experts; shared MLP separate) - Gemma4 routed experts use GeGLU (
gelu_approx) in MoE kernels, matching exllamav3 - Gemma4 sibling fusion: attn qkv + full-layer qk (40 MB threshold; MLP gate+up unfused)
- Fusion parity test vs unfused logits (
tests/test_gemma4_model.py) - Fix Gemma4 generation stop: merge top-level +
text_configeos_token_id(honors<turn|>)
[0.1.6] — 2026-06-13
- CLI validation: model dir, Metal, context limits, empty prompts, spec-flag warnings
ponyexl3-generate-bench: text-repeat prefill padding, cache clear between rows- Generation guards for
prefill_chunk,num_draft, andmax_position_embeddings - CLI edge-case tests (
tests/test_cli.py,tests/test_generate_validation.py)
[0.1.5] — 2026-06-13
ponyexl3-generate-bench: prefill sweep (1k–32k) with 128-token decode per row- Shared generate CLI setup (
--mtp,--dflash,--eagle3,--lookup, engines, etc.) - Default prompt file:
README.md(--prompt-fileto override)
[0.1.3] — 2026-06-13
- MTP speculative decoding: temperature-aware verify (Leviathan–Chen rejection sampling)
- README benchmark tables (M5 Max, M1 Max, RTX 4090 comparison)
[0.1.2] — 2026-06-13
- Fix load transient memory / MLX buffer cache growth on 32 GB Macs
- Wired-memory cap via
PONYEXL3_MEM_LIMIT_GB(92% of device recommended working set) - M1 Max benchmark numbers in README
[0.1.1] — 2026-06-13
- First 32 GB memory fix (load peak ~27.5 GB for 27B 4.15bpw)
[0.1.0] — 2026-06-13
- Initial public release: EXL3 inference on Apple Silicon via MLX
- CPU
ref/golden codec + MLX Metal runtime - Model loader for Qwen3.5 / Qwen3.6 dense and MoE
- Speculative decoding: MTP, DFlash, EAGLE-3, n-gram lookup (verify-gated)
- CLIs:
ponyexl3-generate,ponyexl3-compare-layer,ponyexl3-compare-engines - Cross-platform reference export/compare scripts (
ponyexl3/reference/)