Skip to content

Releases: beamivalice/PonyExl3

v0.3.0

20 Jun 16:08

Choose a tag to compare

Summary

  • One-command converterponyexl3-convert --in-dir SOURCE --out-dir OUT --bits 4.15 runs plan → calibration → measured bit allocation → LDLQ → resumable HF shards
  • Self-converted Qwen3.6-27B @ 4.15bpw — KLD parity vs bf16; better ΔPPL (+0.015 vs +0.169) and p99 (0.548 vs 0.592) than UnstableLlama 4.15bpw
  • ponyexl3-convert-advanced — low-level/oracle path; ponyexl3-convert-e2e is a deprecated alias
  • GPU-residency: MLX LDLQ, sibling batching, parallel measurement, layer reuse

Install

pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.3.0"

Convert

ponyexl3-convert --in-dir /path/to/Qwen3.6-27B \
  --out-dir /path/to/Qwen3.6-27B-PonyExl3-4.15bpw --bits 4.15

Test plan

  • pytest — 293 passed, 13 skipped
  • uv buildponyexl3-0.3.0 sdist + wheel

v0.2.1

18 Jun 23:13

Choose a tag to compare

Summary

  • Source-only quantization planning--init-quant-config builds quantization_config.json from BF16 weights (no turboderp oracle)
  • Plan-only conversion — use the plan dir as --oracle-dir with --scale-mode computed
  • Bit budget--bits, --head-bits, --use-bit-allocation, --layer-bits REGEX:K
  • .work/ gitignored

Install

pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.2.1"

Test plan

  • pytest tests/test_convert*.py — 53 passed
  • uv buildponyexl3-0.2.1 sdist + wheel

v0.2.0 - Exl3 Converter on Apple Silicon

18 Jun 12:10

Choose a tag to compare

Inference

  • MiniCPM5-1B EXL3 support (model_type llama)
  • ~152 tok/s greedy decode on M5 Max; ~0.9 GB resident

Converter (ponyexl3-convert)

  • HF → EXL3 conversion on Metal: trellis search, Hessian/LDLQ, regularization, calibration, allocation
  • Full-model MiniCPM5-1B in ~7 min (direct path)
  • KLD vs bf16 matches turboderp/MiniCPM5-1B-exl3 4.00bpw (KLD 0.0422 vs 0.0428)

Install

pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.2.0"

Full changelog: CHANGELOG.md

v0.1.7

15 Jun 17:27

Choose a tag to compare

Changelog

All notable changes to PonyExl3 are documented here.

[0.1.7] — 2026-06-13

  • Gemma4-26B-A4B EXL3 support (model_type gemma4 / gemma4_text)
  • Gemma4 MoE: EXL3Gemma4MoEBlock (compiled router + stacked experts; shared MLP separate)
  • Gemma4 routed experts use GeGLU (gelu_approx) in MoE kernels, matching exllamav3
  • Gemma4 sibling fusion: attn qkv + full-layer qk (40 MB threshold; MLP gate+up unfused)
  • Fusion parity test vs unfused logits (tests/test_gemma4_model.py)
  • Fix Gemma4 generation stop: merge top-level + text_config eos_token_id (honors <turn|>)

[0.1.6] — 2026-06-13

  • CLI validation: model dir, Metal, context limits, empty prompts, spec-flag warnings
  • ponyexl3-generate-bench: text-repeat prefill padding, cache clear between rows
  • Generation guards for prefill_chunk, num_draft, and max_position_embeddings
  • CLI edge-case tests (tests/test_cli.py, tests/test_generate_validation.py)

[0.1.5] — 2026-06-13

  • ponyexl3-generate-bench: prefill sweep (1k–32k) with 128-token decode per row
  • Shared generate CLI setup (--mtp, --dflash, --eagle3, --lookup, engines, etc.)
  • Default prompt file: README.md (--prompt-file to override)

[0.1.3] — 2026-06-13

  • MTP speculative decoding: temperature-aware verify (Leviathan–Chen rejection sampling)
  • README benchmark tables (M5 Max, M1 Max, RTX 4090 comparison)

[0.1.2] — 2026-06-13

  • Fix load transient memory / MLX buffer cache growth on 32 GB Macs
  • Wired-memory cap via PONYEXL3_MEM_LIMIT_GB (92% of device recommended working set)
  • M1 Max benchmark numbers in README

[0.1.1] — 2026-06-13

  • First 32 GB memory fix (load peak ~27.5 GB for 27B 4.15bpw)

[0.1.0] — 2026-06-13

  • Initial public release: EXL3 inference on Apple Silicon via MLX
  • CPU ref/ golden codec + MLX Metal runtime
  • Model loader for Qwen3.5 / Qwen3.6 dense and MoE
  • Speculative decoding: MTP, DFlash, EAGLE-3, n-gram lookup (verify-gated)
  • CLIs: ponyexl3-generate, ponyexl3-compare-layer, ponyexl3-compare-engines
  • Cross-platform reference export/compare scripts (ponyexl3/reference/)