20 Jun 16:08

55b5410

v0.3.0 Latest

Latest

Summary

One-command converter — ponyexl3-convert --in-dir SOURCE --out-dir OUT --bits 4.15 runs plan → calibration → measured bit allocation → LDLQ → resumable HF shards
Self-converted Qwen3.6-27B @ 4.15bpw — KLD parity vs bf16; better ΔPPL (+0.015 vs +0.169) and p99 (0.548 vs 0.592) than UnstableLlama 4.15bpw
ponyexl3-convert-advanced — low-level/oracle path; ponyexl3-convert-e2e is a deprecated alias
GPU-residency: MLX LDLQ, sibling batching, parallel measurement, layer reuse

Install

pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.3.0"

Convert

ponyexl3-convert --in-dir /path/to/Qwen3.6-27B \
  --out-dir /path/to/Qwen3.6-27B-PonyExl3-4.15bpw --bits 4.15

Test plan

pytest — 293 passed, 13 skipped
uv build — ponyexl3-0.3.0 sdist + wheel

Assets 2

18 Jun 23:13

beamivalice

v0.2.1

41f386a

v0.2.1

Summary

Source-only quantization planning — --init-quant-config builds quantization_config.json from BF16 weights (no turboderp oracle)
Plan-only conversion — use the plan dir as --oracle-dir with --scale-mode computed
Bit budget — --bits, --head-bits, --use-bit-allocation, --layer-bits REGEX:K
.work/ gitignored

Install

pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.2.1"

Test plan

pytest tests/test_convert*.py — 53 passed
uv build — ponyexl3-0.2.1 sdist + wheel

Assets 2

18 Jun 12:10

beamivalice

v0.2.0

ec6cb8c

v0.2.0 - Exl3 Converter on Apple Silicon

Inference

MiniCPM5-1B EXL3 support (model_type llama)
~152 tok/s greedy decode on M5 Max; ~0.9 GB resident

Converter (`ponyexl3-convert`)

HF → EXL3 conversion on Metal: trellis search, Hessian/LDLQ, regularization, calibration, allocation
Full-model MiniCPM5-1B in ~7 min (direct path)
KLD vs bf16 matches turboderp/MiniCPM5-1B-exl3 4.00bpw (KLD 0.0422 vs 0.0428)

Install

pip install "ponyexl3 @ git+https://github.com/beamivalice/PonyExl3.git@v0.2.0"

Full changelog: CHANGELOG.md

Assets 4

15 Jun 17:27

beamivalice

v0.1.7

32a09c4

v0.1.7

Changelog

All notable changes to PonyExl3 are documented here.

[0.1.7] — 2026-06-13

Gemma4-26B-A4B EXL3 support (model_type gemma4 / gemma4_text)
Gemma4 MoE: EXL3Gemma4MoEBlock (compiled router + stacked experts; shared MLP separate)
Gemma4 routed experts use GeGLU (gelu_approx) in MoE kernels, matching exllamav3
Gemma4 sibling fusion: attn qkv + full-layer qk (40 MB threshold; MLP gate+up unfused)
Fusion parity test vs unfused logits (tests/test_gemma4_model.py)
Fix Gemma4 generation stop: merge top-level + text_config eos_token_id (honors <turn|>)

[0.1.6] — 2026-06-13

CLI validation: model dir, Metal, context limits, empty prompts, spec-flag warnings
ponyexl3-generate-bench: text-repeat prefill padding, cache clear between rows
Generation guards for prefill_chunk, num_draft, and max_position_embeddings
CLI edge-case tests (tests/test_cli.py, tests/test_generate_validation.py)

[0.1.5] — 2026-06-13

ponyexl3-generate-bench: prefill sweep (1k–32k) with 128-token decode per row
Shared generate CLI setup (--mtp, --dflash, --eagle3, --lookup, engines, etc.)
Default prompt file: README.md (--prompt-file to override)

[0.1.3] — 2026-06-13

MTP speculative decoding: temperature-aware verify (Leviathan–Chen rejection sampling)
README benchmark tables (M5 Max, M1 Max, RTX 4090 comparison)

[0.1.2] — 2026-06-13

Fix load transient memory / MLX buffer cache growth on 32 GB Macs
Wired-memory cap via PONYEXL3_MEM_LIMIT_GB (92% of device recommended working set)
M1 Max benchmark numbers in README

[0.1.1] — 2026-06-13

First 32 GB memory fix (load peak ~27.5 GB for 27B 4.15bpw)

[0.1.0] — 2026-06-13

Initial public release: EXL3 inference on Apple Silicon via MLX
CPU ref/ golden codec + MLX Metal runtime
Model loader for Qwen3.5 / Qwen3.6 dense and MoE
Speculative decoding: MTP, DFlash, EAGLE-3, n-gram lookup (verify-gated)
CLIs: ponyexl3-generate, ponyexl3-compare-layer, ponyexl3-compare-engines
Cross-platform reference export/compare scripts (ponyexl3/reference/)

Assets 4

Releases: beamivalice/PonyExl3

v0.3.0

Summary

Install

Convert

Test plan

Uh oh!

v0.2.1

Summary

Install

Test plan

Uh oh!

v0.2.0 - Exl3 Converter on Apple Silicon

Inference

Converter (ponyexl3-convert)

Install

Uh oh!

v0.1.7

Changelog

[0.1.7] — 2026-06-13

[0.1.6] — 2026-06-13

[0.1.5] — 2026-06-13

[0.1.3] — 2026-06-13

[0.1.2] — 2026-06-13

[0.1.1] — 2026-06-13

[0.1.0] — 2026-06-13

Uh oh!

Converter (`ponyexl3-convert`)