Benchmarking MLX pour modèles open-source et fine-tunes Ailiance sur Mac Apple Silicon. Évaluation via perplexité sur 5 niches embarquées (spice, stm32, kicad, embedded_iot, emc_power).
- Live demo & cockpit: https://www.ailiance.fr
- Status dashboard: https://home.saillant.cc
- HuggingFace IP source-of-truth: https://huggingface.co/electron-rare
- HuggingFace product distribution: https://huggingface.co/Ailiance-fr
- Audit-grade bench validators: https://github.com/ailiance/iact-bench
- Benchmark results: https://github.com/ailiance/ailiance-bench
Ailiance is the EU-sovereign LLM serving stack of L'Electron Rare, a French SME. Multi-model, audit-grade, EU AI Act Art. 13/15/52/53 transparency.
| Modèle | Provider | Notes |
|---|---|---|
ailiance |
Ailiance | Fine-tune custom ; 52% gsm-S / 78.5% arc / 58% mmluPro |
mascarade |
Ailiance | Fine-tune custom |
base |
Ailiance | Gemma 3 4B vanilla (référence pré-fine-tune) |
gemma3-4b |
Gemma 3 4B | |
ministral-3b |
Mistral | Ministral 3B |
ministral-3-8b |
Mistral | Ministral 3-8B |
qwen-coder-3b |
Alibaba | Qwen Coder 3B |
qwen3.5-9b |
Alibaba | Qwen 3.5 9B |
llama-3.2-3b |
Meta | Llama 3.2 3B |
helium-1-2b |
Helium | Helium 1 2B |
granite-4.1-3b |
IBM | Granite 4.1 3B |
jackrong-9b-opus |
Jackrong | Jackrong 9B Opus |
Scores complets : voir bench-results/BENCH_TABLE.md.
Base = gemma-e4b-eu-kiki-base. 4 adaptateurs LoRA comparés sur 7 tâches.
Source de vérité : bench-results/compare_base_vs_lora.md.
| Phase | Tâche | base | +eu-kiki | +mascarade | +aggro | +kicad9plus |
|---|---|---|---|---|---|---|
| P1 | kicad-dsl | 0.090 | 0.640 (+55) | 0.090 | 0.090 | 0.090 |
| P1 | kicad-pcb | 0.010 | 0.430 (+42) | 0.010 | 0.010 | 0.015 |
| P1 | spice-sim | 0.425 | 0.676 (+25) | 0.176 (−25) | 0.189 (−24) | 0.268 (−16) |
| P2 | kicad-sch-gen | 0.420 | 0.220 (−20) | 0.400 (−2) | 0.320 (−10) | 0.180 (−24) |
| P3 | kicad-sch-extract | 0.308 | 0.690 (+38) | 0.785 (+48) | 0.350 (+4) | 0.000 (−31) |
| P4 | kicad-erc-abs | 0.060 | 0.057 | 0.060 | 0.060 | 0.033 (−3) |
| P5 | kicad-erc-delta | 0.060 | 0.057 | 0.060 | 0.060 | 0.033 (−3) |
- 🥇 eu-kiki : champion généraliste (4/7 tâches, peak P1-DSL +55 pts)
- 🥇 mascarade : champion ciblé P3 extraction (+48 pts)
⚠️ aggro : neutre (sanity check)- ❌ kicad9plus : catastrophic forgetting — régression sur SPICE/P2/P3. À utiliser uniquement si le contexte est exclusivement KiCad permissive.
- 🚫 Génération
.kicad_schfrom-scratch : non résolue (parse_kicad=0partout sur P4/P5) — bottleneck = absence de KiCad 6+ S-expr en pré-entraînement, pas le mode chat.
lm-eval-harness (100 exemples, seed 0) :
gsm8k_cot(8-shot par défaut)arc_easy(0-shot)mmlu_pro_computer_science(0-shot)
Perplexité MLX (20 samples × 1024 seq) :
- Niches :
spice,stm32,kicad,embedded_iot,emc_power - Datasets HF :
Ailiance-fr/mascarade-*-dataset
git clone https://github.com/ailiance/ailiance-bench.git
cd ailiance-bench
python3.12 -m venv .venv && source .venv/bin/activate
pip install -U uv
uv pip install -r requirements.txt
python scripts/bench_new_models.pyRésultats append-only : bench-results/all_models.txt.
Régénérer table markdown : python scripts/regen_bench_table.py.
- Mac Apple Silicon (M1/M2/M3+)
- Python 3.12+
- ≥16 Go RAM (32 Go recommandé — modèles 8-9B peuvent OOM sur 16 Go)
- Xcode complet + Metal Toolchain (optionnel, pour wheel mlx fork) :
xcodebuild -downloadComponent MetalToolchain(~688 Mo)
ministral-3-8b/gsm8k_cot8-shot : OOM Metal (cap GPU ~499K handles). Fork mlx branchemetal-3x-buffer-limit: ×1.5 cap (748K) suffit qwen3.5-9b/helium, ×3 (1497K) pour ministral-3-8b.qwen3.5-9b: timeout 600s standard ; nécessite 1200-1500s ou fork mlx.helium-1-2b: NO_RESULT (template chat/tokenizer) ; test avec fork.- QuantizedKVCache 8-bit : bloqué upstream mlx_lm 0.31.3 (pas
.merge()). Voirscripts/bench_oom_retry.py(workaround patché, inutilisable jusqu'à ajout méthode).
- Symptoms :
ValueError: Received 126 parameters not in model - Affected : any machine using vanilla
mlx-lm==0.31.3(PyPI) to load Gemma 4 (E2B, E4B, etc.) in 4-bit / 8-bit quantization (lmstudio-community/gemma-4-*-MLX-*bit,mlx-community/gemma-4-*-*bit). - Status : already fixed upstream in
ml-explore/mlx-lmPR #1240 (merged 2026-05-04), tracking issue #1242. Waiting for the next PyPI release > 0.31.3. Patch documented locally indocs/mlx_lm_gemma4_kv_shared_fix.mdandpatches/mlx_lm_gemma4_text_kv_shared.patch. - Quick workaround :
uv pip install --force-reinstall "mlx-lm @ git+https://github.com/ml-explore/mlx-lm@main"
- Fork MLX : https://github.com/electron-rare/mlx (branche
metal-3x-buffer-limit) - Build wheels CI : https://github.com/electron-rare/mlx/actions/workflows/build-wheels.yml
- Datasets HF : https://huggingface.co/Ailiance-fr
Apache-2.0 (code). Résultats publiés tels quels ; modèles sous licence d'origine.
scripts/bench_gateway.py — stdlib-only OpenAI-compat client to bench any
/v1/chat/completions endpoint. Configurable via --endpoint, --models,
--rounds, --max-tokens, optional --out JSON dump. Used to characterize
the ailiance gateway (electron-server:9300) and direct workers
(Tower :9304, kxkm-ai tunnel :8002).
# Bench all 11 ailiance routes via the gateway
python3 scripts/bench_gateway.py --rounds 3 --max-tokens 64
# Bench a worker directly
python3 scripts/bench_gateway.py \
--endpoint http://tower:9304/v1/chat/completions \
--models ailiance-gemma --rounds 3Results: bench-results/gateway-*.json (per-route p50 latency, tps, errors).
docs/SYNTHESIS_2026-05-10.md consolidates bench artefacts from macM1,
Studio, electron-server (124-cell 31_domains_baseline.json + 12-model
BENCH_TABLE.md + 11-route gateway + Phase 1 quick eval).
bench-results/aggregated/synthesis_31_domains.{csv,json} — 124 rows
flattened (model, domain, ppl, stderr, status) for the 4-of-8 v2-baseline
matrix on macM1 (gemma E2B/E4B + ministral 3B/3-8B).
Also pushed to grist.saillant.cc (dhyrySCayizD1PNqCNhCPN) as 4 tables:
Bench_31_domains (124), Bench_public (12), Bench_niches_ppl (8),
Bench_gateway (11).
See docs/SYNTHESIS_2026-05-10.md and
docs/2026-05-10-phase2-training-gaps.md for the gap analysis and the
follow-up training plan.