feat: runtime model config from HuggingFace config.json#3
Open
Alexintosh wants to merge 7 commits intodanveloper:mainfrom
Open
feat: runtime model config from HuggingFace config.json#3Alexintosh wants to merge 7 commits intodanveloper:mainfrom
Alexintosh wants to merge 7 commits intodanveloper:mainfrom
Conversation
Spec for replacing ~40 hardcoded #define model constants with a runtime ModelConfig struct populated from HuggingFace config.json, enabling model switching via --model flag without recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing arrays (g_lz4_index, g_pred_experts, g_pred_count, stack VLAs), full_attn_interval fallback, thread safety invariant, MODEL_PATH_DEFAULT handling, MAX_BATCH_SLOTS coupling note, and clarify chat.m needs zero changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds ModelConfig struct, compute_expert_offsets(), and load_model_config() that parses HuggingFace config.json + tokenizer.json via NSJSONSerialization. Old #defines still present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove ~54 model-specific #define constants and replace ~960 occurrences with cfg.* runtime struct fields. Convert 13 static/ stack arrays to dynamic allocation. Parse config.json + tokenizer.json at startup via NSJSONSerialization. Expert byte offsets computed from model dimensions and quantization params. Switching models now requires only --model flag, no recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Generalize file header comment to describe multi-model support. Update startup banner from hardcoded model name to "Flash-MoE" with dynamic config path display. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e models Lists local HF-cached models with compatibility check, searches HuggingFace for compatible Qwen3.5 MoE models (35B-A3B, 122B-A10B, 397B-A17B) with MLX quantization, and supports downloading via huggingface-cli or huggingface_hub. Usage: python model_manager.py # list local + remote python model_manager.py --local # local only python model_manager.py --search # remote only python model_manager.py --download <repo> python model_manager.py --check <path> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add compatible models table, model manager usage instructions, updated quick start with --model flag and FLASH_MOE_MODEL env var, revised project structure, and generalized architecture description. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
seslly
pushed a commit
to shipstuff/flash-moe
that referenced
this pull request
Apr 13, 2026
Phase 0 of the ANE offload strategy scoped in docs/2026-04-11-ane-offload-scoping.md. Adds two measurement harnesses under ane_bench/ and a README documenting the methodology and results. Harnesses: - ane_dispatch_bench.m: loads one pre-compiled LUT4 super-block from anemll-qwen35 and runs 200 back-to-back predictionFromFeatures: calls, reporting min/p50/p90/p99/max latency. - ane_transfer_bench.m: 10k trials each of four MLMultiArray/MTLBuffer interaction patterns (zero-copy wrap, alloc+memcpy, readback, alloc only). Results on M4 Pro mini-01 (2026-04-11): Unknown danveloper#1 — Swift CoreML per-prediction dispatch overhead: p50: 6.642 ms per super-block (4 layers bundled: DDDA) p99: 7.444 ms, max: 7.586 ms across 200 calls This is FASTER than the anemll-qwen35 reference of 9.28 ms (their measurement included Python coremltools overhead; raw Obj-C is tighter). No thermal throttling visible. Unknown danveloper#2 — MTLBuffer ↔ MLMultiArray transfer cost: Zero-copy wrap (initWithDataPointer): 0.5 us mean, p99 1.07 us Alloc + memcpy (naive fallback): 1.05 us mean, p99 4.05 us All paths sub-5-microseconds. Per-token overhead across 45 layer transitions = ~22 us total = completely negligible. Unknown danveloper#3 — GPU + ANE simultaneous load contention: Ran ane_dispatch_bench concurrent with TQ_KV=1 ./infer --tokens 128. ANE under concurrent GPU load: p50 +4%, p90 +5%, p99 +14%. GPU inference: 6.05 tok/s vs historical 5.65-5.91 tok/s baseline at TQ_KV=1 128tok — within noise. Unified memory bandwidth is not a bottleneck for the two engines. Verdict: ALL THREE DECISION GATES PASSED. Proceed to Phase 1. Revised per-token budget for the full ANE offload port: ANE path: 15 super-blocks × ~6.97 ms (with concurrent penalty) = ~104 ms GPU path: 60 MoE dispatches in parallel = ~78 ms Wall clock: max(104, 78) = ~104 ms/token vs current warm-cache baseline: ~150 ms/token Expected speedup: ~31% wall-clock This is materially better than the scoping doc's 10-15% estimate. The scoping doc was written with conservative assumptions about CoreML dispatch overhead; the actual measurement is 2.6 ms *below* the reference, not above it. superblock0.mlmodelc/ bundle (413 MB) is gitignored; reproduce via: rsync -a carl@192.168.0.62:/Users/carl/models/anemll-qwen3.5-9b/\ qwen3_5_superblock0_lut4.mlmodelc/ ane_bench/superblock0.mlmodelc/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rrr3try
pushed a commit
to Graf-RAGov/flash-moe-mlx
that referenced
this pull request
Apr 17, 2026
Upstream + fork + issue context compiled for the port effort: PR diffs (danveloper#3 runtime config, danveloper#11 perf wins, danveloper#13 Qwen3-Coder-Next, danveloper#14 8-bit dequant), fork summaries (nerds-odd-e, gorroai), issue captures (danveloper#15 setup gotchas, danveloper#17 expert_index scope bug, danveloper#20 other Qwen models), target architecture spec (qwen3.6-35b-a3b-arch.md), hardcoded-constants map of upstream flash-moe, condensed port plan. Plus benchmark results, parallelism exploration, 10x optimization ideas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#definemodel constants with a runtimeModelConfigstruct populated from HuggingFaceconfig.jsonat startup via NSJSONSerialization. Switch between Qwen3.5 models (35B, 122B, 397B) with just--model <path>— no recompilation needed.model_manager.pyutility to list local compatible models, search HuggingFace for MLX-quantized Qwen3.5 MoE models, download them, and validate compatibility.--modelflag usage, andFLASH_MOE_MODELenv var.What changed in
infer.mModelConfigstruct +load_model_config()parsesconfig.json(architecture, quantization, layer types, RoPE, EOS tokens) andtokenizer.json(think tokens)compute_expert_offsets()derives all expert byte offsets from dimensions + quantization paramsalloc_tracking_arrays()dynamically allocates all tracking arrays (expert freq, cache state, predictions, layer cache) previously sized by compile-time constants#definereferences replaced withcfg.*fields via helper macros (FREQ(),CACHE_SEEN(),PRED_EXPERT(), etc.)MetalCtxbuffer arrays converted from fixed-size to dynamically allocated (__strongARC pointers)Test plan
cd metal_infer && make./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt "What is 2+2?" --tokens 20python model_manager.py --localto list cached modelspython model_manager.py --searchto find remote modelsFLASH_MOE_MODELenv var as default model path🤖 Generated with Claude Code