Fix palettize_weights enable_per_channel_scale=True crashing MPSGraph on Apple Neural Engine (macOS 26)#2688
Open
john-rocky wants to merge 1 commit intoapple:mainfrom
Open
Conversation
5b398bd to
01069d5
Compare
…ANE (macOS 26)
When OpPalettizerConfig is configured with enable_per_channel_scale=True,
palettize_weights wraps the constexpr_lut_to_dense output in a
constexpr_blockwise_shift_scale op (data=<dense fp16 weight>, scale=<per-channel
fp16>). On macOS 26, the MPSGraph backend lowering for that constexpr op fails
verification when targeting the Apple Neural Engine:
'mps.dequantize' op operand apple#2 must be tensor of quantized values,
but got 'tensor<1xf16>'
... failed assertion `original module failed verification'
The MPSGraph lowering of constexpr_blockwise_shift_scale assumes the data
operand is a quantized integer tensor (it lowers to mps.dequantize); with
enable_per_channel_scale=True, the data is the dense fp16 weight, which fails
that assumption. CPU and GPU compute units accept the wrapper and predict
correctly; only the ANE-targeted MIL -> MPSGraph dispatch is broken.
Fix: bake per_channel_scale into the LUT entries at compile time and re-emit
constexpr_lut_to_dense, instead of leaving the scale as a runtime constexpr.
Both data and scale are fp16 and the wrapper's only effect is data * scale, so
the fold is mathematically identical. The failing MPSGraph dispatch is
eliminated entirely, and CPU / GPU numerics stay bit-identical with the prior
behavior. Resulting graph also has one fewer runtime constexpr per palettized
const.
Test updated: TestPalettizeWeights::test_palettization_pcs previously asserted
that the constexpr_blockwise_shift_scale wrapper was emitted; it now asserts
the wrapper is absent (the LUT is pre-scaled). Numerical equivalence vs the
unpalettized model is verified by the existing verify_model_outputs call on
macOS 15+.
Tested:
- test_palettization_pcs: PASS
- All 155 TestPalettizeWeights / TestJointCompressWeights: PASS
- Manual: Qwen3-VL 2B stateful chunk on macOS 26 + M4 ANE:
MPSGraph verification crash gone (was reproducible at every load).
01069d5 to
c5fb6c7
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OpPalettizerConfig(enable_per_channel_scale=True)produces an.mlpackagethat fails MPSGraph verification at model-load time on the Apple Neural Engine on macOS 26, with:CPU and GPU compute units accept the same
.mlpackageand predict correctly; only the ANE-targeted MIL → MPSGraph dispatch is broken. The flag has been in the public API since 8.0b2 (#2308) and is documented as supported foriOS18+, but no test covered the ANE-load path on the macOS that ships that backend.This PR fixes the MIL emission so that the failing MPSGraph dispatch is never produced, by folding
per_channel_scaleinto the LUT entries at compile time rather than emitting a runtimeconstexpr_blockwise_shift_scalewrapper around the dense fp16 weight. The math is identical (bothdataandscaleare fp16 and the wrapper's only effect isdata * scale), so CPU / GPU numerics stay bit-identical with the prior behavior.Root cause
coremltools/optimize/coreml/_quantization_passes.py:1142wraps theconstexpr_lut_to_denseoutput:The MIL spec for
iOS18.compression.constexpr_blockwise_shift_scaleallowsSrcT ∈ {int4, uint4, int8, uint8, fp16, fp32}, so the op itself is well-typed. The MPSGraph backend lowering, however, lowers this constexpr tomps.dequantize, whose operand #2 is required to be a quantized integer tensor — and withenable_per_channel_scale=Truethedataoperand is fp16. The lowering passes a placeholdertensor<1xf16>, which fails verification.The failing op chain in MIL is:
After this PR:
One fewer runtime constexpr per palettized weight, and no path through the failing MPSGraph op.
Trigger
Pure-synthetic single-
nn.Conv2d(2048, 2048, 1)and 7-layer multi-nn.Conv2dmodels withenable_per_channel_scale=Trueload fine on ANE on macOS 26 — the MPSGraph dispatch only chooses the failing path for some specific combinations of op count / state ops / shape patterns inside larger graphs. Confirmed reproducible on a Qwen3-VL 2B stateful body chunk (42 conv ops + 14 matmul + 14slice_update+write_state+ 7layer_norm, etc.). I did not narrow the trigger further than "real LLM chunks fail; small standalone Conv2d cases don't."The fix is independent of the trigger: removing the runtime
constexpr_blockwise_shift_scaleremoves the MPSGraph dispatch entirely, so the failing path can no longer be reached for any input model.Tests
Updated
tests/optimize/coreml/test_post_training_quantization.py::TestPalettizeWeights::test_palettization_pcsto assert the new (correct) MIL emission: there should be zeroconstexpr_blockwise_shift_scaleops, withper_channel_scalebaked into theconstexpr_lut_to_denseLUT entries. Numerical equivalence vs the un-palettized model is verified by the existingverify_model_outputs(...)call on macOS 15+.End-to-end manually verified on macOS 26 + Apple M4:
enable_per_channel_scale=True+compute_units=CPU_AND_NE: MPSGraph verification crash gone (was 100% reproducible before this PR).CPU_ONLYandCPU_AND_GPU: numerics finite and consistent.nn.Conv2d(2048, 2048, 1)+enable_per_channel_scale=True+ ANE inference: numerically equivalent to the pre-PRenable_per_channel_scale=Falsebaseline (within fp16 rounding).Test plan
test_palettization_pcsupdated and passesTestPalettizeWeights/TestJointCompressWeightstests pass (no regressions)System tested
main@ HEADRelated
coremltools/converters/mil/mil/ops/defs/iOS18/compression.py:24(constexpr_blockwise_shift_scale, allowsSrcT=fp16)coremltools/optimize/coreml/_quantization_passes.py:1136