Fix palettize_weights enable_per_channel_scale=True crashing MPSGraph on Apple Neural Engine (macOS 26) by john-rocky · Pull Request #2688 · apple/coremltools

john-rocky · 2026-05-03T12:04:14Z

Summary

OpPalettizerConfig(enable_per_channel_scale=True) produces an .mlpackage that fails MPSGraph verification at model-load time on the Apple Neural Engine on macOS 26, with:

'mps.dequantize' op operand #2 must be tensor of quantized values, but got 'tensor<1xf16>'
... failed assertion `original module failed verification'

CPU and GPU compute units accept the same .mlpackage and predict correctly; only the ANE-targeted MIL → MPSGraph dispatch is broken. The flag has been in the public API since 8.0b2 (#2308) and is documented as supported for iOS18+, but no test covered the ANE-load path on the macOS that ships that backend.

This PR fixes the MIL emission so that the failing MPSGraph dispatch is never produced, by folding per_channel_scale into the LUT entries at compile time rather than emitting a runtime constexpr_blockwise_shift_scale wrapper around the dense fp16 weight. The math is identical (both data and scale are fp16 and the wrapper's only effect is data * scale), so CPU / GPU numerics stay bit-identical with the prior behavior.

Root cause

coremltools/optimize/coreml/_quantization_passes.py:1142 wraps the constexpr_lut_to_dense output:

new_var = mb.constexpr_blockwise_shift_scale(
    data=new_var,                # dense fp16 weight produced by constexpr_lut_to_dense
    scale=per_channel_scale,     # per-output-channel fp16
    offset=None,
    ...
)

The MIL spec for iOS18.compression.constexpr_blockwise_shift_scale allows SrcT ∈ {int4, uint4, int8, uint8, fp16, fp32}, so the op itself is well-typed. The MPSGraph backend lowering, however, lowers this constexpr to mps.dequantize, whose operand #2 is required to be a quantized integer tensor — and with enable_per_channel_scale=True the data operand is fp16. The lowering passes a placeholder tensor<1xf16>, which fails verification.

The failing op chain in MIL is:

const(palettized LUT, dtype int4)
  └─→ constexpr_lut_to_dense          (output: dense fp16 weight)
        └─→ constexpr_blockwise_shift_scale  (data=<dense fp16>, scale=<per-channel fp16>)  ← rejected by MPSGraph on ANE
              └─→ conv

After this PR:

const(palettized LUT, dtype int4, with per-channel-scale baked into entries)
  └─→ constexpr_lut_to_dense          (output: dense fp16 weight, already scale-corrected)
        └─→ conv

One fewer runtime constexpr per palettized weight, and no path through the failing MPSGraph op.

Trigger

Pure-synthetic single-nn.Conv2d(2048, 2048, 1) and 7-layer multi-nn.Conv2d models with enable_per_channel_scale=True load fine on ANE on macOS 26 — the MPSGraph dispatch only chooses the failing path for some specific combinations of op count / state ops / shape patterns inside larger graphs. Confirmed reproducible on a Qwen3-VL 2B stateful body chunk (42 conv ops + 14 matmul + 14 slice_update + write_state + 7 layer_norm, etc.). I did not narrow the trigger further than "real LLM chunks fail; small standalone Conv2d cases don't."

The fix is independent of the trigger: removing the runtime constexpr_blockwise_shift_scale removes the MPSGraph dispatch entirely, so the failing path can no longer be reached for any input model.

Tests

Updated tests/optimize/coreml/test_post_training_quantization.py::TestPalettizeWeights::test_palettization_pcs to assert the new (correct) MIL emission: there should be zero constexpr_blockwise_shift_scale ops, with per_channel_scale baked into the constexpr_lut_to_dense LUT entries. Numerical equivalence vs the un-palettized model is verified by the existing verify_model_outputs(...) call on macOS 15+.

$ pytest -k test_palettization_pcs
1 passed

$ pytest -k 'Palettize'
155 passed, 0 failed

End-to-end manually verified on macOS 26 + Apple M4:

Qwen3-VL 2B stateful chunk + enable_per_channel_scale=True + compute_units=CPU_AND_NE: MPSGraph verification crash gone (was 100% reproducible before this PR).
Same chunk on CPU_ONLY and CPU_AND_GPU: numerics finite and consistent.
Standalone nn.Conv2d(2048, 2048, 1) + enable_per_channel_scale=True + ANE inference: numerically equivalent to the pre-PR enable_per_channel_scale=False baseline (within fp16 rounding).

Test plan

test_palettization_pcs updated and passes
All 155 TestPalettizeWeights / TestJointCompressWeights tests pass (no regressions)
End-to-end load on Apple Neural Engine on macOS 26 succeeds for a real LLM chunk that previously crashed
CPU / GPU numerics bit-identical with prior behavior

System tested

macOS 26.0 (Darwin 25.0.0)
Apple Silicon (M4)
coremltools main @ HEAD
Python 3.12, PyTorch 2.11.0, NumPy 2.1.3

Per-channel-scale support was introduced in 8.0b2 (8.0b2 Release #2308, Henry Tao 2024-08-15)
MIL spec: coremltools/converters/mil/mil/ops/defs/iOS18/compression.py:24 (constexpr_blockwise_shift_scale, allows SrcT=fp16)
Emitter (this PR's edit point): coremltools/optimize/coreml/_quantization_passes.py:1136

…ANE (macOS 26) When OpPalettizerConfig is configured with enable_per_channel_scale=True, palettize_weights wraps the constexpr_lut_to_dense output in a constexpr_blockwise_shift_scale op (data=<dense fp16 weight>, scale=<per-channel fp16>). On macOS 26, the MPSGraph backend lowering for that constexpr op fails verification when targeting the Apple Neural Engine: 'mps.dequantize' op operand apple#2 must be tensor of quantized values, but got 'tensor<1xf16>' ... failed assertion `original module failed verification' The MPSGraph lowering of constexpr_blockwise_shift_scale assumes the data operand is a quantized integer tensor (it lowers to mps.dequantize); with enable_per_channel_scale=True, the data is the dense fp16 weight, which fails that assumption. CPU and GPU compute units accept the wrapper and predict correctly; only the ANE-targeted MIL -> MPSGraph dispatch is broken. Fix: bake per_channel_scale into the LUT entries at compile time and re-emit constexpr_lut_to_dense, instead of leaving the scale as a runtime constexpr. Both data and scale are fp16 and the wrapper's only effect is data * scale, so the fold is mathematically identical. The failing MPSGraph dispatch is eliminated entirely, and CPU / GPU numerics stay bit-identical with the prior behavior. Resulting graph also has one fewer runtime constexpr per palettized const. Test updated: TestPalettizeWeights::test_palettization_pcs previously asserted that the constexpr_blockwise_shift_scale wrapper was emitted; it now asserts the wrapper is absent (the LUT is pre-scaled). Numerical equivalence vs the unpalettized model is verified by the existing verify_model_outputs call on macOS 15+. Tested: - test_palettization_pcs: PASS - All 155 TestPalettizeWeights / TestJointCompressWeights: PASS - Manual: Qwen3-VL 2B stateful chunk on macOS 26 + M4 ANE: MPSGraph verification crash gone (was reproducible at every load).

john-rocky force-pushed the fix-palettize-pcs-mpsgraph-ane branch from 5b398bd to 01069d5 Compare May 3, 2026 12:11

john-rocky force-pushed the fix-palettize-pcs-mpsgraph-ane branch from 01069d5 to c5fb6c7 Compare May 6, 2026 00:28

john-rocky mentioned this pull request May 6, 2026

Apply alpha scale factor in torch add/sub converters #2680

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix palettize_weights enable_per_channel_scale=True crashing MPSGraph on Apple Neural Engine (macOS 26)#2688

Fix palettize_weights enable_per_channel_scale=True crashing MPSGraph on Apple Neural Engine (macOS 26)#2688
john-rocky wants to merge 1 commit intoapple:mainfrom
john-rocky:fix-palettize-pcs-mpsgraph-ane

john-rocky commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented May 3, 2026

Summary

Root cause

Trigger

Tests

Test plan

System tested

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant