Skip to content

Fix palettize_weights enable_per_channel_scale=True crashing MPSGraph on Apple Neural Engine (macOS 26)#2688

Open
john-rocky wants to merge 1 commit intoapple:mainfrom
john-rocky:fix-palettize-pcs-mpsgraph-ane
Open

Fix palettize_weights enable_per_channel_scale=True crashing MPSGraph on Apple Neural Engine (macOS 26)#2688
john-rocky wants to merge 1 commit intoapple:mainfrom
john-rocky:fix-palettize-pcs-mpsgraph-ane

Conversation

@john-rocky
Copy link
Copy Markdown
Contributor

Summary

OpPalettizerConfig(enable_per_channel_scale=True) produces an .mlpackage that fails MPSGraph verification at model-load time on the Apple Neural Engine on macOS 26, with:

'mps.dequantize' op operand #2 must be tensor of quantized values, but got 'tensor<1xf16>'
... failed assertion `original module failed verification'

CPU and GPU compute units accept the same .mlpackage and predict correctly; only the ANE-targeted MIL → MPSGraph dispatch is broken. The flag has been in the public API since 8.0b2 (#2308) and is documented as supported for iOS18+, but no test covered the ANE-load path on the macOS that ships that backend.

This PR fixes the MIL emission so that the failing MPSGraph dispatch is never produced, by folding per_channel_scale into the LUT entries at compile time rather than emitting a runtime constexpr_blockwise_shift_scale wrapper around the dense fp16 weight. The math is identical (both data and scale are fp16 and the wrapper's only effect is data * scale), so CPU / GPU numerics stay bit-identical with the prior behavior.

Root cause

coremltools/optimize/coreml/_quantization_passes.py:1142 wraps the constexpr_lut_to_dense output:

new_var = mb.constexpr_blockwise_shift_scale(
    data=new_var,                # dense fp16 weight produced by constexpr_lut_to_dense
    scale=per_channel_scale,     # per-output-channel fp16
    offset=None,
    ...
)

The MIL spec for iOS18.compression.constexpr_blockwise_shift_scale allows SrcT ∈ {int4, uint4, int8, uint8, fp16, fp32}, so the op itself is well-typed. The MPSGraph backend lowering, however, lowers this constexpr to mps.dequantize, whose operand #2 is required to be a quantized integer tensor — and with enable_per_channel_scale=True the data operand is fp16. The lowering passes a placeholder tensor<1xf16>, which fails verification.

The failing op chain in MIL is:

const(palettized LUT, dtype int4)
  └─→ constexpr_lut_to_dense          (output: dense fp16 weight)
        └─→ constexpr_blockwise_shift_scale  (data=<dense fp16>, scale=<per-channel fp16>)  ← rejected by MPSGraph on ANE
              └─→ conv

After this PR:

const(palettized LUT, dtype int4, with per-channel-scale baked into entries)
  └─→ constexpr_lut_to_dense          (output: dense fp16 weight, already scale-corrected)
        └─→ conv

One fewer runtime constexpr per palettized weight, and no path through the failing MPSGraph op.

Trigger

Pure-synthetic single-nn.Conv2d(2048, 2048, 1) and 7-layer multi-nn.Conv2d models with enable_per_channel_scale=True load fine on ANE on macOS 26 — the MPSGraph dispatch only chooses the failing path for some specific combinations of op count / state ops / shape patterns inside larger graphs. Confirmed reproducible on a Qwen3-VL 2B stateful body chunk (42 conv ops + 14 matmul + 14 slice_update + write_state + 7 layer_norm, etc.). I did not narrow the trigger further than "real LLM chunks fail; small standalone Conv2d cases don't."

The fix is independent of the trigger: removing the runtime constexpr_blockwise_shift_scale removes the MPSGraph dispatch entirely, so the failing path can no longer be reached for any input model.

Tests

Updated tests/optimize/coreml/test_post_training_quantization.py::TestPalettizeWeights::test_palettization_pcs to assert the new (correct) MIL emission: there should be zero constexpr_blockwise_shift_scale ops, with per_channel_scale baked into the constexpr_lut_to_dense LUT entries. Numerical equivalence vs the un-palettized model is verified by the existing verify_model_outputs(...) call on macOS 15+.

$ pytest -k test_palettization_pcs
1 passed

$ pytest -k 'Palettize'
155 passed, 0 failed

End-to-end manually verified on macOS 26 + Apple M4:

  • Qwen3-VL 2B stateful chunk + enable_per_channel_scale=True + compute_units=CPU_AND_NE: MPSGraph verification crash gone (was 100% reproducible before this PR).
  • Same chunk on CPU_ONLY and CPU_AND_GPU: numerics finite and consistent.
  • Standalone nn.Conv2d(2048, 2048, 1) + enable_per_channel_scale=True + ANE inference: numerically equivalent to the pre-PR enable_per_channel_scale=False baseline (within fp16 rounding).

Test plan

  • test_palettization_pcs updated and passes
  • All 155 TestPalettizeWeights / TestJointCompressWeights tests pass (no regressions)
  • End-to-end load on Apple Neural Engine on macOS 26 succeeds for a real LLM chunk that previously crashed
  • CPU / GPU numerics bit-identical with prior behavior

System tested

  • macOS 26.0 (Darwin 25.0.0)
  • Apple Silicon (M4)
  • coremltools main @ HEAD
  • Python 3.12, PyTorch 2.11.0, NumPy 2.1.3

Related

  • Per-channel-scale support was introduced in 8.0b2 (8.0b2 Release #2308, Henry Tao 2024-08-15)
  • MIL spec: coremltools/converters/mil/mil/ops/defs/iOS18/compression.py:24 (constexpr_blockwise_shift_scale, allows SrcT=fp16)
  • Emitter (this PR's edit point): coremltools/optimize/coreml/_quantization_passes.py:1136

@john-rocky john-rocky force-pushed the fix-palettize-pcs-mpsgraph-ane branch from 5b398bd to 01069d5 Compare May 3, 2026 12:11
…ANE (macOS 26)

When OpPalettizerConfig is configured with enable_per_channel_scale=True,
palettize_weights wraps the constexpr_lut_to_dense output in a
constexpr_blockwise_shift_scale op (data=<dense fp16 weight>, scale=<per-channel
fp16>). On macOS 26, the MPSGraph backend lowering for that constexpr op fails
verification when targeting the Apple Neural Engine:

    'mps.dequantize' op operand apple#2 must be tensor of quantized values,
    but got 'tensor<1xf16>'
    ... failed assertion `original module failed verification'

The MPSGraph lowering of constexpr_blockwise_shift_scale assumes the data
operand is a quantized integer tensor (it lowers to mps.dequantize); with
enable_per_channel_scale=True, the data is the dense fp16 weight, which fails
that assumption. CPU and GPU compute units accept the wrapper and predict
correctly; only the ANE-targeted MIL -> MPSGraph dispatch is broken.

Fix: bake per_channel_scale into the LUT entries at compile time and re-emit
constexpr_lut_to_dense, instead of leaving the scale as a runtime constexpr.
Both data and scale are fp16 and the wrapper's only effect is data * scale, so
the fold is mathematically identical. The failing MPSGraph dispatch is
eliminated entirely, and CPU / GPU numerics stay bit-identical with the prior
behavior. Resulting graph also has one fewer runtime constexpr per palettized
const.

Test updated: TestPalettizeWeights::test_palettization_pcs previously asserted
that the constexpr_blockwise_shift_scale wrapper was emitted; it now asserts
the wrapper is absent (the LUT is pre-scaled). Numerical equivalence vs the
unpalettized model is verified by the existing verify_model_outputs call on
macOS 15+.

Tested:
  - test_palettization_pcs:                                    PASS
  - All 155 TestPalettizeWeights / TestJointCompressWeights:   PASS
  - Manual: Qwen3-VL 2B stateful chunk on macOS 26 + M4 ANE:
    MPSGraph verification crash gone (was reproducible at every load).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant