Add GLM-4.7-Flash model cards (4bit, 5bit, 6bit, 8bit) #1214
+832
−777
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Add support for GLM-4.7-Flash, a lighter variant of GLM-4.7 with the
glm4_moe_litearchitecture. These models are smaller and faster while maintaining good performance.Changes
Added 4 new model cards for GLM-4.7-Flash variants:
glm-4.7-flash-4bit(~18 GB)glm-4.7-flash-5bit(~21 GB)glm-4.7-flash-6bit(~25 GB)glm-4.7-flash-8bit(~32 GB)All variants have:
n_layers: 47 (vs 91 in GLM-4.7)hidden_size: 2048 (vs 5120 in GLM-4.7)supports_tensor: True (nativeshard()method)Bumped mlx from 0.30.1 to 0.30.3 - required by mlx-lm 0.30.4
Updated mlx-lm from 0.30.2 to 0.30.4 - adds
glm4_moe_litearchitecture supportAdded type ignores in
auto_parallel.pyfor stricter type annotations in new mlx-lmFixed EOS token IDs for GLM-4.7-Flash - uses different tokenizer with IDs
[154820, 154827, 154829]vs other GLM models'[151336, 151329, 151338]Renamed
MLX_IBV_DEVICEStoMLX_JACCL_DEVICES- env var name changed in new mlxWhy It Works
The model cards follow the same pattern as existing GLM-4.7 models. Tensor parallel support is enabled because GLM-4.7-Flash implements the native
shard()method in mlx-lm 0.30.4, which is automatically detected inauto_parallel.py.GLM-4.7-Flash uses a new tokenizer with different special token IDs. Without the correct EOS tokens, generation wouldn't stop properly.
Test Plan
Manual Testing
Tested generation with GLM-4.7-Flash-4bit - now correctly stops at EOS tokens.
Automated Testing
basedpyright: 0 errorsruff check: All checks passedpytest: 162/162 tests pass (excluding pre-existingtest_distributed_fix.pytimeout failures)🤖 Generated with Claude Code