Skip to content

Conversation

@AlexCheema
Copy link
Contributor

@AlexCheema AlexCheema commented Jan 20, 2026

Motivation

Add support for GLM-4.7-Flash, a lighter variant of GLM-4.7 with the glm4_moe_lite architecture. These models are smaller and faster while maintaining good performance.

Changes

  1. Added 4 new model cards for GLM-4.7-Flash variants:

    • glm-4.7-flash-4bit (~18 GB)
    • glm-4.7-flash-5bit (~21 GB)
    • glm-4.7-flash-6bit (~25 GB)
    • glm-4.7-flash-8bit (~32 GB)

    All variants have:

    • n_layers: 47 (vs 91 in GLM-4.7)
    • hidden_size: 2048 (vs 5120 in GLM-4.7)
    • supports_tensor: True (native shard() method)
  2. Bumped mlx from 0.30.1 to 0.30.3 - required by mlx-lm 0.30.4

  3. Updated mlx-lm from 0.30.2 to 0.30.4 - adds glm4_moe_lite architecture support

  4. Added type ignores in auto_parallel.py for stricter type annotations in new mlx-lm

  5. Fixed EOS token IDs for GLM-4.7-Flash - uses different tokenizer with IDs [154820, 154827, 154829] vs other GLM models' [151336, 151329, 151338]

  6. Renamed MLX_IBV_DEVICES to MLX_JACCL_DEVICES - env var name changed in new mlx

Why It Works

The model cards follow the same pattern as existing GLM-4.7 models. Tensor parallel support is enabled because GLM-4.7-Flash implements the native shard() method in mlx-lm 0.30.4, which is automatically detected in auto_parallel.py.

GLM-4.7-Flash uses a new tokenizer with different special token IDs. Without the correct EOS tokens, generation wouldn't stop properly.

Test Plan

Manual Testing

Tested generation with GLM-4.7-Flash-4bit - now correctly stops at EOS tokens.

Automated Testing

  • basedpyright: 0 errors
  • ruff check: All checks passed
  • pytest: 162/162 tests pass (excluding pre-existing test_distributed_fix.py timeout failures)

🤖 Generated with Claude Code

@AlexCheema AlexCheema force-pushed the add-glm-4.7-flash-model-cards branch 2 times, most recently from 4ba22ab to 35ec820 Compare January 20, 2026 03:19
- Add model cards for GLM-4.7-Flash variants (4bit, 5bit, 6bit, 8bit)
- Bump mlx from 0.30.1 to 0.30.3
- Update mlx-lm from 0.30.2 to 0.30.4 for glm4_moe_lite architecture support
- Add EOS token IDs for GLM-4.7-Flash (different tokenizer than other GLM models)
- Add type ignores for stricter type annotations in new mlx-lm
- Rename MLX_IBV_DEVICES to MLX_JACCL_DEVICES env var

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the add-glm-4.7-flash-model-cards branch from 35ec820 to 06a05b1 Compare January 20, 2026 03:27
@AlexCheema AlexCheema merged commit 176ab5b into main Jan 20, 2026
8 checks passed
@AlexCheema AlexCheema deleted the add-glm-4.7-flash-model-cards branch January 20, 2026 03:58

# TODO: update once upstream fixes
logger.info(
f"rank {rank} MLX_IBV_DEVICES: {coordination_file} with devices: {jaccl_devices_json}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is plain wrong last i checked...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: it will just break next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants