feat: grab num_experts info from model info if possible #4060

hhzhang16 · 2025-11-03T11:43:33Z

Overview:

num_physical_experts % ep_size != 0 will throw an error; instead of num_physical_experts = 256, ep_size range = [6, 14, 22, 30], grab num_experts info from the model if possible (fall back to defaults otherwise).

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added automatic expert count detection for Mixture of Experts (MoE) models during profiling.
Bug Fixes
- GPU count validation now ensures counts evenly divide the number of experts in MoE models, raising informative errors when invalid configurations are detected.
Chores
- Enhanced logging to include expert count information during profiling preparation.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

coderabbitai · 2025-11-03T11:47:33Z

Walkthrough

The changes add support for handling the num_experts parameter when profiling Mixture of Experts (MoE) models. The num_experts value is extracted from model configs, logged during processing, and used to filter valid GPU counts during profiling to ensure counts evenly divide the expert count.

Changes

Cohort / File(s)	Summary
MoE Expert Count Handling `benchmarks/profiler/profile_sla.py`, `benchmarks/profiler/utils/model_info.py`, `benchmarks/profiler/utils/search_space_autogen.py`	Adds detection of `num_experts` from model config attributes, filters GPU counts to only valid divisors of `num_experts`, and propagates the value through the profiling pipeline with logging and error handling for invalid configurations

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Filtering logic and error handling: Review the divisor-filtering algorithm and ValueError construction in profile_sla.py for correctness and edge cases
Attribute prioritization: Verify the fallback chain for expert count extraction in model_info.py captures all relevant MoE config variants
Integration consistency: Ensure num_experts flows correctly across all three files and state synchronization with args.num_experts

Poem

🐰 Through expert routers we now peer,
Counting each local engineer!
Filter GPUs by experts' song,
Divisors guide the profiling throng! 🧮✨

Pre-merge checks

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title 'feat: grab num_experts info from model info if possible' clearly and concisely describes the main change in the changeset. It directly relates to the core functionality added across the three modified files: extracting num_experts from model configuration when available and using it to filter GPU counts during MoE profiling. The title accurately captures the primary objective without being vague or off-topic.
Description check	✅ Passed	The pull request description provides an overview explaining the problem context (num_physical_experts % ep_size != 0 errors) and the solution approach (grab num_experts from model info). However, the 'Details' and 'Where should the reviewer start?' sections are not filled out—they remain as template placeholders with only comments. The 'Related Issues' section also contains a placeholder (#xxx) rather than an actual issue reference. While the overview conveys the intent adequately, the missing details and reviewer guidance represent incomplete adherence to the template structure.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

benchmarks/profiler/utils/model_info.py (1)
132-146: Consider logging when num_experts cannot be determined for MoE models.

The extraction logic is well-structured with appropriate attribute prioritization. However, when a model is identified as MoE but num_experts cannot be determined (remains None), this could lead to issues during profiling.

Consider adding a warning log when config.is_moe is True but num_experts is None after the loop, as this helps operators understand potential profiling constraints.

Example addition after line 146:
         for attr in expert_attrs:
             if hasattr(config, attr):
                 value = getattr(config, attr)
                 if value is not None:
                     num_experts = value
                     break
+
+        if num_experts is None:
+            logger.warning(
+                f"MoE model detected but num_experts could not be determined from config attributes: {expert_attrs}"
+            )
Note: You'll need to add import logging at the top and initialize a logger if not already present.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30d8588 and 185bd56.

📒 Files selected for processing (3)

benchmarks/profiler/profile_sla.py (1 hunks)
benchmarks/profiler/utils/model_info.py (1 hunks)
benchmarks/profiler/utils/search_space_autogen.py (2 hunks)

🧰 Additional context used

🪛 Ruff (0.14.2)

benchmarks/profiler/profile_sla.py

126-126: Abstract raise to an inner function

(TRY301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

benchmarks/profiler/utils/search_space_autogen.py (2)

63-70: LGTM! Clean conditional logging implementation.

The conditional string construction properly checks for num_experts presence and includes it in the log message only when available. The approach is clear and maintainable.

96-96: LGTM! Proper attribute assignment with None handling.

Using model_info.get("num_experts") correctly handles cases where num_experts isn't available, allowing downstream code to check for None and proceed accordingly.

benchmarks/profiler/utils/model_info.py (1)

148-153: LGTM! Proper dictionary extension.

Adding num_experts to the returned dictionary is straightforward and maintains consistency with the existing return structure. The None value is appropriately handled by downstream consumers.

benchmarks/profiler/profile_sla.py (1)

127-131: LGTM! Helpful logging of filtering transformation.

The informational log clearly communicates when and why GPU counts were filtered, showing both the original and filtered sets along with the constraint. This aids debugging and understanding of the profiling flow.

benchmarks/profiler/profile_sla.py

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>

feat: grab num_experts info from model info if possible

185bd56

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

hhzhang16 requested review from a team as code owners November 3, 2025 11:43

pull-request-size bot added the size/M label Nov 3, 2025

github-actions bot added the feat label Nov 3, 2025

coderabbitai bot reviewed Nov 3, 2025

View reviewed changes

benchmarks/profiler/profile_sla.py Show resolved Hide resolved

tedzhouhk marked this pull request as draft November 3, 2025 17:12

update MoE logic

924d591

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 18:11 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 18:12 Inactive

pc

323c9b8

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk marked this pull request as ready for review November 3, 2025 18:12

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 18:12 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 18:17 Inactive

ishandhanani approved these changes Nov 3, 2025

View reviewed changes

Merge branch 'main' into bug-5630802-fix

23b045c

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 19:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 19:56 Inactive

tedzhouhk merged commit f31a1da into main Nov 3, 2025
34 of 35 checks passed

tedzhouhk deleted the bug-5630802-fix branch November 3, 2025 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: grab num_experts info from model info if possible #4060

feat: grab num_experts info from model info if possible #4060

Uh oh!

hhzhang16 commented Nov 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 3, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: grab num_experts info from model info if possible #4060

feat: grab num_experts info from model info if possible #4060

Uh oh!

Conversation

hhzhang16 commented Nov 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 3, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hhzhang16 commented Nov 3, 2025 •

edited by coderabbitai bot

Loading