Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Nov 3, 2025

Overview:

num_physical_experts % ep_size != 0 will throw an error; instead of num_physical_experts = 256, ep_size range = [6, 14, 22, 30], grab num_experts info from the model if possible (fall back to defaults otherwise).

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added automatic expert count detection for Mixture of Experts (MoE) models during profiling.
  • Bug Fixes

    • GPU count validation now ensures counts evenly divide the number of experts in MoE models, raising informative errors when invalid configurations are detected.
  • Chores

    • Enhanced logging to include expert count information during profiling preparation.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@hhzhang16 hhzhang16 requested review from a team as code owners November 3, 2025 11:43
@github-actions github-actions bot added the feat label Nov 3, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

Walkthrough

The changes add support for handling the num_experts parameter when profiling Mixture of Experts (MoE) models. The num_experts value is extracted from model configs, logged during processing, and used to filter valid GPU counts during profiling to ensure counts evenly divide the expert count.

Changes

Cohort / File(s) Summary
MoE Expert Count Handling
benchmarks/profiler/profile_sla.py, benchmarks/profiler/utils/model_info.py, benchmarks/profiler/utils/search_space_autogen.py
Adds detection of num_experts from model config attributes, filters GPU counts to only valid divisors of num_experts, and propagates the value through the profiling pipeline with logging and error handling for invalid configurations

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Filtering logic and error handling: Review the divisor-filtering algorithm and ValueError construction in profile_sla.py for correctness and edge cases
  • Attribute prioritization: Verify the fallback chain for expert count extraction in model_info.py captures all relevant MoE config variants
  • Integration consistency: Ensure num_experts flows correctly across all three files and state synchronization with args.num_experts

Poem

🐰 Through expert routers we now peer,
Counting each local engineer!
Filter GPUs by experts' song,
Divisors guide the profiling throng! 🧮✨

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'feat: grab num_experts info from model info if possible' clearly and concisely describes the main change in the changeset. It directly relates to the core functionality added across the three modified files: extracting num_experts from model configuration when available and using it to filter GPU counts during MoE profiling. The title accurately captures the primary objective without being vague or off-topic.
Description check ✅ Passed The pull request description provides an overview explaining the problem context (num_physical_experts % ep_size != 0 errors) and the solution approach (grab num_experts from model info). However, the 'Details' and 'Where should the reviewer start?' sections are not filled out—they remain as template placeholders with only comments. The 'Related Issues' section also contains a placeholder (#xxx) rather than an actual issue reference. While the overview conveys the intent adequately, the missing details and reviewer guidance represent incomplete adherence to the template structure.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
benchmarks/profiler/utils/model_info.py (1)

132-146: Consider logging when num_experts cannot be determined for MoE models.

The extraction logic is well-structured with appropriate attribute prioritization. However, when a model is identified as MoE but num_experts cannot be determined (remains None), this could lead to issues during profiling.

Consider adding a warning log when config.is_moe is True but num_experts is None after the loop, as this helps operators understand potential profiling constraints.

Example addition after line 146:

         for attr in expert_attrs:
             if hasattr(config, attr):
                 value = getattr(config, attr)
                 if value is not None:
                     num_experts = value
                     break
+
+        if num_experts is None:
+            logger.warning(
+                f"MoE model detected but num_experts could not be determined from config attributes: {expert_attrs}"
+            )

Note: You'll need to add import logging at the top and initialize a logger if not already present.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30d8588 and 185bd56.

📒 Files selected for processing (3)
  • benchmarks/profiler/profile_sla.py (1 hunks)
  • benchmarks/profiler/utils/model_info.py (1 hunks)
  • benchmarks/profiler/utils/search_space_autogen.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.2)
benchmarks/profiler/profile_sla.py

126-126: Abstract raise to an inner function

(TRY301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
benchmarks/profiler/utils/search_space_autogen.py (2)

63-70: LGTM! Clean conditional logging implementation.

The conditional string construction properly checks for num_experts presence and includes it in the log message only when available. The approach is clear and maintainable.


96-96: LGTM! Proper attribute assignment with None handling.

Using model_info.get("num_experts") correctly handles cases where num_experts isn't available, allowing downstream code to check for None and proceed accordingly.

benchmarks/profiler/utils/model_info.py (1)

148-153: LGTM! Proper dictionary extension.

Adding num_experts to the returned dictionary is straightforward and maintains consistency with the existing return structure. The None value is appropriately handled by downstream consumers.

benchmarks/profiler/profile_sla.py (1)

127-131: LGTM! Helpful logging of filtering transformation.

The informational log clearly communicates when and why GPU counts were filtered, showing both the original and filtered sets along with the constraint. This aids debugging and understanding of the profiling flow.

@tedzhouhk tedzhouhk marked this pull request as draft November 3, 2025 17:12
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk marked this pull request as ready for review November 3, 2025 18:12
@tedzhouhk tedzhouhk merged commit f31a1da into main Nov 3, 2025
34 of 35 checks passed
@tedzhouhk tedzhouhk deleted the bug-5630802-fix branch November 3, 2025 20:59
tedzhouhk added a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants