Require --instance-type when specifying accelerator resources (#317) by FarhanTejani · Pull Request #393 · aws/sagemaker-hyperpod-cli

FarhanTejani · 2026-03-19T00:20:24Z

What's changing and why?

Closes #317 (the --node-count mutual exclusivity part was fixed in #383)

When --accelerators or --accelerators-limit is specified without --instance-type, the CLI silently produces nvidia.com/gpu: 0 in the k8s resource spec instead of the user's requested value. This happens because _set_default_accelerators_val needs the instance type to determine the accelerator key (nvidia.com/gpu vs aws.amazon.com/neuroncore), and when it can't, it silently returns None for both values.

This change adds an early validation that raises a clear error when accelerators are specified without --instance-type, and removes the dead code path in _get_limits that hardcoded nvidia.com/gpu: 0.

Before/After UX

Before:

$ hyp create hyp-pytorch-job --accelerators 4
# Silently generates: nvidia.com/gpu: '0'

After:

$ hyp create hyp-pytorch-job --accelerators 4
# Error: --instance-type is required when specifying accelerator resources

How was this change tested?

Unit tests pass

Are unit tests added?

Updated 3 tests that asserted the old behavior (nvidia.com/gpu: 0 on CPU-only / invalid instances)

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

All automated PR checks pass
Failed tests include local run results/screenshots proving they work
Changes are documentation-only

mollyheamazon · 2026-03-19T17:50:16Z

src/sagemaker/hyperpod/training/hyperpod_pytorch_job.py

            spec = template.get('spec', {})
            node_selector = spec.get('nodeSelector', {})
            instance_type = node_selector.get(INSTANCE_TYPE_LABEL) if node_selector else None
            if not instance_type:


If instance_type is None, the method returns None before ever reaching the new check at line 152. Move your validation to earlier.

mollyheamazon · 2026-03-19T19:05:43Z

Can you reenable these unit tests as part of your fix:

sagemaker-hyperpod-cli/test/unit_tests/training/test_pytorch_job_template_model.py

Lines 6 to 47 in 4406d08

    
           class TestPyTorchJobConfigEFA(unittest.TestCase): 
        
               """Test EFA resource allocation in PyTorchJobConfig""" 
        
               # def test_single_node_no_efa(self): 
        
               #     """Test that single-node jobs don't get EFA resources""" 
        
               #     config = PyTorchJobConfig( 
        
               #         job_name="test-single-node", 
        
               #         image="pytorch:latest", 
        
               #         node_count=1, 
        
               #         accelerators=2, 
        
               #         instance_type="ml.p4d.24xlarge" 
        
               #     ) 
        
               #     job = config.to_domain() 
        
               #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
               #     # Should not have EFA resources 
        
               #     self.assertNotIn("vpc.amazonaws.com/efa", container.resources.requests) 
        
               #     self.assertNotIn("vpc.amazonaws.com/efa", container.resources.limits) 
        
               #     # Should have GPU resources 
        
               #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "2") 
        
               # def test_multi_node_with_efa(self): 
        
               #     """Test that multi-node jobs automatically get EFA resources""" 
        
               #     config = PyTorchJobConfig( 
        
               #         job_name="test-multi-node", 
        
               #         image="pytorch:latest", 
        
               #         node_count=4, 
        
               #         accelerators=8, 
        
               #         instance_type="ml.p4d.24xlarge" 
        
               #     ) 
        
               #     job = config.to_domain() 
        
               #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
               #     # Should have EFA resources 
        
               #     self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1") 
        
               #     self.assertEqual(container.resources.limits["vpc.amazonaws.com/efa"], "1") 
        
               #     # Should also have GPU resources 
        
               #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "8")

sagemaker-hyperpod-cli/test/unit_tests/training/test_pytorch_job_template_model.py

Lines 114 to 150 in 4406d08

    
           # def test_multi_node_with_memory_and_cpu(self): 
        
           #     """Test EFA with other resource types""" 
        
           #     config = PyTorchJobConfig( 
        
           #         job_name="test-multi-resources", 
        
           #         image="pytorch:latest", 
        
           #         node_count=2, 
        
           #         accelerators=4, 
        
           #         vcpu=16.0, 
        
           #         memory=64.0, 
        
           #         instance_type="ml.p4d.24xlarge" 
        
           #     ) 
        
           #     job = config.to_domain() 
        
           #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
           #     # Should have all resources including EFA 
        
           #     self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1") 
        
           #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4") 
        
           #     self.assertEqual(container.resources.requests["cpu"], "16.0") 
        
           #     self.assertEqual(container.resources.requests["memory"], "64.0Gi") 
        
           # def test_accelerators_without_instance_type(self): 
        
           #     """Test that accelerators work without instance_type (fixes the main issue)""" 
        
           #     config = PyTorchJobConfig( 
        
           #         job_name="test-no-instance-type", 
        
           #         image="pytorch:latest", 
        
           #         accelerators=4 
        
           #         # No instance_type specified 
        
           #     ) 
        
           #     job = config.to_domain() 
        
           #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
           #     # Should respect accelerators value even without instance_type 
        
           #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4") 
        
           #     # Limits should default to "0" since accelerators_limit not specified 
        
           #     self.assertEqual(container.resources.limits["nvidia.com/gpu"], "0")

They were commented out previously for a merge conflict resolution.

Require --instance-type when specifying accelerator resources (aws#317)

21b5e32

FarhanTejani requested a review from a team as a code owner March 19, 2026 00:20

FarhanTejani requested a deployment to manual-approval March 19, 2026 00:20 — with GitHub Actions Waiting

mollyheamazon reviewed Mar 19, 2026

View reviewed changes

mollyheamazon mentioned this pull request Mar 19, 2026

Fix node_count resource fields mutual exclusivity and EFA field naming #392

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require --instance-type when specifying accelerator resources (#317)#393

Require --instance-type when specifying accelerator resources (#317)#393
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/accelerators-require-instance-type

FarhanTejani commented Mar 19, 2026

Uh oh!

mollyheamazon Mar 19, 2026

Uh oh!

mollyheamazon commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FarhanTejani commented Mar 19, 2026

What's changing and why?

Before/After UX

How was this change tested?

Are unit tests added?

Are integration tests added?

Reviewer Guidelines

Uh oh!

mollyheamazon Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mollyheamazon commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants