Skip to content

Require --instance-type when specifying accelerator resources (#317)#393

Open
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/accelerators-require-instance-type
Open

Require --instance-type when specifying accelerator resources (#317)#393
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/accelerators-require-instance-type

Conversation

@FarhanTejani
Copy link
Member

What's changing and why?

Closes #317 (the --node-count mutual exclusivity part was fixed in #383)

When --accelerators or --accelerators-limit is specified without --instance-type, the CLI silently produces nvidia.com/gpu: 0 in the k8s resource spec instead of the user's requested value. This happens because _set_default_accelerators_val needs the instance type to determine the accelerator key (nvidia.com/gpu vs aws.amazon.com/neuroncore), and when it can't, it silently returns None for both values.

This change adds an early validation that raises a clear error when accelerators are specified without --instance-type, and removes the dead code path in _get_limits that hardcoded nvidia.com/gpu: 0.

Before/After UX

Before:

$ hyp create hyp-pytorch-job --accelerators 4
# Silently generates: nvidia.com/gpu: '0'

After:

$ hyp create hyp-pytorch-job --accelerators 4
# Error: --instance-type is required when specifying accelerator resources

How was this change tested?

  • Unit tests pass

Are unit tests added?

Updated 3 tests that asserted the old behavior (nvidia.com/gpu: 0 on CPU-only / invalid instances)

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

@FarhanTejani FarhanTejani requested a review from a team as a code owner March 19, 2026 00:20
spec = template.get('spec', {})
node_selector = spec.get('nodeSelector', {})
instance_type = node_selector.get(INSTANCE_TYPE_LABEL) if node_selector else None
if not instance_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If instance_type is None, the method returns None before ever reaching the new check at line 152. Move your validation to earlier.

@mollyheamazon
Copy link
Collaborator

Can you reenable these unit tests as part of your fix:

class TestPyTorchJobConfigEFA(unittest.TestCase):
"""Test EFA resource allocation in PyTorchJobConfig"""
# def test_single_node_no_efa(self):
# """Test that single-node jobs don't get EFA resources"""
# config = PyTorchJobConfig(
# job_name="test-single-node",
# image="pytorch:latest",
# node_count=1,
# accelerators=2,
# instance_type="ml.p4d.24xlarge"
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should not have EFA resources
# self.assertNotIn("vpc.amazonaws.com/efa", container.resources.requests)
# self.assertNotIn("vpc.amazonaws.com/efa", container.resources.limits)
# # Should have GPU resources
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "2")
# def test_multi_node_with_efa(self):
# """Test that multi-node jobs automatically get EFA resources"""
# config = PyTorchJobConfig(
# job_name="test-multi-node",
# image="pytorch:latest",
# node_count=4,
# accelerators=8,
# instance_type="ml.p4d.24xlarge"
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should have EFA resources
# self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1")
# self.assertEqual(container.resources.limits["vpc.amazonaws.com/efa"], "1")
# # Should also have GPU resources
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "8")

# def test_multi_node_with_memory_and_cpu(self):
# """Test EFA with other resource types"""
# config = PyTorchJobConfig(
# job_name="test-multi-resources",
# image="pytorch:latest",
# node_count=2,
# accelerators=4,
# vcpu=16.0,
# memory=64.0,
# instance_type="ml.p4d.24xlarge"
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should have all resources including EFA
# self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1")
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4")
# self.assertEqual(container.resources.requests["cpu"], "16.0")
# self.assertEqual(container.resources.requests["memory"], "64.0Gi")
# def test_accelerators_without_instance_type(self):
# """Test that accelerators work without instance_type (fixes the main issue)"""
# config = PyTorchJobConfig(
# job_name="test-no-instance-type",
# image="pytorch:latest",
# accelerators=4
# # No instance_type specified
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should respect accelerators value even without instance_type
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4")
# # Limits should default to "0" since accelerators_limit not specified
# self.assertEqual(container.resources.limits["nvidia.com/gpu"], "0")

They were commented out previously for a merge conflict resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hyp-pytorch-job does not correctly use --accelerators/--accelerators-limits to specify resource requests and limits

2 participants