HyperPod PyTorch job does not request EFA resources

When submitting a PyTorch job with:

```
hyp create hyp-pytorch-job
     ....
     --instance-type ml.g5.12xlarge \
    --node-count 2 \
```

The EFA resources are not requested thus resulting in the job not to use EFA. The generated YAML does not contain `vpc.amazonaws.com/efa: 1` and as a consequence the job falls back to TCP.

Example:

```
hyp create hyp-pytorch-job \                                                                                          
    --debug True \
    --job-name qwen3-4b-thinking-2507-fsdp \
    --image <account_id>.dkr.ecr.us-west-2.amazonaws.com/qwen3-finetuning:pytorch2.8-cu129 \
    --command '[hyperpodrun, --nnodes=2:2, --nproc_per_node=4, /data/qwen-cli-example/scripts/train.py]' \
    --args '[--config, /data/qwen-cli-example/args.yaml]' \
    --environment '{"LOGLEVEL": "INFO", "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True", "NCCL_DEBUG": "INFO", "NCCL_SOCKET_IFNAME": "^lo", "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1", "FI_PROVIDER": "efa", "FI_EFA_FORK_SAFE": "1", "NCCL_PROTO": "simple"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.g5.12xlarge \
    --node-count 2 \
    --tasks-per-node 4 \
    --deep-health-check-passed-nodes-only false \
    --max-retry 100 \
    --volume name=shmem,type=hostPath,mount_path=/dev/shm,path=/dev/shm,read_only=false \
    --volume name=local,type=hostPath,mount_path=/local,path=/mnt/k8s-disks/0,read_only=false \
    --volume name=fsx-volume,type=pvc,mount_path=/data,claim_name=fsx-claim,read_only=false
```

The resulting YAML is:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: qwen3-4b-thinking-2507-fsdp
  namespace: default
spec:
  nprocPerNode: '4'
  replicaSpecs:
  - name: pod
    replicas: 2
    spares: 0
    template:
      metadata:
        name: qwen3-4b-thinking-2507-fsdp
      spec:
        containers:
        - args:
          - --config
          - /data/qwen-cli-example/args.yaml
          command:
          - hyperpodrun
          - --nnodes=2:2
          - --nproc_per_node=4
          - /data/qwen-cli-example/scripts/train.py
          env:
          - name: LOGLEVEL
            value: INFO
          - name: PYTORCH_CUDA_ALLOC_CONF
            value: expandable_segments:True
          - name: NCCL_DEBUG
            value: INFO
          - name: NCCL_SOCKET_IFNAME
            value: ^lo
          - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
            value: '1'
          - name: FI_PROVIDER
            value: efa
          - name: FI_EFA_FORK_SAFE
            value: '1'
          - name: NCCL_PROTO
            value: simple
          image: <account_id>.dkr.ecr.us-west-2.amazonaws.com/qwen3-finetuning:pytorch2.8-cu129
          imagePullPolicy: IfNotPresent
          name: pytorch-job-container
          resources:
            limits:
              memory: 164Gi
              nvidia.com/gpu: 4
            requests:
              cpu: '44'
              memory: 164Gi
              nvidia.com/gpu: 4
          volumeMounts:
          - mountPath: /dev/shm
            name: shmem
          - mountPath: /local
            name: local
          - mountPath: /data
            name: fsx-volume
        nodeSelector:
          node.kubernetes.io/instance-type: ml.g5.12xlarge
        volumes:
        - hostPath:
            path: /dev/shm
          name: shmem
        - hostPath:
            path: /mnt/k8s-disks/0
          name: local
        - name: fsx-volume
          persistentVolumeClaim:
            claimName: fsx-claim
            readOnly: false
  runPolicy:
    cleanPodPolicy: None
    jobMaxRetryCount: 100
    ttlSecondsAfterFinished: 0
```

As you can see, the requests and limits do not include EFA, thus resulting in the job to use TCP as confirmed by the following messages in the log:

```
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/IB : No device found.
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.1.213.110<0>
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.213.110<0>
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Initialized NET plugin Socket
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Assigned NET plugin Socket to comm
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Using network Socket
qwen3-4b-thinking-2507-fsdp-pod-0:62:131 [1] NCCL INFO NET/OFI No eligible providers were found
```

If I add the EFA explicitly and submit with kubectl, it works as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HyperPod PyTorch job does not request EFA resources #306

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HyperPod PyTorch job does not request EFA resources #306

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions