Skip to content

HyperPod PyTorch job does not request EFA resources #306

@giuseppeporcelli

Description

@giuseppeporcelli

When submitting a PyTorch job with:

hyp create hyp-pytorch-job
     ....
     --instance-type ml.g5.12xlarge \
    --node-count 2 \

The EFA resources are not requested thus resulting in the job not to use EFA. The generated YAML does not contain vpc.amazonaws.com/efa: 1 and as a consequence the job falls back to TCP.

Example:

hyp create hyp-pytorch-job \                                                                                          
    --debug True \
    --job-name qwen3-4b-thinking-2507-fsdp \
    --image <account_id>.dkr.ecr.us-west-2.amazonaws.com/qwen3-finetuning:pytorch2.8-cu129 \
    --command '[hyperpodrun, --nnodes=2:2, --nproc_per_node=4, /data/qwen-cli-example/scripts/train.py]' \
    --args '[--config, /data/qwen-cli-example/args.yaml]' \
    --environment '{"LOGLEVEL": "INFO", "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True", "NCCL_DEBUG": "INFO", "NCCL_SOCKET_IFNAME": "^lo", "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1", "FI_PROVIDER": "efa", "FI_EFA_FORK_SAFE": "1", "NCCL_PROTO": "simple"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.g5.12xlarge \
    --node-count 2 \
    --tasks-per-node 4 \
    --deep-health-check-passed-nodes-only false \
    --max-retry 100 \
    --volume name=shmem,type=hostPath,mount_path=/dev/shm,path=/dev/shm,read_only=false \
    --volume name=local,type=hostPath,mount_path=/local,path=/mnt/k8s-disks/0,read_only=false \
    --volume name=fsx-volume,type=pvc,mount_path=/data,claim_name=fsx-claim,read_only=false

The resulting YAML is:

apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: qwen3-4b-thinking-2507-fsdp
  namespace: default
spec:
  nprocPerNode: '4'
  replicaSpecs:
  - name: pod
    replicas: 2
    spares: 0
    template:
      metadata:
        name: qwen3-4b-thinking-2507-fsdp
      spec:
        containers:
        - args:
          - --config
          - /data/qwen-cli-example/args.yaml
          command:
          - hyperpodrun
          - --nnodes=2:2
          - --nproc_per_node=4
          - /data/qwen-cli-example/scripts/train.py
          env:
          - name: LOGLEVEL
            value: INFO
          - name: PYTORCH_CUDA_ALLOC_CONF
            value: expandable_segments:True
          - name: NCCL_DEBUG
            value: INFO
          - name: NCCL_SOCKET_IFNAME
            value: ^lo
          - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
            value: '1'
          - name: FI_PROVIDER
            value: efa
          - name: FI_EFA_FORK_SAFE
            value: '1'
          - name: NCCL_PROTO
            value: simple
          image: <account_id>.dkr.ecr.us-west-2.amazonaws.com/qwen3-finetuning:pytorch2.8-cu129
          imagePullPolicy: IfNotPresent
          name: pytorch-job-container
          resources:
            limits:
              memory: 164Gi
              nvidia.com/gpu: 4
            requests:
              cpu: '44'
              memory: 164Gi
              nvidia.com/gpu: 4
          volumeMounts:
          - mountPath: /dev/shm
            name: shmem
          - mountPath: /local
            name: local
          - mountPath: /data
            name: fsx-volume
        nodeSelector:
          node.kubernetes.io/instance-type: ml.g5.12xlarge
        volumes:
        - hostPath:
            path: /dev/shm
          name: shmem
        - hostPath:
            path: /mnt/k8s-disks/0
          name: local
        - name: fsx-volume
          persistentVolumeClaim:
            claimName: fsx-claim
            readOnly: false
  runPolicy:
    cleanPodPolicy: None
    jobMaxRetryCount: 100
    ttlSecondsAfterFinished: 0

As you can see, the requests and limits do not include EFA, thus resulting in the job to use TCP as confirmed by the following messages in the log:

qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/IB : No device found.
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.1.213.110<0>
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.213.110<0>
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Initialized NET plugin Socket
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Assigned NET plugin Socket to comm
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Using network Socket
qwen3-4b-thinking-2507-fsdp-pod-0:62:131 [1] NCCL INFO NET/OFI No eligible providers were found

If I add the EFA explicitly and submit with kubectl, it works as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions