generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Description
When submitting a PyTorch job with:
hyp create hyp-pytorch-job
....
--instance-type ml.g5.12xlarge \
--node-count 2 \
The EFA resources are not requested thus resulting in the job not to use EFA. The generated YAML does not contain vpc.amazonaws.com/efa: 1 and as a consequence the job falls back to TCP.
Example:
hyp create hyp-pytorch-job \
--debug True \
--job-name qwen3-4b-thinking-2507-fsdp \
--image <account_id>.dkr.ecr.us-west-2.amazonaws.com/qwen3-finetuning:pytorch2.8-cu129 \
--command '[hyperpodrun, --nnodes=2:2, --nproc_per_node=4, /data/qwen-cli-example/scripts/train.py]' \
--args '[--config, /data/qwen-cli-example/args.yaml]' \
--environment '{"LOGLEVEL": "INFO", "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True", "NCCL_DEBUG": "INFO", "NCCL_SOCKET_IFNAME": "^lo", "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1", "FI_PROVIDER": "efa", "FI_EFA_FORK_SAFE": "1", "NCCL_PROTO": "simple"}' \
--pull-policy "IfNotPresent" \
--instance-type ml.g5.12xlarge \
--node-count 2 \
--tasks-per-node 4 \
--deep-health-check-passed-nodes-only false \
--max-retry 100 \
--volume name=shmem,type=hostPath,mount_path=/dev/shm,path=/dev/shm,read_only=false \
--volume name=local,type=hostPath,mount_path=/local,path=/mnt/k8s-disks/0,read_only=false \
--volume name=fsx-volume,type=pvc,mount_path=/data,claim_name=fsx-claim,read_only=false
The resulting YAML is:
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
name: qwen3-4b-thinking-2507-fsdp
namespace: default
spec:
nprocPerNode: '4'
replicaSpecs:
- name: pod
replicas: 2
spares: 0
template:
metadata:
name: qwen3-4b-thinking-2507-fsdp
spec:
containers:
- args:
- --config
- /data/qwen-cli-example/args.yaml
command:
- hyperpodrun
- --nnodes=2:2
- --nproc_per_node=4
- /data/qwen-cli-example/scripts/train.py
env:
- name: LOGLEVEL
value: INFO
- name: PYTORCH_CUDA_ALLOC_CONF
value: expandable_segments:True
- name: NCCL_DEBUG
value: INFO
- name: NCCL_SOCKET_IFNAME
value: ^lo
- name: TORCH_NCCL_ASYNC_ERROR_HANDLING
value: '1'
- name: FI_PROVIDER
value: efa
- name: FI_EFA_FORK_SAFE
value: '1'
- name: NCCL_PROTO
value: simple
image: <account_id>.dkr.ecr.us-west-2.amazonaws.com/qwen3-finetuning:pytorch2.8-cu129
imagePullPolicy: IfNotPresent
name: pytorch-job-container
resources:
limits:
memory: 164Gi
nvidia.com/gpu: 4
requests:
cpu: '44'
memory: 164Gi
nvidia.com/gpu: 4
volumeMounts:
- mountPath: /dev/shm
name: shmem
- mountPath: /local
name: local
- mountPath: /data
name: fsx-volume
nodeSelector:
node.kubernetes.io/instance-type: ml.g5.12xlarge
volumes:
- hostPath:
path: /dev/shm
name: shmem
- hostPath:
path: /mnt/k8s-disks/0
name: local
- name: fsx-volume
persistentVolumeClaim:
claimName: fsx-claim
readOnly: false
runPolicy:
cleanPodPolicy: None
jobMaxRetryCount: 100
ttlSecondsAfterFinished: 0
As you can see, the requests and limits do not include EFA, thus resulting in the job to use TCP as confirmed by the following messages in the log:
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/IB : No device found.
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.1.213.110<0>
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.213.110<0>
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Initialized NET plugin Socket
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Assigned NET plugin Socket to comm
qwen3-4b-thinking-2507-fsdp-pod-0:63:133 [2] NCCL INFO Using network Socket
qwen3-4b-thinking-2507-fsdp-pod-0:62:131 [1] NCCL INFO NET/OFI No eligible providers were found
If I add the EFA explicitly and submit with kubectl, it works as expected.
Metadata
Metadata
Assignees
Labels
No labels