Skip to content

Commit

Permalink
Add NCCL_PROTO=simple environment variable to handle the out-of-order…
Browse files Browse the repository at this point in the history
… data delivery from EFA (#196)

Co-authored-by: prasadru <frodo@ip-10-0-12-252.us-west-2.compute.internal>
  • Loading branch information
ruhanprasad and prasadru authored Oct 16, 2023
1 parent 09540ec commit 9a5ed9f
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/sagemaker_training/torch_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,9 @@ def _setup(self):
if self._instance_type in SM_EFA_RDMA_INSTANCES:
# Use EFA's RDMA functionality for one-sided and two-sided transfer
os.environ["FI_EFA_USE_DEVICE_RDMA"] = "1"
os.environ["RDMAV_FORK_SAFE"] = "1"
os.environ["NCCL_SOCKET_IFNAME"] = str(self._network_interface_name)
os.environ["NCCL_PROTO"] = "simple"

def _create_command(self):
"""
Expand Down

0 comments on commit 9a5ed9f

Please sign in to comment.