dstackai · jvstme · Jan 24, 2025 · Jan 22, 2025 · Jan 23, 2025
diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md
@@ -38,23 +38,19 @@ Define a fleet configuration as a YAML file in your project directory. The file
 
 </div>
 
-#### Placement
+#### Placement { #cloud-placement }
 
 To ensure instances are interconnected (e.g., for
 [distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`. 
 This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity
 
 ??? info "AWS"
-    `dstack` automatically enables [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}
-    for the instance types that support it:
-    `p5.48xlarge`, `p4d.24xlarge`, `g4dn.12xlarge`, `g4dn.16xlarge`, `g4dn.8xlarge`, `g4dn.metal`,
-    `g5.12xlarge`, `g5.16xlarge`, `g5.24xlarge`, `g5.48xlarge`, `g5.8xlarge`, `g6.12xlarge`,
-    `g6.16xlarge`, `g6.24xlarge`, `g6.48xlarge`, `g6.8xlarge`, and `gr6.8xlarge`.
-
+    `dstack` automatically enables the Elastic Fabric Adapter for all
+    [EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}.
     Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity.
     This will change once [this issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1804){:target="_blank"} is resolved.
 
-> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, and `oci`
+> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `oci`, and `vultr`
 > backends.
 
 #### Resources
@@ -245,7 +241,7 @@ Define a fleet configuration as a YAML file in your project directory. The file
 
     3.&nbsp;The user specified should have passwordless `sudo` access.
 
-#### Placement
+#### Placement { #ssh-placement }
 
 If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`. 
 This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks).

diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md
@@ -71,7 +71,7 @@ application.
 By default, a task runs on a single node.
 However, you can run it on a cluster of nodes by specifying `nodes`.
 
-<div editor-title="examples/fine-tuning/train.dstack.yml">
+<div editor-title="train.dstack.yml">
 
 ```yaml
 type: task
@@ -81,33 +81,59 @@ name: train-distrib
 # The size of the cluster
 nodes: 2
 
-python: "3.10"
+python: "3.12"
 
-# Commands of the task
+# Commands to run on each node
 commands:
+  - git clone https://github.com/pytorch/examples.git
+  - cd examples/distributed/ddp-tutorial-series
   - pip install -r requirements.txt
   - torchrun
-    --nproc_per_node=$DSTACK_GPUS_PER_NODE
-    --node_rank=$DSTACK_NODE_RANK
+    --nproc-per-node=$DSTACK_GPUS_PER_NODE
+    --node-rank=$DSTACK_NODE_RANK
     --nnodes=$DSTACK_NODES_NUM
-    --master_addr=$DSTACK_MASTER_NODE_IP
-    --master_port=8008 resnet_ddp.py
-    --num_epochs 20
+    --master-addr=$DSTACK_MASTER_NODE_IP
+    --master-port=12345
+    multinode.py 50 10
 
 resources:
   gpu: 24GB
+  # Uncomment if using multiple GPUs
+  #shm_size: 24GB
 ```
 
 </div>
 
-All you need to do is pass the corresponding environment variables such as 
-`DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, `DSTACK_NODES_NUM`,
-`DSTACK_MASTER_NODE_IP`, and `DSTACK_GPUS_NUM` (see [System environment variables](#system-environment-variables)).
+Nodes can communicate using their private IP addresses.
+Use `DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, and other
+[System environment variables](#system-environment-variables)
+to discover IP addresses and other details.
+
+??? info "Network interface"
+    Distributed frameworks usually detect the correct network interface automatically,
+    but sometimes you need to specify it explicitly.
+
+    For example, with PyTorch and the NCCL backend, you may need
+    to add these commands to tell NCCL to use the private interface:
+
+    ```yaml
+    commands:
+      - apt-get install -y iproute2
+      - >
+        if [[ $DSTACK_NODE_RANK == 0 ]]; then
+          export NCCL_SOCKET_IFNAME=$(ip -4 -o addr show | fgrep $DSTACK_MASTER_NODE_IP | awk '{print $2}')
+        else
+          export NCCL_SOCKET_IFNAME=$(ip route get $DSTACK_MASTER_NODE_IP | sed -E 's/.*?dev (\S+) .*/\1/;t;d')
+        fi
+      # ... The rest of the commands
+    ```
 
 !!! info "Fleets"
-    To ensure all nodes are provisioned into a cluster placement group and to enable the highest level of inter-node 
-    connectivity (incl. support for [EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}),
-    create a [fleet](fleets.md) via a configuration before running a disstributed task.
+    Distributed tasks can only run on fleets with
+    [cluster placement](fleets.md#cloud-placement).
+    While `dstack` can provision such fleets automatically, it is
+    recommended to create them via a fleet configuration
+    to ensure the highest level of inter-node connectivity.
 
 `dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks.
 
@@ -303,7 +329,7 @@ If you don't assign a value to an environment variable (see `HF_TOKEN` above),
     | `DSTACK_NODES_NUM`      | The number of nodes in the run                                   |
     | `DSTACK_GPUS_PER_NODE`  | The number of GPUs per node                                      |
     | `DSTACK_NODE_RANK`      | The rank of the node                                             |
-    | `DSTACK_MASTER_NODE_IP` | The internal IP address the master node                          |
+    | `DSTACK_MASTER_NODE_IP` | The internal IP address of the master node                          |
     | `DSTACK_NODES_IPS`      | The list of internal IP addresses of all nodes delimited by "\n" |
 
 ### Spot policy

diff --git a/docs/docs/reference/environment-variables.md b/docs/docs/reference/environment-variables.md
@@ -45,31 +45,33 @@ tasks, and services:
 - `DSTACK_NODES_NUM`{ #DSTACK_NODES_NUM } – The number of nodes in the run
 - `DSTACK_GPUS_PER_NODE`{ #DSTACK_GPUS_PER_NODE } – The number of GPUs per node
 - `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The rank of the node
-- `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The internal IP address the master node.
+- `DSTACK_MASTER_NODE_IP`{ #DSTACK_NODE_RANK } – The internal IP address of the master node.
 
-     Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_NODE_RANK`
+     Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_MASTER_NODE_IP`
      for distributed training:
 
      ```yaml
-     type: task
-     name: train-distrib
-
-     # The number of instances in the cluster
-     nodes: 2
-
-     python: "3.10"
-     commands:
-       - pip install -r requirements.txt
-       - torchrun
-         --nproc_per_node=$DSTACK_GPUS_PER_NODE
-         --node_rank=$DSTACK_NODE_RANK
-         --nnodes=$DSTACK_NODES_NUM
-         --master_addr=$DSTACK_MASTER_NODE_IP
-         --master_port=8008 
-         resnet_ddp.py --num_epochs 20
-
-     resources:
-       gpu: 24GB
+      type: task
+      name: train-distrib
+
+      nodes: 2
+      python: "3.12"
+
+      commands:
+        - git clone https://github.com/pytorch/examples.git
+        - cd examples/distributed/ddp-tutorial-series
+        - pip install -r requirements.txt
+        - torchrun
+          --nproc-per-node=$DSTACK_GPUS_PER_NODE
+          --node-rank=$DSTACK_NODE_RANK
+          --nnodes=$DSTACK_NODES_NUM
+          --master-addr=$DSTACK_MASTER_NODE_IP
+          --master-port=12345
+          multinode.py 50 10
+
+      resources:
+        gpu: 24GB
+        shm_size: 24GB
      ```
 
 - `DSTACK_NODES_IPS`{ #DSTACK_NODES_IPS } – The list of internal IP addresses of all nodes delimited by `"\n"`.