Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 5 additions & 9 deletions docs/docs/concepts/fleets.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,23 +38,19 @@ Define a fleet configuration as a YAML file in your project directory. The file

</div>

#### Placement
#### Placement { #cloud-placement }

To ensure instances are interconnected (e.g., for
[distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`.
This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity

??? info "AWS"
`dstack` automatically enables [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}
for the instance types that support it:
`p5.48xlarge`, `p4d.24xlarge`, `g4dn.12xlarge`, `g4dn.16xlarge`, `g4dn.8xlarge`, `g4dn.metal`,
`g5.12xlarge`, `g5.16xlarge`, `g5.24xlarge`, `g5.48xlarge`, `g5.8xlarge`, `g6.12xlarge`,
`g6.16xlarge`, `g6.24xlarge`, `g6.48xlarge`, `g6.8xlarge`, and `gr6.8xlarge`.

`dstack` automatically enables the Elastic Fabric Adapter for all
[EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}.
Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity.
This will change once [this issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1804){:target="_blank"} is resolved.

> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, and `oci`
> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `oci`, and `vultr`
> backends.

#### Resources
Expand Down Expand Up @@ -245,7 +241,7 @@ Define a fleet configuration as a YAML file in your project directory. The file

3.&nbsp;The user specified should have passwordless `sudo` access.

#### Placement
#### Placement { #ssh-placement }

If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`.
This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks).
Expand Down
56 changes: 41 additions & 15 deletions docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ application.
By default, a task runs on a single node.
However, you can run it on a cluster of nodes by specifying `nodes`.

<div editor-title="examples/fine-tuning/train.dstack.yml">
<div editor-title="train.dstack.yml">

```yaml
type: task
Expand All @@ -81,33 +81,59 @@ name: train-distrib
# The size of the cluster
nodes: 2

python: "3.10"
python: "3.12"

# Commands of the task
# Commands to run on each node
commands:
- git clone https://github.com/pytorch/examples.git
- cd examples/distributed/ddp-tutorial-series
- pip install -r requirements.txt
- torchrun
--nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nproc-per-node=$DSTACK_GPUS_PER_NODE
--node-rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008 resnet_ddp.py
--num_epochs 20
--master-addr=$DSTACK_MASTER_NODE_IP
--master-port=12345
multinode.py 50 10

resources:
gpu: 24GB
# Uncomment if using multiple GPUs
#shm_size: 24GB
```

</div>

All you need to do is pass the corresponding environment variables such as
`DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, `DSTACK_NODES_NUM`,
`DSTACK_MASTER_NODE_IP`, and `DSTACK_GPUS_NUM` (see [System environment variables](#system-environment-variables)).
Nodes can communicate using their private IP addresses.
Use `DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, and other
[System environment variables](#system-environment-variables)
to discover IP addresses and other details.

??? info "Network interface"
Distributed frameworks usually detect the correct network interface automatically,
but sometimes you need to specify it explicitly.

For example, with PyTorch and the NCCL backend, you may need
to add these commands to tell NCCL to use the private interface:

```yaml
commands:
- apt-get install -y iproute2
- >
if [[ $DSTACK_NODE_RANK == 0 ]]; then
export NCCL_SOCKET_IFNAME=$(ip -4 -o addr show | fgrep $DSTACK_MASTER_NODE_IP | awk '{print $2}')
else
export NCCL_SOCKET_IFNAME=$(ip route get $DSTACK_MASTER_NODE_IP | sed -E 's/.*?dev (\S+) .*/\1/;t;d')
fi
# ... The rest of the commands
```

!!! info "Fleets"
To ensure all nodes are provisioned into a cluster placement group and to enable the highest level of inter-node
connectivity (incl. support for [EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}),
create a [fleet](fleets.md) via a configuration before running a disstributed task.
Distributed tasks can only run on fleets with
[cluster placement](fleets.md#cloud-placement).
While `dstack` can provision such fleets automatically, it is
recommended to create them via a fleet configuration
to ensure the highest level of inter-node connectivity.

`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks.

Expand Down Expand Up @@ -303,7 +329,7 @@ If you don't assign a value to an environment variable (see `HF_TOKEN` above),
| `DSTACK_NODES_NUM` | The number of nodes in the run |
| `DSTACK_GPUS_PER_NODE` | The number of GPUs per node |
| `DSTACK_NODE_RANK` | The rank of the node |
| `DSTACK_MASTER_NODE_IP` | The internal IP address the master node |
| `DSTACK_MASTER_NODE_IP` | The internal IP address of the master node |
| `DSTACK_NODES_IPS` | The list of internal IP addresses of all nodes delimited by "\n" |

### Spot policy
Expand Down
44 changes: 23 additions & 21 deletions docs/docs/reference/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,31 +45,33 @@ tasks, and services:
- `DSTACK_NODES_NUM`{ #DSTACK_NODES_NUM } – The number of nodes in the run
- `DSTACK_GPUS_PER_NODE`{ #DSTACK_GPUS_PER_NODE } – The number of GPUs per node
- `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The rank of the node
- `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The internal IP address the master node.
- `DSTACK_MASTER_NODE_IP`{ #DSTACK_NODE_RANK } – The internal IP address of the master node.

Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_NODE_RANK`
Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_MASTER_NODE_IP`
for distributed training:

```yaml
type: task
name: train-distrib

# The number of instances in the cluster
nodes: 2

python: "3.10"
commands:
- pip install -r requirements.txt
- torchrun
--nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008
resnet_ddp.py --num_epochs 20

resources:
gpu: 24GB
type: task
name: train-distrib

nodes: 2
python: "3.12"

commands:
- git clone https://github.com/pytorch/examples.git
- cd examples/distributed/ddp-tutorial-series
- pip install -r requirements.txt
- torchrun
--nproc-per-node=$DSTACK_GPUS_PER_NODE
--node-rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master-addr=$DSTACK_MASTER_NODE_IP
--master-port=12345
multinode.py 50 10

resources:
gpu: 24GB
shm_size: 24GB
```

- `DSTACK_NODES_IPS`{ #DSTACK_NODES_IPS } – The list of internal IP addresses of all nodes delimited by `"\n"`.
Expand Down
Loading