0.18.0
RunPod
The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.
To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml
:
projects:
- name: main
backends:
- type: runpod
creds:
type: api_key
api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9
Once the server is restarted, go ahead and run workloads.
Clusters
Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.
type: task
nodes: 2
commands:
- git clone https://github.com/r4victor/pytorch-distributed-resnet.git
- cd pytorch-distributed-resnet
- mkdir -p data
- cd data
- wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
- tar -xvzf cifar-10-python.tar.gz
- cd ..
- pip3 install -r requirements.txt torch
- mkdir -p saved_models
- torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008 resnet_ddp.py
--num_epochs 20
resources:
gpu: 1
Currently supported providers for this feature include AWS, GCP, and Azure.
Other
- The
commands
property is now not required for tasks and services if you use animage
that has a default entrypoint configured. - The permissions required for using
dstack
with GCP are more granular.
What's changed
- Add
username
filter to/api/runs/list
by @r4victor in #1068 - Inherit core models from DualBaseModel by @r4victor in #967
- Fixed the YAML schema validation for
replicas
by @peterschmidt85 in #1055 - Improve the
server/config.yml
reference documentation by @peterschmidt85 in #1077 - Add the
runpod
backend by @Bihan in #1063 - Support JSON log handler by @TheBits in #1085
- Added lock to the
terminate_idle_instance
by @TheBits in #1081 dstack init
doesn't work with a remote Git repo by @peterschmidt85 in #1090- Minor improvements of
dstack server
output by @peterschmidt85 in #1088 - Return an error information from
dstack-shim
by @TheBits in #1061 - Replace
RetryPolicy.limit
toRetryPolicy.duration
by @TheBits in #1074 - Make
dstack version
configurable when deploying docs by @peterschmidt85 in #1095 dstack init
doesn't work with a local Git repo by @peterschmidt85 in #1096- Fix infinite
create_instance()
on thecudo
provider by @r4victor in #1082 - Do not update the
latest
Docker image and YAML scheme for pre-release builds by @peterschmidt85 in #1099 - Support multi-node tasks by @r4victor in #1103
- Make
commands
optional in run configurations by @jvstme in #1104 - Allow the
cudo
backend use non-gpu instances by @Bihan in #1092 - Make GCP permissions more granular by @r4victor in #1107
Full changelog: 0.17.0...0.18.0