27 Jun 12:14

f6395c6

0.18.4 Latest

Latest

Google Cloud TPU

This update introduces initial support for Google Cloud TPU.

To request a TPU, specify the TPU architecture prefixed by tpu- (in gpu under resources):

type: task

python: "3.11"

commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

resources:
  gpu:  tpu-v2-8

Important

Currently, you can't specify other than 8 TPU cores. This means only single TPU device workloads are supported. Support for multiple TPU devices is coming soon.

Major bug-fixes

Besides TPU, the update fixes a few important bugs.

Fix cudo backend stuck && Improve docs for cudo by @smokfyz in #1347
Fix nvidia-smi not available on lambda by @r4victor in #1357
Respect registry_auth for RunPod by @smokfyz in #1333
Support multi-node tasks on oci by @jvstme in #1334

Other

Show warning on required ssh version by @loghijiaha in #1313
Add OCI packer templates by @jvstme in #1316
Support oci Bare Metal instances by @jvstme in #1325
Support oci BM.Optimized3.36 instance by @jvstme in #1328
[Docs] Update dstack pool docs by @jvstme in #1329
Add TPU support in gcp by @Bihan in #1323
Fix failing runner-test workflow by @r4victor in #1336
Document OCI permissions by @jvstme in #1338
Limit the gateway's open ports to 22, 80, and 443 by @smokfyz in #1335
Update serve.dstack.yml - infinity by @michaelfeil in #1340
Support instances without public IP for GCP by @smokfyz in #1341
[Internal] Automate OCI images publishing by @jvstme in #1346
Fix slow /api/pools/list_instances by @r4victor in #1320
Respect gcp VPC config when provisioning TPUs by @r4victor in #1332
[Internal] Fix linter errors by @jvstme in #1322
TPU support enhancements by @r4victor in #1330
TPU initial release by @Bihan in #1354
TPUs fixes by @r4victor in #1360
Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
Even more flexible OCI client credentials by @jvstme in #1317

New contributors

@loghijiaha made their first contribution in #1313
@smokfyz made their first contribution in #1333
@michaelfeil made their first contribution in #1340

Full changelog: 0.18.3...0.18.4

Contributors

Bihan, r4victor, and 4 other contributors

Assets 2

26 Jun 14:49

peterschmidt85

0.18.4rc3

3e89218

0.18.4rc3 Pre-release

Pre-release

This is a preview build of the upcoming 0.18.4 release. See below to see what's new.

TPU

One of the major new features in this update is the initial support for Google Cloud TPU.

To request a TPU, you simply need to specify the system architecture of the required TPU prefixed by tpu- in gpu:

type: task

python: "3.11"

commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

resources:
  gpu:  tpu-v2-8

Important

You cannot request multiple nodes (for running parallel on multiple TPU devices) for tasks. This feature is coming soon.

You're very welcome to try the initial support and share your feedback.

Major bug-fixes

Besides TPU, the update fixes a few important bugs.

Fix cudo backend stuck && Improve docs for cudo by @smokfyz in #1347
Fix nvidia-smi not available on lambda by @r4victor in #1357
Respect registry_auth for RunPod by @smokfyz in #1333
Support multi-node tasks on oci by @jvstme in #1334

Other

Show warning on required ssh version by @loghijiaha in #1313
Add OCI packer templates by @jvstme in #1316
Support oci Bare Metal instances by @jvstme in #1325
Support oci BM.Optimized3.36 instance by @jvstme in #1328
[Docs] Update dstack pool docs by @jvstme in #1329
Add TPU support in gcp by @Bihan in #1323
Fix failing runner-test workflow by @r4victor in #1336
Document OCI permissions by @jvstme in #1338
Limit the gateway's open ports to 22, 80, and 443 by @smokfyz in #1335
Update serve.dstack.yml - infinity by @michaelfeil in #1340
Support instances without public IP for GCP by @smokfyz in #1341
[Internal] Automate OCI images publishing by @jvstme in #1346
Fix slow /api/pools/list_instances by @r4victor in #1320
Respect gcp VPC config when provisioning TPUs by @r4victor in #1332
[Internal] Fix linter errors by @jvstme in #1322
TPU support enhancements by @r4victor in #1330
TPU initial release by @Bihan in #1354
TPUs fixes by @r4victor in #1360
Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
Even more flexible OCI client credentials by @jvstme in #1317

New contributors

@loghijiaha made their first contribution in #1313
@smokfyz made their first contribution in #1333
@michaelfeil made their first contribution in #1340

Full changelog: 0.18.3...0.18.4rc3

Contributors

Bihan, r4victor, and 4 other contributors

Assets 2

06 Jun 10:55

peterschmidt85

0.18.3

36655f0

0.18.3

Oracle Cloud Infrastructure

With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci and can be configured as follows:

projects:
  - name: main
    backends:
      - type: oci
        creds:
          type: default

The supported credential types include default and client. In case default is used, dstack automatically picks the default OCI credentials from ~/.oci/config.

Just like other backends, oci supports dev environments, tasks, and services:

Note

Support for spot instances, multi-node tasks, and gateways is coming soon.

Find more documentation on using Oracle Cloud Infrastructure on the reference page.

Retry policy

We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:

type: task

commands: 
  - python train.py

retry:
  on_events: [no-capacity]
  duration: 2h

Now, if you run such a task, dstack will keep trying to find capacity within 2 hours. Once capacity is found, dstack will run the task.

The on_events property also supports error (in case the run fails with an error) and interruption (if the run is using a spot instance and it was interrupted).

Previously, dstack only allowed retries when spot instances were interrupted.

RunPod

Previously, the runpod backend only allowed the use of Docker images with /bin/bash or /bin/sh as the entrypoint. Thanks to the fix on the RunPod's side, dstack now allows the use of any Docker images.

Additionally, the runpod backend now also supports spot instances.

GCP

The gcp backend now also allows configuring VPCs:

projects:
  - name: main
    backends:
      - type: gcp

        project_id: my-awesome-project
        creds:
          type: default

        vpc_name: my-custom-vpc

The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id.

AWS

Last but not least, for the aws backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:

projects:
  - name: main
    backends:
      - type: aws
        creds:
          type: default

        vpc_ids:
          us-east-1: vpc-0a2b3c4d5e6f7g8h

        default_vpcs: true

You just need to set default_vpcs to true.

Other changes

Fix reverse server-gateway ssh tunnel by @r4victor in #1303
Respect run filters for the ssh backend by @r4victor in #1278
Support resubmitted runs in dstack run attached mode by @r4victor in #1285
Do not run jobs on unreachable instances by @r4victor in #1286
Show job termination reason in dstack ps -v by @r4victor in #1301
Rename dstack destroy to dstack delete by @r4victor in #1275
Prepare OCI backend for release by @jvstme in #1308
[Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in #1295
[Docs] Fix Authorization header by @jvstme in #1305

Contributors

r4victor, jvstme, and peterschmidt85

Assets 2

05 Jun 09:37

peterschmidt85

0.18.3rc1

9d7c42f

0.18.3rc1 Pre-release

Pre-release

OCI

With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci and can be configured as follows:

projects:
  - name: main
    backends:
      - type: oci
        creds:
          type: default

The supported credential types include default and client. In case default is used, dstack automatically picks the default OCI credentials from ~/.oci/config.

Warning

OCI support does not yet include spot instances, multi-node tasks, and gateways. These features will be added in upcoming updates.

Retry policy

We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:

type: task

commands: 
  - python train.py

retry:
  on_events: [no-capacity]
  duration: 2h

Now, if you run such a task, dstack will keep trying to find capacity within 2 hours. Once capacity is found, dstack will run the task.

The on_events property also supports error (in case the run fails with an error) and interruption (if the run is using a spot instance and it was interrupted).

Previously, dstack only allowed retries when spot instances were interrupted.

VPC

GCP

The gcp backend now also allows configuring VPCs:

projects:
  - name: main
    backends:
      - type: gcp

        project_id: my-awesome-project
        creds:
          type: default

        vpc_name: my-custom-vpc

The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id.

AWS

Last but not least, for the aws backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:

projects:
  - name: main
    backends:
      - type: aws
        creds:
          type: default

        vpc_ids:
          us-east-1: vpc-0a2b3c4d5e6f7g8h

        default_vpcs: true

You just need to set default_vpcs to true.

Other changes

Fix reverse server-gateway ssh tunnel by @r4victor in #1303
Respect run filters for the ssh backend by @r4victor in #1278
Support resubmitted runs in dstack run attached mode by @r4victor in #1285
Do not run jobs on unreachable instances by @r4victor in #1286
Show job termination reason in dstack ps -v by @r4victor in #1301
Rename dstack destroy to dstack delete by @r4victor in #1275
Prepare OCI backend for release by @jvstme in #1308
[Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in #1295

Full changelog: 0.18.2...0.18.3rc1

Warning

This is an RC build. Please report any bugs to the issue tracker. The final release is planned for later this week, and the official documentation and examples will be updated then.

Contributors

r4victor, jvstme, and peterschmidt85

Assets 2

13 May 12:30

peterschmidt85

0.18.2

86b41b2

0.18.2

On-prem clusters

Network

The dstack pool add-ssh command now supports the --network argument. Use this argument if you want to use multiple instances that share the same private network as a cluster to run multi-node tasks.

The --network argument accepts the IP address range (CIDR) of the private network of the instance.

Example:

dstack pool add-ssh -i ~/.ssh/id_rsa ubuntu@141.144.229.104 --network 10.0.0.0/24

Once you've added multiple instances with the same network value, you'll be able to use them as a cluster to run multi-node tasks.

Private subnets

By default, dstack uses public IPs for SSH access to running instances, requiring public subnets in the VPC. The new update allows AWS instances to use private subnets instead.

To create instances only in private subnets, set public_ips to false in the AWS backend settings:

type: aws
  creds:
    type: default
  vpc_ids:
     ...
  public_ips: false

Note

Both dstack server and the dstack CLI should have access to the private subnet to access instances.
If you want running instances to access the Internet, the private subnets need to have a NAT gateway.

Gateways

`dstack apply`

Previously, to create or update gateways, one had to use the dstack gateway create or dstack gateway update commands.
Now, it's possible to define a gateway configuration via YAML and create or update it using the dstack apply command.

Example:

type: gateway
name: example-gateway

backend: gcp
region: europe-west1
domain: example.com

dstack apply -f examples/deployment/gateway.dstack.yml

For now, the dstack apply command only supports the gateway configuration type. Soon, it will also support dev-environment, task, and service, replacing the dstack run command.

The dstack destroy command can be used to delete resources.

Private gateways

By default, gateways are deployed using public subnets. Since 0.18.2, it is now possible to deploy gateways using private subnets. To do this, you need to set public_ips to false and specify the ARN of a certificate from AWS Certificate Manager.

type: gateway
name: example-gateway

backend: aws
region: eu-west-1
domain: "example.com"

public_ip: false
certificate:
  type: acm
  arn: "arn:aws:acm:eu-west-1:3515152512515:certificate/3251511125--1241-1224-121251515125"

In this case, dstack will deploy the gateway in a private subnet behind a load balancer using the specified certificate.

Note

Private gateways are currently supported only for AWS.

What's changed

Support multi-node tasks with dstack pool add-ssh instances by @TheBits in #1189
Fixed the JSON schema errors by @r4victor in #1193
Support spot instances with runpod by @Bihan in #1119
Speed up AWS VPC validation by @r4victor in #1196
[Internal] Optimize ProjectModel loading by @r4victor in #1199
Support provisioning instances without public IPs on AWS by @r4victor in #1203
Minor improvements of dstack pool add-ssh by @TheBits in #1202
Instances cannot be reused by other users by @TheBits in #1204
Do not create AWS instance profile when launching instances by @r4victor in #1212
Allow running services without https by @r4victor in #1217
Implement dstack apply for gateways by @r4victor in #1223
Support gateways without public IPs on AWS by @r4victor in #1224
Support --network with dstack pool add-ssh by @TheBits in #1225
[Internal] Make gateway creation async by @r4victor in #1236
Using a more resourceful VM type by default for GCP gateway by @r4victor in #1237
Handle properly if the network passed to dstack pool add-ssh is not correct by @TheBits in #1233
Use valid GCP resource names by @r4victor in #1248
Always try to restart dstack-shim.service with dstack pool add-ssh by @TheBits in #1253
[Internal] Improve instance processing by @r4victor in #1251
Changed dstack pool remove to rm by @muddi900 in #1258
Support gateways behind ALB with ACM certificate by @r4victor in #1264
Support IP addresses with --network by @TheBits in #1263
[Internal] Fix double unlocking when processing runs and instances by @r4victor in #1268
Add dstack destroy command and improve dstack apply by @r4victor in #1271
Fix instances from pools ignoring regions by @r4victor in #1272
Add the axolotl example by @deep-diver in #1187

New Contributors

@muddi900 made their first contribution in #1258

Full Changelog: 0.18.1...0.18.2

Contributors

TheBits, Bihan, and 3 other contributors

Assets 2

29 Apr 15:47

peterschmidt85

0.18.1

2761a90

0.18.1

On-prem servers

Now you can add your own servers as pool instances:

dstack pool add-ssh -i ~/.ssh/id_rsa ubuntu@54.73.155.119

Note

The server should be pre-installed with CUDA 12.1 and NVIDIA Docker.

Configuration

All .dstack/profiles.yml properties now can be specified via run configurations:

type: dev-environment

ide: vscode

spot_policy: auto
backends: ["aws"]

regions: ["eu-west-1", "eu-west-2"]

instance_types: ["p3.8xlarge", "p3.16xlarge"]
max_price: 2.0

max_duration: 1d

New examples 🔥🔥

Thanks to the contribution from @deep-diver, we got two new examples:

Other

Configuring VPCs using their IDs (via vpc_ids in server/config.yml)
Support for global profiles (via ~/.dstack/profiles.yml)
Updated the default environment variables (DSTACK_RUN_NAME, DSTACK_GPUS_NUM, DSTACK_NODES_NUM, DSTACK_NODE_RANK, and DSTACK_MASTER_NODE_IP)
It’s now possible to use NVIDIA A10 GPU on Azure
More granular permissions for Azure

What's changed

Fix server freeze on terminate instance by @jvstme in #1132
Support profile params in run configurations by @r4victor in #1131
Support global .dstack/profiles.yml by @r4victor in #1134
Fix No such profile: None when missing .dstack/profiles.yml by @r4victor in #1135
Make Azure permissions more granular by @r4victor in #1139
Validate min disk size by @r4victor in #1146
Fix unexpected error if system Python version is unknown by @r4victor in #1147
Add request timeouts to prevent code freezes by @jvstme in #1140
Refactor backends to wait for instance IP address outside run_job/create_instance by @r4victor in #1149
Fix provisioning Azure instances with A10 GPU by @jvstme in #1150
[Internal] Move packer -> scripts/packer by @jvstme in #1153
Added the ability of adding own instances by @TheBits in #1115
An issue with the executor_error check being falsely positive by @TheBits in #1160
Make user project quota configurable by @r4victor in #1161
Configure CORS headers on gateway by @r4victor in #1166
Allow to configure AWS vpc_ids by @r4victor in #1170
[Internal] Show dstack version in Sentry issues by @jvstme in #1167
Fix KeyError: 'IpPermissions' when using AWS by @jvstme in #1169
Create public ssh key is it not exist in dstack pool add-ssh by @TheBits in #1173
Fixed is the environment file upload by @TheBits in #1175
Updated shim status processing by @TheBits in #1174
Fix bugs in dstack pool add-ssh by @TheBits in #1178
Fix Cudo Create VM response error by @Bihan in #1179
Implement API for configuring backends via yaml by @r4victor in #1181
Allow running gated models with HUGGING_FACE_HUB_TOKEN by @r4victor in #1184
Pass all dstack runner envs as DSTACK_* by @r4victor in #1185
Improve the retries in the get_host_info and get_shim_healthcheck by @TheBits in #1183
Example/h4alignment handbook by @deep-diver in #1180
The deploy is launched in ThreadPoolExecutor by @TheBits in #1186

Full Changelog: 0.18.0...0.18.1rc2

Contributors

TheBits, Bihan, and 3 other contributors

Assets 2

10 Apr 15:46

peterschmidt85

0.18.0

f9b941c

0.18.0

RunPod

The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.

To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml:

projects:
- name: main
  backends:
  - type: runpod
    creds:
      type: api_key
      api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9

Once the server is restarted, go ahead and run workloads.

Clusters

Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.

type: task

nodes: 2

commands:
  - git clone https://github.com/r4victor/pytorch-distributed-resnet.git
  - cd pytorch-distributed-resnet
  - mkdir -p data
  - cd data
  - wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
  - tar -xvzf cifar-10-python.tar.gz
  - cd ..
  - pip3 install -r requirements.txt torch
  - mkdir -p saved_models
  - torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE 
     --node_rank=$DSTACK_NODE_RANK 
     --nnodes=$DSTACK_NODES_NUM
     --master_addr=$DSTACK_MASTER_NODE_IP
     --master_port=8008 resnet_ddp.py 
     --num_epochs 20

resources:
  gpu: 1

Currently supported providers for this feature include AWS, GCP, and Azure.

Other

The commands property is now not required for tasks and services if you use an image that has a default entrypoint configured.
The permissions required for using dstack with GCP are more granular.

What's changed

Add username filter to /api/runs/list by @r4victor in #1068
Inherit core models from DualBaseModel by @r4victor in #967
Fixed the YAML schema validation for replicas by @peterschmidt85 in #1055
Improve the server/config.yml reference documentation by @peterschmidt85 in #1077
Add the runpod backend by @Bihan in #1063
Support JSON log handler by @TheBits in #1085
Added lock to the terminate_idle_instance by @TheBits in #1081
dstack init doesn't work with a remote Git repo by @peterschmidt85 in #1090
Minor improvements of dstack server output by @peterschmidt85 in #1088
Return an error information from dstack-shim by @TheBits in #1061
Replace RetryPolicy.limit to RetryPolicy.duration by @TheBits in #1074
Make dstack version configurable when deploying docs by @peterschmidt85 in #1095
dstack init doesn't work with a local Git repo by @peterschmidt85 in #1096
Fix infinite create_instance() on the cudo provider by @r4victor in #1082
Do not update the latest Docker image and YAML scheme for pre-release builds by @peterschmidt85 in #1099
Support multi-node tasks by @r4victor in #1103
Make commands optional in run configurations by @jvstme in #1104
Allow the cudo backend use non-gpu instances by @Bihan in #1092
Make GCP permissions more granular by @r4victor in #1107

Full changelog: 0.17.0...0.18.0

Contributors

TheBits, Bihan, and 3 other contributors

Assets 2

03 Apr 10:20

peterschmidt85

0.17.0

f14ddf5

0.17.0

Service auto-scaling

Previously, dstack always served services as single replicas. While this is suitable for development, in production, the service must automatically scale based on the load.

That's why in 0.17.0, we extended dstack with the capability to configure replicas (the number of replicas) as well as scaling (the auto-scaling policy).

Regions and instance types

The update brings support for specifying regions and instance types (in dstack run and .dstack/profiles.yml)

Environment variables

Firstly, it's now possible to configure an environment variable in the configuration without hardcoding its value. Secondly, dstack run now inherits environment variables from the current process.

For more details on these new features, check the changelog.

What's changed

Support running multiple replicas for a service by @Egor-S in #986 and #1015
Allow to specify instance_type via CLI and profiles by @r4victor in #1023
Allow to specify regions via CLI and profiles by @r4victor in #947
Allow specifying required env variables by @spott in #1003
Allow configuring CA for gateways by @jvstme in #1022
Support Python 3.12 by @peterschmidt85 in #1031
The shm_size property in resources doesn't take effect by @peterschmidt85 in #1007
Sometimes, runs get stuck at pulling by @TheBits in #1035
vastai doesn't show any offers since 0.16.0 by @iRohith in #959
It's not possible to configure projects other than main by @peterschmidt85 in #992
Spot instances don't work on GCP by @peterschmidt85 in #996

New contributors

@iRohith made their first contribution in #959
@Bihan made their first contribution in #928

Full changelog: 0.16.5...0.17.0

Contributors

spott, TheBits, and 6 other contributors

Assets 2

26 Mar 07:59

peterschmidt85

0.16.5

7ab0cff

0.16.5

Bug-fixes

Docker pull related issues #1025

Full changelog: 0.16.4...0.16.5

Assets 2

18 Mar 13:25

peterschmidt85

0.16.4

05bde51

0.16.4

CUDO Compute

The 0.16.4 update introduces the cudo backend, which allows running workloads with CUDO Compute, a cloud GPU marketplace.

To configure the cudo backend, you simply need to specify your CUDO Compute project ID and API key:

projects:
- name: main
  backends:
  - type: cudo
    project_id: my-cudo-project
    creds:
      type: api_key
      api_key: 7487240a466624b48de22865589

Once it's done, you can restart the dstack server and use the dstack CLI or API to run workloads.

Note

Limitations

The dstack gateway feature is not yet compatible with cudo, but it is expected to be supported in version 0.17.0,
planned for release within a week.
The cudo backend cannot yet be used with dstack Sky, but it will also be enabled within a week.

Full changelog: 0.16.3...0.16.4

Assets 2

Releases: dstackai/dstack

0.18.4

Google Cloud TPU

Major bug-fixes

Other

New contributors

Contributors

0.18.4rc3

TPU

Major bug-fixes

Other

New contributors

Contributors

0.18.3

Oracle Cloud Infrastructure

Retry policy

RunPod

GCP

AWS

Other changes

Contributors

0.18.3rc1

OCI

Retry policy

VPC

GCP

AWS

Other changes

Contributors

0.18.2

On-prem clusters

Network

Private subnets

Gateways

dstack apply

Private gateways

What's changed

New Contributors

Contributors

0.18.1

On-prem servers

Configuration

New examples 🔥🔥

Other

What's changed

Contributors

0.18.0

RunPod

Clusters

Other

What's changed

Contributors

0.17.0

Service auto-scaling

Regions and instance types

Environment variables

What's changed

New contributors

Contributors

0.16.5

Bug-fixes

0.16.4

CUDO Compute

`dstack apply`