diff --git a/docs/blog/posts/docker-inside-containers.md b/docs/blog/posts/docker-inside-containers.md index 699e75fe70..1a88f20b13 100644 --- a/docs/blog/posts/docker-inside-containers.md +++ b/docs/blog/posts/docker-inside-containers.md @@ -12,18 +12,26 @@ To run containers with `dstack`, you can use your own Docker image (or the defau directly with Docker. However, some existing code may require direct use of Docker or Docker Compose. That's why, in our latest release, we've added this option. -
- -```yaml +
+ +```yaml type: task -name: chat-ui-task +name: compose-task image: dstackai/dind privileged: true -working_dir: examples/misc/docker-compose commands: - start-dockerd + - | + cat > compose.yaml <<'EOF' + services: + web: + image: python:3.11-slim + command: python -m http.server 9000 + ports: + - "9000:9000" + EOF - docker compose up ports: [9000] @@ -37,7 +45,7 @@ resources: ## How it works -To use Docker or Docker Compose with your `dstack` configuration, set `image` to `dstackai/dind`, `privileged` to +To use Docker or Docker Compose with your `dstack` configuration, set `image` to `dstackai/dind`, `privileged` to `true`, and add the `start-dockerd` command. After this command, you can use Docker or Docker Compose directly. @@ -45,7 +53,7 @@ For dev environments, add `start-dockerd` as the first command in the `init` property. ??? info "Dev environment" -
+
```yaml type: dev-environment @@ -59,7 +67,7 @@ in the `init` property. - start-dockerd resources: - gpu: 16GB..24GB + gpu: 16GB..24GB ```
@@ -71,15 +79,15 @@ With this setup, you don’t have to worry about configuration—both Docker and support GPU usage. !!! info "Backends" - Note that the `privileged` option is only supported by VM-based backends. This does not include `runpod`, `vastai`, + Note that the `privileged` option is only supported by VM-based backends. This does not include `runpod`, `vastai`, and `kubernetes`. All other backends support it. ## When using it ### docker compose -One of the obvious use cases for this feature is when you need to use Docker Compose. -For example, the Hugging Face Chat UI requires a MongoDB database, so using Docker Compose to run it is +One of the obvious use cases for this feature is when you need to use Docker Compose. +For example, the Hugging Face Chat UI requires a MongoDB database, so using Docker Compose to run it is the easiest way: @@ -92,14 +100,10 @@ Another use case for this feature is when you need to build a custom Docker imag Last but not least, you can, of course, use the `docker run` command, for example, if your existing code requires it. -## Examples - -A few examples of using this feature can be found in [`examples/misc/docker-compose`](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose). - ## Feedback If you find something not working as intended, please be sure to report it to -our [bug tracker](https://github.com/dstackai/dstack/issues){:target="_ blank"}. -Your feedback and feature requests are also very welcome on both +our [bug tracker](https://github.com/dstackai/dstack/issues){:target="_ blank"}. +Your feedback and feature requests are also very welcome on both [Discord](https://discord.gg/u8SmfwPpMd) and the [issue tracker](https://github.com/dstackai/dstack/issues). diff --git a/docs/blog/posts/nvidia-and-amd-on-vultr.md b/docs/blog/posts/nvidia-and-amd-on-vultr.md index 512d316f8b..e3961b37a9 100644 --- a/docs/blog/posts/nvidia-and-amd-on-vultr.md +++ b/docs/blog/posts/nvidia-and-amd-on-vultr.md @@ -67,8 +67,6 @@ projects: For more details, refer to [Installation](../../docs/installation.md). -> Interested in fine-tuning or deploying DeepSeek on Vultr? Check out the corresponding [example](../../examples/llms/deepseek/index.md). - !!! info "What's next?" 1. Refer to [Quickstart](../../docs/quickstart.md) 2. Sign up with [Vultr](https://www.vultr.com/) diff --git a/docs/blog/posts/volumes-on-runpod.md b/docs/blog/posts/volumes-on-runpod.md index c17faf7b13..08f2e19126 100644 --- a/docs/blog/posts/volumes-on-runpod.md +++ b/docs/blog/posts/volumes-on-runpod.md @@ -20,7 +20,7 @@ deploying a model on Runpod. Suppose you want to deploy Llama 3.1 on Runpod as a [service](../../docs/concepts/services.md): -
+
```yaml type: service @@ -63,7 +63,7 @@ Great news: Runpod supports network volumes, which we can use for caching models With `dstack`, you can create a Runpod volume using the following configuration: -
+
```yaml type: volume @@ -83,7 +83,7 @@ Go ahead and create it via `dstack apply`:
```shell -$ dstack apply -f examples/mist/volumes/runpod.dstack.yml +$ dstack apply -f runpod-volume.dstack.yml ```
@@ -91,7 +91,7 @@ $ dstack apply -f examples/mist/volumes/runpod.dstack.yml Once the volume is created, attach it to your service by updating the configuration file and mapping the volume name to the `/data` path. -
+
```yaml type: service diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index 39e8f0b868..fd0d2a2dc2 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -15,57 +15,116 @@ Services allow you to deploy models or web apps as secure and scalable endpoints First, define a service configuration as a YAML file in your project folder. The filename must end with `.dstack.yml` (e.g. `.dstack.yml` or `dev.dstack.yml` are both acceptable). -
+=== "NVIDIA" -```yaml -type: service -name: llama31 +
-# If `image` is not specified, dstack uses its default image -python: 3.12 -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct - - MAX_MODEL_LEN=4096 -commands: - - uv pip install vllm - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --tensor-parallel-size $DSTACK_GPUS_NUM -port: 8000 -# (Optional) Register the model -model: meta-llama/Meta-Llama-3.1-8B-Instruct + ```yaml + type: service + name: qwen397 -# Uncomment to leverage spot instances -#spot_policy: auto + image: lmsysorg/sglang:v0.5.10.post1 -resources: - gpu: 24GB -``` + commands: + - | + sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B-FP8 \ + --port 30000 \ + --tp $DSTACK_GPUS_NUM \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --enable-flashinfer-allreduce-fusion \ + --mem-fraction-static 0.8 + + port: 30000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + # Optional instance volume for model and runtime caches + - instance_path: /root/.cache + path: /root/.cache + optional: true -
+ resources: + cpu: x86:96.. + memory: 512GB.. + shm_size: 16GB + disk: 500GB.. + gpu: H100:80GB:8 + ``` + +
+ +=== "AMD" + +
+ + ```yaml + type: service + name: qwen397 + + image: lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x + + env: + - HIP_FORCE_DEV_KERNARG=1 + - SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 + - SGLANG_DISABLE_CUDNN_CHECK=1 + - SGLANG_INT4_WEIGHT=0 + - SGLANG_MOE_PADDING=1 + - SGLANG_ROCM_DISABLE_LINEARQUANT=0 + - SGLANG_ROCM_FUSED_DECODE_MLA=1 + - SGLANG_SET_CPU_AFFINITY=1 + - SGLANG_USE_AITER=1 + - SGLANG_USE_ROCM700A=1 + + commands: + - | + sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B-FP8 \ + --tp $DSTACK_GPUS_NUM \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --mem-fraction-static 0.8 \ + --context-length 262144 \ + --attention-backend triton \ + --disable-cuda-graph \ + --fp8-gemm-backend aiter \ + --port 30000 + + port: 30000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + # Optional instance volume for model and runtime caches + - instance_path: /root/.cache + path: /root/.cache + optional: true + + resources: + cpu: x86:52.. + memory: 700GB.. + shm_size: 16GB + disk: 600GB.. + gpu: MI300X:192GB:4 + ``` + +
To run a service, pass the configuration to [`dstack apply`](../reference/cli/dstack/apply.md):
```shell -$ HF_TOKEN=... $ dstack apply -f .dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22 - 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22 - 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33 - -Submit the run llama31? [y/n]: y +Submit the run qwen397? [y/n]: y Provisioning... ---> 100% -Service is published at: - http://localhost:3000/proxy/services/main/llama31/ -Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at: +Service is published at: + http://localhost:3000/proxy/services/main/qwen397/ +Model Qwen/Qwen3.5-397B-A17B-FP8 is published at: http://localhost:3000/proxy/models/main/ ``` @@ -79,11 +138,11 @@ If you do not have a [gateway](gateways.md) created, the service endpoint will b
```shell -$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \ +$ curl http://localhost:3000/proxy/services/main/qwen397/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer <dstack token>' \ -d '{ - "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "model": "Qwen/Qwen3.5-397B-A17B-FP8", "messages": [ { "role": "user", @@ -95,6 +154,10 @@ $ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
+The request and response format depends on the serving framework used by the +service. Even for OpenAI-compatible endpoints, the format may vary slightly +across frameworks. + If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with `Bearer `. ## Configuration options @@ -144,38 +207,115 @@ $ curl https://llama31.example.com/v1/chat/completions \ By default, `dstack` runs a single replica of the service. You can configure the number of replicas as well as the auto-scaling rules. -
+=== "NVIDIA" -```yaml -type: service -name: llama31-service +
-python: 3.12 + ```yaml + type: service + name: qwen397-service -env: - - HF_TOKEN -commands: - - uv pip install vllm - - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096 -port: 8000 + image: lmsysorg/sglang:v0.5.10.post1 -resources: - gpu: 24GB + commands: + - | + sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B-FP8 \ + --port 30000 \ + --tp $DSTACK_GPUS_NUM \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --enable-flashinfer-allreduce-fusion \ + --mem-fraction-static 0.8 + + port: 30000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + # Optional instance volume for model and runtime caches + - instance_path: /root/.cache + path: /root/.cache + optional: true -replicas: 1..4 -scaling: - # Requests per seconds - metric: rps - # Target metric value - target: 10 -``` + resources: + cpu: x86:96.. + memory: 512GB.. + shm_size: 16GB + disk: 500GB.. + gpu: H100:80GB:8 -
+ replicas: 1..2 + scaling: + metric: rps + target: 1 + ``` + +
+ +=== "AMD" + +
+ + ```yaml + type: service + name: qwen397-service + + image: lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x + + env: + - HIP_FORCE_DEV_KERNARG=1 + - SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 + - SGLANG_DISABLE_CUDNN_CHECK=1 + - SGLANG_INT4_WEIGHT=0 + - SGLANG_MOE_PADDING=1 + - SGLANG_ROCM_DISABLE_LINEARQUANT=0 + - SGLANG_ROCM_FUSED_DECODE_MLA=1 + - SGLANG_SET_CPU_AFFINITY=1 + - SGLANG_USE_AITER=1 + - SGLANG_USE_ROCM700A=1 + + commands: + - | + sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B-FP8 \ + --tp $DSTACK_GPUS_NUM \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --mem-fraction-static 0.8 \ + --context-length 262144 \ + --attention-backend triton \ + --disable-cuda-graph \ + --fp8-gemm-backend aiter \ + --port 30000 + + port: 30000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + # Optional instance volume for model and runtime caches + - instance_path: /root/.cache + path: /root/.cache + optional: true + + resources: + cpu: x86:52.. + memory: 700GB.. + shm_size: 16GB + disk: 600GB.. + gpu: MI300X:192GB:4 + + replicas: 1..2 + scaling: + metric: rps + target: 1 + ``` + +
The [`replicas`](../reference/dstack.yml/service.md#replicas) property can be a number or a range. -The [`metric`](../reference/dstack.yml/service.md#metric) property of [`scaling`](../reference/dstack.yml/service.md#scaling) only supports the `rps` metric (requests per second). In this -case `dstack` adjusts the number of replicas (scales up or down) automatically based on the load. +The [`metric`](../reference/dstack.yml/service.md#metric) property of [`scaling`](../reference/dstack.yml/service.md#scaling) only supports the `rps` metric (requests per second). In this +case `dstack` adjusts the number of replicas (scales up or down) automatically based on the load. Setting the minimum number of replicas to `0` allows the service to scale down to zero when there are no requests. @@ -186,13 +326,13 @@ Setting the minimum number of replicas to `0` allows the service to scale down t ??? info "Replica groups" A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules. -
+
```yaml type: service name: llama-8b-service - image: lmsysorg/sglang:latest + image: lmsysorg/sglang:v0.5.10.post1 env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B @@ -239,75 +379,74 @@ Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use Below is an example for running `zai-org/GLM-4.5-Air-FP8`: -
+=== "NVIDIA" -```yaml -type: service -name: prefill-decode -image: lmsysorg/sglang:latest +
-env: - - HF_TOKEN - - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + ```yaml + type: service + name: prefill-decode + image: lmsysorg/sglang:v0.5.10.post1 -replicas: - - count: 1 - # For now replica group with router must have count: 1 - commands: - - pip install sglang_router - - | - python -m sglang_router.launch_router \ - --host 0.0.0.0 \ - --port 8000 \ - --pd-disaggregation \ - --prefill-policy cache_aware - router: - type: sglang - resources: - cpu: 4 + env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 - - count: 1..4 - scaling: - metric: rps - target: 3 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode prefill \ - --disaggregation-transfer-backend nixl \ - --host 0.0.0.0 \ - --port 8000 \ - --disaggregation-bootstrap-port 8998 - resources: - gpu: H200 + replicas: + - count: 1 + # For now replica group with router must have count: 1 + commands: + - pip install sglang_router + - | + python -m sglang_router.launch_router \ + --port 8000 \ + --pd-disaggregation \ + --prefill-policy cache_aware + router: + type: sglang + resources: + cpu: 4 - - count: 1..8 - scaling: - metric: rps - target: 2 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode decode \ - --disaggregation-transfer-backend nixl \ - --host 0.0.0.0 \ - --port 8000 - resources: - gpu: H200 + - count: 1..4 + scaling: + metric: rps + target: 3 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl \ + --port 8000 \ + --disaggregation-bootstrap-port 8998 + resources: + gpu: H200 -port: 8000 -model: zai-org/GLM-4.5-Air-FP8 + - count: 1..8 + scaling: + metric: rps + target: 2 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend nixl \ + --port 8000 + resources: + gpu: H200 -# Custom probe is required for PD disaggregation. -probes: - - type: http - url: /health - interval: 15s -``` + port: 8000 + model: zai-org/GLM-4.5-Air-FP8 -
+ # Custom probe is required for PD disaggregation. + probes: + - type: http + url: /health + interval: 15s + ``` + +
!!! info "Cluster" PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances. @@ -319,7 +458,7 @@ probes: By default, the service enables authorization, meaning the service endpoint requires a `dstack` user token. This can be disabled by setting `auth` to `false`. -
+
```yaml type: service @@ -410,7 +549,7 @@ Probes are executed for each service replica while the replica is `running`. A p
??? info "Model" - If you set the [`model`](#model) property but don't explicitly configure `probes`, + If you set the [`model`](#model) property but don't explicitly configure `probes`, `dstack` automatically configures a default probe that tests the model using the `/v1/chat/completions` API. To disable probes entirely when `model` is set, explicitly set `probes` to an empty list. @@ -423,7 +562,7 @@ If your `dstack` project doesn't have a [gateway](gateways.md), services are hos When running web apps, you may need to set some app-specific settings so that browser-side scripts and CSS work correctly with the path prefix. -
+
```yaml type: service @@ -462,7 +601,7 @@ on a dedicated domain name by setting up a [gateway](gateways.md). If you have a [gateway](gateways.md), you can configure rate limits for your service using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property. -
+
```yaml type: service @@ -488,7 +627,7 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients Instead of partitioning requests by client IP address, you can choose to partition by the value of a header. -
+
```yaml type: service @@ -508,7 +647,7 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients ### Model -If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page. +If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page. In this case, `dstack` will use the service's `/v1/chat/completions` service. When `model` is set, `dstack` automatically configures [`probes`](#probes) to verify model health. @@ -516,10 +655,10 @@ To customize or disable this, set `probes` explicitly. ### Resources -If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a +If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a range (e.g. `24GB..`, or `24GB..80GB`, or `..80GB`). -
+
```yaml type: service @@ -550,10 +689,10 @@ resources:
-The `cpu` property lets you set the architecture (`x86` or `arm`) and core count — e.g., `x86:16` (16 x86 cores), `arm:8..` (at least 8 ARM cores). +The `cpu` property lets you set the architecture (`x86` or `arm`) and core count — e.g., `x86:16` (16 x86 cores), `arm:8..` (at least 8 ARM cores). If not set, `dstack` infers it from the GPU or defaults to `x86`. -The `gpu` property lets you specify vendor, model, memory, and count — e.g., `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either), `A100:80GB` (one 80GB A100), `A100:2` (two A100), `24GB..40GB:2` (two GPUs with 24–40GB), `A100:40GB:2` (two 40GB A100s). +The `gpu` property lets you specify vendor, model, memory, and count — e.g., `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either), `A100:80GB` (one 80GB A100), `A100:2` (two A100), `24GB..40GB:2` (two GPUs with 24–40GB), `A100:40GB:2` (two 40GB A100s). If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. @@ -563,7 +702,7 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. ```yaml type: service name: llama31-service-optimum-tpu - + image: dstackai/optimum-tpu:llama31 env: - HF_TOKEN @@ -575,7 +714,7 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. port: 8000 # Register the model model: meta-llama/Meta-Llama-3.1-8B-Instruct - + resources: gpu: v5litepod-4 ``` @@ -583,7 +722,7 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. Currently, only 8 TPU cores can be specified, supporting single TPU device workloads. Multi-TPU support is coming soon. --> ??? info "Shared memory" - If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure + If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure `shm_size`, e.g. set it to `16GB`. > If you’re unsure which offers (hardware configurations) are available from the configured backends, use the @@ -594,18 +733,18 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. #### Default image -If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with - `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). +If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with + `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). Set the `python` property to pre-install a specific version of Python. -
+
```yaml type: service -name: http-server-service +name: http-server-service python: 3.12 @@ -618,16 +757,16 @@ port: 8000 #### NVCC -By default, the base Docker image doesn’t include `nvcc`, which is required for building custom CUDA kernels. +By default, the base Docker image doesn’t include `nvcc`, which is required for building custom CUDA kernels. If you need `nvcc`, set the [`nvcc`](../reference/dstack.yml/dev-environment.md#nvcc) property to true. -
+
```yaml type: service -name: http-server-service +name: http-server-service python: 3.12 nvcc: true @@ -650,7 +789,7 @@ If you want, you can specify your own Docker image via `image`. name: http-server-service image: python - + commands: - python3 -m http.server port: 8000 @@ -662,18 +801,26 @@ If you want, you can specify your own Docker image via `image`. Set `docker` to `true` to enable the `docker` CLI in your service, e.g., to run Docker images or use Docker Compose. -
+
```yaml type: service -name: chat-ui-task +name: compose-service auth: false docker: true -working_dir: examples/misc/docker-compose commands: + - | + cat > compose.yaml <<'EOF' + services: + web: + image: python:3.11-slim + command: python -m http.server 9000 + ports: + - "9000:9000" + EOF - docker compose up port: 9000 ``` @@ -689,8 +836,8 @@ To enable privileged mode, set [`privileged`](../reference/dstack.yml/dev-enviro Not supported with `runpod`, `vastai`, and `kubernetes`. #### Private registry - -Use the [`registry_auth`](../reference/dstack.yml/dev-environment.md#registry_auth) property to provide credentials for a private Docker registry. + +Use the [`registry_auth`](../reference/dstack.yml/dev-environment.md#registry_auth) property to provide credentials for a private Docker registry. ```yaml type: service @@ -711,7 +858,7 @@ model: deepseek-ai/deepseek-r1-distill-llama-8b resources: gpu: H100:1 ``` - + ### Environment variables
@@ -741,7 +888,7 @@ resources: ??? info "System environment variables" The following environment variables are available in any run by default: - + | Name | Description | |-------------------------|--------------------------------------------------| | `DSTACK_RUN_NAME` | The name of the run | @@ -768,7 +915,7 @@ Sometimes, when you run a service, you may want to mount local files. This is po -
+
```yaml type: service @@ -801,7 +948,7 @@ The container path is optional. If not specified, it will be automatically calcu -
+
```yaml type: service @@ -840,7 +987,7 @@ Imagine you have a Git repo (clonned locally) containing an `examples` subdirect -
+
```yaml type: service @@ -874,10 +1021,10 @@ The local path can be either relative to the configuration file or absolute. By default, `dstack` clones the repo to the [working directory](#working-directory). - + You can override the repo directory using either a relative or an absolute path: -
+
```yaml type: service @@ -905,18 +1052,18 @@ The local path can be either relative to the configuration file or absolute. > If the repo directory is relative, it is resolved against [working directory](#working-directory). - If the repo directory is not empty, the run will fail with a runner error. + If the repo directory is not empty, the run will fail with a runner error. To override this behavior, you can set `if_exists` to `skip`: ```yaml type: service - name: llama-2-7b-service - + name: llama-2-7b-service + repos: - local_path: .. path: /my-repo if_exists: skip - + python: 3.12 env: @@ -932,7 +1079,7 @@ The local path can be either relative to the configuration file or absolute. ``` ??? info "Repo size" - The repo size is not limited. However, local changes are limited to 2MB. + The repo size is not limited. However, local changes are limited to 2MB. To avoid exceeding this limit, exclude unnecessary files using `.gitignore` or `.dstackignore`. You can increase the 2MB limit by setting the `DSTACK_SERVER_CODE_UPLOAD_LIMIT` environment variable. @@ -941,7 +1088,7 @@ The local path can be either relative to the configuration file or absolute. -
+
```yaml type: service @@ -979,7 +1126,7 @@ Currently, you can configure up to one repo per run configuration. By default, if `dstack` can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail. -If you'd like `dstack` to automatically retry, configure the +If you'd like `dstack` to automatically retry, configure the [retry](../reference/dstack.yml/service.md#retry) property accordingly: @@ -1075,7 +1222,7 @@ The `schedule` property can be combined with `max_duration` or `utilization_poli ??? info "Cron syntax" `dstack` supports [POSIX cron syntax](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07). One exception is that days of the week are started from Monday instead of Sunday so `0` corresponds to Monday. - + The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively. A cron expression consists of five fields: @@ -1105,8 +1252,8 @@ The `schedule` property can be combined with `max_duration` or `utilization_poli !!! info "Reference" Services support many more configuration options, - incl. [`backends`](../reference/dstack.yml/service.md#backends), - [`regions`](../reference/dstack.yml/service.md#regions), + incl. [`backends`](../reference/dstack.yml/service.md#backends), + [`regions`](../reference/dstack.yml/service.md#regions), [`max_price`](../reference/dstack.yml/service.md#max_price), and among [others](../reference/dstack.yml/service.md). @@ -1133,7 +1280,7 @@ Update the run? [y/n]: If approved, `dstack` gradually updates the service replicas. To update a replica, `dstack` starts a new replica, waits for it to become `running` and for all of its [probes](#probes) to pass, then terminates the old replica. This process is repeated for each replica, one at a time. -You can track the progress of rolling deployment in both `dstack apply` or `dstack ps`. +You can track the progress of rolling deployment in both `dstack apply` or `dstack ps`. Older replicas have lower `deployment` numbers; newer ones have higher. diff --git a/examples/accelerators/amd/README.md b/examples/accelerators/amd/README.md index 3f6b4966b1..36be8044e8 100644 --- a/examples/accelerators/amd/README.md +++ b/examples/accelerators/amd/README.md @@ -16,7 +16,7 @@ Llama 3.1 70B in FP16 using [vLLM](https://docs.vllm.ai/en/latest/getting_starte === "vLLM" -
+
```yaml type: service @@ -69,9 +69,7 @@ Llama 3.1 70B in FP16 using [vLLM](https://docs.vllm.ai/en/latest/getting_starte Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version. - > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3. - > You can find the task to build and upload the binary in - > [`examples/inference/vllm/amd/`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd/). + > To speed up the `vLLM-ROCm` installation, this example uses a pre-built binary from S3. !!! info "Docker image" If you want to use AMD, specifying `image` is currently required. This must be an image that includes @@ -87,7 +85,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by and the [`mlabonne/guanaco-llama2-1k`](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k) dataset. -
+
```yaml type: task @@ -133,7 +131,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by and the [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) dataset. -
+
```yaml type: task @@ -187,13 +185,12 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile)](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm). - > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3. - > You can find the tasks that build and upload the binaries - > in [`examples/single-node-training/axolotl/amd/`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd/). + > To speed up installation of `flash-attention` and `xformers`, we use pre-built binaries uploaded to S3. ## Running a configuration -Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the +Once a configuration is ready, save it to a `.dstack.yml` file, then run +`dstack apply -f `, and `dstack` will automatically provision the cloud resources and run the configuration.
@@ -204,18 +201,11 @@ $ WANDB_API_KEY=... $ WANDB_PROJECT=... $ WANDB_NAME=axolotl-amd-llama31-train $ HUB_MODEL_ID=... -$ dstack apply -f examples/inference/vllm/amd/.dstack.yml +$ dstack apply -f service.dstack.yml ```
-## Source code - -The source-code of this example can be found in -[`examples/inference/vllm/amd`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd), -[`examples/single-node-training/axolotl/amd`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd) and -[`examples/single-node-training/trl/amd`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl/amd) - ## What's next? 1. Browse [vLLM](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm), diff --git a/examples/accelerators/tpu/README.md b/examples/accelerators/tpu/README.md index 98982a919a..53f31b93bd 100644 --- a/examples/accelerators/tpu/README.md +++ b/examples/accelerators/tpu/README.md @@ -29,7 +29,7 @@ and [vLLM](https://github.com/vllm-project/vllm). === "Optimum TPU" -
+
```yaml type: service @@ -61,7 +61,7 @@ and [vLLM](https://github.com/vllm-project/vllm). the official Docker image can be used. === "vLLM" -
+
```yaml type: service @@ -184,13 +184,6 @@ Note, `v5litepod` is optimized for fine-tuning transformer-based models. Each co | **TRL** | bfloat16 | To fine-tune using TRL, Optimum TPU is recommended. TRL doesn't support Llama 3.1 out of the box. | | **Pytorch XLA** | bfloat16 | | -## Source code - -The source-code of this example can be found in -[`examples/inference/tgi/tpu`](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/tpu), -[`examples/inference/vllm/tpu`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/tpu), -and [`examples/single-node-training/optimum-tpu`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl). - ## What's next? 1. Browse [Optimum TPU](https://github.com/huggingface/optimum-tpu), diff --git a/examples/clusters/nccl-rccl-tests/README.md b/examples/clusters/nccl-rccl-tests/README.md index 1d9166591e..a9cadd82e8 100644 --- a/examples/clusters/nccl-rccl-tests/README.md +++ b/examples/clusters/nccl-rccl-tests/README.md @@ -115,7 +115,6 @@ Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPU kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it using `LD_PRELOAD` when running MPI. - !!! info "Privileged" In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand). @@ -138,11 +137,6 @@ Submit the run nccl-tests? [y/n]: y
-## Source code - -The source-code of this example can be found in -[`examples/clusters/nccl-rccl-tests`](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-rccl-tests). - ## What's next? 1. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), diff --git a/examples/distributed-training/axolotl/README.md b/examples/distributed-training/axolotl/README.md index 593d4142b2..cd7be95e4c 100644 --- a/examples/distributed-training/axolotl/README.md +++ b/examples/distributed-training/axolotl/README.md @@ -94,11 +94,6 @@ Provisioning... ```
-## Source code - -The source-code of this example can be found in -[`examples/distributed-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl). - !!! info "What's next?" 1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide 2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), diff --git a/examples/distributed-training/trl/README.md b/examples/distributed-training/trl/README.md index 84ac7dd6ba..47d3f6f888 100644 --- a/examples/distributed-training/trl/README.md +++ b/examples/distributed-training/trl/README.md @@ -154,11 +154,6 @@ Provisioning... ```
-## Source code - -The source-code of this example can be found in -[`examples/distributed-training/trl`](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl). - !!! info "What's next?" 1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), diff --git a/examples/inference/nim/.dstack.yml b/examples/inference/nim/.dstack.yml deleted file mode 100644 index 4a7d33406b..0000000000 --- a/examples/inference/nim/.dstack.yml +++ /dev/null @@ -1,27 +0,0 @@ -type: service -name: serve-distill-deepseek - -image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b -env: - - NGC_API_KEY - - NIM_MAX_MODEL_LEN=4096 -registry_auth: - username: $oauthtoken - password: ${{ env.NGC_API_KEY }} -port: 8000 -# Register the model -model: deepseek-ai/deepseek-r1-distill-llama-8b - -# Uncomment to leverage spot instances -#spot_policy: auto - -# Cache downloaded models -volumes: - - instance_path: /root/.cache/nim - path: /opt/nim/.cache - optional: true - -resources: - gpu: A100:40GB - # Uncomment if using multiple GPUs - #shm_size: 16GB diff --git a/examples/inference/nim/README.md b/examples/inference/nim/README.md index 356492a49a..680c51f498 100644 --- a/examples/inference/nim/README.md +++ b/examples/inference/nim/README.md @@ -1,11 +1,11 @@ --- title: NVIDIA NIM -description: Deploying DeepSeek-R1-Distill-Llama-8B using NVIDIA NIM +description: Deploying Nemotron-3-Super-120B-A12B using NVIDIA NIM --- # NVIDIA NIM -This example shows how to deploy DeepSeek-R1-Distill-Llama-8B using [NVIDIA NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) and `dstack`. +This example shows how to deploy Nemotron-3-Super-120B-A12B using [NVIDIA NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) and `dstack`. ??? info "Prerequisites" Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. @@ -21,60 +21,46 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama-8B using [NVIDIA NIM] ## Deployment -Here's an example of a service that deploys DeepSeek-R1-Distill-Llama-8B using NIM. +Here's an example of a service that deploys Nemotron-3-Super-120B-A12B using NIM. -
+
```yaml type: service -name: serve-distill-deepseek +name: nemotron120 -image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b +image: nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:1.8.0 env: - NGC_API_KEY - - NIM_MAX_MODEL_LEN=4096 registry_auth: username: $oauthtoken password: ${{ env.NGC_API_KEY }} port: 8000 -# Register the model -model: deepseek-ai/deepseek-r1-distill-llama-8b - -# Uncomment to leverage spot instances -#spot_policy: auto - -# Cache downloaded models +model: nvidia/nemotron-3-super-120b-a12b volumes: - instance_path: /root/.cache/nim path: /opt/nim/.cache optional: true resources: - gpu: A100:40GB - # Uncomment if using multiple GPUs - #shm_size: 16GB + cpu: x86:96.. + memory: 512GB.. + shm_size: 16GB + disk: 500GB.. + gpu: H100:80GB:8 ```
### Running a configuration -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. +Save the configuration above as `nemotron120.dstack.yml`, then use the +[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
```shell $ NGC_API_KEY=... -$ dstack apply -f examples/inference/nim/.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199 - 2 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199 - 3 vultr nrt 6xCPU, 60GB, 1xA100 (40GB) no $1.199 - -Submit the run serve-distill-deepseek? [y/n]: y - -Provisioning... ----> 100% +$ dstack apply -f nemotron120.dstack.yml ```
@@ -83,12 +69,12 @@ If no gateway is created, the service endpoint will be available at ` ```shell -$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/nemotron120/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ -d '{ - "model": "meta/llama3-8b-instruct", + "model": "nvidia/nemotron-3-super-120b-a12b", "messages": [ { "role": "system", @@ -105,14 +91,9 @@ $ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill-deepseek./`. - -## Source code - -The source-code of this example can be found in -[`examples/inference/nim`](https://github.com/dstackai/dstack/blob/master/examples/inference/nim). +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://nemotron120./`. ## What's next? 1. Check [services](https://dstack.ai/docs/services) -2. Browse the [DeepSeek AI NIM](https://build.nvidia.com/deepseek-ai) +2. Browse the [Nemotron-3-Super-120B-A12B model page](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b) diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md index 8dbfe347f1..9d08fe09cb 100644 --- a/examples/inference/sglang/README.md +++ b/examples/inference/sglang/README.md @@ -1,84 +1,121 @@ --- title: SGLang -description: Deploying DeepSeek-R1-Distill-Llama models using SGLang on NVIDIA and AMD GPUs +description: Deploying Qwen3.5-397B-A17B-FP8 using SGLang on NVIDIA and AMD GPUs --- # SGLang -This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGLang](https://github.com/sgl-project/sglang) and `dstack`. +This example shows how to deploy `Qwen/Qwen3.5-397B-A17B-FP8` using +[SGLang](https://github.com/sgl-project/sglang) and `dstack`. ## Apply a configuration -Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang. +Here's an example of a service that deploys +`Qwen/Qwen3.5-397B-A17B-FP8` using SGLang. === "NVIDIA" -
+
```yaml type: service - name: deepseek-r1 + name: qwen397 - image: lmsysorg/sglang:latest - env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B + image: lmsysorg/sglang:v0.5.10.post1 commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - - port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + - | + sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B-FP8 \ + --port 30000 \ + --tp $DSTACK_GPUS_NUM \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --enable-flashinfer-allreduce-fusion \ + --mem-fraction-static 0.8 + + port: 30000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + - instance_path: /root/.cache + path: /root/.cache + optional: true resources: - gpu: 24GB + cpu: x86:96.. + memory: 512GB.. + shm_size: 16GB + disk: 500GB.. + gpu: H100:80GB:8 ```
=== "AMD" -
+
```yaml type: service - name: deepseek-r1 + name: qwen397 + + image: lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x - image: lmsysorg/sglang:v0.4.1.post4-rocm620 env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - HIP_FORCE_DEV_KERNARG=1 + - SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 + - SGLANG_DISABLE_CUDNN_CHECK=1 + - SGLANG_INT4_WEIGHT=0 + - SGLANG_MOE_PADDING=1 + - SGLANG_ROCM_DISABLE_LINEARQUANT=0 + - SGLANG_ROCM_FUSED_DECODE_MLA=1 + - SGLANG_SET_CPU_AFFINITY=1 + - SGLANG_USE_AITER=1 + - SGLANG_USE_ROCM700A=1 commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - - port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - | + sglang serve \ + --model-path Qwen/Qwen3.5-397B-A17B-FP8 \ + --tp $DSTACK_GPUS_NUM \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_coder \ + --mem-fraction-static 0.8 \ + --context-length 262144 \ + --attention-backend triton \ + --disable-cuda-graph \ + --fp8-gemm-backend aiter \ + --port 30000 + + port: 30000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + - instance_path: /root/.cache + path: /root/.cache + optional: true resources: - gpu: MI300x - disk: 300GB + cpu: x86:52.. + memory: 700GB.. + shm_size: 16GB + disk: 600GB.. + gpu: MI300X:192GB:4 ```
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. +The AMD example uses the exact validated MI300X configuration for this model, +including the ROCm/AITER settings required for stable FP8 serving. + +Save one of the configurations above as `qwen397.dstack.yml`, then use the +[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
```shell -$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 - -Submit the run deepseek-r1? [y/n]: y - -Provisioning... ----> 100% +$ dstack apply -f qwen397.dstack.yml ``` +
If no gateway is created, the service endpoint will be available at `/proxy/services///`. @@ -86,29 +123,26 @@ If no gateway is created, the service endpoint will be available at ` ```shell -curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ +curl http://127.0.0.1:3000/proxy/services/main/qwen397/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ -d '{ - "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", + "model": "Qwen/Qwen3.5-397B-A17B-FP8", "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, { "role": "user", - "content": "What is Deep Learning?" + "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with just the dollar amount." } ], - "stream": true, - "max_tokens": 512 + "chat_template_kwargs": {"enable_thinking": true}, + "separate_reasoning": true, + "max_tokens": 1024 }' ```
-> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1./`. +> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen397./`. ## Configuration options @@ -116,75 +150,77 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), use replicas groups: one for a router (for example, [SGLang Model Gateway](https://docs.sglang.io/advanced_features/sgl_model_gateway.html)), one for prefill workers, and one for decode workers. -
- -```yaml -type: service -name: prefill-decode -image: lmsysorg/sglang:latest - -env: - - HF_TOKEN - - MODEL_ID=zai-org/GLM-4.5-Air-FP8 +=== "NVIDIA" -replicas: - - count: 1 - # For now replica group with router must have count: 1 - commands: - - pip install sglang_router - - | - python -m sglang_router.launch_router \ - --host 0.0.0.0 \ - --port 8000 \ - --pd-disaggregation \ - --prefill-policy cache_aware - router: - type: sglang - resources: - cpu: 4 +
- - count: 1..4 - scaling: - metric: rps - target: 3 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode prefill \ - --disaggregation-transfer-backend nixl \ - --host 0.0.0.0 \ - --port 8000 \ - --disaggregation-bootstrap-port 8998 - resources: - gpu: H200 + ```yaml + type: service + name: prefill-decode + image: lmsysorg/sglang:v0.5.10.post1 - - count: 1..8 - scaling: - metric: rps - target: 2 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode decode \ - --disaggregation-transfer-backend nixl \ - --host 0.0.0.0 \ - --port 8000 - resources: - gpu: H200 + env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + + replicas: + - count: 1 + # For now replica group with router must have count: 1 + commands: + - pip install sglang_router + - | + python -m sglang_router.launch_router \ + --host 0.0.0.0 \ + --port 8000 \ + --pd-disaggregation \ + --prefill-policy cache_aware + router: + type: sglang + resources: + cpu: 4 + + - count: 1..4 + scaling: + metric: rps + target: 3 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl \ + --host 0.0.0.0 \ + --port 8000 \ + --disaggregation-bootstrap-port 8998 + resources: + gpu: H200 + + - count: 1..8 + scaling: + metric: rps + target: 2 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend nixl \ + --host 0.0.0.0 \ + --port 8000 + resources: + gpu: H200 -port: 8000 -model: zai-org/GLM-4.5-Air-FP8 + port: 8000 + model: zai-org/GLM-4.5-Air-FP8 -# Custom probe is required for PD disaggregation. -probes: - - type: http - url: /health - interval: 15s -``` + # Custom probe is required for PD disaggregation. + probes: + - type: http + url: /health + interval: 15s + ``` -
+
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon. @@ -193,12 +229,7 @@ Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster. -## Source code - -The source-code of these examples can be found in -[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang). - ## What's next? 1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways) -2. Browse the [SgLang DeepSeek Usage](https://docs.sglang.ai/references/deepseek.html), [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html) +2. Browse the [Qwen 3.5 SGLang cookbook](https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5) and the [SGLang server arguments reference](https://docs.sglang.ai/advanced_features/server_arguments.html) diff --git a/examples/inference/sglang/pd-disagg.fleet.dstack.yml b/examples/inference/sglang/pd-disagg.fleet.dstack.yml deleted file mode 100644 index 6cb839d98b..0000000000 --- a/examples/inference/sglang/pd-disagg.fleet.dstack.yml +++ /dev/null @@ -1,12 +0,0 @@ -type: fleet -name: pd-disagg - -placement: cluster - -ssh_config: - user: ubuntu - identity_file: ~/.ssh/id_rsa - hosts: - - 89.169.108.16 # CPU Host (router) - - 89.169.123.100 # GPU Host (prefill/decode workers) - - 89.169.110.65 # GPU Host (prefill/decode workers) diff --git a/examples/inference/sglang/pd.deprecated.dstack.yml b/examples/inference/sglang/pd.deprecated.dstack.yml deleted file mode 100644 index ff62080ae9..0000000000 --- a/examples/inference/sglang/pd.deprecated.dstack.yml +++ /dev/null @@ -1,54 +0,0 @@ -# DEPRECATED: Gateway-based PD disaggregation config. -# Use `pd.dstack.yml` instead (router runs as a replica). - -type: service -name: prefill-decode -image: lmsysorg/sglang:latest - -env: - - HF_TOKEN - - MODEL_ID=zai-org/GLM-4.5-Air-FP8 - -replicas: - - count: 1..4 - scaling: - metric: rps - target: 3 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode prefill \ - --disaggregation-transfer-backend mooncake \ - --host 0.0.0.0 \ - --port 8000 \ - --disaggregation-bootstrap-port 8998 - resources: - gpu: 1 - - - count: 1..8 - scaling: - metric: rps - target: 2 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode decode \ - --disaggregation-transfer-backend mooncake \ - --host 0.0.0.0 \ - --port 8000 - resources: - gpu: 1 - -port: 8000 -model: zai-org/GLM-4.5-Air-FP8 - -probes: - - type: http - url: /health_generate - interval: 15s - -router: - type: sglang - pd_disaggregation: true diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml deleted file mode 100644 index c026bab242..0000000000 --- a/examples/inference/sglang/pd.dstack.yml +++ /dev/null @@ -1,62 +0,0 @@ -type: service -name: prefill-decode -image: lmsysorg/sglang:latest - -env: - - HF_TOKEN - - MODEL_ID=zai-org/GLM-4.5-Air-FP8 - -replicas: - - count: 1 - # For now replica group with router must have count: 1 - commands: - - pip install sglang_router - - | - python -m sglang_router.launch_router \ - --host 0.0.0.0 \ - --port 8000 \ - --pd-disaggregation \ - --prefill-policy cache_aware - router: - type: sglang - resources: - cpu: 4 - - - count: 1..4 - scaling: - metric: rps - target: 3 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode prefill \ - --disaggregation-transfer-backend nixl \ - --host 0.0.0.0 \ - --port 8000 \ - --disaggregation-bootstrap-port 8998 - resources: - gpu: H200 - - - count: 1..8 - scaling: - metric: rps - target: 2 - commands: - - | - python -m sglang.launch_server \ - --model-path $MODEL_ID \ - --disaggregation-mode decode \ - --disaggregation-transfer-backend nixl \ - --host 0.0.0.0 \ - --port 8000 - resources: - gpu: H200 - -port: 8000 -model: zai-org/GLM-4.5-Air-FP8 - -probes: - - type: http - url: /health - interval: 15s diff --git a/examples/inference/trtllm/README.md b/examples/inference/trtllm/README.md index 31ddb5b0c3..ae3666d225 100644 --- a/examples/inference/trtllm/README.md +++ b/examples/inference/trtllm/README.md @@ -1,331 +1,66 @@ --- title: TensorRT-LLM -description: Deploying DeepSeek R1 and distilled models using NVIDIA TensorRT-LLM with Triton +description: Deploying Qwen3-235B-A22B-FP8 using NVIDIA TensorRT-LLM on NVIDIA GPUs --- # TensorRT-LLM -This example shows how to deploy both DeepSeek R1 and its distilled version -using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `dstack`. +This example shows how to deploy `nvidia/Qwen3-235B-A22B-FP8` using +[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `dstack`. -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. +## Apply a configuration -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
+Here's an example of a service that deploys +`nvidia/Qwen3-235B-A22B-FP8` using TensorRT-LLM. -## Deployment - -### DeepSeek R1 - -We normally use Triton with the TensorRT-LLM backend to serve models. While this works for the distilled Llama-based -version, DeepSeek R1 isn’t yet compatible. So, for DeepSeek R1, we’ll use `trtllm-serve` with the PyTorch backend instead. - -To use `trtllm-serve`, we first need to build the TensorRT-LLM Docker image from the `main` branch. - -#### Build a Docker image - -Here’s the task config that builds the image and pushes it using the provided Docker credentials. - -
+
```yaml -type: task -name: build-image +type: service +name: qwen235 + +image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc11 -privileged: true -image: dstackai/dind env: - - DOCKER_USERNAME - - DOCKER_PASSWORD + - HF_HUB_ENABLE_HF_TRANSFER=1 + commands: - - start-dockerd - - apt update && apt-get install -y build-essential make git git-lfs - - git lfs install - - git clone https://github.com/NVIDIA/TensorRT-LLM.git - - cd TensorRT-LLM - - git submodule update --init --recursive - - git lfs pull - # Limit compilation to Hopper for a smaller image - - make -C docker release_build CUDA_ARCHS="90-real" - - docker tag tensorrt_llm/release:latest $DOCKER_USERNAME/tensorrt_llm:latest - - echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin - - docker push "$DOCKER_USERNAME/tensorrt_llm:latest" + - pip install hf_transfer + - | + trtllm-serve serve nvidia/Qwen3-235B-A22B-FP8 \ + --host 0.0.0.0 \ + --port 8000 \ + --backend pytorch \ + --tp_size $DSTACK_GPUS_NUM \ + --max_batch_size 32 \ + --max_num_tokens 4096 \ + --kv_cache_free_gpu_memory_fraction 0.75 + +port: 8000 +model: nvidia/Qwen3-235B-A22B-FP8 + +volumes: + - instance_path: /root/.cache + path: /root/.cache + optional: true resources: - cpu: 8 - disk: 500GB.. + cpu: 96.. + memory: 512GB.. + shm_size: 32GB + disk: 1000GB.. + gpu: H100:8 ```
-To run it, pass the task configuration to `dstack apply`. +Apply it with [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md):
```shell -$ dstack apply -f examples/inference/trtllm/build-image.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 cudo ca-montreal-2 8xCPU, 25GB, (500.0GB) yes $0.1073 - -Submit the run build-image? [y/n]: y - -Provisioning... ----> 100% +$ dstack apply -f qwen235.dstack.yml ``` -
-#### Deploy the model - -Below is the service configuration that deploys DeepSeek R1 using the built TensorRT-LLM image. - -
- - ```yaml - type: service - name: serve-r1 - - # Specify the image built with `examples/inference/trtllm/build-image.dstack.yml` - image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167 - env: - - MAX_BATCH_SIZE=256 - - MAX_NUM_TOKENS=16384 - - MAX_SEQ_LENGTH=16384 - - EXPERT_PARALLEL=4 - - PIPELINE_PARALLEL=1 - - HF_HUB_ENABLE_HF_TRANSFER=1 - commands: - - pip install -U "huggingface_hub[cli]" - - pip install hf_transfer - - huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1 - - trtllm-serve - --backend pytorch - --max_batch_size $MAX_BATCH_SIZE - --max_num_tokens $MAX_NUM_TOKENS - --max_seq_len $MAX_SEQ_LENGTH - --tp_size $DSTACK_GPUS_NUM - --ep_size $EXPERT_PARALLEL - --pp_size $PIPELINE_PARALLEL - DeepSeek-R1 - port: 8000 - model: deepseek-ai/DeepSeek-R1 - - resources: - gpu: 8:H200 - shm_size: 32GB - disk: 2000GB.. - ``` -
- - -To run it, pass the configuration to `dstack apply`. - -
- -```shell -$ dstack apply -f examples/inference/trtllm/serve-r1.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai is-iceland 192xCPU, 2063GB, 8xH200 (141GB) yes $25.62 - -Submit the run serve-r1? [y/n]: y - -Provisioning... ----> 100% -``` -
- - -### DeepSeek R1 Distill Llama 8B - -To deploy DeepSeek R1 Distill Llama 8B, follow the steps below. - -#### Convert and upload checkpoints - -Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format -and uploads it to S3 using the provided AWS credentials. - -
- - ```yaml - type: task - name: convert-model - - image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 - env: - - HF_TOKEN - - MODEL_REPO=https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - S3_BUCKET_NAME - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_DEFAULT_REGION - commands: - # nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0, - # therefore we are using branch v0.17.0 - - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git - - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git - - git clone https://github.com/triton-inference-server/server.git - - cd TensorRT-LLM/examples/llama - - apt-get -y install git git-lfs - - git lfs install - - git config --global credential.helper store - - huggingface-cli login --token $HF_TOKEN --add-to-git-credential - - git clone $MODEL_REPO - - python3 convert_checkpoint.py --model_dir DeepSeek-R1-Distill-Llama-8B --output_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --dtype bfloat16 --tp_size $DSTACK_GPUS_NUM - # Download the AWS CLI - - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" - - unzip awscliv2.zip - - ./aws/install - - aws s3 sync tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read - - resources: - gpu: A100:40GB - - ``` -
- - -To run it, pass the configuration to `dstack apply`. - -
- -```shell -$ dstack apply -f examples/inference/trtllm/convert-model.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 - -Submit the run convert-model? [y/n]: y - -Provisioning... ----> 100% -``` -
- - -#### Build and upload the model - -Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 with the provided AWS credentials. - -
- - ```yaml - type: task - name: build-model - - image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 - env: - - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - S3_BUCKET_NAME - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_DEFAULT_REGION - - MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length - - MAX_INPUT_LEN=4096 - - MAX_BATCH_SIZE=256 - - TRITON_MAX_BATCH_SIZE=1 - - INSTANCE_COUNT=1 - - MAX_QUEUE_DELAY_MS=0 - - MAX_QUEUE_SIZE=0 - - DECOUPLED_MODE=true # Set true for streaming - commands: - - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir - - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" - - unzip awscliv2.zip - - ./aws/install - - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 ./tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 - - trtllm-build --checkpoint_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --gemm_plugin bfloat16 --output_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_seq_len $MAX_SEQ_LEN --max_input_len $MAX_INPUT_LEN --max_batch_size $MAX_BATCH_SIZE --gpt_attention_plugin bfloat16 --use_paged_context_fmha enable - - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git - - python3 TensorRT-LLM/examples/run.py --engine_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_output_len 40 --tokenizer_dir tokenizer_dir --input_text "What is Deep Learning?" - - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git - - mkdir triton_model_repo - - cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* triton_model_repo/ - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:TYPE_BF16 - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_BF16,logits_datatype:TYPE_BF16 - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE} - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:TYPE_BF16 - - aws s3 sync triton_model_repo s3://${S3_BUCKET_NAME}/triton_model_repo --acl public-read - - aws s3 sync tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read - - resources: - gpu: A100:40GB - ``` -
- -To run it, pass the configuration to `dstack apply`. - -
- -```shell -$ dstack apply -f examples/inference/trtllm/build-model.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 - -Submit the run build-model? [y/n]: y - -Provisioning... ----> 100% -``` -
- -#### Deploy the model - -Below is the service configuration that deploys DeepSeek R1 Distill Llama 8B. - -
- -```yaml - type: service - name: serve-distill - - image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 - env: - - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - S3_BUCKET_NAME - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_DEFAULT_REGION - - commands: - - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir - - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" - - unzip awscliv2.zip - - ./aws/install - - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16 - - git clone https://github.com/triton-inference-server/server.git - - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000 - port: 8000 - model: ensemble - - resources: - gpu: A100:40GB - -``` -
- -To run it, pass the configuration to `dstack apply`. - -
- -```shell -$ dstack apply -f examples/inference/trtllm/serve-distill.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 - -Submit the run serve-distill? [y/n]: y - -Provisioning... ----> 100% -```
## Access the endpoint @@ -335,38 +70,30 @@ If no gateway is created, the service endpoint will be available at ` ```shell -$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill/v1/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/qwen235/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ -d '{ - "model": "deepseek-ai/DeepSeek-R1", + "model": "nvidia/Qwen3-235B-A22B-FP8", "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, { "role": "user", - "content": "What is Deep Learning?" + "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?" } ], - "stream": true, - "max_tokens": 128 + "chat_template_kwargs": {"enable_thinking": true}, + "max_tokens": 1024, + "temperature": 0.0 }' ```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill./`. - -## Source code - -The source-code of this example can be found in -[`examples/inference/trtllm`](https://github.com/dstackai/dstack/blob/master/examples/inference/trtllm). +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://qwen235./`. ## What's next? -1. Check [services](https://dstack.ai/docs/services) -2. Browse [Tensorrt-LLM DeepSeek-R1 with PyTorch Backend](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/deepseek_v3) and [Prepare the Model Repository](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#prepare-the-model-repository) -3. See also [`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html#trtllm-serve) +1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways) +2. Browse the [TensorRT-LLM deployment guides](https://nvidia.github.io/TensorRT-LLM/deployment-guide/index.html) and the [Qwen3 deployment guide](https://nvidia.github.io/TensorRT-LLM/deployment-guide/deployment-guide-for-qwen3-on-trtllm.html) +3. See the [`trtllm-serve` reference](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/trtllm-serve.html) diff --git a/examples/inference/trtllm/build-image.dstack.yml b/examples/inference/trtllm/build-image.dstack.yml deleted file mode 100644 index 6761379299..0000000000 --- a/examples/inference/trtllm/build-image.dstack.yml +++ /dev/null @@ -1,25 +0,0 @@ -type: task -name: build-image - -privileged: true -image: dstackai/dind -env: - - DOCKER_USERNAME - - DOCKER_PASSWORD -commands: - - start-dockerd - - apt update && apt-get install -y build-essential make git git-lfs - - git lfs install - - git clone https://github.com/NVIDIA/TensorRT-LLM.git - - cd TensorRT-LLM - - git submodule update --init --recursive - - git lfs pull - # Limit compilation to Hopper for a smaller image - - make -C docker release_build CUDA_ARCHS="90-real" - - docker tag tensorrt_llm/release:latest $DOCKER_USERNAME/tensorrt_llm:latest - - echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin - - docker push "$DOCKER_USERNAME/tensorrt_llm:latest" - -resources: - cpu: 8 - disk: 500GB.. diff --git a/examples/inference/trtllm/build-model.dstack.yml b/examples/inference/trtllm/build-model.dstack.yml deleted file mode 100644 index b02b87644d..0000000000 --- a/examples/inference/trtllm/build-model.dstack.yml +++ /dev/null @@ -1,44 +0,0 @@ -type: task -name: build-model - -image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 - - -env: - - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - S3_BUCKET_NAME - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_DEFAULT_REGION - - MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length - - MAX_INPUT_LEN=4096 - - MAX_BATCH_SIZE=256 - - TRITON_MAX_BATCH_SIZE=1 - - INSTANCE_COUNT=1 - - MAX_QUEUE_DELAY_MS=0 - - MAX_QUEUE_SIZE=0 - - DECOUPLED_MODE=true # Set true for streaming - -commands: - - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir - - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" - - unzip awscliv2.zip - - ./aws/install - - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 ./tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 - - trtllm-build --checkpoint_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --gemm_plugin bfloat16 --output_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_seq_len $MAX_SEQ_LEN --max_input_len $MAX_INPUT_LEN --max_batch_size $MAX_BATCH_SIZE --gpt_attention_plugin bfloat16 --use_paged_context_fmha enable - - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git - - python3 TensorRT-LLM/examples/run.py --engine_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_output_len 40 --tokenizer_dir tokenizer_dir --input_text "What is Deep Learning?" - - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git - - mkdir triton_model_repo - - cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* triton_model_repo/ - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:TYPE_BF16 - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_BF16,logits_datatype:TYPE_BF16 - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE} - - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:TYPE_BF16 - - aws s3 sync triton_model_repo s3://${S3_BUCKET_NAME}/triton_model_repo --acl public-read - - aws s3 sync tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read - - -resources: - gpu: A100:40GB diff --git a/examples/inference/trtllm/convert-model.dstack.yml b/examples/inference/trtllm/convert-model.dstack.yml deleted file mode 100644 index 262e8f2945..0000000000 --- a/examples/inference/trtllm/convert-model.dstack.yml +++ /dev/null @@ -1,34 +0,0 @@ -type: task -name: convert-model - -image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 - -env: - - HF_TOKEN - - MODEL_REPO=https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - S3_BUCKET_NAME - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_DEFAULT_REGION - -commands: - # nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0, - # therefore we are using branch v0.17.0 - - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git - - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git - - git clone https://github.com/triton-inference-server/server.git - - cd TensorRT-LLM/examples/llama - - apt-get -y install git git-lfs - - git lfs install - - git config --global credential.helper store - - huggingface-cli login --token $HF_TOKEN --add-to-git-credential - - git clone $MODEL_REPO - - python3 convert_checkpoint.py --model_dir DeepSeek-R1-Distill-Llama-8B --output_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --dtype bfloat16 --tp_size $DSTACK_GPUS_NUM - # Download the AWS CLI - - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" - - unzip awscliv2.zip - - ./aws/install - - aws s3 sync tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read - -resources: - gpu: A100:40GB diff --git a/examples/inference/trtllm/serve-distill.dstack.yml b/examples/inference/trtllm/serve-distill.dstack.yml deleted file mode 100644 index bc5ad6d028..0000000000 --- a/examples/inference/trtllm/serve-distill.dstack.yml +++ /dev/null @@ -1,28 +0,0 @@ -type: service -name: serve-distill - -image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 - -env: - - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - S3_BUCKET_NAME - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_DEFAULT_REGION - -commands: - - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir - - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" - - unzip awscliv2.zip - - ./aws/install - - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16 - - git clone https://github.com/triton-inference-server/server.git - - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000 - - -port: 8000 - -model: ensemble - -resources: - gpu: A100:40GB diff --git a/examples/inference/trtllm/serve-r1.dstack.yml b/examples/inference/trtllm/serve-r1.dstack.yml deleted file mode 100644 index f5e9091720..0000000000 --- a/examples/inference/trtllm/serve-r1.dstack.yml +++ /dev/null @@ -1,32 +0,0 @@ -type: service -name: serve-r1 - -# Specify the image built with `examples/inference/trtllm/build-image.dstack.yml` -image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167 -env: - - MAX_BATCH_SIZE=256 - - MAX_NUM_TOKENS=16384 - - MAX_SEQ_LENGTH=16384 - - EXPERT_PARALLEL=4 - - PIPELINE_PARALLEL=1 - - HF_HUB_ENABLE_HF_TRANSFER=1 -commands: - - pip install -U "huggingface_hub[cli]" - - pip install hf_transfer - - huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1 - - trtllm-serve - --backend pytorch - --max_batch_size $MAX_BATCH_SIZE - --max_num_tokens $MAX_NUM_TOKENS - --max_seq_len $MAX_SEQ_LENGTH - --tp_size $DSTACK_GPUS_NUM - --ep_size $EXPERT_PARALLEL - --pp_size $PIPELINE_PARALLEL - DeepSeek-R1 -port: 8000 -model: deepseek-ai/DeepSeek-R1 - -resources: - gpu: 8:H200 - shm_size: 32GB - disk: 2000GB.. diff --git a/examples/inference/vllm/.dstack.yml b/examples/inference/vllm/.dstack.yml deleted file mode 100644 index 9060cff5ab..0000000000 --- a/examples/inference/vllm/.dstack.yml +++ /dev/null @@ -1,28 +0,0 @@ -type: service -name: llama31 - -python: "3.11" -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct - - MAX_MODEL_LEN=4096 -commands: - - pip install vllm - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --tensor-parallel-size $DSTACK_GPUS_NUM -port: 8000 -# Register the model -model: meta-llama/Meta-Llama-3.1-8B-Instruct - -# Uncomment to leverage spot instances -#spot_policy: auto - -# Uncomment to cache downloaded models -#volumes: -# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub - -resources: - gpu: 24GB - # Uncomment if using multiple GPUs - #shm_size: 24GB diff --git a/examples/inference/vllm/README.md b/examples/inference/vllm/README.md index ce77e31782..7497af6698 100644 --- a/examples/inference/vllm/README.md +++ b/examples/inference/vllm/README.md @@ -1,82 +1,68 @@ --- title: vLLM -description: Deploying Llama 3.1 8B using vLLM with OpenAI-compatible API +description: Deploying Qwen3.5-397B-A17B-FP8 using vLLM on NVIDIA GPUs --- # vLLM -This example shows how to deploy Llama 3.1 8B with `dstack` using [vLLM](https://docs.vllm.ai/en/latest/). +This example shows how to deploy `Qwen/Qwen3.5-397B-A17B-FP8` using +[vLLM](https://docs.vllm.ai/en/latest/) and `dstack`. -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. +## Apply a configuration -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
+Here's an example of a service that deploys +`Qwen/Qwen3.5-397B-A17B-FP8` using vLLM. -## Deployment - -Here's an example of a service that deploys Llama 3.1 8B using vLLM. - -
- -```yaml -type: service -name: llama31 - -python: "3.11" -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct - - MAX_MODEL_LEN=4096 -commands: - - pip install vllm - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --tensor-parallel-size $DSTACK_GPUS_NUM -port: 8000 -# Register the model -model: meta-llama/Meta-Llama-3.1-8B-Instruct - -# Uncomment to leverage spot instances -#spot_policy: auto - -# Uncomment to cache downloaded models -#volumes: -# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub - -resources: - gpu: 24GB - # Uncomment if using multiple GPUs - #shm_size: 24GB -``` +=== "NVIDIA" -
+
-### Running a configuration + ```yaml + type: service + name: qwen397 -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. + image: vllm/vllm-openai:v0.19.1 -
+ commands: + - | + vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \ + --port 8000 \ + --tensor-parallel-size $DSTACK_GPUS_NUM \ + --max-model-len 262144 \ + --reasoning-parser qwen3 \ + --language-model-only -```shell -$ dstack apply -f examples/inference/vllm/.dstack.yml + port: 8000 + model: Qwen/Qwen3.5-397B-A17B-FP8 + + volumes: + - instance_path: /root/.cache + path: /root/.cache + optional: true + + resources: + cpu: x86:96.. + memory: 512GB.. + shm_size: 16GB + disk: 500GB.. + gpu: H100:80GB:8 + ``` - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12 - 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12 - 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23 +
+ +The NVIDIA example serves `Qwen/Qwen3.5-397B-A17B-FP8` on `8x H100` GPUs using +vLLM with tensor parallelism enabled. It uses `--language-model-only` because +`Qwen/Qwen3.5-397B-A17B-FP8` is a text-only model. -Submit a new run? [y/n]: y +Save the configuration above as `qwen397.dstack.yml`, then use the +[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. -Provisioning... ----> 100% +
+ +```shell +$ dstack apply -f qwen397.dstack.yml ``` +
If no gateway is created, the service endpoint will be available at `/proxy/services///`. @@ -84,39 +70,27 @@ If no gateway is created, the service endpoint will be available at ` ```shell -$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \ +curl http://127.0.0.1:3000/proxy/services/main/qwen397/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ -d '{ - "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "model": "Qwen/Qwen3.5-397B-A17B-FP8", "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, { "role": "user", - "content": "What is Deep Learning?" + "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?" } ], - "max_tokens": 128 + "max_tokens": 1024 }' ```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31./`. - -## Source code - -The source-code of this example can be found in -[`examples/inference/vllm`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm). +> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen397./`. ## What's next? -1. Check [services](https://dstack.ai/docs/services) -2. Browse the [Llama 3.1](https://dstack.ai/examples/llms/llama31/) and - [NIM](https://dstack.ai/examples/inference/nim/) examples -3. See also [AMD](https://dstack.ai/examples/accelerators/amd/) and - [TPU](https://dstack.ai/examples/accelerators/tpu/) +1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways) +2. Browse the [SGLang](https://dstack.ai/examples/inference/sglang/) and [NIM](https://dstack.ai/examples/inference/nim/) examples diff --git a/examples/inference/vllm/amd/.dstack.yml b/examples/inference/vllm/amd/.dstack.yml deleted file mode 100644 index f0c488d755..0000000000 --- a/examples/inference/vllm/amd/.dstack.yml +++ /dev/null @@ -1,43 +0,0 @@ -type: service -name: llama31-service-vllm-amd - -image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04 -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct - - MAX_MODEL_LEN=126192 -commands: - - export PATH=/opt/conda/envs/py_3.10/bin:$PATH - - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip - - unzip rocm-6.1.0.zip - - cd hipBLAS-rocm-6.1.0 - - python rmake.py - - cd .. - - git clone https://github.com/vllm-project/vllm.git - - cd vllm - - pip install triton - - pip uninstall torch -y - - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1 - - pip install /opt/rocm/share/amd_smi - - pip install --upgrade numba scipy huggingface-hub[cli] - - pip install "numpy<2" - - pip install -r requirements-rocm.txt - - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib - - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so* - - export PYTORCH_ROCM_ARCH="gfx90a;gfx942" - - wget https://dstack-binaries.s3.amazonaws.com/vllm-0.6.0%2Brocm614-cp310-cp310-linux_x86_64.whl - - pip install vllm-0.6.0+rocm614-cp310-cp310-linux_x86_64.whl - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --port 8000 -# Expose the vllm server port -port: 8000 -# Register the model -model: meta-llama/Meta-Llama-3.1-70B-Instruct - -# Uncomment to leverage spot instances -#spot_policy: auto - -resources: - gpu: MI300X - disk: 200GB diff --git a/examples/inference/vllm/amd/build-vllm.dstack.yml b/examples/inference/vllm/amd/build-vllm.dstack.yml deleted file mode 100644 index 3510d0f6d6..0000000000 --- a/examples/inference/vllm/amd/build-vllm.dstack.yml +++ /dev/null @@ -1,47 +0,0 @@ -type: task -name: build-vllm-rocm - -image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04 - -env: - - HF_TOKEN - - AWS_ACCESS_KEY_ID - - AWS_SECRET_ACCESS_KEY - - AWS_REGION - - BUCKET_NAME - -command: - - apt-get update -y - - apt-get install awscli -y - - aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID - - aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY - - aws configure set region $AWS_REGION - - export PATH=/opt/conda/envs/py_3.10/bin:$PATH - - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip - - unzip rocm-6.1.0.zip - - cd hipBLAS-rocm-6.1.0 - - python rmake.py - - cd .. - - git clone https://github.com/vllm-project/vllm.git - - cd vllm - - pip install triton - - pip uninstall torch -y - - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1 - - pip install /opt/rocm/share/amd_smi - - pip install --upgrade numba scipy huggingface-hub[cli] - - pip install "numpy<2" - - pip install -r requirements-rocm.txt - - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib - - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so* - - export PYTORCH_ROCM_ARCH="gfx90a;gfx942" - - pip install wheel setuptools setuptools_scm ninja - - python setup.py bdist_wheel -d dist/ - - cd dist - - aws s3 cp "$(ls -1 | head -n 1)" s3://$BUCKET_NAME/ --acl public-read - -# Uncomment to leverage spot instances -#spot_policy: auto - -resources: - gpu: MI300X - disk: 150GB diff --git a/examples/inference/vllm/tpu/.dstack.yml b/examples/inference/vllm/tpu/.dstack.yml deleted file mode 100644 index 5a637ca797..0000000000 --- a/examples/inference/vllm/tpu/.dstack.yml +++ /dev/null @@ -1,23 +0,0 @@ -type: service -# The name is optional, if not specified, generated randomly -name: llama31-service-vllm-tpu -image: vllm/vllm-tpu:nightly -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct - - MAX_MODEL_LEN=4096 -commands: - - vllm serve $MODEL_ID - --tensor-parallel-size 4 - --max-model-len $MAX_MODEL_LEN - --port 8000 -# Expose the vllm server port -port: 8000 -# Register the model -model: meta-llama/Meta-Llama-3.1-8B-Instruct - -# Uncomment to leverage spot instances -#spot_policy: auto - -resources: - gpu: v5litepod-4 diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md deleted file mode 100644 index ae467891fc..0000000000 --- a/examples/llms/deepseek/README.md +++ /dev/null @@ -1,457 +0,0 @@ -# Deepseek - -This example walks you through how to deploy and -train [Deepseek](https://huggingface.co/deepseek-ai) -models with `dstack`. - -> We used Deepseek-R1 distilled models and Deepseek-V2-Lite, a 16B model with the same architecture as Deepseek-R1 (671B). Deepseek-V2-Lite retains MLA and DeepSeekMoE but requires less memory, making it ideal for testing and fine-tuning on smaller GPUs. - -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` -
- -## Deployment - -### AMD - -Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` using [SGLang](https://github.com/sgl-project/sglang) and [vLLM](https://github.com/vllm-project/vllm) with AMD `MI300X` GPU. The below configurations also support `Deepseek-V2-Lite`. - -=== "SGLang" - -
- - ```yaml - type: service - name: deepseek-r1-amd - - image: lmsysorg/sglang:v0.4.1.post4-rocm620 - env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B - commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - - port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B - - resources: - gpu: MI300X - disk: 300Gb - - ``` -
- -=== "vLLM" - -
- - ```yaml - type: service - name: deepseek-r1-amd - - image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 - env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B - - MAX_MODEL_LEN=126432 - commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --trust-remote-code - port: 8000 - - model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B - - resources: - gpu: MI300X - disk: 300Gb - ``` -
- -Note, when using `Deepseek-R1-Distill-Llama-70B` with `vLLM` with a 192GB GPU, we must limit the context size to 126432 tokens to fit the memory. - -### NVIDIA - -Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-8B` -using [SGLang](https://github.com/sgl-project/sglang) -and [vLLM](https://github.com/vllm-project/vllm) with NVIDIA GPUs. -Both SGLang and vLLM also support `Deepseek-V2-Lite`. - -=== "SGLang" -
- - ```yaml - type: service - name: deepseek-r1 - - image: lmsysorg/sglang:latest - env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - - port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - resources: - gpu: 24GB - ``` -
- -=== "vLLM" -
- - ```yaml - type: service - name: deepseek-r1 - - image: vllm/vllm-openai:latest - env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - MAX_MODEL_LEN=4096 - commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - resources: - gpu: 24GB - ``` -
- -Note, to run `Deepseek-R1-Distill-Llama-8B` with `vLLM` with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory. - -> To run `Deepseek-V2-Lite` with `vLLM`, we must use 40GB GPU and to run `Deepseek-V2-Lite` with SGLang, we must use -> 80GB GPU. For more details on SGlang's memory requirements you can refer to -> this [issue](https://github.com/sgl-project/sglang/issues/3451). - -### Memory requirements - -Approximate memory requirements for loading the model (excluding context and CUDA/ROCm kernel reservations). - -| Model | Size | FP16 | FP8 | INT4 | -|-----------------------------|----------|--------|--------|--------| -| `Deepseek-R1` | **671B** | 1.35TB | 671GB | 336GB | -| `DeepSeek-R1-Distill-Llama` | **70B** | 161GB | 80.5GB | 40B | -| `DeepSeek-R1-Distill-Qwen` | **32B** | 74GB | 37GB | 18.5GB | -| `DeepSeek-V2-Lite` | **16B** | 35GB | 17.5GB | 8.75GB | -| `DeepSeek-R1-Distill-Qwen` | **14B** | 32GB | 16GB | 8GB | -| `DeepSeek-R1-Distill-Llama` | **8B** | 18GB | 9GB | 4.5GB | -| `DeepSeek-R1-Distill-Qwen` | **7B** | 16GB | 8GB | 4GB | - -For example, the FP8 version of Deepseek-R1 671B fits on a single node of MI300X with eight 192GB GPUs, a single node of -H200 with eight 141GB GPUs. - -### Applying the configuration - -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. - -
- -```shell -$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 - -Submit the run deepseek-r1? [y/n]: y - -Provisioning... ----> 100% -``` -
- -If no gateway is created, the service endpoint will be available at `/proxy/services///`. - -
- -```shell -curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ - -X POST \ - -H 'Authorization: Bearer <dstack token>' \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", - "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, - { - "role": "user", - "content": "What is Deep Learning?" - } - ], - "stream": true, - "max_tokens": 512 - }' -``` -
- -When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://deepseek-r1./`. - -## Fine-tuning - -### AMD - -Here are the examples of LoRA fine-tuning of `Deepseek-V2-Lite` and GRPO fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` on `MI300X` GPU using HuggingFace's [TRL](https://github.com/huggingface/trl). - -=== "LoRA" - -
- - ```yaml - type: task - name: trl-train - - image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0 - - env: - - WANDB_API_KEY - - WANDB_PROJECT - - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite - - ACCELERATE_USE_FSDP=False - commands: - - git clone https://github.com/huggingface/peft.git - - pip install trl - - pip install "numpy<2" - - pip install peft - - pip install wandb - - cd peft/examples/sft - - python train.py - --seed 100 - --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite" - --dataset_name "smangrul/ultrachat-10k-chatml" - --chat_template_format "chatml" - --add_special_tokens False - --append_concat_token False - --splits "train,test" - --max_seq_len 512 - --num_train_epochs 1 - --logging_steps 5 - --log_level "info" - --logging_strategy "steps" - --eval_strategy "epoch" - --save_strategy "epoch" - --hub_private_repo True - --hub_strategy "every_save" - --packing True - --learning_rate 1e-4 - --lr_scheduler_type "cosine" - --weight_decay 1e-4 - --warmup_ratio 0.0 - --max_grad_norm 1.0 - --output_dir "deepseek-sft-lora" - --per_device_train_batch_size 8 - --per_device_eval_batch_size 8 - --gradient_accumulation_steps 4 - --gradient_checkpointing True - --use_reentrant True - --dataset_text_field "content" - --use_peft_lora True - --lora_r 16 - --lora_alpha 16 - --lora_dropout 0.05 - --lora_target_modules "all-linear" - - resources: - gpu: MI300X - disk: 150GB - ``` -
- -=== "GRPO" - -
- ```yaml - type: task - name: trl-train-grpo - - image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0 - - env: - - WANDB_API_KEY - - WANDB_PROJECT - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B - files: - - grpo_train.py - commands: - - pip install trl - - pip install datasets - # numPy version less than 2 is required for the scipy installation with AMD. - - pip install "numpy<2" - - mkdir -p grpo_example - - cp grpo_train.py grpo_example/grpo_train.py - - cd grpo_example - - python grpo_train.py - --model_name_or_path $MODEL_ID - --dataset_name trl-lib/tldr - --per_device_train_batch_size 2 - --logging_steps 25 - --output_dir Deepseek-Distill-Qwen-1.5B-GRPO - --trust_remote_code - - resources: - gpu: MI300X - disk: 150GB - ``` -
- -Note, the `GRPO` fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` consumes up to 135GB of VRAM. - -### NVIDIA - -Here are examples of LoRA fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` and QLoRA fine-tuning of `DeepSeek-V2-Lite` -on NVIDIA GPU using HuggingFace's [TRL](https://github.com/huggingface/trl) library. - -=== "LoRA" -
- - ```yaml - type: task - name: trl-train - - python: 3.12 - - env: - - WANDB_API_KEY - - WANDB_PROJECT - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B - commands: - - git clone https://github.com/huggingface/trl.git - - pip install trl - - pip install peft - - pip install wandb - - cd trl/trl/scripts - - python sft.py - --model_name_or_path $MODEL_ID - --dataset_name trl-lib/Capybara - --learning_rate 2.0e-4 - --num_train_epochs 1 - --packing - --per_device_train_batch_size 2 - --gradient_accumulation_steps 8 - --gradient_checkpointing - --logging_steps 25 - --eval_strategy steps - --eval_steps 100 - --use_peft - --lora_r 32 - --lora_alpha 16 - --report_to wandb - --output_dir DeepSeek-R1-Distill-Qwen-1.5B-SFT - - resources: - gpu: 24GB - ``` -
- -=== "QLoRA" -
- - ```yaml - type: task - name: trl-train-deepseek-v2 - - python: 3.12 - nvcc: true - env: - - WANDB_API_KEY - - WANDB_PROJECT - - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite - - ACCELERATE_USE_FSDP=False - commands: - - git clone https://github.com/huggingface/peft.git - - pip install trl - - pip install peft - - pip install wandb - - pip install bitsandbytes - - cd peft/examples/sft - - python train.py - --seed 100 - --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite" - --dataset_name "smangrul/ultrachat-10k-chatml" - --chat_template_format "chatml" - --add_special_tokens False - --append_concat_token False - --splits "train,test" - --max_seq_len 512 - --num_train_epochs 1 - --logging_steps 5 - --log_level "info" - --logging_strategy "steps" - --eval_strategy "epoch" - --save_strategy "epoch" - --hub_private_repo True - --hub_strategy "every_save" - --bf16 True - --packing True - --learning_rate 1e-4 - --lr_scheduler_type "cosine" - --weight_decay 1e-4 - --warmup_ratio 0.0 - --max_grad_norm 1.0 - --output_dir "mistral-sft-lora" - --per_device_train_batch_size 8 - --per_device_eval_batch_size 8 - --gradient_accumulation_steps 4 - --gradient_checkpointing True - --use_reentrant True - --dataset_text_field "content" - --use_peft_lora True - --lora_r 16 - --lora_alpha 16 - --lora_dropout 0.05 - --lora_target_modules "all-linear" - --use_4bit_quantization True - --use_nested_quant True - --bnb_4bit_compute_dtype "bfloat16" - - resources: - # Consumes ~25GB of VRAM for QLoRA fine-tuning deepseek-ai/DeepSeek-V2-Lite - gpu: 48GB - ``` -
- -### Memory requirements - -| Model | Size | Full fine-tuning | LoRA | QLoRA | -|-----------------------------|----------|------------------|-------|-------| -| `Deepseek-R1` | **671B** | 10.5TB | 1.4TB | 442GB | -| `DeepSeek-R1-Distill-Llama` | **70B** | 1.09TB | 151GB | 46GB | -| `DeepSeek-R1-Distill-Qwen` | **32B** | 512GB | 70GB | 21GB | -| `DeepSeek-V2-Lite` | **16B** | 256GB | 35GB | 11GB | -| `DeepSeek-R1-Distill-Qwen` | **14B** | 224GB | 30GB | 9GB | -| `DeepSeek-R1-Distill-Llama` | **8B** | 128GB | 17GB | 5GB | -| `DeepSeek-R1-Distill-Qwen` | **7B** | 112GB | 15GB | 4GB | -| `DeepSeek-R1-Distill-Qwen` | **1.5B** | 24GB | 3.2GB | 1GB | - -The memory requirements assume low-rank update matrices are 1% of model parameters. In practice, a 7B model with QLoRA -needs 7–10GB due to intermediate hidden states. - -| Fine-tuning type | Calculation | -|------------------|--------------------------------------------------| -| Full fine-tuning | 671B × 16 bytes = 10.48TB | -| LoRA | 671B × 2 bytes + 1% of 671B × 16 bytes = 1.41TB | -| QLoRA(4-bit) | 671B × 0.5 bytes + 1% of 671B × 16 bytes = 442GB | - -## Source code - -The source-code of this example can be found in -[`examples/llms/deepseek`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek). - -!!! info "What's next?" - 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), - [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). diff --git a/examples/llms/deepseek/sglang/amd/.dstack.yml b/examples/llms/deepseek/sglang/amd/.dstack.yml deleted file mode 100644 index 99a19bfee3..0000000000 --- a/examples/llms/deepseek/sglang/amd/.dstack.yml +++ /dev/null @@ -1,18 +0,0 @@ -type: service -name: deepseek-r1-amd - -image: lmsysorg/sglang:v0.4.1.post4-rocm620 -env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B -commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - -port: 8000 -model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B - -resources: - gpu: mi300x - disk: 300Gb diff --git a/examples/llms/deepseek/sglang/amd/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/sglang/amd/deepseek_v2_lite.dstack.yml deleted file mode 100644 index 01ef71a6be..0000000000 --- a/examples/llms/deepseek/sglang/amd/deepseek_v2_lite.dstack.yml +++ /dev/null @@ -1,18 +0,0 @@ -type: service -name: deepseek-v2-lite-amd - -image: lmsysorg/sglang:v0.4.1.post4-rocm620 -env: - - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite -commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - -port: 8000 -model: deepseek-ai/DeepSeek-V2-Lite - -resources: - gpu: mi300x - disk: 150Gb diff --git a/examples/llms/deepseek/sglang/nvidia/.dstack.yml b/examples/llms/deepseek/sglang/nvidia/.dstack.yml deleted file mode 100644 index d1c92b64d1..0000000000 --- a/examples/llms/deepseek/sglang/nvidia/.dstack.yml +++ /dev/null @@ -1,18 +0,0 @@ -type: service -name: deepseek-r1-nvidia - -image: lmsysorg/sglang:latest -env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B -commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - -port: 8000 - -model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - -resources: - gpu: 24GB diff --git a/examples/llms/deepseek/sglang/nvidia/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/sglang/nvidia/deepseek_v2_lite.dstack.yml deleted file mode 100644 index 8c0adaa41b..0000000000 --- a/examples/llms/deepseek/sglang/nvidia/deepseek_v2_lite.dstack.yml +++ /dev/null @@ -1,19 +0,0 @@ -# Not Working https://github.com/sgl-project/sglang/issues/3451 -type: service -name: deepseek-v2-lite-nvidia - -image: lmsysorg/sglang:latest -env: - - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite -commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --port 8000 - --trust-remote-code - -port: 8000 - -model: deepseek-ai/DeepSeek-V2-Lite - -resources: - gpu: 80GB diff --git a/examples/llms/deepseek/vllm/amd/.dstack.yml b/examples/llms/deepseek/vllm/amd/.dstack.yml deleted file mode 100644 index 23bfb033c0..0000000000 --- a/examples/llms/deepseek/vllm/amd/.dstack.yml +++ /dev/null @@ -1,19 +0,0 @@ -type: service -name: deepseek-r1-amd - -image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 -env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B - - MAX_MODEL_LEN=126432 -commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --trust-remote-code -port: 8000 - -model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B - - -resources: - gpu: mi300x - disk: 300Gb diff --git a/examples/llms/deepseek/vllm/amd/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/vllm/amd/deepseek_v2_lite.dstack.yml deleted file mode 100644 index 8937e95266..0000000000 --- a/examples/llms/deepseek/vllm/amd/deepseek_v2_lite.dstack.yml +++ /dev/null @@ -1,18 +0,0 @@ -type: service -name: deepseek-v2-lite-amd - -image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 -env: - - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite -commands: - - vllm serve $MODEL_ID - --trust-remote-code - -port: 8000 - -model: deepseek-ai/DeepSeek-V2-Lite - - -resources: - gpu: mi300x - disk: 150Gb diff --git a/examples/llms/deepseek/vllm/nvidia/.dstack.yml b/examples/llms/deepseek/vllm/nvidia/.dstack.yml deleted file mode 100644 index e623b182c4..0000000000 --- a/examples/llms/deepseek/vllm/nvidia/.dstack.yml +++ /dev/null @@ -1,17 +0,0 @@ -type: service -name: deepseek-r1-nvidia - -image: vllm/vllm-openai:latest -env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - - MAX_MODEL_LEN=4096 -commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - -port: 8000 - -model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - -resources: - gpu: 24GB diff --git a/examples/llms/deepseek/vllm/nvidia/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/vllm/nvidia/deepseek_v2_lite.dstack.yml deleted file mode 100644 index 06e78f379b..0000000000 --- a/examples/llms/deepseek/vllm/nvidia/deepseek_v2_lite.dstack.yml +++ /dev/null @@ -1,19 +0,0 @@ -type: service -name: deepseek-v2-lite-nvidia - -image: vllm/vllm-openai:latest -env: - - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite - - MAX_MODEL_LEN=4096 -commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --tensor-parallel-size $DSTACK_GPUS_NUM - --trust-remote-code - -port: 8000 - -model: deepseek-ai/DeepSeek-V2-Lite - -resources: - gpu: 48GB diff --git a/examples/llms/llama/README.md b/examples/llms/llama/README.md deleted file mode 100644 index 3f2b8ab54d..0000000000 --- a/examples/llms/llama/README.md +++ /dev/null @@ -1,288 +0,0 @@ -# Llama - -This example walks you through how to deploy Llama 4 Scout model with `dstack`. - -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
- -## Deployment - -### AMD -Here's an example of a service that deploys -[`Llama-4-Scout-17B-16E-Instruct`](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) -using [vLLM](https://github.com/vllm-project/vllm) -with AMD `MI300X` GPUs. - -
- -```yaml -type: service -name: llama4-scout - -image: rocm/vllm-dev:llama4-20250407 -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - - VLLM_WORKER_MULTIPROC_METHOD=spawn - - VLLM_USE_MODELSCOPE=False - - VLLM_USE_TRITON_FLASH_ATTN=0 - - MAX_MODEL_LEN=256000 - -commands: - - | - vllm serve $MODEL_ID \ - --tensor-parallel-size $DSTACK_GPUS_NUM \ - --max-model-len $MAX_MODEL_LEN \ - --kv-cache-dtype fp8 \ - --max-num-seqs 64 \ - --override-generation-config='{"attn_temperature_tuning": true}' - - -port: 8000 -# Register the model -model: meta-llama/Llama-4-Scout-17B-16E-Instruct - -resources: - gpu: Mi300x:2 - disk: 500GB.. -``` -
- -### NVIDIA -Here's an example of a service that deploys -[`Llama-4-Scout-17B-16E-Instruct`](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) -using [SGLang](https://github.com/sgl-project/sglang) and [vLLM](https://github.com/vllm-project/vllm) -with NVIDIA `H200` GPUs. - -=== "SGLang" - -
- - ```yaml - type: service - name: llama4-scout - - image: lmsysorg/sglang - env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - - CONTEXT_LEN=256000 - commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --tp $DSTACK_GPUS_NUM - --context-length $CONTEXT_LEN - --kv-cache-dtype fp8_e5m2 - --port 8000 - - port: 8000 - ## Register the model - model: meta-llama/Llama-4-Scout-17B-16E-Instruct - - resources: - gpu: H200:2 - disk: 500GB.. - ``` -
- -=== "vLLM" - -
- - ```yaml - type: service - name: llama4-scout - - image: vllm/vllm-openai - env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - - VLLM_DISABLE_COMPILE_CACHE=1 - - MAX_MODEL_LEN=256000 - commands: - - | - vllm serve $MODEL_ID \ - --tensor-parallel-size $DSTACK_GPUS_NUM \ - --max-model-len $MAX_MODEL_LEN \ - --kv-cache-dtype fp8 \ - --override-generation-config='{"attn_temperature_tuning": true}' - - port: 8000 - # Register the model - model: meta-llama/Llama-4-Scout-17B-16E-Instruct - - resources: - gpu: H200:2 - disk: 500GB.. - ``` -
- -!!! info "NOTE:" - With vLLM, add `--override-generation-config='{"attn_temperature_tuning": true}'` to - improve accuracy for [contexts longer than 32K tokens](https://blog.vllm.ai/2025/04/05/llama4.html). - -### Memory requirements - -Below are the approximate memory requirements for loading the model. -This excludes memory for the model context and CUDA kernel reservations. - -| Model | Size | FP16 | FP8 | INT4 | -|---------------|----------|--------|--------|--------| -| `Behemoth` | **2T** | 4TB | 2TB | 1TB | -| `Maverick` | **400B** | 800GB | 200GB | 100GB | -| `Scout` | **109B** | 218GB | 109GB | 54.5GB | - - -### Running a configuration - -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. - -
- -```shell -$ HF_TOKEN=... -$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87 - 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98 - - -Submit the run llama4-scout? [y/n]: y - -Provisioning... ----> 100% -``` - -
- -Once the service is up, it will be available via the service endpoint -at `/proxy/services///`. - -
- -```shell -curl http://127.0.0.1:3000/proxy/services/main/llama4-scout/v1/chat/completions \ - -X POST \ - -H 'Authorization: Bearer <dstack token>' \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", - "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, - { - "role": "user", - "content": "What is Deep Learning?" - } - ], - "stream": true, - "max_tokens": 512 - }' -``` - -
- -When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint -is available at `https://./`. - -[//]: # (TODO: https://github.com/dstackai/dstack/issues/1777) - -## Fine-tuning - -Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized [Llama-4-Scout-17B-16E](https://huggingface.co/axolotl-quants/Llama-4-Scout-17B-16E-Linearized-bnb-nf4-bf16) on 2xH100 NVIDIA GPUs using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) - -
- -```yaml -type: task -# The name is optional, if not specified, generated randomly -name: axolotl-nvidia-llama-scout-train - -# Using the official Axolotl's Docker image -image: axolotlai/axolotl:main-latest - -# Required environment variables -env: - - HF_TOKEN - - WANDB_API_KEY - - WANDB_PROJECT - - WANDB_NAME=axolotl-nvidia-llama-scout-train - - HUB_MODEL_ID -# Commands of the task -commands: - - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml - - axolotl train scout-qlora-fsdp1.yaml - --wandb-project $WANDB_PROJECT - --wandb-name $WANDB_NAME - --hub-model-id $HUB_MODEL_ID - -resources: - # Two GPU (required by FSDP) - gpu: H100:2 - # Shared memory size for inter-process communication - shm_size: 24GB - disk: 500GB.. -``` -
- -The task uses Axolotl's Docker image, where Axolotl is already pre-installed. - -### Memory requirements - -Below are the approximate memory requirements for loading the model. -This excludes memory for the model context and CUDA kernel reservations. - -| Model | Size | Full fine-tuning | LoRA | QLoRA | -|---------------|----------|--------------------|--------|--------| -| `Behemoth` | **2T** | 32TB | 4.3TB | 1.3TB | -| `Maverick` | **400B** | 6.5TB | 864GB | 264GB | -| `Scout` | **109B** | 1.75TB | 236GB | 72GB | - -The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters. - -| Fine-tuning type | Calculation | -|------------------|--------------------------------------------------| -| Full fine-tuning | 2T × 16 bytes = 32TB | -| LoRA | 2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB | -| QLoRA(4-bit) | 2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB | - -## Running a configuration - -Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the -cloud resources and run the configuration. - -
- -```shell -$ HF_TOKEN=... -$ WANDB_API_KEY=... -$ WANDB_PROJECT=... -$ WANDB_NAME=axolotl-nvidia-llama-scout-train -$ HUB_MODEL_ID=... -$ dstack apply -f examples/single-node-training/axolotl/.dstack.yml -``` - -
- -## Source code - -The source-code for deployment examples can be found in -[`examples/llms/llama`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama) and the source-code for the finetuning example can be found in [`examples/single-node-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl). - -## What's next? - -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), - [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). -2. Browse [Llama 4 with SGLang](https://github.com/sgl-project/sglang/blob/main/docs/references/llama4.md), [Llama 4 with vLLM](https://blog.vllm.ai/2025/04/05/llama4.html), [Llama 4 with AMD](https://rocm.blogs.amd.com/artificial-intelligence/llama4-day-0-support/README.html) and [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). diff --git a/examples/llms/llama/sglang/nvidia/.dstack.yml b/examples/llms/llama/sglang/nvidia/.dstack.yml deleted file mode 100644 index b1aea2ce51..0000000000 --- a/examples/llms/llama/sglang/nvidia/.dstack.yml +++ /dev/null @@ -1,23 +0,0 @@ -type: service -name: llama4-scout - -image: lmsysorg/sglang -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - - CONTEXT_LEN=256000 -commands: - - python3 -m sglang.launch_server - --model-path $MODEL_ID - --tp $DSTACK_GPUS_NUM - --context-length $CONTEXT_LEN - --port 8000 - --kv-cache-dtype fp8_e5m2 - -port: 8000 -## Register the model -model: meta-llama/Llama-4-Scout-17B-16E-Instruct - -resources: - gpu: H200:2 - disk: 500GB.. diff --git a/examples/llms/llama/vllm/amd/.dstack.yml b/examples/llms/llama/vllm/amd/.dstack.yml deleted file mode 100644 index eaaf1e6aeb..0000000000 --- a/examples/llms/llama/vllm/amd/.dstack.yml +++ /dev/null @@ -1,29 +0,0 @@ -type: service -name: llama4-scout - -image: rocm/vllm-dev:llama4-20250407 -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - - VLLM_WORKER_MULTIPROC_METHOD=spawn - - VLLM_USE_MODELSCOPE=False - - VLLM_USE_TRITON_FLASH_ATTN=0 - - MAX_MODEL_LEN=256000 - -commands: - - | - vllm serve $MODEL_ID \ - --tensor-parallel-size $DSTACK_GPUS_NUM \ - --max-model-len $MAX_MODEL_LEN \ - --kv-cache-dtype fp8 \ - --max-num-seqs 64 \ - --override-generation-config='{"attn_temperature_tuning": true}' - - -port: 8000 -# Register the model -model: meta-llama/Llama-4-Scout-17B-16E-Instruct - -resources: - gpu: Mi300x:2 - disk: 500GB.. diff --git a/examples/llms/llama/vllm/nvidia/.dstack.yml b/examples/llms/llama/vllm/nvidia/.dstack.yml deleted file mode 100644 index c1f8e4919d..0000000000 --- a/examples/llms/llama/vllm/nvidia/.dstack.yml +++ /dev/null @@ -1,24 +0,0 @@ -type: service -name: llama4-scout - -image: vllm/vllm-openai -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - - VLLM_DISABLE_COMPILE_CACHE=1 - - MAX_MODEL_LEN=256000 -commands: - - | - vllm serve $MODEL_ID \ - --tensor-parallel-size $DSTACK_GPUS_NUM \ - --max-model-len $MAX_MODEL_LEN \ - --kv-cache-dtype fp8 \ - --override-generation-config='{"attn_temperature_tuning": true}' - -port: 8000 -# Register the model -model: meta-llama/Llama-4-Scout-17B-16E-Instruct - -resources: - gpu: H200:2 - disk: 500GB.. diff --git a/examples/llms/llama31/README.md b/examples/llms/llama31/README.md deleted file mode 100644 index b99362cde0..0000000000 --- a/examples/llms/llama31/README.md +++ /dev/null @@ -1,384 +0,0 @@ -# Llama 3.1 - -This example walks you through how to deploy and fine-tuning Llama 3.1 with `dstack`. - -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
- -## Deployment - -You can use any serving frameworks. -Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NIM. - -=== "vLLM" - -
- - ```yaml - type: service - name: llama31 - - python: "3.11" - env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct - - MAX_MODEL_LEN=4096 - commands: - - pip install vllm - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --tensor-parallel-size $DSTACK_GPUS_NUM - port: 8000 - # Register the model - model: meta-llama/Meta-Llama-3.1-8B-Instruct - - # Uncomment to leverage spot instances - #spot_policy: auto - - # Uncomment to cache downloaded models - #volumes: - # - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub - - resources: - gpu: 24GB - # Uncomment if using multiple GPUs - #shm_size: 24GB - ``` - -
- -=== "TGI" - -
- - ```yaml - type: service - name: llama31 - - image: ghcr.io/huggingface/text-generation-inference:latest - env: - - HF_TOKEN - - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct - - MAX_INPUT_LENGTH=4000 - - MAX_TOTAL_TOKENS=4096 - commands: - - NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher - port: 80 - # Register the model - model: meta-llama/Meta-Llama-3.1-8B-Instruct - - # Uncomment to leverage spot instances - #spot_policy: auto - - # Uncomment to cache downloaded models - #volumes: - # - /data:/data - - resources: - gpu: 24GB - # Uncomment if using multiple GPUs - #shm_size: 24GB - ``` - -
- -=== "NIM" - -
- - ```yaml - type: service - name: llama31 - - image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest - env: - - NGC_API_KEY - - NIM_MAX_MODEL_LEN=4096 - registry_auth: - username: $oauthtoken - password: ${{ env.NGC_API_KEY }} - port: 8000 - # Register the model - model: meta/llama-3.1-8b-instruct - - # Uncomment to leverage spot instances - #spot_policy: auto - - # Cache downloaded models - volumes: - - /root/.cache/nim:/opt/nim/.cache - - resources: - gpu: 24GB - # Uncomment if using multiple GPUs - #shm_size: 24GB - ``` - -
- -Note, when using Llama 3.1 8B with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory. - -### Memory requirements - -Below are the approximate memory requirements for loading the model. -This excludes memory for the model context and CUDA kernel reservations. - -| Model size | FP16 | FP8 | INT4 | -|------------|-------|-------|-------| -| **8B** | 16GB | 8GB | 4GB | -| **70B** | 140GB | 70GB | 35GB | -| **405B** | 810GB | 405GB | 203GB | - -For example, the FP16 version of Llama 3.1 405B won't fit into a single machine with eight 80GB GPUs, so we'd need at least two -nodes. - -### Quantization - -The INT4 version of Llama 3.1 70B, can fit into two 40GB GPUs. - -[//]: # (TODO: Example: INT4 / 70B / 40GB:2) - -The INT4 version of Llama 3.1 405B can fit into eight 40GB GPUs. - -[//]: # (TODO: Example: INT4 / 405B / 40GB:8) - -Useful links: - - * [Meta's official FP8 quantized version of Llama 3.1 405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) (with minimal accuracy degradation) - * [Llama 3.1 Quantized Models](https://huggingface.co/collections/hugging-quants/llama-31-gptq-awq-and-bnb-quants-669fa7f50f6e713fd54bd198) with quantized checkpoints - -### Running a configuration - -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. - -
- -```shell -$ HF_TOKEN=... -$ dstack apply -f examples/llms/llama31/vllm/.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12 - 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12 - 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23 - -Submit the run llama31? [y/n]: y - -Provisioning... ----> 100% -``` - -
- -If no gateway is created, the service endpoint will be available at `/proxy/services///`. - -
- -```shell -$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \ - -X POST \ - -H 'Authorization: Bearer <dstack token>' \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "llama3.1", - "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, - { - "role": "user", - "content": "What is Deep Learning?" - } - ], - "max_tokens": 128 - }' -``` - -
- -When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31./`. - -[//]: # (TODO: How to prompting and tool calling) - -[//]: # (TODO: Syntetic data generation) - -## Fine-tuning - -### Running on multiple GPUs - -Below is the task configuration file of fine-tuning Llama 3.1 8B using TRL on the -[`OpenAssistant/oasst_top1_2023-08-25`](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25) dataset. - -
- -```yaml -type: task -name: trl-train - -python: 3.12 -# Ensure nvcc is installed (req. for Flash Attention) -nvcc: true -env: - - HF_TOKEN - - WANDB_API_KEY -commands: - - pip install "transformers>=4.43.2" - - pip install bitsandbytes - - pip install flash-attn --no-build-isolation - - pip install peft - - pip install wandb - - git clone https://github.com/huggingface/trl - - cd trl - - pip install . - - accelerate launch - --config_file=examples/accelerate_configs/multi_gpu.yaml - --num_processes $DSTACK_GPUS_PER_NODE - examples/scripts/sft.py - --model_name meta-llama/Meta-Llama-3.1-8B - --dataset_name OpenAssistant/oasst_top1_2023-08-25 - --dataset_text_field="text" - --per_device_train_batch_size 1 - --per_device_eval_batch_size 1 - --gradient_accumulation_steps 4 - --learning_rate 2e-4 - --report_to wandb - --bf16 - --max_seq_length 1024 - --lora_r 16 --lora_alpha 32 - --lora_target_modules q_proj k_proj v_proj o_proj - --load_in_4bit - --use_peft - --attn_implementation "flash_attention_2" - --logging_steps=10 - --output_dir models/llama31 - --hub_model_id peterschmidt85/FineLlama-3.1-8B - -resources: -gpu: - # 24GB or more VRAM - memory: 24GB.. - # One or more GPU - count: 1.. -# Shared memory (for multi-gpu) -shm_size: 24GB -``` - -
- -Change the `resources` property to specify more GPUs. - -### Memory requirements - -Below are the approximate memory requirements for fine-tuning Llama 3.1. - -| Model size | Full fine-tuning | LoRA | QLoRA | -|------------|------------------|-------|-------| -| **8B** | 60GB | 16GB | 6GB | -| **70B** | 500GB | 160GB | 48GB | -| **405B** | 3.25TB | 950GB | 250GB | - -The requirements can be significantly reduced with certain optimizations. - -### DeepSpeed - -For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3. - -To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configuration file instead of -`examples/accelerate_configs/multi_gpu.yaml`. - -### Running on multiple nodes - -In case the model doesn't feet into a single GPU, consider running a `dstack` task on multiple nodes. -Below is the corresponding task configuration file. - -
- -```yaml -type: task -name: trl-train-distrib - -# Size of the cluster -nodes: 2 - -python: "3.10" -# Ensure nvcc is installed (req. for Flash Attention) -nvcc: true - -env: - - HF_TOKEN - - WANDB_API_KEY -commands: - - pip install "transformers>=4.43.2" - - pip install bitsandbytes - - pip install flash-attn --no-build-isolation - - pip install peft - - pip install wandb - - git clone https://github.com/huggingface/trl - - cd trl - - pip install . - - accelerate launch - --config_file=examples/accelerate_configs/fsdp_qlora.yaml - --main_process_ip=$DSTACK_MASTER_NODE_IP - --main_process_port=8008 - --machine_rank=$DSTACK_NODE_RANK - --num_processes=$DSTACK_GPUS_NUM - --num_machines=$DSTACK_NODES_NUM - examples/scripts/sft.py - --model_name meta-llama/Meta-Llama-3.1-8B - --dataset_name OpenAssistant/oasst_top1_2023-08-25 - --dataset_text_field="text" - --per_device_train_batch_size 1 - --per_device_eval_batch_size 1 - --gradient_accumulation_steps 4 - --learning_rate 2e-4 - --report_to wandb - --bf16 - --max_seq_length 1024 - --lora_r 16 --lora_alpha 32 - --lora_target_modules q_proj k_proj v_proj o_proj - --load_in_4bit - --use_peft - --attn_implementation "flash_attention_2" - --logging_steps=10 - --output_dir models/llama31 - --hub_model_id peterschmidt85/FineLlama-3.1-8B - --torch_dtype bfloat16 - --use_bnb_nested_quant - -resources: - gpu: - # 24GB or more VRAM - memory: 24GB.. - # One or more GPU - count: 1.. - # Shared memory (for multi-gpu) - shm_size: 24GB -``` - -
- -[//]: # (TODO: Find a better example for a multi-node training) - -## Source code - -The source-code of this example can be found in -[`examples/llms/llama31`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama31) and [`examples/single-node-training/trl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl). - -## What's next? - -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), - [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). -2. Browse [Llama 3.1 on HuggingFace](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f), - [HuggingFace's Llama recipes](https://github.com/huggingface/huggingface-llama-recipes), - [Meta's Llama recipes](https://github.com/meta-llama/llama-recipes) - and [Llama Agentic System](https://github.com/meta-llama/llama-agentic-system/). diff --git a/examples/llms/llama32/README.md b/examples/llms/llama32/README.md deleted file mode 100644 index c607139326..0000000000 --- a/examples/llms/llama32/README.md +++ /dev/null @@ -1,132 +0,0 @@ -# Llama 3.2 - -This example walks you through how to deploy Llama 3.2 vision model with `dstack` using `vLLM`. - -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
- -## Deployment - -Here's an example of a service that deploys Llama 3.2 11B using vLLM. - -
- -```yaml -type: service -name: llama32 - -image: vllm/vllm-openai:latest -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct - - MAX_MODEL_LEN=4096 - - MAX_NUM_SEQS=8 -commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --max-num-seqs $MAX_NUM_SEQS - --enforce-eager - --disable-log-requests - --limit-mm-per-prompt "image=1" - --tensor-parallel-size $DSTACK_GPUS_NUM -port: 8000 -# Register the model -model: meta-llama/Llama-3.2-11B-Vision-Instruct - -# Uncomment to cache downloaded models -#volumes: -# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub - -resources: - gpu: 40GB..48GB -``` -
- -[//]: # (TODO: Comment on MAX_MODEL_LEN and MAX_NUM_SEQS) - -### Memory requirements - -Below are the approximate memory requirements for loading the model. -This excludes memory for the model context and CUDA kernel reservations. - -| Model size | FP16 | -|------------|-------| -| **11B** | 40GB | -| **90B** | 180GB | - -[//]: # (TODO: Quantization mention) - -### Running a configuration - -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. - -
- -```shell -$ HF_TOKEN=... -$ dstack apply -f examples/llms/llama32/vllm/.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod CA-MTL-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24 - 2 runpod EU-SE-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24 - 3 runpod EU-SE-1 9xCPU, 50GB, 1xA6000 (48GB) yes $0.25 - - -Submit the run llama32? [y/n]: y - -Provisioning... ----> 100% -``` - -
- -Once the service is up, it will be available via the service endpoint -at `/proxy/services///`. - -
- -```shell -$ curl http://127.0.0.1:3000/proxy/services/main/llama32/v1/chat/completions \ - -H 'Content-Type: application/json' \ - -H 'Authorization: Bearer token' \ - --data '{ - "model": "meta-llama/Llama-3.2-11B-Vision-Instruct", - "messages": [ - { - "role": "user", - "content": [ - {"type" : "text", "text": "Describe the image."}, - {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}} - ] - }], - "max_tokens": 2048 - }' -``` - -
- -When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint -is available at `https://./`. - -[//]: # (TODO: https://github.com/dstackai/dstack/issues/1777) - -## Source code - -The source-code of this example can be found in -[`examples/llms/llama32`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama32). - -## What's next? - -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), - [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). -2. Browse [Llama 3.2 on HuggingFace](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) - and [LLama 3.2 on vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html#multimodal-language-models). diff --git a/examples/llms/llama32/vllm/.dstack.yml b/examples/llms/llama32/vllm/.dstack.yml deleted file mode 100644 index 3712a64f80..0000000000 --- a/examples/llms/llama32/vllm/.dstack.yml +++ /dev/null @@ -1,27 +0,0 @@ -type: service -name: llama32 - -image: vllm/vllm-openai:latest -env: - - HF_TOKEN - - MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct - - MAX_MODEL_LEN=4096 - - MAX_NUM_SEQS=8 -commands: - - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - --max-num-seqs $MAX_NUM_SEQS - --enforce-eager - --disable-log-requests - --limit-mm-per-prompt "image=1" - --tensor-parallel-size $DSTACK_GPUS_NUM -port: 8000 -# Register the model -model: meta-llama/Llama-3.2-11B-Vision-Instruct - -# Uncomment to cache downloaded models -#volumes: -# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub - -resources: - gpu: 40GB..48GB diff --git a/examples/misc/docker-compose/.dstack.yml b/examples/misc/docker-compose/.dstack.yml deleted file mode 100644 index 0967f72b81..0000000000 --- a/examples/misc/docker-compose/.dstack.yml +++ /dev/null @@ -1,16 +0,0 @@ -type: dev-environment -name: vscode-docker - -docker: true -env: - - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - - HF_TOKEN -ide: vscode -files: - - compose.yaml - -# Uncomment to leverage spot instances -#spot_policy: auto - -resources: - gpu: 24GB diff --git a/examples/misc/docker-compose/README.md b/examples/misc/docker-compose/README.md deleted file mode 100644 index d74dba0304..0000000000 --- a/examples/misc/docker-compose/README.md +++ /dev/null @@ -1,180 +0,0 @@ -# Docker Compose - -All backends except `runpod`, `vastai`, and `kubernetes` allow using [Docker and Docker Compose](https://dstack.ai/docs/guides/protips#docker-and-docker-compose) inside `dstack` runs. - -This example shows how to deploy Hugging Face [Chat UI](https://huggingface.co/docs/chat-ui/index) -with [TGI](https://huggingface.co/docs/text-generation-inference/en/index) -serving [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) -using [Docker Compose](https://docs.docker.com/compose/). - -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
- -## Deployment - -### Running as a task - -=== "`task.dstack.yml`" - -
- - ```yaml - type: task - name: chat-ui-task - - docker: true - env: - - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - - HF_TOKEN - files: - - compose.yaml - commands: - - docker compose up - ports: - - 9000 - - resources: - gpu: "nvidia:24GB" - ``` - -
- -=== "`compose.yaml`" - -
- - ```yaml - services: - app: - image: ghcr.io/huggingface/chat-ui:sha-bf0bc92 - command: - - bash - - -c - - | - echo MONGODB_URL=mongodb://db:27017 > .env.local - echo MODELS='`[{ - "name": "${MODEL_ID?}", - "endpoints": [{"type": "tgi", "url": "http://tgi:8000"}] - }]`' >> .env.local - exec ./entrypoint.sh - ports: - - 127.0.0.1:9000:3000 - depends_on: - - tgi - - db - - tgi: - image: ghcr.io/huggingface/text-generation-inference:sha-704a58c - volumes: - - tgi_data:/data - environment: - HF_TOKEN: ${HF_TOKEN?} - MODEL_ID: ${MODEL_ID?} - PORT: 8000 - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: all - capabilities: [gpu] - - db: - image: mongo:latest - volumes: - - db_data:/data/db - - volumes: - tgi_data: - db_data: - ``` - -
- -### Deploying as a service - -If you'd like to deploy Chat UI as an auto-scalable and secure endpoint, -use the service configuration. You can find it at [`examples/misc/docker-compose/service.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose/service.dstack.yml) - -### Running a configuration - -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. - -
- -```shell -$ HF_TOKEN=... -$ dstack apply -f examples/examples/misc/docker-compose/task.dstack.yml - - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12 - 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12 - 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23 - -Submit the run chat-ui-task? [y/n]: y - -Provisioning... ----> 100% -``` - -
- -## Persisting data - -To persist data between runs, create a [volume](https://dstack.ai/docs/concepts/volumes/) and attach it to the run -configuration. - -
- -```yaml -type: task -name: chat-ui-task - -privileged: true -image: dstackai/dind -env: - - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - - HF_TOKEN -files: - - compose.yaml -commands: - - start-dockerd - - docker compose up -ports: - - 9000 - -# Uncomment to leverage spot instances -#spot_policy: auto - -resources: - # Required resources - gpu: "nvidia:24GB" - -volumes: - - name: my-dind-volume - path: /var/lib/docker -``` - -
- -With this change, all Docker data—pulled images, containers, and crucially, volumes for database and model storage—will -be persisted. - -## Source code - -The source-code of this example can be found in -[`examples/misc/docker-compose`](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose). - -## What's next? - -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), - [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). diff --git a/examples/misc/docker-compose/compose.yaml b/examples/misc/docker-compose/compose.yaml deleted file mode 100644 index c5c843667c..0000000000 --- a/examples/misc/docker-compose/compose.yaml +++ /dev/null @@ -1,42 +0,0 @@ -services: - app: - image: ghcr.io/huggingface/chat-ui-db:0.9.5 - environment: - HF_TOKEN: ${HF_TOKEN?} - MONGODB_URL: mongodb://db:27017 - MODELS: | - [{ - "name": "${MODEL_ID?}", - "endpoints": [{"type": "openai", "baseURL": "http://tgi:8000/v1"}] - }] - ports: - - 127.0.0.1:9000:3000 - depends_on: - - tgi - - db - - tgi: - image: ghcr.io/huggingface/text-generation-inference:3.3.4 - volumes: - - tgi_data:/data - environment: - HF_TOKEN: ${HF_TOKEN?} - MODEL_ID: ${MODEL_ID?} - PORT: 8000 - shm_size: 1g - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: all - capabilities: [gpu] - - db: - image: mongo:latest - volumes: - - db_data:/data/db - -volumes: - tgi_data: - db_data: diff --git a/examples/misc/docker-compose/service.dstack.yml b/examples/misc/docker-compose/service.dstack.yml deleted file mode 100644 index b33b900fbd..0000000000 --- a/examples/misc/docker-compose/service.dstack.yml +++ /dev/null @@ -1,26 +0,0 @@ -type: service -name: chat-ui-service - -docker: true -env: - - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - - HF_TOKEN -files: - - compose.yaml -commands: - - docker compose up -port: 9000 -auth: false - -# Uncomment to leverage spot instances -#spot_policy: auto - -resources: - # Required resources - gpu: 1 - -# Cache the Docker data -volumes: - - instance_path: /root/.cache/docker-data - path: /var/lib/docker - optional: true diff --git a/examples/misc/docker-compose/task.dstack.yml b/examples/misc/docker-compose/task.dstack.yml deleted file mode 100644 index e7af43f383..0000000000 --- a/examples/misc/docker-compose/task.dstack.yml +++ /dev/null @@ -1,25 +0,0 @@ -type: task -name: chat-ui-task - -docker: true -env: - - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - - HF_TOKEN -files: - - compose.yaml -commands: - - docker compose up -ports: - - 9000 - -# Use either spot or on-demand instances -spot_policy: auto - -resources: - gpu: 1 - -# Cache the Docker data -volumes: - - instance_path: /root/.cache/docker-data - path: /var/lib/docker - optional: true diff --git a/examples/misc/docker-compose/volume.dstack.yml b/examples/misc/docker-compose/volume.dstack.yml deleted file mode 100644 index d5ddc1cc67..0000000000 --- a/examples/misc/docker-compose/volume.dstack.yml +++ /dev/null @@ -1,8 +0,0 @@ -type: volume -name: my-dind-volume - -backend: aws -region: eu-west-1 - -# Required size -size: 100GB diff --git a/examples/models/wan22/.dstack.yml b/examples/models/wan22/.dstack.yml deleted file mode 100644 index 528e11e0d8..0000000000 --- a/examples/models/wan22/.dstack.yml +++ /dev/null @@ -1,63 +0,0 @@ -type: task -name: wan22 - -repos: - # Clones it to `/workflow` (the default working directory) - - https://github.com/Wan-Video/Wan2.2.git - -python: 3.12 -nvcc: true - -env: - - PROMPT="Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." - # Required for storing cache on a volume - - UV_LINK_MODE=copy -commands: - # Install flash-attn - - | - uv pip install torch - uv pip install flash-attn --no-build-isolation - # Install dependencies - - | - uv pip install . decord librosa - uv pip install "huggingface_hub[cli]" - hf download Wan-AI/Wan2.2-T2V-A14B --local-dir /root/.cache/Wan2.2-T2V-A14B - # Generate video - - | - if [ ${DSTACK_GPUS_NUM} -gt 1 ]; then - torchrun \ - --nproc_per_node=${DSTACK_GPUS_NUM} \ - generate.py \ - --task t2v-A14B \ - --size 1280*720 \ - --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \ - --dit_fsdp --t5_fsdp --ulysses_size ${DSTACK_GPUS_NUM} \ - --save_file ${DSTACK_RUN_NAME}.mp4 \ - --prompt "${PROMPT}" - else - python generate.py \ - --task t2v-A14B \ - --size 1280*720 \ - --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \ - --offload_model True \ - --convert_model_dtype \ - --save_file ${DSTACK_RUN_NAME}.mp4 \ - --prompt "${PROMPT}" - fi - # Upload video - - curl https://bashupload.com/ -T ./${DSTACK_RUN_NAME}.mp4 - -resources: - gpu: - name: [H100, H200] - count: 1..8 - disk: 300GB - -# Change to on-demand for disabling spot -spot_policy: auto - -volumes: - # Cache pip packages and HF models - - instance_path: /root/dstack-cache - path: /root/.cache/ - optional: true diff --git a/examples/models/wan22/README.md b/examples/models/wan22/README.md deleted file mode 100644 index c99856fbf4..0000000000 --- a/examples/models/wan22/README.md +++ /dev/null @@ -1,145 +0,0 @@ ---- -title: Wan2.2 -description: Text-to-video generation using the Wan2.2 T2V-A14B foundational video model ---- - -# Wan2.2 - -[Wan2.2](https://github.com/Wan-Video/Wan2.2) is an open-source SOTA foundational video model. This example shows how to run the T2V-A14B model variant via `dstack` for text-to-video generation. - -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
- -## Define a configuration - -Below is a task configuration that generates a video using Wan2.2, uploads it, and provides the download link. - -
- -```yaml -type: task -name: wan22 - -repos: - # Clones it to `/workflow` (the default working directory) - - https://github.com/Wan-Video/Wan2.2.git - -python: 3.12 -nvcc: true - -env: - - PROMPT="Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." - # Required for storing cache on a volume - - UV_LINK_MODE=copy -commands: - # Install flash-attn - - | - uv pip install torch - uv pip install flash-attn --no-build-isolation - # Install dependencies - - | - uv pip install . decord librosa - uv pip install "huggingface_hub[cli]" - hf download Wan-AI/Wan2.2-T2V-A14B --local-dir /root/.cache/Wan2.2-T2V-A14B - # Generate video - - | - if [ ${DSTACK_GPUS_NUM} -gt 1 ]; then - torchrun \ - --nproc_per_node=${DSTACK_GPUS_NUM} \ - generate.py \ - --task t2v-A14B \ - --size 1280*720 \ - --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \ - --dit_fsdp --t5_fsdp --ulysses_size ${DSTACK_GPUS_NUM} \ - --save_file ${DSTACK_RUN_NAME}.mp4 \ - --prompt "${PROMPT}" - else - python generate.py \ - --task t2v-A14B \ - --size 1280*720 \ - --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \ - --offload_model True \ - --convert_model_dtype \ - --save_file ${DSTACK_RUN_NAME}.mp4 \ - --prompt "${PROMPT}" - fi - # Upload video - - curl https://bashupload.com/ -T ./${DSTACK_RUN_NAME}.mp4 - -resources: - gpu: - name: [H100, H200] - count: 1..8 - disk: 300GB - -# Change to on-demand for disabling spot -spot_policy: auto - -volumes: - # Cache pip packages and HF models - - instance_path: /root/dstack-cache - path: /root/.cache/ - optional: true -``` - -
- -## Run the configuration - -Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the -cloud resources and run the configuration. - -
- -```shell -$ dstack apply -f examples/models/wan22/.dstack.yml - - # BACKEND RESOURCES INSTANCE TYPE PRICE - 1 verda (FIN-01) cpu=30 mem=120GB disk=200GB H100:80GB:1 (spot) 1H100.80S.30V $0.99 - 2 verda (FIN-01) cpu=30 mem=120GB disk=200GB H100:80GB:1 (spot) 1H100.80S.30V $0.99 - 3 verda (FIN-02) cpu=44 mem=182GB disk=200GB H200:141GB:1 (spot) 1H200.141S.44V $0.99 - ----> 100% - -Uploaded 1 file, 8 375 523 bytes - -wget https://bashupload.com/fIo7l/wan22.mp4 -``` - -
- -If you want you can override the default GPU, spot policy, and even the prompt via the CLI. - -
- -```shell -$ PROMPT=... -$ dstack apply -f examples/models/wan22/.dstack.yml --spot --gpu H100,H200:8 - - # BACKEND RESOURCES INSTANCE TYPE PRICE - 1 aws (us-east-2) cpu=192 mem=2048GB disk=300GB H100:80GB:8 (spot) p5.48xlarge $6.963 - 2 verda (FIN-02) cpu=176 mem=1480GB disk=300GB H100:80GB:8 (spot) 8H100.80S.176V $7.93 - 3 verda (ICE-01) cpu=176 mem=1450GB disk=300GB H200:141GB:8 (spot) 8H200.141S.176V $7.96 - ----> 100% - -Uploaded 1 file, 8 375 523 bytes - -wget https://bashupload.com/fIo7l/wan22.mp4 -``` - -
- -## Source code - -The source-code of this example can be found in -[`examples/models/wan22`](https://github.com/dstackai/dstack/blob/master/examples/models/wan22). diff --git a/examples/single-node-training/axolotl/README.md b/examples/single-node-training/axolotl/README.md index f8ae04f7ce..7781139e0b 100644 --- a/examples/single-node-training/axolotl/README.md +++ b/examples/single-node-training/axolotl/README.md @@ -92,11 +92,6 @@ Provisioning...
-## Source code - -The source-code of this example can be found in -[`examples/single-node-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl) and [`examples/distributed-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl). - ## What's next? 1. Browse the [Axolotl distributed training](https://dstack.ai/docs/examples/distributed-training/axolotl) example diff --git a/examples/single-node-training/trl/README.md b/examples/single-node-training/trl/README.md index f5cf7f5a4a..82dca87a98 100644 --- a/examples/single-node-training/trl/README.md +++ b/examples/single-node-training/trl/README.md @@ -108,11 +108,6 @@ Provisioning...
-## Source code - -The source-code of this example can be found in -[`examples/llms/llama31`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama31) and [`examples/single-node-training/trl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl). - ## What's next? 1. Browse the [TRL distributed training](https://dstack.ai/docs/examples/distributed-training/trl) example diff --git a/mkdocs.yml b/mkdocs.yml index 34437b1799..8dbe0ad85e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -98,10 +98,10 @@ plugins: "docs/tasks.md": "docs/concepts/tasks.md" "docs/services.md": "docs/concepts/services.md" "docs/fleets.md": "docs/concepts/fleets.md" - "docs/examples/llms/llama31.md": "examples/llms/llama/index.md" - "docs/examples/llms/llama32.md": "examples/llms/llama/index.md" - "examples/llms/llama31/index.md": "examples/llms/llama/index.md" - "examples/llms/llama32/index.md": "examples/llms/llama/index.md" + "docs/examples/llms/llama31.md": "examples/inference/vllm/index.md" + "docs/examples/llms/llama32.md": "examples/inference/vllm/index.md" + "examples/llms/llama31/index.md": "examples/inference/vllm/index.md" + "examples/llms/llama32/index.md": "examples/inference/vllm/index.md" "docs/examples/accelerators/amd/index.md": "examples/accelerators/amd/index.md" "docs/examples/deployment/nim/index.md": "examples/inference/nim/index.md" "docs/examples/deployment/vllm/index.md": "examples/inference/vllm/index.md" @@ -308,8 +308,6 @@ nav: - AMD: examples/accelerators/amd/index.md - TPU: examples/accelerators/tpu/index.md - Tenstorrent: examples/accelerators/tenstorrent/index.md - - Models: - - Wan2.2: examples/models/wan22/index.md - Blog: - blog/index.md - Case studies: blog/case-studies.md