diff --git a/docs/blog/posts/docker-inside-containers.md b/docs/blog/posts/docker-inside-containers.md
index 699e75fe70..1a88f20b13 100644
--- a/docs/blog/posts/docker-inside-containers.md
+++ b/docs/blog/posts/docker-inside-containers.md
@@ -12,18 +12,26 @@ To run containers with `dstack`, you can use your own Docker image (or the defau
 directly with Docker. However, some existing code may require direct use of Docker or Docker Compose. That's why,
 in our latest release, we've added this option.
 
-<div editor-title="examples/misc/docker-compose/task.dstack.yml"> 
-    
-```yaml 
+<div editor-title="task.dstack.yml">
+
+```yaml
 type: task
-name: chat-ui-task
+name: compose-task
 
 image: dstackai/dind
 privileged: true
 
-working_dir: examples/misc/docker-compose
 commands:
   - start-dockerd
+  - |
+    cat > compose.yaml <<'EOF'
+    services:
+      web:
+        image: python:3.11-slim
+        command: python -m http.server 9000
+        ports:
+          - "9000:9000"
+    EOF
   - docker compose up
 ports: [9000]
 
@@ -37,7 +45,7 @@ resources:
 
 ## How it works
 
-To use Docker or Docker Compose with your `dstack` configuration, set `image` to `dstackai/dind`, `privileged` to 
+To use Docker or Docker Compose with your `dstack` configuration, set `image` to `dstackai/dind`, `privileged` to
 `true`, and add the `start-dockerd` command. After this command, you can use Docker or Docker Compose directly.
 
 
@@ -45,7 +53,7 @@ For dev environments, add `start-dockerd` as the first command
 in the `init` property.
 
 ??? info "Dev environment"
-    <div editor-title="examples/misc/docker-compose/.dstack.yml">
+    <div editor-title=".dstack.yml">
 
     ```yaml
     type: dev-environment
@@ -59,7 +67,7 @@ in the `init` property.
       - start-dockerd
 
     resources:
-    gpu: 16GB..24GB
+      gpu: 16GB..24GB
     ```
 
     </div>
@@ -71,15 +79,15 @@ With this setup, you don’t have to worry about configuration—both Docker and
 support GPU usage.
 
 !!! info "Backends"
-    Note that the `privileged` option is only supported by VM-based backends. This does not include `runpod`, `vastai`, 
+    Note that the `privileged` option is only supported by VM-based backends. This does not include `runpod`, `vastai`,
     and `kubernetes`. All other backends support it.
 
 ## When using it
 
 ### docker compose
 
-One of the obvious use cases for this feature is when you need to use Docker Compose. 
-For example, the Hugging Face Chat UI requires a MongoDB database, so using Docker Compose to run it is 
+One of the obvious use cases for this feature is when you need to use Docker Compose.
+For example, the Hugging Face Chat UI requires a MongoDB database, so using Docker Compose to run it is
 the easiest way:
 
 <img src="https://dstack.ai/static-assets/static-assets/images/dstack-docker-compose-terminal.png" width="750"/>
@@ -92,14 +100,10 @@ Another use case for this feature is when you need to build a custom Docker imag
 
 Last but not least, you can, of course, use the `docker run` command, for example, if your existing code requires it.
 
-## Examples
-
-A few examples of using this feature can be found in [`examples/misc/docker-compose`](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose).
-
 ## Feedback
 
 If you find something not working as intended, please be sure to report it to
-our [bug tracker](https://github.com/dstackai/dstack/issues){:target="_ blank"}. 
-Your feedback and feature requests are also very welcome on both 
+our [bug tracker](https://github.com/dstackai/dstack/issues){:target="_ blank"}.
+Your feedback and feature requests are also very welcome on both
 [Discord](https://discord.gg/u8SmfwPpMd) and the
 [issue tracker](https://github.com/dstackai/dstack/issues).
diff --git a/docs/blog/posts/nvidia-and-amd-on-vultr.md b/docs/blog/posts/nvidia-and-amd-on-vultr.md
index 512d316f8b..e3961b37a9 100644
--- a/docs/blog/posts/nvidia-and-amd-on-vultr.md
+++ b/docs/blog/posts/nvidia-and-amd-on-vultr.md
@@ -67,8 +67,6 @@ projects:
 
 For more details, refer to [Installation](../../docs/installation.md).
 
-> Interested in fine-tuning or deploying DeepSeek on Vultr? Check out the corresponding [example](../../examples/llms/deepseek/index.md).
-
 !!! info "What's next?"
     1. Refer to [Quickstart](../../docs/quickstart.md)
     2. Sign up with [Vultr](https://www.vultr.com/)
diff --git a/docs/blog/posts/volumes-on-runpod.md b/docs/blog/posts/volumes-on-runpod.md
index c17faf7b13..08f2e19126 100644
--- a/docs/blog/posts/volumes-on-runpod.md
+++ b/docs/blog/posts/volumes-on-runpod.md
@@ -20,7 +20,7 @@ deploying a model on Runpod.
 
 Suppose you want to deploy Llama 3.1 on Runpod as a [service](../../docs/concepts/services.md):
 
-<div editor-title="examples/llms/llama31/tgi/service.dstack.yml">
+<div editor-title="service.dstack.yml">
 
 ```yaml
 type: service
@@ -63,7 +63,7 @@ Great news: Runpod supports network volumes, which we can use for caching models
 
 With `dstack`, you can create a Runpod volume using the following configuration:
 
-<div editor-title="examples/mist/volumes/runpod.dstack.yml">
+<div editor-title="runpod-volume.dstack.yml">
 
 ```yaml
 type: volume
@@ -83,7 +83,7 @@ Go ahead and create it via `dstack apply`:
 <div class="termy">
 
 ```shell
-$ dstack apply -f examples/mist/volumes/runpod.dstack.yml
+$ dstack apply -f runpod-volume.dstack.yml
 ```
 
 </div>
@@ -91,7 +91,7 @@ $ dstack apply -f examples/mist/volumes/runpod.dstack.yml
 Once the volume is created, attach it to your service by updating the configuration file and mapping the 
 volume name to the `/data` path.
 
-<div editor-title="examples/llms/llama31/tgi/service.dstack.yml">
+<div editor-title="service.dstack.yml">
 
 ```yaml
 type: service
diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
index 39e8f0b868..fd0d2a2dc2 100644
--- a/docs/docs/concepts/services.md
+++ b/docs/docs/concepts/services.md
@@ -15,57 +15,116 @@ Services allow you to deploy models or web apps as secure and scalable endpoints
 First, define a service configuration as a YAML file in your project folder.
 The filename must end with `.dstack.yml` (e.g. `.dstack.yml` or `dev.dstack.yml` are both acceptable).
 
-<div editor-title=".dstack.yml"> 
+=== "NVIDIA"
 
-```yaml
-type: service
-name: llama31
+    <div editor-title=".dstack.yml">
 
-# If `image` is not specified, dstack uses its default image
-python: 3.12
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
-  - MAX_MODEL_LEN=4096
-commands:
-  - uv pip install vllm
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --tensor-parallel-size $DSTACK_GPUS_NUM
-port: 8000
-# (Optional) Register the model
-model: meta-llama/Meta-Llama-3.1-8B-Instruct
+    ```yaml
+    type: service
+    name: qwen397
 
-# Uncomment to leverage spot instances
-#spot_policy: auto
+    image: lmsysorg/sglang:v0.5.10.post1
 
-resources:
-  gpu: 24GB
-```
+    commands:
+      - |
+        sglang serve \
+          --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
+          --port 30000 \
+          --tp $DSTACK_GPUS_NUM \
+          --reasoning-parser qwen3 \
+          --tool-call-parser qwen3_coder \
+          --enable-flashinfer-allreduce-fusion \
+          --mem-fraction-static 0.8
+
+    port: 30000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      # Optional instance volume for model and runtime caches
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
 
-</div>
+    resources:
+      cpu: x86:96..
+      memory: 512GB..
+      shm_size: 16GB
+      disk: 500GB..
+      gpu: H100:80GB:8
+    ```
+
+    </div>
+
+=== "AMD"
+
+    <div editor-title=".dstack.yml">
+
+    ```yaml
+    type: service
+    name: qwen397
+
+    image: lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x
+
+    env:
+      - HIP_FORCE_DEV_KERNARG=1
+      - SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+      - SGLANG_DISABLE_CUDNN_CHECK=1
+      - SGLANG_INT4_WEIGHT=0
+      - SGLANG_MOE_PADDING=1
+      - SGLANG_ROCM_DISABLE_LINEARQUANT=0
+      - SGLANG_ROCM_FUSED_DECODE_MLA=1
+      - SGLANG_SET_CPU_AFFINITY=1
+      - SGLANG_USE_AITER=1
+      - SGLANG_USE_ROCM700A=1
+
+    commands:
+      - |
+        sglang serve \
+          --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
+          --tp $DSTACK_GPUS_NUM \
+          --reasoning-parser qwen3 \
+          --tool-call-parser qwen3_coder \
+          --mem-fraction-static 0.8 \
+          --context-length 262144 \
+          --attention-backend triton \
+          --disable-cuda-graph \
+          --fp8-gemm-backend aiter \
+          --port 30000
+
+    port: 30000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      # Optional instance volume for model and runtime caches
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
+
+    resources:
+      cpu: x86:52..
+      memory: 700GB..
+      shm_size: 16GB
+      disk: 600GB..
+      gpu: MI300X:192GB:4
+    ```
+
+    </div>
 
 To run a service, pass the configuration to [`dstack apply`](../reference/cli/dstack/apply.md):
 
 <div class="termy">
 
 ```shell
-$ HF_TOKEN=...
 $ dstack apply -f .dstack.yml
 
- #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
- 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB:2  yes   $0.22
- 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB:2  yes   $0.22
- 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:3  yes   $0.33
- 
-Submit the run llama31? [y/n]: y
+Submit the run qwen397? [y/n]: y
 
 Provisioning...
 ---> 100%
 
-Service is published at: 
-  http://localhost:3000/proxy/services/main/llama31/
-Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
+Service is published at:
+  http://localhost:3000/proxy/services/main/qwen397/
+Model Qwen/Qwen3.5-397B-A17B-FP8 is published at:
   http://localhost:3000/proxy/models/main/
 ```
 
@@ -79,11 +138,11 @@ If you do not have a [gateway](gateways.md) created, the service endpoint will b
 <div class="termy">
 
 ```shell
-$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
+$ curl http://localhost:3000/proxy/services/main/qwen397/v1/chat/completions \
     -H 'Content-Type: application/json' \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -d '{
-        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+        "model": "Qwen/Qwen3.5-397B-A17B-FP8",
         "messages": [
             {
                 "role": "user",
@@ -95,6 +154,10 @@ $ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
 
 </div>
 
+The request and response format depends on the serving framework used by the
+service. Even for OpenAI-compatible endpoints, the format may vary slightly
+across frameworks.
+
 If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with `Bearer <dstack token>`.
 
 ## Configuration options
@@ -144,38 +207,115 @@ $ curl https://llama31.example.com/v1/chat/completions \
 By default, `dstack` runs a single replica of the service.
 You can configure the number of replicas as well as the auto-scaling rules.
 
-<div editor-title="service.dstack.yml"> 
+=== "NVIDIA"
 
-```yaml
-type: service
-name: llama31-service
+    <div editor-title="service.dstack.yml">
 
-python: 3.12
+    ```yaml
+    type: service
+    name: qwen397-service
 
-env:
-  - HF_TOKEN
-commands:
-  - uv pip install vllm
-  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
-port: 8000
+    image: lmsysorg/sglang:v0.5.10.post1
 
-resources:
-  gpu: 24GB
+    commands:
+      - |
+        sglang serve \
+          --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
+          --port 30000 \
+          --tp $DSTACK_GPUS_NUM \
+          --reasoning-parser qwen3 \
+          --tool-call-parser qwen3_coder \
+          --enable-flashinfer-allreduce-fusion \
+          --mem-fraction-static 0.8
+
+    port: 30000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      # Optional instance volume for model and runtime caches
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
 
-replicas: 1..4
-scaling:
-  # Requests per seconds
-  metric: rps
-  # Target metric value
-  target: 10
-```
+    resources:
+      cpu: x86:96..
+      memory: 512GB..
+      shm_size: 16GB
+      disk: 500GB..
+      gpu: H100:80GB:8
 
-</div>
+    replicas: 1..2
+    scaling:
+      metric: rps
+      target: 1
+    ```
+
+    </div>
+
+=== "AMD"
+
+    <div editor-title="service.dstack.yml">
+
+    ```yaml
+    type: service
+    name: qwen397-service
+
+    image: lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x
+
+    env:
+      - HIP_FORCE_DEV_KERNARG=1
+      - SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+      - SGLANG_DISABLE_CUDNN_CHECK=1
+      - SGLANG_INT4_WEIGHT=0
+      - SGLANG_MOE_PADDING=1
+      - SGLANG_ROCM_DISABLE_LINEARQUANT=0
+      - SGLANG_ROCM_FUSED_DECODE_MLA=1
+      - SGLANG_SET_CPU_AFFINITY=1
+      - SGLANG_USE_AITER=1
+      - SGLANG_USE_ROCM700A=1
+
+    commands:
+      - |
+        sglang serve \
+          --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
+          --tp $DSTACK_GPUS_NUM \
+          --reasoning-parser qwen3 \
+          --tool-call-parser qwen3_coder \
+          --mem-fraction-static 0.8 \
+          --context-length 262144 \
+          --attention-backend triton \
+          --disable-cuda-graph \
+          --fp8-gemm-backend aiter \
+          --port 30000
+
+    port: 30000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      # Optional instance volume for model and runtime caches
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
+
+    resources:
+      cpu: x86:52..
+      memory: 700GB..
+      shm_size: 16GB
+      disk: 600GB..
+      gpu: MI300X:192GB:4
+
+    replicas: 1..2
+    scaling:
+      metric: rps
+      target: 1
+    ```
+
+    </div>
 
 The [`replicas`](../reference/dstack.yml/service.md#replicas) property can be a number or a range.
 
-The [`metric`](../reference/dstack.yml/service.md#metric) property of [`scaling`](../reference/dstack.yml/service.md#scaling) only supports the `rps` metric (requests per second). In this 
-case `dstack` adjusts the number of replicas (scales up or down) automatically based on the load. 
+The [`metric`](../reference/dstack.yml/service.md#metric) property of [`scaling`](../reference/dstack.yml/service.md#scaling) only supports the `rps` metric (requests per second). In this
+case `dstack` adjusts the number of replicas (scales up or down) automatically based on the load.
 
 Setting the minimum number of replicas to `0` allows the service to scale down to zero when there are no requests.
 
@@ -186,13 +326,13 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 ??? info "Replica groups"
     A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules.
 
-    <div editor-title="service.dstack.yml"> 
+    <div editor-title="service.dstack.yml">
 
     ```yaml
     type: service
     name: llama-8b-service
 
-    image: lmsysorg/sglang:latest
+    image: lmsysorg/sglang:v0.5.10.post1
     env:
       - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 
@@ -239,75 +379,74 @@ Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use
 
 Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
 
-<div editor-title="examples/inference/sglang/pd.dstack.yml">
+=== "NVIDIA"
 
-```yaml
-type: service
-name: prefill-decode
-image: lmsysorg/sglang:latest
+    <div editor-title="pd.dstack.yml">
 
-env:
-  - HF_TOKEN
-  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+    ```yaml
+    type: service
+    name: prefill-decode
+    image: lmsysorg/sglang:v0.5.10.post1
 
-replicas:
-  - count: 1
-    # For now replica group with router must have count: 1
-    commands:
-      - pip install sglang_router
-      - |
-        python -m sglang_router.launch_router \
-          --host 0.0.0.0 \
-          --port 8000 \
-          --pd-disaggregation \
-          --prefill-policy cache_aware
-    router:
-      type: sglang
-    resources:
-      cpu: 4
+    env:
+      - HF_TOKEN
+      - MODEL_ID=zai-org/GLM-4.5-Air-FP8
 
-  - count: 1..4
-    scaling:
-      metric: rps
-      target: 3
-    commands:
-      - |
-        python -m sglang.launch_server \
-          --model-path $MODEL_ID \
-          --disaggregation-mode prefill \
-          --disaggregation-transfer-backend nixl \
-          --host 0.0.0.0 \
-          --port 8000 \
-          --disaggregation-bootstrap-port 8998
-    resources:
-      gpu: H200
+    replicas:
+      - count: 1
+        # For now replica group with router must have count: 1
+        commands:
+          - pip install sglang_router
+          - |
+            python -m sglang_router.launch_router \
+              --port 8000 \
+              --pd-disaggregation \
+              --prefill-policy cache_aware
+        router:
+          type: sglang
+        resources:
+          cpu: 4
 
-  - count: 1..8
-    scaling:
-      metric: rps
-      target: 2
-    commands:
-      - |
-        python -m sglang.launch_server \
-          --model-path $MODEL_ID \
-          --disaggregation-mode decode \
-          --disaggregation-transfer-backend nixl \
-          --host 0.0.0.0 \
-          --port 8000
-    resources:
-      gpu: H200
+      - count: 1..4
+        scaling:
+          metric: rps
+          target: 3
+        commands:
+          - |
+            python -m sglang.launch_server \
+              --model-path $MODEL_ID \
+              --disaggregation-mode prefill \
+              --disaggregation-transfer-backend nixl \
+              --port 8000 \
+              --disaggregation-bootstrap-port 8998
+        resources:
+          gpu: H200
 
-port: 8000
-model: zai-org/GLM-4.5-Air-FP8
+      - count: 1..8
+        scaling:
+          metric: rps
+          target: 2
+        commands:
+          - |
+            python -m sglang.launch_server \
+              --model-path $MODEL_ID \
+              --disaggregation-mode decode \
+              --disaggregation-transfer-backend nixl \
+              --port 8000
+        resources:
+          gpu: H200
 
-# Custom probe is required for PD disaggregation.
-probes:
-  - type: http
-    url: /health
-    interval: 15s
-```
+    port: 8000
+    model: zai-org/GLM-4.5-Air-FP8
 
-</div>
+    # Custom probe is required for PD disaggregation.
+    probes:
+      - type: http
+        url: /health
+        interval: 15s
+    ```
+
+    </div>
 
 !!! info "Cluster"
     PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
@@ -319,7 +458,7 @@ probes:
 By default, the service enables authorization, meaning the service endpoint requires a `dstack` user token.
 This can be disabled by setting `auth` to `false`.
 
-<div editor-title="examples/misc/http.server/service.dstack.yml"> 
+<div editor-title="examples/misc/http.server/service.dstack.yml">
 
 ```yaml
 type: service
@@ -410,7 +549,7 @@ Probes are executed for each service replica while the replica is `running`. A p
     </div>
 
 ??? info "Model"
-    If you set the [`model`](#model) property but don't explicitly configure `probes`, 
+    If you set the [`model`](#model) property but don't explicitly configure `probes`,
     `dstack` automatically configures a default probe that tests the model using the `/v1/chat/completions` API.
     To disable probes entirely when `model` is set, explicitly set `probes` to an empty list.
 
@@ -423,7 +562,7 @@ If your `dstack` project doesn't have a [gateway](gateways.md), services are hos
 When running web apps, you may need to set some app-specific settings
 so that browser-side scripts and CSS work correctly with the path prefix.
 
-<div editor-title="dash.dstack.yml"> 
+<div editor-title="dash.dstack.yml">
 
 ```yaml
 type: service
@@ -462,7 +601,7 @@ on a dedicated domain name by setting up a [gateway](gateways.md).
 If you have a [gateway](gateways.md), you can configure rate limits for your service
 using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property.
 
-<div editor-title="service.dstack.yml"> 
+<div editor-title="service.dstack.yml">
 
 ```yaml
 type: service
@@ -488,7 +627,7 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients
     Instead of partitioning requests by client IP address,
     you can choose to partition by the value of a header.
 
-    <div editor-title="service.dstack.yml"> 
+    <div editor-title="service.dstack.yml">
 
     ```yaml
     type: service
@@ -508,7 +647,7 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients
 
 ### Model
 
-If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page. 
+If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page.
 In this case, `dstack` will use the service's `/v1/chat/completions` service.
 
 When `model` is set, `dstack` automatically configures [`probes`](#probes) to verify model health.
@@ -516,10 +655,10 @@ To customize or disable this, set `probes` explicitly.
 
 ### Resources
 
-If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a 
+If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a
 range (e.g. `24GB..`, or `24GB..80GB`, or `..80GB`).
 
-<div editor-title=".dstack.yml"> 
+<div editor-title=".dstack.yml">
 
 ```yaml
 type: service
@@ -550,10 +689,10 @@ resources:
 
 </div>
 
-The `cpu` property lets you set the architecture (`x86` or `arm`) and core count — e.g., `x86:16` (16 x86 cores), `arm:8..` (at least 8 ARM cores). 
+The `cpu` property lets you set the architecture (`x86` or `arm`) and core count — e.g., `x86:16` (16 x86 cores), `arm:8..` (at least 8 ARM cores).
 If not set, `dstack` infers it from the GPU or defaults to `x86`.
 
-The `gpu` property lets you specify vendor, model, memory, and count — e.g., `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either), `A100:80GB` (one 80GB A100), `A100:2` (two A100), `24GB..40GB:2` (two GPUs with 24–40GB), `A100:40GB:2` (two 40GB A100s). 
+The `gpu` property lets you specify vendor, model, memory, and count — e.g., `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either), `A100:80GB` (one 80GB A100), `A100:2` (two A100), `24GB..40GB:2` (two GPUs with 24–40GB), `A100:40GB:2` (two 40GB A100s).
 
 If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`.
 
@@ -563,7 +702,7 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`.
     ```yaml
     type: service
     name: llama31-service-optimum-tpu
-    
+
     image: dstackai/optimum-tpu:llama31
     env:
       - HF_TOKEN
@@ -575,7 +714,7 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`.
     port: 8000
     # Register the model
     model: meta-llama/Meta-Llama-3.1-8B-Instruct
-    
+
     resources:
       gpu: v5litepod-4
     ```
@@ -583,7 +722,7 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`.
     Currently, only 8 TPU cores can be specified, supporting single TPU device workloads. Multi-TPU support is coming soon. -->
 
 ??? info "Shared memory"
-    If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure 
+    If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
     `shm_size`, e.g. set it to `16GB`.
 
 > If you’re unsure which offers (hardware configurations) are available from the configured backends, use the
@@ -594,18 +733,18 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`.
 
 #### Default image
 
-If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with 
-    `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). 
+If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with
+    `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`).
 
 Set the `python` property to pre-install a specific version of Python.
 
 <!-- TODO: Add a relevant example -->
 
-<div editor-title=".dstack.yml"> 
+<div editor-title=".dstack.yml">
 
 ```yaml
 type: service
-name: http-server-service    
+name: http-server-service
 
 python: 3.12
 
@@ -618,16 +757,16 @@ port: 8000
 
 #### NVCC
 
-By default, the base Docker image doesn’t include `nvcc`, which is required for building custom CUDA kernels. 
+By default, the base Docker image doesn’t include `nvcc`, which is required for building custom CUDA kernels.
 If you need `nvcc`, set the [`nvcc`](../reference/dstack.yml/dev-environment.md#nvcc) property to true.
 
 <!-- TODO: Add a relevant example -->
 
-<div editor-title="service.dstack.yml"> 
+<div editor-title="service.dstack.yml">
 
 ```yaml
 type: service
-name: http-server-service    
+name: http-server-service
 
 python: 3.12
 nvcc: true
@@ -650,7 +789,7 @@ If you want, you can specify your own Docker image via `image`.
     name: http-server-service
 
     image: python
-    
+
     commands:
       - python3 -m http.server
     port: 8000
@@ -662,18 +801,26 @@ If you want, you can specify your own Docker image via `image`.
 
 Set `docker` to `true` to enable the `docker` CLI in your service, e.g., to run Docker images or use Docker Compose.
 
-<div editor-title="examples/misc/docker-compose/service.dstack.yml"> 
+<div editor-title="service.dstack.yml">
 
 ```yaml
 type: service
-name: chat-ui-task
+name: compose-service
 
 auth: false
 
 docker: true
 
-working_dir: examples/misc/docker-compose
 commands:
+  - |
+    cat > compose.yaml <<'EOF'
+    services:
+      web:
+        image: python:3.11-slim
+        command: python -m http.server 9000
+        ports:
+          - "9000:9000"
+    EOF
   - docker compose up
 port: 9000
 ```
@@ -689,8 +836,8 @@ To enable privileged mode, set [`privileged`](../reference/dstack.yml/dev-enviro
 Not supported with `runpod`, `vastai`, and `kubernetes`.
 
 #### Private registry
-    
-Use the [`registry_auth`](../reference/dstack.yml/dev-environment.md#registry_auth) property to provide credentials for a private Docker registry. 
+
+Use the [`registry_auth`](../reference/dstack.yml/dev-environment.md#registry_auth) property to provide credentials for a private Docker registry.
 
 ```yaml
 type: service
@@ -711,7 +858,7 @@ model: deepseek-ai/deepseek-r1-distill-llama-8b
 resources:
   gpu: H100:1
 ```
-    
+
 ### Environment variables
 
 <div editor-title=".dstack.yml">
@@ -741,7 +888,7 @@ resources:
 
 ??? info "System environment variables"
     The following environment variables are available in any run by default:
-    
+
     | Name                    | Description                                      |
     |-------------------------|--------------------------------------------------|
     | `DSTACK_RUN_NAME`       | The name of the run                              |
@@ -768,7 +915,7 @@ Sometimes, when you run a service, you may want to mount local files. This is po
 
 <!-- TODO: Add a more relevant example -->
 
-<div editor-title="examples/.dstack.yml"> 
+<div editor-title="examples/.dstack.yml">
 
 ```yaml
 type: service
@@ -801,7 +948,7 @@ The container path is optional. If not specified, it will be automatically calcu
 
 <!-- TODO: Add a more relevant example -->
 
-<div editor-title="examples/.dstack.yml"> 
+<div editor-title="examples/.dstack.yml">
 
 ```yaml
 type: service
@@ -840,7 +987,7 @@ Imagine you have a Git repo (clonned locally) containing an `examples` subdirect
 
 <!-- TODO: Add a more relevant example -->
 
-<div editor-title="examples/.dstack.yml"> 
+<div editor-title="examples/.dstack.yml">
 
 ```yaml
 type: service
@@ -874,10 +1021,10 @@ The local path can be either relative to the configuration file or absolute.
     By default, `dstack` clones the repo to the [working directory](#working-directory).
 
     <!-- TODO: In a future version, the default working directory will come from the image, so this should be revisited. -->
-    
+
     You can override the repo directory using either a relative or an absolute path:
 
-    <div editor-title="examples/.dstack.yml"> 
+    <div editor-title="examples/.dstack.yml">
 
     ```yaml
     type: service
@@ -905,18 +1052,18 @@ The local path can be either relative to the configuration file or absolute.
 
     > If the repo directory is relative, it is resolved against [working directory](#working-directory).
 
-    If the repo directory is not empty, the run will fail with a runner error.  
+    If the repo directory is not empty, the run will fail with a runner error.
     To override this behavior, you can set `if_exists` to `skip`:
 
     ```yaml
     type: service
-    name: llama-2-7b-service   
-  
+    name: llama-2-7b-service
+
     repos:
       - local_path: ..
         path: /my-repo
         if_exists: skip
-  
+
     python: 3.12
 
     env:
@@ -932,7 +1079,7 @@ The local path can be either relative to the configuration file or absolute.
     ```
 
 ??? info "Repo size"
-    The repo size is not limited. However, local changes are limited to 2MB. 
+    The repo size is not limited. However, local changes are limited to 2MB.
     To avoid exceeding this limit, exclude unnecessary files using `.gitignore` or `.dstackignore`.
     You can increase the 2MB limit by setting the `DSTACK_SERVER_CODE_UPLOAD_LIMIT` environment variable.
 
@@ -941,7 +1088,7 @@ The local path can be either relative to the configuration file or absolute.
 
     <!-- TODO: Add a more relevant example -->
 
-    <div editor-title="examples/.dstack.yml"> 
+    <div editor-title="examples/.dstack.yml">
 
     ```yaml
     type: service
@@ -979,7 +1126,7 @@ Currently, you can configure up to one repo per run configuration.
 
 By default, if `dstack` can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.
 
-If you'd like `dstack` to automatically retry, configure the 
+If you'd like `dstack` to automatically retry, configure the
 [retry](../reference/dstack.yml/service.md#retry) property accordingly:
 <!-- TODO: Add a relevant example -->
 
@@ -1075,7 +1222,7 @@ The `schedule` property can be combined with `max_duration` or `utilization_poli
 
 ??? info "Cron syntax"
     `dstack` supports [POSIX cron syntax](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07). One exception is that days of the week are started from Monday instead of Sunday so `0` corresponds to Monday.
-    
+
     The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively.
 
     A cron expression consists of five fields:
@@ -1105,8 +1252,8 @@ The `schedule` property can be combined with `max_duration` or `utilization_poli
 
 !!! info "Reference"
     Services support many more configuration options,
-    incl. [`backends`](../reference/dstack.yml/service.md#backends), 
-    [`regions`](../reference/dstack.yml/service.md#regions), 
+    incl. [`backends`](../reference/dstack.yml/service.md#backends),
+    [`regions`](../reference/dstack.yml/service.md#regions),
     [`max_price`](../reference/dstack.yml/service.md#max_price), and
     among [others](../reference/dstack.yml/service.md).
 
@@ -1133,7 +1280,7 @@ Update the run? [y/n]:
 
 If approved, `dstack` gradually updates the service replicas. To update a replica, `dstack` starts a new replica, waits for it to become `running` and for all of its [probes](#probes) to pass, then terminates the old replica. This process is repeated for each replica, one at a time.
 
-You can track the progress of rolling deployment in both `dstack apply` or `dstack ps`. 
+You can track the progress of rolling deployment in both `dstack apply` or `dstack ps`.
 Older replicas have lower `deployment` numbers; newer ones have higher.
 
 <!--
@@ -1162,7 +1309,7 @@ The rolling deployment stops when all replicas are updated or when a new deploym
 
     Changes to other properties require a full service restart.
 
-    To trigger a rolling deployment when no properties have changed (e.g., after updating [secrets](secrets.md) or to restart all replicas),  
+    To trigger a rolling deployment when no properties have changed (e.g., after updating [secrets](secrets.md) or to restart all replicas),
     make a minor config change, such as adding a dummy [environment variable](#environment-variables).
 
 --8<-- "docs/concepts/snippets/manage-runs.ext"
diff --git a/docs/docs/guides/protips.md b/docs/docs/guides/protips.md
index f090fcf649..9f800a956e 100644
--- a/docs/docs/guides/protips.md
+++ b/docs/docs/guides/protips.md
@@ -121,7 +121,7 @@ utilization_policy:
 Set `docker` to `true` to enable the `docker` CLI in your dev environment, e.g., to run or build Docker images, or use Docker Compose.
 
 === "Dev environment"
-    <div editor-title="examples/misc/docker-compose/.dstack.yml">
+    <div editor-title=".dstack.yml">
 
     ```yaml
     type: dev-environment
@@ -137,7 +137,7 @@ Set `docker` to `true` to enable the `docker` CLI in your dev environment, e.g.,
     </div>
 
 === "Task"
-    <div editor-title="examples/misc/dind/task.dstack.yml">
+    <div editor-title=".dstack.yml">
 
     ```yaml
     type: task
@@ -188,8 +188,6 @@ Set `docker` to `true` to enable the `docker` CLI in your dev environment, e.g.,
                 optional: true
         ```
 
-See more Docker examples [here](https://github.com/dstackai/dstack/tree/master/examples/misc/docker-compose).
-
 ## Fleets
 
 ### Creation policy
diff --git a/docs/examples.md b/docs/examples.md
index cbecf2435e..04cd5ff0f9 100644
--- a/docs/examples.md
+++ b/docs/examples.md
@@ -180,7 +180,7 @@ hide:
            TensorRT-LLM
        </h3>
        <p>
-            Deploy DeepSeek models with TensorRT-LLM
+            Deploy Qwen3 with TensorRT-LLM
         </p>
     </a>
 </div>
@@ -221,34 +221,3 @@ hide:
         </p>
     </a>
 </div>
-
-## Models
-
-<div class="tx-landing__highlights_grid">
-    <a href="/examples/models/wan22"
-       class="feature-cell sky">
-        <h3>
-            Wan2.2
-        </h3>
-
-        <p>
-            Use Wan2.2 to generate videos from text
-        </p>
-    </a>
-</div>
-
-
-<!-- ## Misc
-
-<div class="tx-landing__highlights_grid">
-    <a href="/examples/misc/docker-compose"
-       class="feature-cell sky">
-        <h3>
-            Docker Compose
-        </h3>
-
-        <p>
-            Use Docker and Docker Compose inside runs
-        </p>
-    </a>
-</div> -->
diff --git a/examples/accelerators/amd/README.md b/examples/accelerators/amd/README.md
index 3f6b4966b1..36be8044e8 100644
--- a/examples/accelerators/amd/README.md
+++ b/examples/accelerators/amd/README.md
@@ -16,7 +16,7 @@ Llama 3.1 70B in FP16 using [vLLM](https://docs.vllm.ai/en/latest/getting_starte
 
 === "vLLM"
 
-    <div editor-title="examples/inference/vllm/amd/.dstack.yml">
+    <div editor-title="service.dstack.yml">
 
     ```yaml
     type: service
@@ -69,9 +69,7 @@ Llama 3.1 70B in FP16 using [vLLM](https://docs.vllm.ai/en/latest/getting_starte
 
     Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version.
 
-    > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3.
-    > You can find the task to build and upload the binary in
-    > [`examples/inference/vllm/amd/`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd/).
+    > To speed up the `vLLM-ROCm` installation, this example uses a pre-built binary from S3.
 
 !!! info "Docker image"
     If you want to use AMD, specifying `image` is currently required. This must be an image that includes
@@ -87,7 +85,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
     and the [`mlabonne/guanaco-llama2-1k`](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)
     dataset.
 
-    <div editor-title="examples/single-node-training/trl/amd/.dstack.yml">
+    <div editor-title="train.dstack.yml">
 
     ```yaml
     type: task
@@ -133,7 +131,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
     and the [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
     dataset.
 
-    <div editor-title="examples/single-node-training/axolotl/amd/.dstack.yml">
+    <div editor-title="train.dstack.yml">
 
     ```yaml
     type: task
@@ -187,13 +185,12 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
 
     Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile)](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm).
 
-    > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3.
-    > You can find the tasks that build and upload the binaries
-    > in [`examples/single-node-training/axolotl/amd/`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd/).
+    > To speed up installation of `flash-attention` and `xformers`, we use pre-built binaries uploaded to S3.
 
 ## Running a configuration
 
-Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
+Once a configuration is ready, save it to a `.dstack.yml` file, then run
+`dstack apply -f <configuration file>`, and `dstack` will automatically provision the
 cloud resources and run the configuration.
 
 <div class="termy">
@@ -204,18 +201,11 @@ $ WANDB_API_KEY=...
 $ WANDB_PROJECT=...
 $ WANDB_NAME=axolotl-amd-llama31-train
 $ HUB_MODEL_ID=...
-$ dstack apply -f examples/inference/vllm/amd/.dstack.yml
+$ dstack apply -f service.dstack.yml
 ```
 
 </div>
 
-## Source code
-
-The source-code of this example can be found in
-[`examples/inference/vllm/amd`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd),
-[`examples/single-node-training/axolotl/amd`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd) and
-[`examples/single-node-training/trl/amd`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl/amd)
-
 ## What's next?
 
 1. Browse [vLLM](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm),
diff --git a/examples/accelerators/tpu/README.md b/examples/accelerators/tpu/README.md
index 98982a919a..53f31b93bd 100644
--- a/examples/accelerators/tpu/README.md
+++ b/examples/accelerators/tpu/README.md
@@ -29,7 +29,7 @@ and [vLLM](https://github.com/vllm-project/vllm).
 
 === "Optimum TPU"
 
-    <div editor-title="examples/inference/tgi/tpu/.dstack.yml">
+    <div editor-title="service-optimum-tpu.dstack.yml">
 
     ```yaml
     type: service
@@ -61,7 +61,7 @@ and [vLLM](https://github.com/vllm-project/vllm).
         the official Docker image can be used.
 
 === "vLLM"
-    <div editor-title="examples/inference/vllm/tpu/.dstack.yml">
+    <div editor-title="service-vllm-tpu.dstack.yml">
 
     ```yaml
     type: service
@@ -184,13 +184,6 @@ Note, `v5litepod` is optimized for fine-tuning transformer-based models. Each co
 | **TRL**         | bfloat16     | To fine-tune using TRL, Optimum TPU is recommended. TRL doesn't support Llama 3.1 out of the box. |
 | **Pytorch XLA** | bfloat16     |                                                                                                   |
 
-## Source code
-
-The source-code of this example can be found in
-[`examples/inference/tgi/tpu`](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/tpu),
-[`examples/inference/vllm/tpu`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/tpu),
-and [`examples/single-node-training/optimum-tpu`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl).
-
 ## What's next?
 
 1. Browse [Optimum TPU](https://github.com/huggingface/optimum-tpu),
diff --git a/examples/clusters/nccl-rccl-tests/README.md b/examples/clusters/nccl-rccl-tests/README.md
index 1d9166591e..a9cadd82e8 100644
--- a/examples/clusters/nccl-rccl-tests/README.md
+++ b/examples/clusters/nccl-rccl-tests/README.md
@@ -115,7 +115,6 @@ Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPU
         kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it 
         using `LD_PRELOAD` when running MPI.
 
-
 !!! info "Privileged"
     In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand).
 
@@ -138,11 +137,6 @@ Submit the run nccl-tests? [y/n]: y
 
 </div>
 
-## Source code
-
-The source-code of this example can be found in 
-[`examples/clusters/nccl-rccl-tests`](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-rccl-tests).
-
 ## What's next?
 
 1. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), 
diff --git a/examples/distributed-training/axolotl/README.md b/examples/distributed-training/axolotl/README.md
index 593d4142b2..cd7be95e4c 100644
--- a/examples/distributed-training/axolotl/README.md
+++ b/examples/distributed-training/axolotl/README.md
@@ -94,11 +94,6 @@ Provisioning...
 ```
 </div>
 
-## Source code
-
-The source-code of this example can be found in
-[`examples/distributed-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl).
-
 !!! info "What's next?"
     1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide
     2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
diff --git a/examples/distributed-training/trl/README.md b/examples/distributed-training/trl/README.md
index 84ac7dd6ba..47d3f6f888 100644
--- a/examples/distributed-training/trl/README.md
+++ b/examples/distributed-training/trl/README.md
@@ -154,11 +154,6 @@ Provisioning...
 ```
 </div>
 
-## Source code
-
-The source-code of this example can be found in
-[`examples/distributed-training/trl`](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).
-
 !!! info "What's next?"
     1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide
     2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
diff --git a/examples/inference/nim/.dstack.yml b/examples/inference/nim/.dstack.yml
deleted file mode 100644
index 4a7d33406b..0000000000
--- a/examples/inference/nim/.dstack.yml
+++ /dev/null
@@ -1,27 +0,0 @@
-type: service
-name: serve-distill-deepseek
-
-image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
-env:
-  - NGC_API_KEY
-  - NIM_MAX_MODEL_LEN=4096
-registry_auth:
-  username: $oauthtoken
-  password: ${{ env.NGC_API_KEY }}
-port: 8000
-# Register the model
-model: deepseek-ai/deepseek-r1-distill-llama-8b
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-# Cache downloaded models
-volumes:
-  - instance_path: /root/.cache/nim
-    path: /opt/nim/.cache
-    optional: true
-
-resources:
-  gpu: A100:40GB
-  # Uncomment if using multiple GPUs
-  #shm_size: 16GB
diff --git a/examples/inference/nim/README.md b/examples/inference/nim/README.md
index 356492a49a..680c51f498 100644
--- a/examples/inference/nim/README.md
+++ b/examples/inference/nim/README.md
@@ -1,11 +1,11 @@
 ---
 title: NVIDIA NIM
-description: Deploying DeepSeek-R1-Distill-Llama-8B using NVIDIA NIM
+description: Deploying Nemotron-3-Super-120B-A12B using NVIDIA NIM
 ---
 
 # NVIDIA NIM
 
-This example shows how to deploy DeepSeek-R1-Distill-Llama-8B using [NVIDIA NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) and `dstack`.
+This example shows how to deploy Nemotron-3-Super-120B-A12B using [NVIDIA NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) and `dstack`.
 
 ??? info "Prerequisites"
     Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
@@ -21,60 +21,46 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama-8B using [NVIDIA NIM]
 
 ## Deployment
 
-Here's an example of a service that deploys DeepSeek-R1-Distill-Llama-8B using NIM.
+Here's an example of a service that deploys Nemotron-3-Super-120B-A12B using NIM.
 
-<div editor-title="examples/inference/nim/.dstack.yml">
+<div editor-title="nemotron120.dstack.yml">
 
 ```yaml
 type: service
-name: serve-distill-deepseek
+name: nemotron120
 
-image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
+image: nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:1.8.0
 env:
   - NGC_API_KEY
-  - NIM_MAX_MODEL_LEN=4096
 registry_auth:
   username: $oauthtoken
   password: ${{ env.NGC_API_KEY }}
 port: 8000
-# Register the model
-model: deepseek-ai/deepseek-r1-distill-llama-8b
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-# Cache downloaded models
+model: nvidia/nemotron-3-super-120b-a12b
 volumes:
   - instance_path: /root/.cache/nim
     path: /opt/nim/.cache
     optional: true
 
 resources:
-  gpu: A100:40GB
-  # Uncomment if using multiple GPUs
-  #shm_size: 16GB
+  cpu: x86:96..
+  memory: 512GB..
+  shm_size: 16GB
+  disk: 500GB..
+  gpu: H100:80GB:8
 ```
 </div>
 
 ### Running a configuration
 
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+Save the configuration above as `nemotron120.dstack.yml`, then use the
+[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
 
 <div class="termy">
 
 ```shell
 $ NGC_API_KEY=...
-$ dstack apply -f examples/inference/nim/.dstack.yml
-
- #  BACKEND  REGION    RESOURCES                  SPOT  PRICE
- 1  vultr    ewr       6xCPU, 60GB, 1xA100 (40GB) no    $1.199
- 2  vultr    ewr       6xCPU, 60GB, 1xA100 (40GB) no    $1.199
- 3  vultr    nrt       6xCPU, 60GB, 1xA100 (40GB) no    $1.199
-
-Submit the run serve-distill-deepseek? [y/n]: y
-
-Provisioning...
----> 100%
+$ dstack apply -f nemotron120.dstack.yml
 ```
 </div>
 
@@ -83,12 +69,12 @@ If no gateway is created, the service endpoint will be available at `<dstack ser
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/nemotron120/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
     -d '{
-      "model": "meta/llama3-8b-instruct",
+      "model": "nvidia/nemotron-3-super-120b-a12b",
       "messages": [
         {
           "role": "system",
@@ -105,14 +91,9 @@ $ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill-deepseek.<gateway domain>/`.
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/inference/nim`](https://github.com/dstackai/dstack/blob/master/examples/inference/nim).
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://nemotron120.<gateway domain>/`.
 
 ## What's next?
 
 1. Check [services](https://dstack.ai/docs/services)
-2. Browse the [DeepSeek AI NIM](https://build.nvidia.com/deepseek-ai)
+2. Browse the [Nemotron-3-Super-120B-A12B model page](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b)
diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
index 8dbfe347f1..9d08fe09cb 100644
--- a/examples/inference/sglang/README.md
+++ b/examples/inference/sglang/README.md
@@ -1,84 +1,121 @@
 ---
 title: SGLang
-description: Deploying DeepSeek-R1-Distill-Llama models using SGLang on NVIDIA and AMD GPUs
+description: Deploying Qwen3.5-397B-A17B-FP8 using SGLang on NVIDIA and AMD GPUs
 ---
 
 # SGLang
 
-This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGLang](https://github.com/sgl-project/sglang) and `dstack`.
+This example shows how to deploy `Qwen/Qwen3.5-397B-A17B-FP8` using
+[SGLang](https://github.com/sgl-project/sglang) and `dstack`.
 
 ## Apply a configuration
 
-Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.
+Here's an example of a service that deploys
+`Qwen/Qwen3.5-397B-A17B-FP8` using SGLang.
 
 === "NVIDIA"
 
-    <div editor-title="examples/inference/sglang/nvidia/.dstack.yml">
+    <div editor-title="qwen397.dstack.yml">
 
     ```yaml
     type: service
-    name: deepseek-r1
+    name: qwen397
 
-    image: lmsysorg/sglang:latest
-    env:
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    image: lmsysorg/sglang:v0.5.10.post1
 
     commands:
-      - python3 -m sglang.launch_server
-         --model-path $MODEL_ID
-         --port 8000
-         --trust-remote-code
-
-    port: 8000
-    model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+      - |
+        sglang serve \
+          --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
+          --port 30000 \
+          --tp $DSTACK_GPUS_NUM \
+          --reasoning-parser qwen3 \
+          --tool-call-parser qwen3_coder \
+          --enable-flashinfer-allreduce-fusion \
+          --mem-fraction-static 0.8
+
+    port: 30000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
 
     resources:
-       gpu: 24GB
+      cpu: x86:96..
+      memory: 512GB..
+      shm_size: 16GB
+      disk: 500GB..
+      gpu: H100:80GB:8
     ```
     </div>
 
 === "AMD"
 
-    <div editor-title="examples/inference/sglang/amd/.dstack.yml">
+    <div editor-title="qwen397.dstack.yml">
 
     ```yaml
     type: service
-    name: deepseek-r1
+    name: qwen397
+
+    image: lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x
 
-    image: lmsysorg/sglang:v0.4.1.post4-rocm620
     env:
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
+      - HIP_FORCE_DEV_KERNARG=1
+      - SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+      - SGLANG_DISABLE_CUDNN_CHECK=1
+      - SGLANG_INT4_WEIGHT=0
+      - SGLANG_MOE_PADDING=1
+      - SGLANG_ROCM_DISABLE_LINEARQUANT=0
+      - SGLANG_ROCM_FUSED_DECODE_MLA=1
+      - SGLANG_SET_CPU_AFFINITY=1
+      - SGLANG_USE_AITER=1
+      - SGLANG_USE_ROCM700A=1
 
     commands:
-      - python3 -m sglang.launch_server
-         --model-path $MODEL_ID
-         --port 8000
-         --trust-remote-code
-
-    port: 8000
-    model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
+      - |
+        sglang serve \
+          --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
+          --tp $DSTACK_GPUS_NUM \
+          --reasoning-parser qwen3 \
+          --tool-call-parser qwen3_coder \
+          --mem-fraction-static 0.8 \
+          --context-length 262144 \
+          --attention-backend triton \
+          --disable-cuda-graph \
+          --fp8-gemm-backend aiter \
+          --port 30000
+
+    port: 30000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
 
     resources:
-      gpu: MI300x
-      disk: 300GB
+      cpu: x86:52..
+      memory: 700GB..
+      shm_size: 16GB
+      disk: 600GB..
+      gpu: MI300X:192GB:4
     ```
     </div>
 
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+The AMD example uses the exact validated MI300X configuration for this model,
+including the ROCm/AITER settings required for stable FP8 serving.
+
+Save one of the configurations above as `qwen397.dstack.yml`, then use the
+[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
 
 <div class="termy">
 
 ```shell
-$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
-
- #  BACKEND  REGION     RESOURCES                         SPOT  PRICE
- 1  runpod   EU-RO-1   24xCPU, 283GB, 1xMI300X (192GB)    no    $2.49
-
-Submit the run deepseek-r1? [y/n]: y
-
-Provisioning...
----> 100%
+$ dstack apply -f qwen397.dstack.yml
 ```
+
 </div>
 
 If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
@@ -86,29 +123,26 @@ If no gateway is created, the service endpoint will be available at `<dstack ser
 <div class="termy">
 
 ```shell
-curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
+curl http://127.0.0.1:3000/proxy/services/main/qwen397/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
     -d '{
-      "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
+      "model": "Qwen/Qwen3.5-397B-A17B-FP8",
       "messages": [
-        {
-          "role": "system",
-          "content": "You are a helpful assistant."
-        },
         {
           "role": "user",
-          "content": "What is Deep Learning?"
+          "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with just the dollar amount."
         }
       ],
-      "stream": true,
-      "max_tokens": 512
+      "chat_template_kwargs": {"enable_thinking": true},
+      "separate_reasoning": true,
+      "max_tokens": 1024
     }'
 ```
 </div>
 
-> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
+> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen397.<gateway domain>/`.
 
 ## Configuration options
 
@@ -116,75 +150,77 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
 
 To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), use replicas groups: one for a router (for example, [SGLang Model Gateway](https://docs.sglang.io/advanced_features/sgl_model_gateway.html)), one for prefill workers, and one for decode workers.
 
-<div editor-title="examples/inference/sglang/pd.dstack.yml">
-
-```yaml
-type: service
-name: prefill-decode
-image: lmsysorg/sglang:latest
-
-env:
-  - HF_TOKEN
-  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+=== "NVIDIA"
 
-replicas:
-  - count: 1
-    # For now replica group with router must have count: 1
-    commands:
-      - pip install sglang_router
-      - |
-        python -m sglang_router.launch_router \
-          --host 0.0.0.0 \
-          --port 8000 \
-          --pd-disaggregation \
-          --prefill-policy cache_aware
-    router:
-      type: sglang
-    resources:
-      cpu: 4
+    <div editor-title="pd.dstack.yml">
 
-  - count: 1..4
-    scaling:
-      metric: rps
-      target: 3
-    commands:
-      - |
-        python -m sglang.launch_server \
-          --model-path $MODEL_ID \
-          --disaggregation-mode prefill \
-          --disaggregation-transfer-backend nixl \
-          --host 0.0.0.0 \
-          --port 8000 \
-          --disaggregation-bootstrap-port 8998
-    resources:
-      gpu: H200
+    ```yaml
+    type: service
+    name: prefill-decode
+    image: lmsysorg/sglang:v0.5.10.post1
 
-  - count: 1..8
-    scaling:
-      metric: rps
-      target: 2
-    commands:
-      - |
-        python -m sglang.launch_server \
-          --model-path $MODEL_ID \
-          --disaggregation-mode decode \
-          --disaggregation-transfer-backend nixl \
-          --host 0.0.0.0 \
-          --port 8000
-    resources:
-      gpu: H200
+    env:
+      - HF_TOKEN
+      - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+    replicas:
+      - count: 1
+        # For now replica group with router must have count: 1
+        commands:
+          - pip install sglang_router
+          - |
+            python -m sglang_router.launch_router \
+              --host 0.0.0.0 \
+              --port 8000 \
+              --pd-disaggregation \
+              --prefill-policy cache_aware
+        router:
+          type: sglang
+        resources:
+          cpu: 4
+
+      - count: 1..4
+        scaling:
+          metric: rps
+          target: 3
+        commands:
+          - |
+            python -m sglang.launch_server \
+              --model-path $MODEL_ID \
+              --disaggregation-mode prefill \
+              --disaggregation-transfer-backend nixl \
+              --host 0.0.0.0 \
+              --port 8000 \
+              --disaggregation-bootstrap-port 8998
+        resources:
+          gpu: H200
+
+      - count: 1..8
+        scaling:
+          metric: rps
+          target: 2
+        commands:
+          - |
+            python -m sglang.launch_server \
+              --model-path $MODEL_ID \
+              --disaggregation-mode decode \
+              --disaggregation-transfer-backend nixl \
+              --host 0.0.0.0 \
+              --port 8000
+        resources:
+          gpu: H200
 
-port: 8000
-model: zai-org/GLM-4.5-Air-FP8
+    port: 8000
+    model: zai-org/GLM-4.5-Air-FP8
 
-# Custom probe is required for PD disaggregation.
-probes:
-  - type: http
-    url: /health
-    interval: 15s
-```
+    # Custom probe is required for PD disaggregation.
+    probes:
+      - type: http
+        url: /health
+        interval: 15s
+    ```
 
-</div>
+    </div>
 
 Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
 
@@ -193,12 +229,7 @@ Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics
 
     While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
 
-## Source code
-
-The source-code of these examples can be found in
-[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang).
-
 ## What's next?
 
 1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways)
-2. Browse the [SgLang DeepSeek Usage](https://docs.sglang.ai/references/deepseek.html), [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
+2. Browse the [Qwen 3.5 SGLang cookbook](https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5) and the [SGLang server arguments reference](https://docs.sglang.ai/advanced_features/server_arguments.html)
diff --git a/examples/inference/sglang/pd-disagg.fleet.dstack.yml b/examples/inference/sglang/pd-disagg.fleet.dstack.yml
deleted file mode 100644
index 6cb839d98b..0000000000
--- a/examples/inference/sglang/pd-disagg.fleet.dstack.yml
+++ /dev/null
@@ -1,12 +0,0 @@
-type: fleet
-name: pd-disagg
-
-placement: cluster
-
-ssh_config:
-  user: ubuntu
-  identity_file: ~/.ssh/id_rsa
-  hosts:
-    - 89.169.108.16   # CPU Host (router)
-    - 89.169.123.100  # GPU Host (prefill/decode workers)
-    - 89.169.110.65   # GPU Host (prefill/decode workers)
diff --git a/examples/inference/sglang/pd.deprecated.dstack.yml b/examples/inference/sglang/pd.deprecated.dstack.yml
deleted file mode 100644
index ff62080ae9..0000000000
--- a/examples/inference/sglang/pd.deprecated.dstack.yml
+++ /dev/null
@@ -1,54 +0,0 @@
-# DEPRECATED: Gateway-based PD disaggregation config.
-# Use `pd.dstack.yml` instead (router runs as a replica).
-
-type: service
-name: prefill-decode
-image: lmsysorg/sglang:latest
-
-env:
-  - HF_TOKEN
-  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
-
-replicas:
-  - count: 1..4
-    scaling:
-      metric: rps
-      target: 3
-    commands:
-      - |
-          python -m sglang.launch_server \
-            --model-path $MODEL_ID \
-            --disaggregation-mode prefill \
-            --disaggregation-transfer-backend mooncake \
-            --host 0.0.0.0 \
-            --port 8000 \
-            --disaggregation-bootstrap-port 8998
-    resources:
-      gpu: 1
-
-  - count: 1..8
-    scaling:
-      metric: rps
-      target: 2
-    commands:
-      - |
-          python -m sglang.launch_server \
-            --model-path $MODEL_ID \
-            --disaggregation-mode decode \
-            --disaggregation-transfer-backend mooncake \
-            --host 0.0.0.0 \
-            --port 8000
-    resources:
-      gpu: 1
-
-port: 8000
-model: zai-org/GLM-4.5-Air-FP8
-
-probes:
-  - type: http
-    url: /health_generate
-    interval: 15s
-
-router:
-  type: sglang
-  pd_disaggregation: true
diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml
deleted file mode 100644
index c026bab242..0000000000
--- a/examples/inference/sglang/pd.dstack.yml
+++ /dev/null
@@ -1,62 +0,0 @@
-type: service
-name: prefill-decode
-image: lmsysorg/sglang:latest
-
-env:
-  - HF_TOKEN
-  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
-
-replicas:
-  - count: 1
-    # For now replica group with router must have count: 1
-    commands:
-      - pip install sglang_router
-      - |
-          python -m sglang_router.launch_router \
-            --host 0.0.0.0 \
-            --port 8000 \
-            --pd-disaggregation \
-            --prefill-policy cache_aware
-    router:
-      type: sglang
-    resources:
-      cpu: 4
-
-  - count: 1..4
-    scaling:
-      metric: rps
-      target: 3
-    commands:
-      - |
-          python -m sglang.launch_server \
-            --model-path $MODEL_ID \
-            --disaggregation-mode prefill \
-            --disaggregation-transfer-backend nixl \
-            --host 0.0.0.0 \
-            --port 8000 \
-            --disaggregation-bootstrap-port 8998
-    resources:
-      gpu: H200
-
-  - count: 1..8
-    scaling:
-      metric: rps
-      target: 2
-    commands:
-      - |
-          python -m sglang.launch_server \
-            --model-path $MODEL_ID \
-            --disaggregation-mode decode \
-            --disaggregation-transfer-backend nixl \
-            --host 0.0.0.0 \
-            --port 8000
-    resources:
-      gpu: H200
-
-port: 8000
-model: zai-org/GLM-4.5-Air-FP8
-
-probes:
-  - type: http
-    url: /health
-    interval: 15s
diff --git a/examples/inference/trtllm/README.md b/examples/inference/trtllm/README.md
index 31ddb5b0c3..ae3666d225 100644
--- a/examples/inference/trtllm/README.md
+++ b/examples/inference/trtllm/README.md
@@ -1,331 +1,66 @@
 ---
 title: TensorRT-LLM
-description: Deploying DeepSeek R1 and distilled models using NVIDIA TensorRT-LLM with Triton
+description: Deploying Qwen3-235B-A22B-FP8 using NVIDIA TensorRT-LLM on NVIDIA GPUs
 ---
 
 # TensorRT-LLM
 
-This example shows how to deploy both DeepSeek R1 and its distilled version
-using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `dstack`.
+This example shows how to deploy `nvidia/Qwen3-235B-A22B-FP8` using
+[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `dstack`.
 
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
+## Apply a configuration
 
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
+Here's an example of a service that deploys
+`nvidia/Qwen3-235B-A22B-FP8` using TensorRT-LLM.
 
-## Deployment
-
-### DeepSeek R1
-
-We normally use Triton with the TensorRT-LLM backend to serve models. While this works for the distilled Llama-based
-version, DeepSeek R1 isn’t yet compatible. So, for DeepSeek R1, we’ll use `trtllm-serve` with the PyTorch backend instead.
-
-To use `trtllm-serve`, we first need to build the TensorRT-LLM Docker image from the `main` branch.
-
-#### Build a Docker image
-
-Here’s the task config that builds the image and pushes it using the provided Docker credentials.
-
-<div editor-title="examples/inference/trtllm/build-image.dstack.yml">
+<div editor-title="qwen235.dstack.yml">
 
 ```yaml
-type: task
-name: build-image
+type: service
+name: qwen235
+
+image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc11
 
-privileged: true
-image: dstackai/dind
 env:
-  - DOCKER_USERNAME
-  - DOCKER_PASSWORD
+  - HF_HUB_ENABLE_HF_TRANSFER=1
+
 commands:
-  - start-dockerd
-  - apt update && apt-get install -y build-essential make git git-lfs
-  - git lfs install
-  - git clone https://github.com/NVIDIA/TensorRT-LLM.git
-  - cd TensorRT-LLM
-  - git submodule update --init --recursive
-  - git lfs pull
-  # Limit compilation to Hopper for a smaller image
-  - make -C docker release_build CUDA_ARCHS="90-real"
-  - docker tag tensorrt_llm/release:latest $DOCKER_USERNAME/tensorrt_llm:latest
-  - echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin
-  - docker push "$DOCKER_USERNAME/tensorrt_llm:latest"
+  - pip install hf_transfer
+  - |
+    trtllm-serve serve nvidia/Qwen3-235B-A22B-FP8 \
+      --host 0.0.0.0 \
+      --port 8000 \
+      --backend pytorch \
+      --tp_size $DSTACK_GPUS_NUM \
+      --max_batch_size 32 \
+      --max_num_tokens 4096 \
+      --kv_cache_free_gpu_memory_fraction 0.75
+
+port: 8000
+model: nvidia/Qwen3-235B-A22B-FP8
+
+volumes:
+  - instance_path: /root/.cache
+    path: /root/.cache
+    optional: true
 
 resources:
-  cpu: 8
-  disk: 500GB..
+  cpu: 96..
+  memory: 512GB..
+  shm_size: 32GB
+  disk: 1000GB..
+  gpu: H100:8
 ```
 </div>
 
-To run it, pass the task configuration to `dstack apply`.
+Apply it with [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md):
 
 <div class="termy">
 
 ```shell
-$ dstack apply -f examples/inference/trtllm/build-image.dstack.yml
-
- #  BACKEND  REGION             RESOURCES               SPOT  PRICE
- 1  cudo     ca-montreal-2      8xCPU, 25GB, (500.0GB)  yes   $0.1073
-
-Submit the run build-image? [y/n]: y
-
-Provisioning...
----> 100%
+$ dstack apply -f qwen235.dstack.yml
 ```
-</div>
 
-#### Deploy the model
-
-Below is the service configuration that deploys DeepSeek R1 using the built TensorRT-LLM image.
-
-<div editor-title="examples/inference/trtllm/serve-r1.dstack.yml">
-
-    ```yaml
-    type: service
-    name: serve-r1
-
-    # Specify the image built with `examples/inference/trtllm/build-image.dstack.yml`
-    image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167
-    env:
-      - MAX_BATCH_SIZE=256
-      - MAX_NUM_TOKENS=16384
-      - MAX_SEQ_LENGTH=16384
-      - EXPERT_PARALLEL=4
-      - PIPELINE_PARALLEL=1
-      - HF_HUB_ENABLE_HF_TRANSFER=1
-    commands:
-      - pip install -U "huggingface_hub[cli]"
-      - pip install hf_transfer
-      - huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1
-      - trtllm-serve
-              --backend pytorch
-              --max_batch_size $MAX_BATCH_SIZE
-              --max_num_tokens $MAX_NUM_TOKENS
-              --max_seq_len $MAX_SEQ_LENGTH
-              --tp_size $DSTACK_GPUS_NUM
-              --ep_size $EXPERT_PARALLEL
-              --pp_size $PIPELINE_PARALLEL
-              DeepSeek-R1
-    port: 8000
-    model: deepseek-ai/DeepSeek-R1
-
-    resources:
-      gpu: 8:H200
-      shm_size: 32GB
-      disk: 2000GB..
-    ```
-    </div>
-
-
-To run it, pass the configuration to `dstack apply`.
-
-<div class="termy">
-
-```shell
-$ dstack apply -f examples/inference/trtllm/serve-r1.dstack.yml
-
- #  BACKEND  REGION             RESOURCES                        SPOT  PRICE
- 1  vastai   is-iceland         192xCPU, 2063GB, 8xH200 (141GB)  yes   $25.62
-
-Submit the run serve-r1? [y/n]: y
-
-Provisioning...
----> 100%
-```
-</div>
-
-
-### DeepSeek R1 Distill Llama 8B
-
-To deploy DeepSeek R1 Distill Llama 8B, follow the steps below.
-
-#### Convert and upload checkpoints
-
-Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format
-and uploads it to S3 using the provided AWS credentials.
-
-<div editor-title="examples/inference/trtllm/convert-model.dstack.yml">
-
-    ```yaml
-    type: task
-    name: convert-model
-
-    image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
-    env:
-      - HF_TOKEN
-      - MODEL_REPO=https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-      - S3_BUCKET_NAME
-      - AWS_ACCESS_KEY_ID
-      - AWS_SECRET_ACCESS_KEY
-      - AWS_DEFAULT_REGION
-    commands:
-      # nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0,
-      # therefore we are using branch v0.17.0
-      - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
-      - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
-      - git clone https://github.com/triton-inference-server/server.git
-      - cd TensorRT-LLM/examples/llama
-      - apt-get -y install git git-lfs
-      - git lfs install
-      - git config --global credential.helper store
-      - huggingface-cli login --token $HF_TOKEN --add-to-git-credential
-      - git clone $MODEL_REPO
-      - python3 convert_checkpoint.py --model_dir DeepSeek-R1-Distill-Llama-8B  --output_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --dtype bfloat16 --tp_size $DSTACK_GPUS_NUM
-      # Download the AWS CLI
-      - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
-      - unzip awscliv2.zip
-      - ./aws/install
-      - aws s3 sync tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read
-
-    resources:
-      gpu: A100:40GB
-
-    ```
-    </div>
-
-
-To run it, pass the configuration to `dstack apply`.
-
-<div class="termy">
-
-```shell
-$ dstack apply -f examples/inference/trtllm/convert-model.dstack.yml
-
- #  BACKEND  REGION       RESOURCES                    SPOT  PRICE
- 1  vastai   us-iowa      12xCPU, 85GB, 1xA100 (40GB)  yes   $0.66904
-
-Submit the run convert-model? [y/n]: y
-
-Provisioning...
----> 100%
-```
-</div>
-
-
-#### Build and upload the model
-
-Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 with the provided AWS credentials.
-
-<div editor-title="build-model.dstack.yml">
-
-    ```yaml
-      type: task
-      name: build-model
-
-      image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
-      env:
-        - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-        - S3_BUCKET_NAME
-        - AWS_ACCESS_KEY_ID
-        - AWS_SECRET_ACCESS_KEY
-        - AWS_DEFAULT_REGION
-        - MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length
-        - MAX_INPUT_LEN=4096
-        - MAX_BATCH_SIZE=256
-        - TRITON_MAX_BATCH_SIZE=1
-        - INSTANCE_COUNT=1
-        - MAX_QUEUE_DELAY_MS=0
-        - MAX_QUEUE_SIZE=0
-        - DECOUPLED_MODE=true # Set true for streaming
-      commands:
-        - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir
-        - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
-        - unzip awscliv2.zip
-        - ./aws/install
-        - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 ./tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16
-        - trtllm-build --checkpoint_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --gemm_plugin bfloat16 --output_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16  --max_seq_len $MAX_SEQ_LEN --max_input_len $MAX_INPUT_LEN --max_batch_size $MAX_BATCH_SIZE --gpt_attention_plugin bfloat16 --use_paged_context_fmha enable
-        - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
-        - python3 TensorRT-LLM/examples/run.py --engine_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_output_len 40 --tokenizer_dir tokenizer_dir  --input_text "What is Deep Learning?"
-        - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
-        - mkdir triton_model_repo
-        - cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* triton_model_repo/
-        - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:TYPE_BF16
-        - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
-        - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_BF16,logits_datatype:TYPE_BF16
-        - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
-        - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:TYPE_BF16
-        - aws s3 sync triton_model_repo s3://${S3_BUCKET_NAME}/triton_model_repo --acl public-read
-        - aws s3 sync tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read
-
-      resources:
-        gpu: A100:40GB
-    ```
-    </div>
-
-To run it, pass the configuration to `dstack apply`.
-
-<div class="termy">
-
-```shell
-$ dstack apply -f examples/inference/trtllm/build-model.dstack.yml
-
- #  BACKEND  REGION       RESOURCES                    SPOT  PRICE
- 1  vastai   us-iowa      12xCPU, 85GB, 1xA100 (40GB)  yes   $0.66904
-
-Submit the run build-model? [y/n]: y
-
-Provisioning...
----> 100%
-```
-</div>
-
-#### Deploy the model
-
-Below is the service configuration that deploys DeepSeek R1 Distill Llama 8B.
-
-<div editor-title="serve-distill.dstack.yml">
-
-```yaml
-    type: service
-    name: serve-distill
-
-    image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
-    env:
-      - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-      - S3_BUCKET_NAME
-      - AWS_ACCESS_KEY_ID
-      - AWS_SECRET_ACCESS_KEY
-      - AWS_DEFAULT_REGION
-
-    commands:
-      - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir
-      - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
-      - unzip awscliv2.zip
-      - ./aws/install
-      - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16
-      - git clone https://github.com/triton-inference-server/server.git
-      - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo  --tokenizer tokenizer_dir --openai-port 8000
-    port: 8000
-    model: ensemble
-
-    resources:
-      gpu: A100:40GB
-
-```
-</div>
-
-To run it, pass the configuration to `dstack apply`.
-
-<div class="termy">
-
-```shell
-$ dstack apply -f examples/inference/trtllm/serve-distill.dstack.yml
-
- #  BACKEND  REGION       RESOURCES                    SPOT  PRICE
- 1  vastai   us-iowa      12xCPU, 85GB, 1xA100 (40GB)  yes   $0.66904
-
-Submit the run serve-distill? [y/n]: y
-
-Provisioning...
----> 100%
-```
 </div>
 
 ## Access the endpoint
@@ -335,38 +70,30 @@ If no gateway is created, the service endpoint will be available at `<dstack ser
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill/v1/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/qwen235/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
     -d '{
-      "model": "deepseek-ai/DeepSeek-R1",
+      "model": "nvidia/Qwen3-235B-A22B-FP8",
       "messages": [
-        {
-          "role": "system",
-          "content": "You are a helpful assistant."
-        },
         {
           "role": "user",
-          "content": "What is Deep Learning?"
+          "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"
         }
       ],
-      "stream": true,
-      "max_tokens": 128
+      "chat_template_kwargs": {"enable_thinking": true},
+      "max_tokens": 1024,
+      "temperature": 0.0
     }'
 ```
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill.<gateway domain>/`.
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/inference/trtllm`](https://github.com/dstackai/dstack/blob/master/examples/inference/trtllm).
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://qwen235.<gateway domain>/`.
 
 ## What's next?
 
-1. Check [services](https://dstack.ai/docs/services)
-2. Browse [Tensorrt-LLM DeepSeek-R1 with PyTorch Backend](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/deepseek_v3) and [Prepare the Model Repository](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#prepare-the-model-repository)
-3. See also [`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html#trtllm-serve)
+1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways)
+2. Browse the [TensorRT-LLM deployment guides](https://nvidia.github.io/TensorRT-LLM/deployment-guide/index.html) and the [Qwen3 deployment guide](https://nvidia.github.io/TensorRT-LLM/deployment-guide/deployment-guide-for-qwen3-on-trtllm.html)
+3. See the [`trtllm-serve` reference](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/trtllm-serve.html)
diff --git a/examples/inference/trtllm/build-image.dstack.yml b/examples/inference/trtllm/build-image.dstack.yml
deleted file mode 100644
index 6761379299..0000000000
--- a/examples/inference/trtllm/build-image.dstack.yml
+++ /dev/null
@@ -1,25 +0,0 @@
-type: task
-name: build-image
-
-privileged: true
-image: dstackai/dind
-env:
-  - DOCKER_USERNAME
-  - DOCKER_PASSWORD
-commands:
-  - start-dockerd
-  - apt update && apt-get install -y build-essential make git git-lfs
-  - git lfs install
-  - git clone https://github.com/NVIDIA/TensorRT-LLM.git
-  - cd TensorRT-LLM
-  - git submodule update --init --recursive
-  - git lfs pull
-  # Limit compilation to Hopper for a smaller image
-  - make -C docker release_build CUDA_ARCHS="90-real"
-  - docker tag tensorrt_llm/release:latest $DOCKER_USERNAME/tensorrt_llm:latest
-  - echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin
-  - docker push "$DOCKER_USERNAME/tensorrt_llm:latest"
-
-resources:
-  cpu: 8
-  disk: 500GB..
diff --git a/examples/inference/trtllm/build-model.dstack.yml b/examples/inference/trtllm/build-model.dstack.yml
deleted file mode 100644
index b02b87644d..0000000000
--- a/examples/inference/trtllm/build-model.dstack.yml
+++ /dev/null
@@ -1,44 +0,0 @@
-type: task
-name: build-model
-
-image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
-
-
-env:
-  - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  - S3_BUCKET_NAME
-  - AWS_ACCESS_KEY_ID
-  - AWS_SECRET_ACCESS_KEY
-  - AWS_DEFAULT_REGION
-  - MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length
-  - MAX_INPUT_LEN=4096 
-  - MAX_BATCH_SIZE=256
-  - TRITON_MAX_BATCH_SIZE=1
-  - INSTANCE_COUNT=1
-  - MAX_QUEUE_DELAY_MS=0
-  - MAX_QUEUE_SIZE=0
-  - DECOUPLED_MODE=true # Set true for streaming
-
-commands:
-  - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir
-  - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
-  - unzip awscliv2.zip
-  - ./aws/install
-  - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 ./tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16
-  - trtllm-build --checkpoint_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --gemm_plugin bfloat16 --output_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16  --max_seq_len $MAX_SEQ_LEN --max_input_len $MAX_INPUT_LEN --max_batch_size $MAX_BATCH_SIZE --gpt_attention_plugin bfloat16 --use_paged_context_fmha enable
-  - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
-  - python3 TensorRT-LLM/examples/run.py --engine_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_output_len 40 --tokenizer_dir tokenizer_dir  --input_text "What is Deep Learning?"
-  - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
-  - mkdir triton_model_repo
-  - cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* triton_model_repo/
-  - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:TYPE_BF16
-  - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
-  - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_BF16,logits_datatype:TYPE_BF16
-  - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
-  - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:TYPE_BF16
-  - aws s3 sync triton_model_repo s3://${S3_BUCKET_NAME}/triton_model_repo --acl public-read
-  - aws s3 sync tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read
-  
-
-resources:
-  gpu: A100:40GB
diff --git a/examples/inference/trtllm/convert-model.dstack.yml b/examples/inference/trtllm/convert-model.dstack.yml
deleted file mode 100644
index 262e8f2945..0000000000
--- a/examples/inference/trtllm/convert-model.dstack.yml
+++ /dev/null
@@ -1,34 +0,0 @@
-type: task
-name: convert-model
-
-image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
-
-env:
-  - HF_TOKEN
-  - MODEL_REPO=https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  - S3_BUCKET_NAME
-  - AWS_ACCESS_KEY_ID
-  - AWS_SECRET_ACCESS_KEY
-  - AWS_DEFAULT_REGION
-
-commands:
-  # nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0,
-  # therefore we are using branch v0.17.0 
-  - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
-  - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
-  - git clone https://github.com/triton-inference-server/server.git
-  - cd TensorRT-LLM/examples/llama
-  - apt-get -y install git git-lfs
-  - git lfs install
-  - git config --global credential.helper store
-  - huggingface-cli login --token $HF_TOKEN --add-to-git-credential
-  - git clone $MODEL_REPO
-  - python3 convert_checkpoint.py --model_dir DeepSeek-R1-Distill-Llama-8B  --output_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --dtype bfloat16 --tp_size $DSTACK_GPUS_NUM
-  # Download the AWS CLI
-  - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
-  - unzip awscliv2.zip
-  - ./aws/install
-  - aws s3 sync tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read
-
-resources:
-  gpu: A100:40GB
diff --git a/examples/inference/trtllm/serve-distill.dstack.yml b/examples/inference/trtllm/serve-distill.dstack.yml
deleted file mode 100644
index bc5ad6d028..0000000000
--- a/examples/inference/trtllm/serve-distill.dstack.yml
+++ /dev/null
@@ -1,28 +0,0 @@
-type: service
-name: serve-distill
-
-image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
-
-env:
-  - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  - S3_BUCKET_NAME
-  - AWS_ACCESS_KEY_ID
-  - AWS_SECRET_ACCESS_KEY
-  - AWS_DEFAULT_REGION
-
-commands:
-  - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir
-  - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
-  - unzip awscliv2.zip
-  - ./aws/install
-  - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16
-  - git clone https://github.com/triton-inference-server/server.git
-  - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo  --tokenizer tokenizer_dir --openai-port 8000  
-
-
-port: 8000
-
-model: ensemble
-
-resources:
-  gpu: A100:40GB
diff --git a/examples/inference/trtllm/serve-r1.dstack.yml b/examples/inference/trtllm/serve-r1.dstack.yml
deleted file mode 100644
index f5e9091720..0000000000
--- a/examples/inference/trtllm/serve-r1.dstack.yml
+++ /dev/null
@@ -1,32 +0,0 @@
-type: service
-name: serve-r1
-
-# Specify the image built with `examples/inference/trtllm/build-image.dstack.yml`
-image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167 
-env:
-  - MAX_BATCH_SIZE=256
-  - MAX_NUM_TOKENS=16384
-  - MAX_SEQ_LENGTH=16384
-  - EXPERT_PARALLEL=4
-  - PIPELINE_PARALLEL=1
-  - HF_HUB_ENABLE_HF_TRANSFER=1
-commands:
-  - pip install -U "huggingface_hub[cli]"
-  - pip install hf_transfer
-  - huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1
-  - trtllm-serve
-          --backend pytorch
-          --max_batch_size $MAX_BATCH_SIZE
-          --max_num_tokens $MAX_NUM_TOKENS
-          --max_seq_len $MAX_SEQ_LENGTH
-          --tp_size $DSTACK_GPUS_NUM
-          --ep_size $EXPERT_PARALLEL
-          --pp_size $PIPELINE_PARALLEL
-          DeepSeek-R1
-port: 8000
-model: deepseek-ai/DeepSeek-R1
-
-resources:
-  gpu: 8:H200
-  shm_size: 32GB
-  disk: 2000GB..
diff --git a/examples/inference/vllm/.dstack.yml b/examples/inference/vllm/.dstack.yml
deleted file mode 100644
index 9060cff5ab..0000000000
--- a/examples/inference/vllm/.dstack.yml
+++ /dev/null
@@ -1,28 +0,0 @@
-type: service
-name: llama31
-
-python: "3.11"
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
-  - MAX_MODEL_LEN=4096
-commands:
-  - pip install vllm
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --tensor-parallel-size $DSTACK_GPUS_NUM
-port: 8000
-# Register the model
-model: meta-llama/Meta-Llama-3.1-8B-Instruct
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-# Uncomment to cache downloaded models
-#volumes:
-#  - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
-
-resources:
-  gpu: 24GB
-  # Uncomment if using multiple GPUs
-  #shm_size: 24GB
diff --git a/examples/inference/vllm/README.md b/examples/inference/vllm/README.md
index ce77e31782..7497af6698 100644
--- a/examples/inference/vllm/README.md
+++ b/examples/inference/vllm/README.md
@@ -1,82 +1,68 @@
 ---
 title: vLLM
-description: Deploying Llama 3.1 8B using vLLM with OpenAI-compatible API
+description: Deploying Qwen3.5-397B-A17B-FP8 using vLLM on NVIDIA GPUs
 ---
 
 # vLLM
 
-This example shows how to deploy Llama 3.1 8B with `dstack` using [vLLM](https://docs.vllm.ai/en/latest/).
+This example shows how to deploy `Qwen/Qwen3.5-397B-A17B-FP8` using
+[vLLM](https://docs.vllm.ai/en/latest/) and `dstack`.
 
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
+## Apply a configuration
 
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
+Here's an example of a service that deploys
+`Qwen/Qwen3.5-397B-A17B-FP8` using vLLM.
 
-## Deployment
-
-Here's an example of a service that deploys Llama 3.1 8B using vLLM.
-
-<div editor-title="examples/inference/vllm/.dstack.yml">
-
-```yaml
-type: service
-name: llama31
-
-python: "3.11"
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
-  - MAX_MODEL_LEN=4096
-commands:
-  - pip install vllm
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --tensor-parallel-size $DSTACK_GPUS_NUM
-port: 8000
-# Register the model
-model: meta-llama/Meta-Llama-3.1-8B-Instruct
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-# Uncomment to cache downloaded models
-#volumes:
-#  - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
-
-resources:
-  gpu: 24GB
-  # Uncomment if using multiple GPUs
-  #shm_size: 24GB
-```
+=== "NVIDIA"
 
-</div>
+    <div editor-title="qwen397.dstack.yml">
 
-### Running a configuration
+    ```yaml
+    type: service
+    name: qwen397
 
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+    image: vllm/vllm-openai:v0.19.1
 
-<div class="termy">
+    commands:
+      - |
+        vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
+          --port 8000 \
+          --tensor-parallel-size $DSTACK_GPUS_NUM \
+          --max-model-len 262144 \
+          --reasoning-parser qwen3 \
+          --language-model-only
 
-```shell
-$ dstack apply -f examples/inference/vllm/.dstack.yml
+    port: 8000
+    model: Qwen/Qwen3.5-397B-A17B-FP8
+
+    volumes:
+      - instance_path: /root/.cache
+        path: /root/.cache
+        optional: true
+
+    resources:
+      cpu: x86:96..
+      memory: 512GB..
+      shm_size: 16GB
+      disk: 500GB..
+      gpu: H100:80GB:8
+    ```
 
- #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
- 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB    yes   $0.12
- 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB    yes   $0.12
- 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:2  yes   $0.23
+    </div>
+
+The NVIDIA example serves `Qwen/Qwen3.5-397B-A17B-FP8` on `8x H100` GPUs using
+vLLM with tensor parallelism enabled. It uses `--language-model-only` because
+`Qwen/Qwen3.5-397B-A17B-FP8` is a text-only model.
 
-Submit a new run? [y/n]: y
+Save the configuration above as `qwen397.dstack.yml`, then use the
+[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
 
-Provisioning...
----> 100%
+<div class="termy">
+
+```shell
+$ dstack apply -f qwen397.dstack.yml
 ```
+
 </div>
 
 If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
@@ -84,39 +70,27 @@ If no gateway is created, the service endpoint will be available at `<dstack ser
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \
+curl http://127.0.0.1:3000/proxy/services/main/qwen397/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
     -d '{
-      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+      "model": "Qwen/Qwen3.5-397B-A17B-FP8",
       "messages": [
-        {
-          "role": "system",
-          "content": "You are a helpful assistant."
-        },
         {
           "role": "user",
-          "content": "What is Deep Learning?"
+          "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"
         }
       ],
-      "max_tokens": 128
+      "max_tokens": 1024
     }'
 ```
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31.<gateway domain>/`.
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/inference/vllm`](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm).
+> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen397.<gateway domain>/`.
 
 ## What's next?
 
-1. Check [services](https://dstack.ai/docs/services)
-2. Browse the [Llama 3.1](https://dstack.ai/examples/llms/llama31/) and
-   [NIM](https://dstack.ai/examples/inference/nim/) examples
-3. See also [AMD](https://dstack.ai/examples/accelerators/amd/) and
-   [TPU](https://dstack.ai/examples/accelerators/tpu/)
+1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways)
+2. Browse the [SGLang](https://dstack.ai/examples/inference/sglang/) and [NIM](https://dstack.ai/examples/inference/nim/) examples
diff --git a/examples/inference/vllm/amd/.dstack.yml b/examples/inference/vllm/amd/.dstack.yml
deleted file mode 100644
index f0c488d755..0000000000
--- a/examples/inference/vllm/amd/.dstack.yml
+++ /dev/null
@@ -1,43 +0,0 @@
-type: service
-name: llama31-service-vllm-amd
-
-image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
-  - MAX_MODEL_LEN=126192
-commands:
-  - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
-  - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
-  - unzip rocm-6.1.0.zip
-  - cd hipBLAS-rocm-6.1.0
-  - python rmake.py
-  - cd ..
-  - git clone https://github.com/vllm-project/vllm.git
-  - cd vllm
-  - pip install triton
-  - pip uninstall torch -y
-  - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
-  - pip install /opt/rocm/share/amd_smi
-  - pip install --upgrade numba scipy huggingface-hub[cli]
-  - pip install "numpy<2"
-  - pip install -r requirements-rocm.txt
-  - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
-  - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
-  - export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
-  - wget https://dstack-binaries.s3.amazonaws.com/vllm-0.6.0%2Brocm614-cp310-cp310-linux_x86_64.whl
-  - pip install vllm-0.6.0+rocm614-cp310-cp310-linux_x86_64.whl
-  - vllm serve $MODEL_ID
-      --max-model-len $MAX_MODEL_LEN
-      --port 8000
-# Expose the vllm server port
-port: 8000
-# Register the model
-model: meta-llama/Meta-Llama-3.1-70B-Instruct
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-resources:
-  gpu: MI300X
-  disk: 200GB
diff --git a/examples/inference/vllm/amd/build-vllm.dstack.yml b/examples/inference/vllm/amd/build-vllm.dstack.yml
deleted file mode 100644
index 3510d0f6d6..0000000000
--- a/examples/inference/vllm/amd/build-vllm.dstack.yml
+++ /dev/null
@@ -1,47 +0,0 @@
-type: task
-name: build-vllm-rocm
-
-image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
-
-env:
-  - HF_TOKEN
-  - AWS_ACCESS_KEY_ID
-  - AWS_SECRET_ACCESS_KEY
-  - AWS_REGION
-  - BUCKET_NAME
-
-command:
-  - apt-get update -y
-  - apt-get install awscli -y
-  - aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
-  - aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
-  - aws configure set region $AWS_REGION
-  - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
-  - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
-  - unzip rocm-6.1.0.zip
-  - cd hipBLAS-rocm-6.1.0
-  - python rmake.py
-  - cd ..
-  - git clone https://github.com/vllm-project/vllm.git
-  - cd vllm
-  - pip install triton
-  - pip uninstall torch -y
-  - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
-  - pip install /opt/rocm/share/amd_smi
-  - pip install --upgrade numba scipy huggingface-hub[cli]
-  - pip install "numpy<2"
-  - pip install -r requirements-rocm.txt
-  - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
-  - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
-  - export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
-  - pip install wheel setuptools setuptools_scm ninja
-  - python setup.py bdist_wheel -d dist/
-  - cd dist
-  - aws s3 cp "$(ls -1 | head -n 1)" s3://$BUCKET_NAME/ --acl public-read
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-resources:
-  gpu: MI300X
-  disk: 150GB
diff --git a/examples/inference/vllm/tpu/.dstack.yml b/examples/inference/vllm/tpu/.dstack.yml
deleted file mode 100644
index 5a637ca797..0000000000
--- a/examples/inference/vllm/tpu/.dstack.yml
+++ /dev/null
@@ -1,23 +0,0 @@
-type: service
-# The name is optional, if not specified, generated randomly
-name: llama31-service-vllm-tpu
-image: vllm/vllm-tpu:nightly
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
-  - MAX_MODEL_LEN=4096
-commands:
-  - vllm serve $MODEL_ID 
-      --tensor-parallel-size 4 
-      --max-model-len $MAX_MODEL_LEN
-      --port 8000
-# Expose the vllm server port
-port: 8000
-# Register the model
-model: meta-llama/Meta-Llama-3.1-8B-Instruct
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-resources:
-  gpu: v5litepod-4
diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md
deleted file mode 100644
index ae467891fc..0000000000
--- a/examples/llms/deepseek/README.md
+++ /dev/null
@@ -1,457 +0,0 @@
-# Deepseek
-
-This example walks you through how to deploy and
-train [Deepseek](https://huggingface.co/deepseek-ai)
-models with `dstack`.
-
-> We used Deepseek-R1 distilled models and Deepseek-V2-Lite, a 16B model with the same architecture as Deepseek-R1 (671B). Deepseek-V2-Lite retains MLA and DeepSeekMoE but requires less memory, making it ideal for testing and fine-tuning on smaller GPUs.
-
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
-    </div>
-
-## Deployment
-
-### AMD
-
-Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` using [SGLang](https://github.com/sgl-project/sglang) and [vLLM](https://github.com/vllm-project/vllm) with AMD `MI300X` GPU. The below configurations also support `Deepseek-V2-Lite`.
-
-=== "SGLang"
-
-    <div editor-title="examples/llms/deepseek/sglang/amd/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: deepseek-r1-amd
-
-    image: lmsysorg/sglang:v0.4.1.post4-rocm620
-    env:
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-    commands:
-       - python3 -m sglang.launch_server
-         --model-path $MODEL_ID
-         --port 8000
-         --trust-remote-code
-
-    port: 8000
-    model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
-    resources:
-      gpu: MI300X
-      disk: 300Gb
-
-    ```
-    </div>
-
-=== "vLLM"
-
-    <div editor-title="examples/llms/deepseek/sglang/amd/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: deepseek-r1-amd
-
-    image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-    env:
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-      - MAX_MODEL_LEN=126432
-    commands:
-      - vllm serve $MODEL_ID
-        --max-model-len $MAX_MODEL_LEN
-        --trust-remote-code
-    port: 8000
-
-    model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
-    resources:
-      gpu: MI300X
-      disk: 300Gb
-    ```
-    </div>
-
-Note, when using `Deepseek-R1-Distill-Llama-70B` with `vLLM` with a 192GB GPU, we must limit the context size to 126432 tokens to fit the memory.
-
-### NVIDIA
-
-Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-8B`
-using [SGLang](https://github.com/sgl-project/sglang)
-and [vLLM](https://github.com/vllm-project/vllm) with NVIDIA GPUs.
-Both SGLang and vLLM also support `Deepseek-V2-Lite`.
-
-=== "SGLang"
-    <div editor-title="examples/llms/deepseek/sglang/nvidia/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: deepseek-r1
-
-    image: lmsysorg/sglang:latest
-    env:
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-    commands:
-        - python3 -m sglang.launch_server
-          --model-path $MODEL_ID
-          --port 8000
-          --trust-remote-code
-
-    port: 8000
-    model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
-    resources:
-      gpu: 24GB
-    ```
-    </div>
-
-=== "vLLM"
-    <div editor-title="examples/llms/deepseek/vllm/nvidia/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: deepseek-r1
-
-    image: vllm/vllm-openai:latest
-    env:
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-      - MAX_MODEL_LEN=4096
-    commands:
-      - vllm serve $MODEL_ID
-        --max-model-len $MAX_MODEL_LEN
-    port: 8000
-    model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
-    resources:
-      gpu: 24GB
-    ```
-    </div>
-
-Note, to run `Deepseek-R1-Distill-Llama-8B` with `vLLM` with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory.
-
-> To run `Deepseek-V2-Lite` with `vLLM`, we must use 40GB GPU and to run `Deepseek-V2-Lite` with SGLang, we must use
-> 80GB GPU. For more details on SGlang's memory requirements you can refer to
-> this [issue](https://github.com/sgl-project/sglang/issues/3451).
-
-### Memory requirements
-
-Approximate memory requirements for loading the model (excluding context and CUDA/ROCm kernel reservations).
-
-| Model                       | Size     | FP16   | FP8    | INT4   |
-|-----------------------------|----------|--------|--------|--------|
-| `Deepseek-R1`               | **671B** | 1.35TB | 671GB  | 336GB  |
-| `DeepSeek-R1-Distill-Llama` | **70B**  | 161GB  | 80.5GB | 40B    |
-| `DeepSeek-R1-Distill-Qwen`  | **32B**  | 74GB   | 37GB   | 18.5GB |
-| `DeepSeek-V2-Lite`          | **16B**  | 35GB   | 17.5GB | 8.75GB |
-| `DeepSeek-R1-Distill-Qwen`  | **14B**  | 32GB   | 16GB   | 8GB    |
-| `DeepSeek-R1-Distill-Llama` | **8B**   | 18GB   | 9GB    | 4.5GB  |
-| `DeepSeek-R1-Distill-Qwen`  | **7B**   | 16GB   | 8GB    | 4GB    |
-
-For example, the FP8 version of Deepseek-R1 671B fits on a single node of MI300X with eight 192GB GPUs, a single node of
-H200 with eight 141GB GPUs.
-
-### Applying the configuration
-
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
-
-<div class="termy">
-
-```shell
-$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
-
- #  BACKEND  REGION     RESOURCES                         SPOT  PRICE
- 1  runpod   EU-RO-1   24xCPU, 283GB, 1xMI300X (192GB)    no    $2.49
-
-Submit the run deepseek-r1? [y/n]: y
-
-Provisioning...
----> 100%
-```
-</div>
-
-If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
-
-<div class="termy">
-
-```shell
-curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
-    -X POST \
-    -H 'Authorization: Bearer &lt;dstack token&gt;' \
-    -H 'Content-Type: application/json' \
-    -d '{
-      "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
-      "messages": [
-        {
-          "role": "system",
-          "content": "You are a helpful assistant."
-        },
-        {
-          "role": "user",
-          "content": "What is Deep Learning?"
-        }
-      ],
-      "stream": true,
-      "max_tokens": 512
-    }'
-```
-</div>
-
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
-
-## Fine-tuning
-
-### AMD
-
-Here are the examples of LoRA fine-tuning of `Deepseek-V2-Lite` and GRPO fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` on `MI300X` GPU using HuggingFace's [TRL](https://github.com/huggingface/trl).
-
-=== "LoRA"
-
-    <div editor-title="examples/llms/deepseek/trl/amd/.dstack.yml">
-
-    ```yaml
-    type: task
-    name: trl-train
-
-    image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
-
-    env:
-      - WANDB_API_KEY
-      - WANDB_PROJECT
-      - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
-      - ACCELERATE_USE_FSDP=False
-    commands:
-      - git clone https://github.com/huggingface/peft.git
-      - pip install trl
-      - pip install "numpy<2"
-      - pip install peft
-      - pip install wandb
-      - cd peft/examples/sft
-      - python train.py
-        --seed 100
-        --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite"
-        --dataset_name "smangrul/ultrachat-10k-chatml"
-        --chat_template_format "chatml"
-        --add_special_tokens False
-        --append_concat_token False
-        --splits "train,test"
-        --max_seq_len 512
-        --num_train_epochs 1
-        --logging_steps 5
-        --log_level "info"
-        --logging_strategy "steps"
-        --eval_strategy "epoch"
-        --save_strategy "epoch"
-        --hub_private_repo True
-        --hub_strategy "every_save"
-        --packing True
-        --learning_rate 1e-4
-        --lr_scheduler_type "cosine"
-        --weight_decay 1e-4
-        --warmup_ratio 0.0
-        --max_grad_norm 1.0
-        --output_dir "deepseek-sft-lora"
-        --per_device_train_batch_size 8
-        --per_device_eval_batch_size 8
-        --gradient_accumulation_steps 4
-        --gradient_checkpointing True
-        --use_reentrant True
-        --dataset_text_field "content"
-        --use_peft_lora True
-        --lora_r 16
-        --lora_alpha 16
-        --lora_dropout 0.05
-        --lora_target_modules "all-linear"
-
-    resources:
-      gpu: MI300X
-      disk: 150GB
-    ```
-    </div>
-
-=== "GRPO"
-
-    <div editor-title="examples/llms/deepseek/trl/amd/grpo.dstack.yml">
-    ```yaml
-    type: task
-    name: trl-train-grpo
-
-    image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
-
-    env:
-      - WANDB_API_KEY
-      - WANDB_PROJECT
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-    files:
-      - grpo_train.py
-    commands:
-      - pip install trl
-      - pip install datasets
-      # numPy version less than 2 is required for the scipy installation with AMD.
-      - pip install "numpy<2"
-      - mkdir -p grpo_example
-      - cp grpo_train.py grpo_example/grpo_train.py
-      - cd grpo_example
-      - python grpo_train.py
-        --model_name_or_path $MODEL_ID
-        --dataset_name trl-lib/tldr
-        --per_device_train_batch_size 2
-        --logging_steps 25
-        --output_dir Deepseek-Distill-Qwen-1.5B-GRPO
-        --trust_remote_code
-
-    resources:
-      gpu: MI300X
-      disk: 150GB
-    ```
-    </div>
-
-Note, the `GRPO` fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` consumes up to 135GB of VRAM.
-
-### NVIDIA
-
-Here are examples of LoRA fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` and QLoRA fine-tuning of `DeepSeek-V2-Lite`
-on NVIDIA GPU using HuggingFace's [TRL](https://github.com/huggingface/trl) library.
-
-=== "LoRA"
-    <div editor-title="examples/llms/deepseek/trl/nvidia/.dstack.yml">
-
-    ```yaml
-    type: task
-    name: trl-train
-
-    python: 3.12
-
-    env:
-      - WANDB_API_KEY
-      - WANDB_PROJECT
-      - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-    commands:
-      - git clone https://github.com/huggingface/trl.git
-      - pip install trl
-      - pip install peft
-      - pip install wandb
-      - cd trl/trl/scripts
-      - python sft.py
-        --model_name_or_path $MODEL_ID
-        --dataset_name trl-lib/Capybara
-        --learning_rate 2.0e-4
-        --num_train_epochs 1
-        --packing
-        --per_device_train_batch_size 2
-        --gradient_accumulation_steps 8
-        --gradient_checkpointing
-        --logging_steps 25
-        --eval_strategy steps
-        --eval_steps 100
-        --use_peft
-        --lora_r 32
-        --lora_alpha 16
-        --report_to wandb
-        --output_dir DeepSeek-R1-Distill-Qwen-1.5B-SFT
-
-    resources:
-      gpu: 24GB
-    ```
-    </div>
-
-=== "QLoRA"
-    <div editor-title="examples/llms/deepseek/trl/nvidia/deepseek_v2.dstack.yml">
-
-    ```yaml
-    type: task
-    name: trl-train-deepseek-v2
-
-    python: 3.12
-    nvcc: true
-    env:
-      - WANDB_API_KEY
-      - WANDB_PROJECT
-      - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
-      - ACCELERATE_USE_FSDP=False
-    commands:
-      - git clone https://github.com/huggingface/peft.git
-      - pip install trl
-      - pip install peft
-      - pip install wandb
-      - pip install bitsandbytes
-      - cd peft/examples/sft
-      - python train.py
-        --seed 100
-        --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite"
-        --dataset_name "smangrul/ultrachat-10k-chatml"
-        --chat_template_format "chatml"
-        --add_special_tokens False
-        --append_concat_token False
-        --splits "train,test"
-        --max_seq_len 512
-        --num_train_epochs 1
-        --logging_steps 5
-        --log_level "info"
-        --logging_strategy "steps"
-        --eval_strategy "epoch"
-        --save_strategy "epoch"
-        --hub_private_repo True
-        --hub_strategy "every_save"
-        --bf16 True
-        --packing True
-        --learning_rate 1e-4
-        --lr_scheduler_type "cosine"
-        --weight_decay 1e-4
-        --warmup_ratio 0.0
-        --max_grad_norm 1.0
-        --output_dir "mistral-sft-lora"
-        --per_device_train_batch_size 8
-        --per_device_eval_batch_size 8
-        --gradient_accumulation_steps 4
-        --gradient_checkpointing True
-        --use_reentrant True
-        --dataset_text_field "content"
-        --use_peft_lora True
-        --lora_r 16
-        --lora_alpha 16
-        --lora_dropout 0.05
-        --lora_target_modules "all-linear"
-        --use_4bit_quantization True
-        --use_nested_quant True
-        --bnb_4bit_compute_dtype "bfloat16"
-
-    resources:
-      # Consumes ~25GB of VRAM for QLoRA fine-tuning deepseek-ai/DeepSeek-V2-Lite
-      gpu: 48GB
-    ```
-    </div>
-
-### Memory requirements
-
-| Model                       | Size     | Full fine-tuning | LoRA  | QLoRA |
-|-----------------------------|----------|------------------|-------|-------|
-| `Deepseek-R1`               | **671B** | 10.5TB           | 1.4TB | 442GB |
-| `DeepSeek-R1-Distill-Llama` | **70B**  | 1.09TB           | 151GB | 46GB  |
-| `DeepSeek-R1-Distill-Qwen`  | **32B**  | 512GB            | 70GB  | 21GB  |
-| `DeepSeek-V2-Lite`          | **16B**  | 256GB            | 35GB  | 11GB  |
-| `DeepSeek-R1-Distill-Qwen`  | **14B**  | 224GB            | 30GB  | 9GB   |
-| `DeepSeek-R1-Distill-Llama` | **8B**   | 128GB            | 17GB  | 5GB   |
-| `DeepSeek-R1-Distill-Qwen`  | **7B**   | 112GB            | 15GB  | 4GB   |
-| `DeepSeek-R1-Distill-Qwen`  | **1.5B** | 24GB             | 3.2GB | 1GB   |
-
-The memory requirements assume low-rank update matrices are 1% of model parameters. In practice, a 7B model with QLoRA
-needs 7–10GB due to intermediate hidden states.
-
-| Fine-tuning type | Calculation                                      |
-|------------------|--------------------------------------------------|
-| Full fine-tuning | 671B × 16 bytes = 10.48TB                        |
-| LoRA             | 671B × 2 bytes + 1% of 671B × 16 bytes = 1.41TB  |
-| QLoRA(4-bit)     | 671B × 0.5 bytes + 1% of 671B × 16 bytes = 442GB |
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/llms/deepseek`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek).
-
-!!! info "What's next?"
-    1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
-       [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
diff --git a/examples/llms/deepseek/sglang/amd/.dstack.yml b/examples/llms/deepseek/sglang/amd/.dstack.yml
deleted file mode 100644
index 99a19bfee3..0000000000
--- a/examples/llms/deepseek/sglang/amd/.dstack.yml
+++ /dev/null
@@ -1,18 +0,0 @@
-type: service
-name: deepseek-r1-amd
-
-image: lmsysorg/sglang:v0.4.1.post4-rocm620
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-commands:
-  - python3 -m sglang.launch_server
-    --model-path $MODEL_ID
-    --port 8000
-    --trust-remote-code
-
-port: 8000
-model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
-resources:
-  gpu: mi300x
-  disk: 300Gb
diff --git a/examples/llms/deepseek/sglang/amd/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/sglang/amd/deepseek_v2_lite.dstack.yml
deleted file mode 100644
index 01ef71a6be..0000000000
--- a/examples/llms/deepseek/sglang/amd/deepseek_v2_lite.dstack.yml
+++ /dev/null
@@ -1,18 +0,0 @@
-type: service
-name: deepseek-v2-lite-amd
-
-image: lmsysorg/sglang:v0.4.1.post4-rocm620
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
-commands:
-  - python3 -m sglang.launch_server
-    --model-path $MODEL_ID
-    --port 8000
-    --trust-remote-code
-
-port: 8000
-model: deepseek-ai/DeepSeek-V2-Lite
-
-resources:
-  gpu: mi300x
-  disk: 150Gb
diff --git a/examples/llms/deepseek/sglang/nvidia/.dstack.yml b/examples/llms/deepseek/sglang/nvidia/.dstack.yml
deleted file mode 100644
index d1c92b64d1..0000000000
--- a/examples/llms/deepseek/sglang/nvidia/.dstack.yml
+++ /dev/null
@@ -1,18 +0,0 @@
-type: service
-name: deepseek-r1-nvidia
-
-image: lmsysorg/sglang:latest
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-commands:
-  - python3 -m sglang.launch_server
-    --model-path $MODEL_ID
-    --port 8000
-    --trust-remote-code
-
-port: 8000
-
-model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
-resources:
-  gpu: 24GB
diff --git a/examples/llms/deepseek/sglang/nvidia/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/sglang/nvidia/deepseek_v2_lite.dstack.yml
deleted file mode 100644
index 8c0adaa41b..0000000000
--- a/examples/llms/deepseek/sglang/nvidia/deepseek_v2_lite.dstack.yml
+++ /dev/null
@@ -1,19 +0,0 @@
-# Not Working https://github.com/sgl-project/sglang/issues/3451
-type: service
-name: deepseek-v2-lite-nvidia
-
-image: lmsysorg/sglang:latest
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
-commands:
-  - python3 -m sglang.launch_server
-    --model-path $MODEL_ID
-    --port 8000
-    --trust-remote-code
-
-port: 8000
-
-model: deepseek-ai/DeepSeek-V2-Lite
-
-resources:
-  gpu: 80GB
diff --git a/examples/llms/deepseek/vllm/amd/.dstack.yml b/examples/llms/deepseek/vllm/amd/.dstack.yml
deleted file mode 100644
index 23bfb033c0..0000000000
--- a/examples/llms/deepseek/vllm/amd/.dstack.yml
+++ /dev/null
@@ -1,19 +0,0 @@
-type: service
-name: deepseek-r1-amd
-
-image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-  - MAX_MODEL_LEN=126432
-commands:
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --trust-remote-code
-port: 8000
-
-model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
-
-resources:
-    gpu: mi300x
-    disk: 300Gb
diff --git a/examples/llms/deepseek/vllm/amd/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/vllm/amd/deepseek_v2_lite.dstack.yml
deleted file mode 100644
index 8937e95266..0000000000
--- a/examples/llms/deepseek/vllm/amd/deepseek_v2_lite.dstack.yml
+++ /dev/null
@@ -1,18 +0,0 @@
-type: service
-name: deepseek-v2-lite-amd
-
-image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
-commands:
-  - vllm serve $MODEL_ID
-    --trust-remote-code
-
-port: 8000
-
-model: deepseek-ai/DeepSeek-V2-Lite
-
-
-resources:
-  gpu: mi300x
-  disk: 150Gb
diff --git a/examples/llms/deepseek/vllm/nvidia/.dstack.yml b/examples/llms/deepseek/vllm/nvidia/.dstack.yml
deleted file mode 100644
index e623b182c4..0000000000
--- a/examples/llms/deepseek/vllm/nvidia/.dstack.yml
+++ /dev/null
@@ -1,17 +0,0 @@
-type: service
-name: deepseek-r1-nvidia
-
-image: vllm/vllm-openai:latest
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  - MAX_MODEL_LEN=4096
-commands:
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-
-port: 8000
-
-model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
-resources:
-  gpu: 24GB
diff --git a/examples/llms/deepseek/vllm/nvidia/deepseek_v2_lite.dstack.yml b/examples/llms/deepseek/vllm/nvidia/deepseek_v2_lite.dstack.yml
deleted file mode 100644
index 06e78f379b..0000000000
--- a/examples/llms/deepseek/vllm/nvidia/deepseek_v2_lite.dstack.yml
+++ /dev/null
@@ -1,19 +0,0 @@
-type: service
-name: deepseek-v2-lite-nvidia
-
-image: vllm/vllm-openai:latest
-env:
-  - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
-  - MAX_MODEL_LEN=4096
-commands:
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --tensor-parallel-size $DSTACK_GPUS_NUM
-    --trust-remote-code
-
-port: 8000
-
-model: deepseek-ai/DeepSeek-V2-Lite
-
-resources:
-  gpu: 48GB
diff --git a/examples/llms/llama/README.md b/examples/llms/llama/README.md
deleted file mode 100644
index 3f2b8ab54d..0000000000
--- a/examples/llms/llama/README.md
+++ /dev/null
@@ -1,288 +0,0 @@
-# Llama
-
-This example walks you through how to deploy Llama 4 Scout model with `dstack`.
-
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
-
-## Deployment
-
-### AMD
-Here's an example of a service that deploys
-[`Llama-4-Scout-17B-16E-Instruct`](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
-using [vLLM](https://github.com/vllm-project/vllm)
-with AMD `MI300X` GPUs.
-
-<div editor-title="examples/llms/llama/vllm/amd/.dstack.yml">
-
-```yaml
-type: service
-name: llama4-scout
-
-image: rocm/vllm-dev:llama4-20250407
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
-  - VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - VLLM_USE_MODELSCOPE=False
-  - VLLM_USE_TRITON_FLASH_ATTN=0
-  - MAX_MODEL_LEN=256000
-
-commands:
-   - |
-     vllm serve $MODEL_ID \
-       --tensor-parallel-size $DSTACK_GPUS_NUM \
-       --max-model-len $MAX_MODEL_LEN \
-       --kv-cache-dtype fp8 \
-       --max-num-seqs 64 \
-       --override-generation-config='{"attn_temperature_tuning": true}'
-
-
-port: 8000
-# Register the model
-model: meta-llama/Llama-4-Scout-17B-16E-Instruct
-
-resources:
-  gpu: Mi300x:2
-  disk: 500GB..
-```
-</div>
-
-### NVIDIA
-Here's an example of a service that deploys
-[`Llama-4-Scout-17B-16E-Instruct`](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
-using [SGLang](https://github.com/sgl-project/sglang) and [vLLM](https://github.com/vllm-project/vllm)
-with NVIDIA `H200` GPUs.
-
-=== "SGLang"
-
-    <div editor-title="examples/llms/llama/sglang/nvidia/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: llama4-scout
-
-    image: lmsysorg/sglang
-    env:
-      - HF_TOKEN
-      - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
-      - CONTEXT_LEN=256000
-    commands:
-       - python3 -m sglang.launch_server
-           --model-path $MODEL_ID
-           --tp $DSTACK_GPUS_NUM
-           --context-length $CONTEXT_LEN
-           --kv-cache-dtype fp8_e5m2
-           --port 8000
-
-    port: 8000
-    ## Register the model
-    model: meta-llama/Llama-4-Scout-17B-16E-Instruct
-
-    resources:
-      gpu: H200:2
-      disk: 500GB..
-    ```
-    </div>
-
-=== "vLLM"
-
-    <div editor-title="examples/llms/llama/vllm/nvidia/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: llama4-scout
-
-    image: vllm/vllm-openai
-    env:
-      - HF_TOKEN
-      - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
-      - VLLM_DISABLE_COMPILE_CACHE=1
-      - MAX_MODEL_LEN=256000
-    commands:
-       - |
-         vllm serve $MODEL_ID \
-           --tensor-parallel-size $DSTACK_GPUS_NUM \
-           --max-model-len $MAX_MODEL_LEN \
-           --kv-cache-dtype fp8 \
-           --override-generation-config='{"attn_temperature_tuning": true}'
-
-    port: 8000
-    # Register the model
-    model: meta-llama/Llama-4-Scout-17B-16E-Instruct
-
-    resources:
-      gpu: H200:2
-      disk: 500GB..
-    ```
-    </div>
-
-!!! info "NOTE:"
-    With vLLM, add `--override-generation-config='{"attn_temperature_tuning": true}'` to
-    improve accuracy for [contexts longer than 32K tokens](https://blog.vllm.ai/2025/04/05/llama4.html).
-
-### Memory requirements
-
-Below are the approximate memory requirements for loading the model.
-This excludes memory for the model context and CUDA kernel reservations.
-
-| Model         | Size     | FP16   | FP8    | INT4   |
-|---------------|----------|--------|--------|--------|
-| `Behemoth`    | **2T**   | 4TB    | 2TB    | 1TB    |
-| `Maverick`    | **400B** | 800GB  | 200GB  | 100GB  |
-| `Scout`       | **109B** | 218GB  | 109GB  | 54.5GB |
-
-
-### Running a configuration
-
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
-
-<div class="termy">
-
-```shell
-$ HF_TOKEN=...
-$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml
-
- #  BACKEND  REGION     RESOURCES                      SPOT PRICE
- 1  vastai   is-iceland 48xCPU, 128GB, 2xH200 (140GB)  no   $7.87
- 2  runpod   EU-SE-1    40xCPU, 128GB, 2xH200 (140GB)  no   $7.98
-
-
-Submit the run llama4-scout? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-</div>
-
-Once the service is up, it will be available via the service endpoint
-at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
-
-<div class="termy">
-
-```shell
-curl http://127.0.0.1:3000/proxy/services/main/llama4-scout/v1/chat/completions \
-    -X POST \
-    -H 'Authorization: Bearer &lt;dstack token&gt;' \
-    -H 'Content-Type: application/json' \
-    -d '{
-      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
-      "messages": [
-        {
-          "role": "system",
-          "content": "You are a helpful assistant."
-        },
-        {
-          "role": "user",
-          "content": "What is Deep Learning?"
-        }
-      ],
-      "stream": true,
-      "max_tokens": 512
-    }'
-```
-
-</div>
-
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint
-is available at `https://<run name>.<gateway domain>/`.
-
-[//]: # (TODO: https://github.com/dstackai/dstack/issues/1777)
-
-## Fine-tuning
-
-Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized [Llama-4-Scout-17B-16E](https://huggingface.co/axolotl-quants/Llama-4-Scout-17B-16E-Linearized-bnb-nf4-bf16) on 2xH100 NVIDIA GPUs using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
-
-<div editor-title="examples/single-node-training/axolotl/.dstack.yml">
-
-```yaml
-type: task
-# The name is optional, if not specified, generated randomly
-name: axolotl-nvidia-llama-scout-train
-
-# Using the official Axolotl's Docker image
-image: axolotlai/axolotl:main-latest
-
-# Required environment variables
-env:
-  - HF_TOKEN
-  - WANDB_API_KEY
-  - WANDB_PROJECT
-  - WANDB_NAME=axolotl-nvidia-llama-scout-train
-  - HUB_MODEL_ID
-# Commands of the task
-commands:
-  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
-  - axolotl train scout-qlora-fsdp1.yaml
-            --wandb-project $WANDB_PROJECT
-            --wandb-name $WANDB_NAME
-            --hub-model-id $HUB_MODEL_ID
-
-resources:
-  # Two GPU (required by FSDP)
-  gpu: H100:2
-  # Shared memory size for inter-process communication
-  shm_size: 24GB
-  disk: 500GB..
-```
-</div>
-
-The task uses Axolotl's Docker image, where Axolotl is already pre-installed.
-
-### Memory requirements
-
-Below are the approximate memory requirements for loading the model.
-This excludes memory for the model context and CUDA kernel reservations.
-
-| Model         | Size     | Full fine-tuning   | LoRA   | QLoRA  |
-|---------------|----------|--------------------|--------|--------|
-| `Behemoth`    | **2T**   | 32TB               | 4.3TB  | 1.3TB  |
-| `Maverick`    | **400B** | 6.5TB              | 864GB  | 264GB  |
-| `Scout`       | **109B** | 1.75TB             | 236GB  | 72GB   |
-
-The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters.
-
-| Fine-tuning type | Calculation                                      |
-|------------------|--------------------------------------------------|
-| Full fine-tuning | 2T × 16 bytes = 32TB                             |
-| LoRA             | 2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB       |
-| QLoRA(4-bit)     | 2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB     |
-
-## Running a configuration
-
-Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
-cloud resources and run the configuration.
-
-<div class="termy">
-
-```shell
-$ HF_TOKEN=...
-$ WANDB_API_KEY=...
-$ WANDB_PROJECT=...
-$ WANDB_NAME=axolotl-nvidia-llama-scout-train
-$ HUB_MODEL_ID=...
-$ dstack apply -f examples/single-node-training/axolotl/.dstack.yml
-```
-
-</div>
-
-## Source code
-
-The source-code for deployment examples can be found in
-[`examples/llms/llama`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama) and the source-code for the finetuning example can be found in [`examples/single-node-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl).
-
-## What's next?
-
-1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
-   [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
-2. Browse [Llama 4 with SGLang](https://github.com/sgl-project/sglang/blob/main/docs/references/llama4.md), [Llama 4 with vLLM](https://blog.vllm.ai/2025/04/05/llama4.html), [Llama 4 with AMD](https://rocm.blogs.amd.com/artificial-intelligence/llama4-day-0-support/README.html) and [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl).
diff --git a/examples/llms/llama/sglang/nvidia/.dstack.yml b/examples/llms/llama/sglang/nvidia/.dstack.yml
deleted file mode 100644
index b1aea2ce51..0000000000
--- a/examples/llms/llama/sglang/nvidia/.dstack.yml
+++ /dev/null
@@ -1,23 +0,0 @@
-type: service
-name: llama4-scout
-
-image: lmsysorg/sglang
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
-  - CONTEXT_LEN=256000
-commands:
-  - python3 -m sglang.launch_server
-      --model-path $MODEL_ID
-      --tp $DSTACK_GPUS_NUM
-      --context-length $CONTEXT_LEN
-      --port 8000
-      --kv-cache-dtype fp8_e5m2
-
-port: 8000
-## Register the model
-model: meta-llama/Llama-4-Scout-17B-16E-Instruct
-
-resources:
-  gpu: H200:2
-  disk: 500GB..
diff --git a/examples/llms/llama/vllm/amd/.dstack.yml b/examples/llms/llama/vllm/amd/.dstack.yml
deleted file mode 100644
index eaaf1e6aeb..0000000000
--- a/examples/llms/llama/vllm/amd/.dstack.yml
+++ /dev/null
@@ -1,29 +0,0 @@
-type: service
-name: llama4-scout
-
-image: rocm/vllm-dev:llama4-20250407
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
-  - VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - VLLM_USE_MODELSCOPE=False
-  - VLLM_USE_TRITON_FLASH_ATTN=0 
-  - MAX_MODEL_LEN=256000
-
-commands:
-   - |
-     vllm serve $MODEL_ID \
-       --tensor-parallel-size $DSTACK_GPUS_NUM \
-       --max-model-len $MAX_MODEL_LEN \
-       --kv-cache-dtype fp8 \
-       --max-num-seqs 64 \
-       --override-generation-config='{"attn_temperature_tuning": true}'
-
-   
-port: 8000
-# Register the model
-model: meta-llama/Llama-4-Scout-17B-16E-Instruct
-
-resources:
-  gpu: Mi300x:2
-  disk: 500GB..
diff --git a/examples/llms/llama/vllm/nvidia/.dstack.yml b/examples/llms/llama/vllm/nvidia/.dstack.yml
deleted file mode 100644
index c1f8e4919d..0000000000
--- a/examples/llms/llama/vllm/nvidia/.dstack.yml
+++ /dev/null
@@ -1,24 +0,0 @@
-type: service
-name: llama4-scout
-
-image: vllm/vllm-openai
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
-  - VLLM_DISABLE_COMPILE_CACHE=1
-  - MAX_MODEL_LEN=256000
-commands:
-   - |
-     vllm serve $MODEL_ID \
-       --tensor-parallel-size $DSTACK_GPUS_NUM \
-       --max-model-len $MAX_MODEL_LEN \
-       --kv-cache-dtype fp8 \
-       --override-generation-config='{"attn_temperature_tuning": true}'
-
-port: 8000
-# Register the model
-model: meta-llama/Llama-4-Scout-17B-16E-Instruct
-
-resources:
-  gpu: H200:2
-  disk: 500GB..
diff --git a/examples/llms/llama31/README.md b/examples/llms/llama31/README.md
deleted file mode 100644
index b99362cde0..0000000000
--- a/examples/llms/llama31/README.md
+++ /dev/null
@@ -1,384 +0,0 @@
-# Llama 3.1
-
-This example walks you through how to deploy and fine-tuning Llama 3.1 with `dstack`.
-
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
-
-## Deployment
-
-You can use any serving frameworks.
-Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NIM.
-
-=== "vLLM"
-
-    <div editor-title="examples/llms/llama31/vllm/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: llama31
-
-    python: "3.11"
-    env:
-      - HF_TOKEN
-      - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
-      - MAX_MODEL_LEN=4096
-    commands:
-      - pip install vllm
-      - vllm serve $MODEL_ID
-        --max-model-len $MAX_MODEL_LEN
-        --tensor-parallel-size $DSTACK_GPUS_NUM
-    port: 8000
-    # Register the model
-    model: meta-llama/Meta-Llama-3.1-8B-Instruct
-
-    # Uncomment to leverage spot instances
-    #spot_policy: auto
-
-    # Uncomment to cache downloaded models
-    #volumes:
-    #  - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
-
-    resources:
-      gpu: 24GB
-      # Uncomment if using multiple GPUs
-      #shm_size: 24GB
-    ```
-
-    </div>
-
-=== "TGI"
-
-    <div editor-title="examples/llms/llama31/tgi/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: llama31
-
-    image: ghcr.io/huggingface/text-generation-inference:latest
-    env:
-      - HF_TOKEN
-      - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
-      - MAX_INPUT_LENGTH=4000
-      - MAX_TOTAL_TOKENS=4096
-    commands:
-      - NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
-    port: 80
-    # Register the model
-    model: meta-llama/Meta-Llama-3.1-8B-Instruct
-
-    # Uncomment to leverage spot instances
-    #spot_policy: auto
-
-    # Uncomment to cache downloaded models
-    #volumes:
-    #  - /data:/data
-
-    resources:
-      gpu: 24GB
-      # Uncomment if using multiple GPUs
-      #shm_size: 24GB
-    ```
-
-    </div>
-
-=== "NIM"
-
-    <div editor-title="examples/llms/llama31/ollama/.dstack.yml">
-
-    ```yaml
-    type: service
-    name: llama31
-
-    image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
-    env:
-      - NGC_API_KEY
-      - NIM_MAX_MODEL_LEN=4096
-    registry_auth:
-      username: $oauthtoken
-      password: ${{ env.NGC_API_KEY }}
-    port: 8000
-    # Register the model
-    model: meta/llama-3.1-8b-instruct
-
-    # Uncomment to leverage spot instances
-    #spot_policy: auto
-
-    # Cache downloaded models
-    volumes:
-      - /root/.cache/nim:/opt/nim/.cache
-
-    resources:
-      gpu: 24GB
-      # Uncomment if using multiple GPUs
-      #shm_size: 24GB
-    ```
-
-    </div>
-
-Note, when using Llama 3.1 8B with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory.
-
-### Memory requirements
-
-Below are the approximate memory requirements for loading the model.
-This excludes memory for the model context and CUDA kernel reservations.
-
-| Model size | FP16  | FP8   | INT4  |
-|------------|-------|-------|-------|
-| **8B**     | 16GB  | 8GB   | 4GB   |
-| **70B**    | 140GB | 70GB  | 35GB  |
-| **405B**   | 810GB | 405GB | 203GB |
-
-For example, the FP16 version of Llama 3.1 405B won't fit into a single machine with eight 80GB GPUs, so we'd need at least two
-nodes.
-
-### Quantization
-
-The INT4 version of Llama 3.1 70B, can fit into two 40GB GPUs.
-
-[//]: # (TODO: Example: INT4 / 70B / 40GB:2)
-
-The INT4 version of Llama 3.1 405B can fit into eight 40GB GPUs.
-
-[//]: # (TODO: Example: INT4 / 405B / 40GB:8)
-
-Useful links:
-
- * [Meta's official FP8 quantized version of Llama 3.1 405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) (with minimal accuracy degradation)
- * [Llama 3.1 Quantized Models](https://huggingface.co/collections/hugging-quants/llama-31-gptq-awq-and-bnb-quants-669fa7f50f6e713fd54bd198) with quantized checkpoints
-
-### Running a configuration
-
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
-
-<div class="termy">
-
-```shell
-$ HF_TOKEN=...
-$ dstack apply -f examples/llms/llama31/vllm/.dstack.yml
-
- #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
- 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB    yes   $0.12
- 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB    yes   $0.12
- 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:2  yes   $0.23
-
-Submit the run llama31? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-</div>
-
-If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
-
-<div class="termy">
-
-```shell
-$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \
-    -X POST \
-    -H 'Authorization: Bearer &lt;dstack token&gt;' \
-    -H 'Content-Type: application/json' \
-    -d '{
-      "model": "llama3.1",
-      "messages": [
-        {
-          "role": "system",
-          "content": "You are a helpful assistant."
-        },
-        {
-          "role": "user",
-          "content": "What is Deep Learning?"
-        }
-      ],
-      "max_tokens": 128
-    }'
-```
-
-</div>
-
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31.<gateway domain>/`.
-
-[//]: # (TODO: How to prompting and tool calling)
-
-[//]: # (TODO: Syntetic data generation)
-
-## Fine-tuning
-
-### Running on multiple GPUs
-
-Below is the task configuration file of fine-tuning Llama 3.1 8B using TRL on the
-[`OpenAssistant/oasst_top1_2023-08-25`](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25) dataset.
-
-<div editor-title="examples/single-node-training/trl/train.dstack.yml">
-
-```yaml
-type: task
-name: trl-train
-
-python: 3.12
-# Ensure nvcc is installed (req. for Flash Attention)
-nvcc: true
-env:
-  - HF_TOKEN
-  - WANDB_API_KEY
-commands:
-  - pip install "transformers>=4.43.2"
-  - pip install bitsandbytes
-  - pip install flash-attn --no-build-isolation
-  - pip install peft
-  - pip install wandb
-  - git clone https://github.com/huggingface/trl
-  - cd trl
-  - pip install .
-  - accelerate launch
-    --config_file=examples/accelerate_configs/multi_gpu.yaml
-    --num_processes $DSTACK_GPUS_PER_NODE
-    examples/scripts/sft.py
-    --model_name meta-llama/Meta-Llama-3.1-8B
-    --dataset_name OpenAssistant/oasst_top1_2023-08-25
-    --dataset_text_field="text"
-    --per_device_train_batch_size 1
-    --per_device_eval_batch_size 1
-    --gradient_accumulation_steps 4
-    --learning_rate 2e-4
-    --report_to wandb
-    --bf16
-    --max_seq_length 1024
-    --lora_r 16 --lora_alpha 32
-    --lora_target_modules q_proj k_proj v_proj o_proj
-    --load_in_4bit
-    --use_peft
-    --attn_implementation "flash_attention_2"
-    --logging_steps=10
-    --output_dir models/llama31
-    --hub_model_id peterschmidt85/FineLlama-3.1-8B
-
-resources:
-gpu:
-  # 24GB or more VRAM
-  memory: 24GB..
-  # One or more GPU
-  count: 1..
-# Shared memory (for multi-gpu)
-shm_size: 24GB
-```
-
-</div>
-
-Change the `resources` property to specify more GPUs.
-
-### Memory requirements
-
-Below are the approximate memory requirements for fine-tuning Llama 3.1.
-
-| Model size | Full fine-tuning | LoRA  | QLoRA |
-|------------|------------------|-------|-------|
-| **8B**     | 60GB             | 16GB  | 6GB   |
-| **70B**    | 500GB            | 160GB | 48GB  |
-| **405B**   | 3.25TB           | 950GB | 250GB |
-
-The requirements can be significantly reduced with certain optimizations.
-
-### DeepSpeed
-
-For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.
-
-To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configuration file instead of
-`examples/accelerate_configs/multi_gpu.yaml`.
-
-### Running on multiple nodes
-
-In case the model doesn't feet into a single GPU, consider running a `dstack` task on multiple nodes.
-Below is the corresponding task configuration file.
-
-<div editor-title="examples/single-node-training/trl/train.dstack.yml">
-
-```yaml
-type: task
-name: trl-train-distrib
-
-# Size of the cluster
-nodes: 2
-
-python: "3.10"
-# Ensure nvcc is installed (req. for Flash Attention)
-nvcc: true
-
-env:
-  - HF_TOKEN
-  - WANDB_API_KEY
-commands:
-  - pip install "transformers>=4.43.2"
-  - pip install bitsandbytes
-  - pip install flash-attn --no-build-isolation
-  - pip install peft
-  - pip install wandb
-  - git clone https://github.com/huggingface/trl
-  - cd trl
-  - pip install .
-  - accelerate launch
-    --config_file=examples/accelerate_configs/fsdp_qlora.yaml
-    --main_process_ip=$DSTACK_MASTER_NODE_IP
-    --main_process_port=8008
-    --machine_rank=$DSTACK_NODE_RANK
-    --num_processes=$DSTACK_GPUS_NUM
-    --num_machines=$DSTACK_NODES_NUM
-      examples/scripts/sft.py
-    --model_name meta-llama/Meta-Llama-3.1-8B
-    --dataset_name OpenAssistant/oasst_top1_2023-08-25
-    --dataset_text_field="text"
-    --per_device_train_batch_size 1
-    --per_device_eval_batch_size 1
-    --gradient_accumulation_steps 4
-    --learning_rate 2e-4
-    --report_to wandb
-    --bf16
-    --max_seq_length 1024
-    --lora_r 16 --lora_alpha 32
-    --lora_target_modules q_proj k_proj v_proj o_proj
-    --load_in_4bit
-    --use_peft
-    --attn_implementation "flash_attention_2"
-    --logging_steps=10
-    --output_dir models/llama31
-    --hub_model_id peterschmidt85/FineLlama-3.1-8B
-    --torch_dtype bfloat16
-    --use_bnb_nested_quant
-
-resources:
-  gpu:
-    # 24GB or more VRAM
-    memory: 24GB..
-    # One or more GPU
-    count: 1..
-  # Shared memory (for multi-gpu)
-  shm_size: 24GB
-```
-
-</div>
-
-[//]: # (TODO: Find a better example for a multi-node training)
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/llms/llama31`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama31) and [`examples/single-node-training/trl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl).
-
-## What's next?
-
-1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
-   [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
-2. Browse [Llama 3.1 on HuggingFace](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f),
-   [HuggingFace's Llama recipes](https://github.com/huggingface/huggingface-llama-recipes),
-   [Meta's Llama recipes](https://github.com/meta-llama/llama-recipes)
-   and [Llama Agentic System](https://github.com/meta-llama/llama-agentic-system/).
diff --git a/examples/llms/llama32/README.md b/examples/llms/llama32/README.md
deleted file mode 100644
index c607139326..0000000000
--- a/examples/llms/llama32/README.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Llama 3.2
-
-This example walks you through how to deploy Llama 3.2 vision model with `dstack` using `vLLM`.
-
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
-
-## Deployment
-
-Here's an example of a service that deploys Llama 3.2 11B using vLLM.
-
-<div editor-title="examples/llms/llama32/vllm/.dstack.yml">
-
-```yaml
-type: service
-name: llama32
-
-image: vllm/vllm-openai:latest
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
-  - MAX_MODEL_LEN=4096
-  - MAX_NUM_SEQS=8
-commands:
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --max-num-seqs $MAX_NUM_SEQS
-    --enforce-eager
-    --disable-log-requests
-    --limit-mm-per-prompt "image=1"
-    --tensor-parallel-size $DSTACK_GPUS_NUM
-port: 8000
-# Register the model
-model: meta-llama/Llama-3.2-11B-Vision-Instruct
-
-# Uncomment to cache downloaded models
-#volumes:
-#  - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
-
-resources:
-  gpu: 40GB..48GB
-```
-</div>
-
-[//]: # (TODO: Comment on MAX_MODEL_LEN and MAX_NUM_SEQS)
-
-### Memory requirements
-
-Below are the approximate memory requirements for loading the model.
-This excludes memory for the model context and CUDA kernel reservations.
-
-| Model size | FP16  |
-|------------|-------|
-| **11B**    | 40GB  |
-| **90B**    | 180GB |
-
-[//]: # (TODO: Quantization mention)
-
-### Running a configuration
-
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
-
-<div class="termy">
-
-```shell
-$ HF_TOKEN=...
-$ dstack apply -f examples/llms/llama32/vllm/.dstack.yml
-
- #  BACKEND  REGION     RESOURCES                    SPOT  PRICE
- 1  runpod   CA-MTL-1   9xCPU, 50GB, 1xA40 (48GB)    yes   $0.24
- 2  runpod   EU-SE-1    9xCPU, 50GB, 1xA40 (48GB)    yes   $0.24
- 3  runpod   EU-SE-1    9xCPU, 50GB, 1xA6000 (48GB)  yes   $0.25
-
-
-Submit the run llama32? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-</div>
-
-Once the service is up, it will be available via the service endpoint
-at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
-
-<div class="termy">
-
-```shell
-$ curl http://127.0.0.1:3000/proxy/services/main/llama32/v1/chat/completions \
-    -H 'Content-Type: application/json' \
-    -H 'Authorization: Bearer token' \
-    --data '{
-        "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
-        "messages": [
-        {
-            "role": "user",
-            "content": [
-                {"type" : "text", "text": "Describe the image."},
-                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}}
-            ]
-        }],
-        "max_tokens": 2048
-    }'
-```
-
-</div>
-
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint
-is available at `https://<run name>.<gateway domain>/`.
-
-[//]: # (TODO: https://github.com/dstackai/dstack/issues/1777)
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/llms/llama32`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama32).
-
-## What's next?
-
-1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
-   [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
-2. Browse [Llama 3.2 on HuggingFace](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf)
-   and [LLama 3.2 on vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html#multimodal-language-models).
diff --git a/examples/llms/llama32/vllm/.dstack.yml b/examples/llms/llama32/vllm/.dstack.yml
deleted file mode 100644
index 3712a64f80..0000000000
--- a/examples/llms/llama32/vllm/.dstack.yml
+++ /dev/null
@@ -1,27 +0,0 @@
-type: service
-name: llama32
-
-image: vllm/vllm-openai:latest
-env:
-  - HF_TOKEN
-  - MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
-  - MAX_MODEL_LEN=4096
-  - MAX_NUM_SEQS=8
-commands:
-  - vllm serve $MODEL_ID
-    --max-model-len $MAX_MODEL_LEN
-    --max-num-seqs $MAX_NUM_SEQS
-    --enforce-eager
-    --disable-log-requests
-    --limit-mm-per-prompt "image=1"
-    --tensor-parallel-size $DSTACK_GPUS_NUM
-port: 8000
-# Register the model
-model: meta-llama/Llama-3.2-11B-Vision-Instruct
-
-# Uncomment to cache downloaded models
-#volumes:
-#  - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
-
-resources:
-  gpu: 40GB..48GB
diff --git a/examples/misc/docker-compose/.dstack.yml b/examples/misc/docker-compose/.dstack.yml
deleted file mode 100644
index 0967f72b81..0000000000
--- a/examples/misc/docker-compose/.dstack.yml
+++ /dev/null
@@ -1,16 +0,0 @@
-type: dev-environment
-name: vscode-docker
-
-docker: true
-env:
-  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
-  - HF_TOKEN
-ide: vscode
-files:
-  - compose.yaml
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-resources:
-  gpu: 24GB
diff --git a/examples/misc/docker-compose/README.md b/examples/misc/docker-compose/README.md
deleted file mode 100644
index d74dba0304..0000000000
--- a/examples/misc/docker-compose/README.md
+++ /dev/null
@@ -1,180 +0,0 @@
-# Docker Compose
-
-All backends except `runpod`, `vastai`, and `kubernetes` allow using [Docker and Docker Compose](https://dstack.ai/docs/guides/protips#docker-and-docker-compose) inside `dstack` runs.
-
-This example shows how to deploy Hugging Face [Chat UI](https://huggingface.co/docs/chat-ui/index)
-with [TGI](https://huggingface.co/docs/text-generation-inference/en/index)
-serving [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
-using [Docker Compose](https://docs.docker.com/compose/).
-
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
-
-## Deployment
-
-### Running as a task
-
-=== "`task.dstack.yml`"
-
-    <div editor-title="examples/misc/docker-compose/task.dstack.yml">
-
-    ```yaml
-    type: task
-    name: chat-ui-task
-
-    docker: true
-    env:
-      - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
-      - HF_TOKEN
-    files:
-      - compose.yaml
-    commands:
-      - docker compose up
-    ports:
-      - 9000
-
-    resources:
-      gpu: "nvidia:24GB"
-    ```
-
-    </div>
-
-=== "`compose.yaml`"
-
-    <div editor-title="examples/misc/docker-compose/compose.yaml">
-
-    ```yaml
-    services:
-      app:
-        image: ghcr.io/huggingface/chat-ui:sha-bf0bc92
-        command:
-          - bash
-          - -c
-          - |
-            echo MONGODB_URL=mongodb://db:27017 > .env.local
-            echo MODELS='`[{
-              "name": "${MODEL_ID?}",
-              "endpoints": [{"type": "tgi", "url": "http://tgi:8000"}]
-            }]`' >> .env.local
-            exec ./entrypoint.sh
-        ports:
-          - 127.0.0.1:9000:3000
-        depends_on:
-          - tgi
-          - db
-
-      tgi:
-        image: ghcr.io/huggingface/text-generation-inference:sha-704a58c
-        volumes:
-          - tgi_data:/data
-        environment:
-          HF_TOKEN: ${HF_TOKEN?}
-          MODEL_ID: ${MODEL_ID?}
-          PORT: 8000
-        deploy:
-          resources:
-            reservations:
-              devices:
-                - driver: nvidia
-                  count: all
-                  capabilities: [gpu]
-
-      db:
-        image: mongo:latest
-        volumes:
-          - db_data:/data/db
-
-    volumes:
-      tgi_data:
-      db_data:
-    ```
-
-    </div>
-
-### Deploying as a service
-
-If you'd like to deploy Chat UI as an auto-scalable and secure endpoint,
-use the service configuration. You can find it at [`examples/misc/docker-compose/service.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose/service.dstack.yml)
-
-### Running a configuration
-
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
-
-<div class="termy">
-
-```shell
-$ HF_TOKEN=...
-$ dstack apply -f examples/examples/misc/docker-compose/task.dstack.yml
-
- #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
- 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB    yes   $0.12
- 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB    yes   $0.12
- 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:2  yes   $0.23
-
-Submit the run chat-ui-task? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-</div>
-
-## Persisting data
-
-To persist data between runs, create a [volume](https://dstack.ai/docs/concepts/volumes/) and attach it to the run
-configuration.
-
-<div editor-title="examples/misc/docker-compose/task.dstack.yml">
-
-```yaml
-type: task
-name: chat-ui-task
-
-privileged: true
-image: dstackai/dind
-env:
-  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
-  - HF_TOKEN
-files:
-  - compose.yaml
-commands:
-  - start-dockerd
-  - docker compose up
-ports:
-  - 9000
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-resources:
-  # Required resources
-  gpu: "nvidia:24GB"
-
-volumes:
-  - name: my-dind-volume
-    path: /var/lib/docker
-```
-
-</div>
-
-With this change, all Docker data—pulled images, containers, and crucially, volumes for database and model storage—will
-be persisted.
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/misc/docker-compose`](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose).
-
-## What's next?
-
-1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
-   [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
diff --git a/examples/misc/docker-compose/compose.yaml b/examples/misc/docker-compose/compose.yaml
deleted file mode 100644
index c5c843667c..0000000000
--- a/examples/misc/docker-compose/compose.yaml
+++ /dev/null
@@ -1,42 +0,0 @@
-services:
-  app:
-    image: ghcr.io/huggingface/chat-ui-db:0.9.5
-    environment:
-      HF_TOKEN: ${HF_TOKEN?}
-      MONGODB_URL: mongodb://db:27017
-      MODELS: |
-        [{
-          "name": "${MODEL_ID?}",
-          "endpoints": [{"type": "openai", "baseURL": "http://tgi:8000/v1"}]
-        }]
-    ports:
-      - 127.0.0.1:9000:3000
-    depends_on:
-      - tgi
-      - db
-
-  tgi:
-    image: ghcr.io/huggingface/text-generation-inference:3.3.4
-    volumes:
-      - tgi_data:/data
-    environment:
-      HF_TOKEN: ${HF_TOKEN?}
-      MODEL_ID: ${MODEL_ID?}
-      PORT: 8000
-    shm_size: 1g
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-
-  db:
-    image: mongo:latest
-    volumes:
-      - db_data:/data/db
-
-volumes:
-  tgi_data:
-  db_data:
diff --git a/examples/misc/docker-compose/service.dstack.yml b/examples/misc/docker-compose/service.dstack.yml
deleted file mode 100644
index b33b900fbd..0000000000
--- a/examples/misc/docker-compose/service.dstack.yml
+++ /dev/null
@@ -1,26 +0,0 @@
-type: service
-name: chat-ui-service
-
-docker: true
-env:
-  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
-  - HF_TOKEN
-files:
-  - compose.yaml
-commands:
-  - docker compose up
-port: 9000
-auth: false
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
-
-resources:
-  # Required resources
-  gpu: 1
-
-# Cache the Docker data
-volumes:
-  - instance_path: /root/.cache/docker-data
-    path: /var/lib/docker
-    optional: true
diff --git a/examples/misc/docker-compose/task.dstack.yml b/examples/misc/docker-compose/task.dstack.yml
deleted file mode 100644
index e7af43f383..0000000000
--- a/examples/misc/docker-compose/task.dstack.yml
+++ /dev/null
@@ -1,25 +0,0 @@
-type: task
-name: chat-ui-task
-
-docker: true
-env:
-  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
-  - HF_TOKEN
-files:
-  - compose.yaml
-commands:
-  - docker compose up
-ports:
-  - 9000
-
-# Use either spot or on-demand instances
-spot_policy: auto
-
-resources:
-  gpu: 1
-
-# Cache the Docker data
-volumes:
-  - instance_path: /root/.cache/docker-data
-    path: /var/lib/docker
-    optional: true
diff --git a/examples/misc/docker-compose/volume.dstack.yml b/examples/misc/docker-compose/volume.dstack.yml
deleted file mode 100644
index d5ddc1cc67..0000000000
--- a/examples/misc/docker-compose/volume.dstack.yml
+++ /dev/null
@@ -1,8 +0,0 @@
-type: volume
-name: my-dind-volume
-
-backend: aws
-region: eu-west-1
-
-# Required size
-size: 100GB
diff --git a/examples/models/wan22/.dstack.yml b/examples/models/wan22/.dstack.yml
deleted file mode 100644
index 528e11e0d8..0000000000
--- a/examples/models/wan22/.dstack.yml
+++ /dev/null
@@ -1,63 +0,0 @@
-type: task
-name: wan22
-
-repos:
-  # Clones it to `/workflow` (the default working directory)
-  - https://github.com/Wan-Video/Wan2.2.git
-
-python: 3.12
-nvcc: true
-
-env:
-  - PROMPT="Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
-  # Required for storing cache on a volume
-  - UV_LINK_MODE=copy
-commands:
-  # Install flash-attn
-  - |
-    uv pip install torch
-    uv pip install flash-attn --no-build-isolation 
-  # Install dependencies
-  - |
-    uv pip install . decord librosa
-    uv pip install "huggingface_hub[cli]"
-    hf download Wan-AI/Wan2.2-T2V-A14B --local-dir /root/.cache/Wan2.2-T2V-A14B
-  # Generate video
-  - |
-    if [ ${DSTACK_GPUS_NUM} -gt 1 ]; then
-      torchrun \
-        --nproc_per_node=${DSTACK_GPUS_NUM} \
-        generate.py \
-        --task t2v-A14B \
-        --size 1280*720 \
-        --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \
-        --dit_fsdp --t5_fsdp --ulysses_size ${DSTACK_GPUS_NUM} \
-        --save_file ${DSTACK_RUN_NAME}.mp4 \
-        --prompt "${PROMPT}"
-    else
-      python generate.py \
-        --task t2v-A14B \
-        --size 1280*720 \
-        --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \
-        --offload_model True \
-        --convert_model_dtype \
-        --save_file ${DSTACK_RUN_NAME}.mp4 \
-        --prompt "${PROMPT}"
-    fi
-  # Upload video
-  - curl https://bashupload.com/ -T ./${DSTACK_RUN_NAME}.mp4
-
-resources: 
-  gpu:
-    name: [H100, H200]
-    count: 1..8
-  disk: 300GB
-
-# Change to on-demand for disabling spot
-spot_policy: auto
-
-volumes:
-  # Cache pip packages and HF models
-  - instance_path: /root/dstack-cache
-    path: /root/.cache/
-    optional: true
diff --git a/examples/models/wan22/README.md b/examples/models/wan22/README.md
deleted file mode 100644
index c99856fbf4..0000000000
--- a/examples/models/wan22/README.md
+++ /dev/null
@@ -1,145 +0,0 @@
----
-title: Wan2.2
-description: Text-to-video generation using the Wan2.2 T2V-A14B foundational video model
----
-
-# Wan2.2
-
-[Wan2.2](https://github.com/Wan-Video/Wan2.2) is an open-source SOTA foundational video model. This example shows how to run the T2V-A14B model variant via `dstack` for text-to-video generation.
-
-??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-    <div class="termy">
- 
-    ```shell
-    $ git clone https://github.com/dstackai/dstack
-    $ cd dstack
-    ```
- 
-    </div>
-
-## Define a configuration
-
-Below is a task configuration that generates a video using Wan2.2, uploads it, and provides the download link.
-
-<div editor-title="examples/models/wan22/.dstack.yml"> 
-
-```yaml
-type: task
-name: wan22
-
-repos:
-  # Clones it to `/workflow` (the default working directory)
-  - https://github.com/Wan-Video/Wan2.2.git
-
-python: 3.12
-nvcc: true
-
-env:
-  - PROMPT="Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
-  # Required for storing cache on a volume
-  - UV_LINK_MODE=copy
-commands:
-  # Install flash-attn
-  - |
-    uv pip install torch
-    uv pip install flash-attn --no-build-isolation 
-  # Install dependencies
-  - |
-    uv pip install . decord librosa
-    uv pip install "huggingface_hub[cli]"
-    hf download Wan-AI/Wan2.2-T2V-A14B --local-dir /root/.cache/Wan2.2-T2V-A14B
-  # Generate video
-  - |
-    if [ ${DSTACK_GPUS_NUM} -gt 1 ]; then
-      torchrun \
-        --nproc_per_node=${DSTACK_GPUS_NUM} \
-        generate.py \
-        --task t2v-A14B \
-        --size 1280*720 \
-        --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \
-        --dit_fsdp --t5_fsdp --ulysses_size ${DSTACK_GPUS_NUM} \
-        --save_file ${DSTACK_RUN_NAME}.mp4 \
-        --prompt "${PROMPT}"
-    else
-      python generate.py \
-        --task t2v-A14B \
-        --size 1280*720 \
-        --ckpt_dir /root/.cache/Wan2.2-T2V-A14B \
-        --offload_model True \
-        --convert_model_dtype \
-        --save_file ${DSTACK_RUN_NAME}.mp4 \
-        --prompt "${PROMPT}"
-    fi
-  # Upload video
-  - curl https://bashupload.com/ -T ./${DSTACK_RUN_NAME}.mp4
-
-resources: 
-  gpu:
-    name: [H100, H200]
-    count: 1..8
-  disk: 300GB
-
-# Change to on-demand for disabling spot
-spot_policy: auto
-
-volumes:
-  # Cache pip packages and HF models
-  - instance_path: /root/dstack-cache
-    path: /root/.cache/
-    optional: true
-```
-
-</div>
-
-## Run the configuration
-
-Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
-cloud resources and run the configuration.
-
-<div class="termy">
-
-```shell
-$ dstack apply -f examples/models/wan22/.dstack.yml
-
- #  BACKEND         RESOURCES                                        INSTANCE TYPE   PRICE
- 1  verda (FIN-01)  cpu=30 mem=120GB disk=200GB H100:80GB:1 (spot)   1H100.80S.30V   $0.99
- 2  verda (FIN-01)  cpu=30 mem=120GB disk=200GB H100:80GB:1 (spot)   1H100.80S.30V   $0.99
- 3  verda (FIN-02)  cpu=44 mem=182GB disk=200GB H200:141GB:1 (spot)  1H200.141S.44V  $0.99
-
----> 100%
-
-Uploaded 1 file, 8 375 523 bytes
-
-wget https://bashupload.com/fIo7l/wan22.mp4
-```
-
-</div>
-
-If you want you can override the default GPU, spot policy, and even the prompt via the CLI.
-
-<div class="termy">
-
-```shell
-$ PROMPT=...
-$ dstack apply -f examples/models/wan22/.dstack.yml --spot --gpu H100,H200:8
-
- #  BACKEND          RESOURCES                                          INSTANCE TYPE    PRICE
- 1  aws (us-east-2)  cpu=192 mem=2048GB disk=300GB H100:80GB:8 (spot)   p5.48xlarge      $6.963
- 2  verda (FIN-02)   cpu=176 mem=1480GB disk=300GB H100:80GB:8 (spot)   8H100.80S.176V   $7.93
- 3  verda (ICE-01)   cpu=176 mem=1450GB disk=300GB H200:141GB:8 (spot)  8H200.141S.176V  $7.96
- 
----> 100%
-
-Uploaded 1 file, 8 375 523 bytes
-
-wget https://bashupload.com/fIo7l/wan22.mp4
-```
-
-</div>
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/models/wan22`](https://github.com/dstackai/dstack/blob/master/examples/models/wan22).
diff --git a/examples/single-node-training/axolotl/README.md b/examples/single-node-training/axolotl/README.md
index f8ae04f7ce..7781139e0b 100644
--- a/examples/single-node-training/axolotl/README.md
+++ b/examples/single-node-training/axolotl/README.md
@@ -92,11 +92,6 @@ Provisioning...
 
 </div>
 
-## Source code
-
-The source-code of this example can be found in
-[`examples/single-node-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl) and [`examples/distributed-training/axolotl`](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl).
-
 ## What's next?
 
 1. Browse the [Axolotl distributed training](https://dstack.ai/docs/examples/distributed-training/axolotl) example
diff --git a/examples/single-node-training/trl/README.md b/examples/single-node-training/trl/README.md
index f5cf7f5a4a..82dca87a98 100644
--- a/examples/single-node-training/trl/README.md
+++ b/examples/single-node-training/trl/README.md
@@ -108,11 +108,6 @@ Provisioning...
 
 </div>
 
-## Source code
-
-The source-code of this example can be found in 
-[`examples/llms/llama31`](https://github.com/dstackai/dstack/blob/master/examples/llms/llama31) and [`examples/single-node-training/trl`](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl).
-
 ## What's next?
 
 1. Browse the [TRL distributed training](https://dstack.ai/docs/examples/distributed-training/trl) example
diff --git a/mkdocs.yml b/mkdocs.yml
index 34437b1799..8dbe0ad85e 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -98,10 +98,10 @@ plugins:
         "docs/tasks.md": "docs/concepts/tasks.md"
         "docs/services.md": "docs/concepts/services.md"
         "docs/fleets.md": "docs/concepts/fleets.md"
-        "docs/examples/llms/llama31.md": "examples/llms/llama/index.md"
-        "docs/examples/llms/llama32.md": "examples/llms/llama/index.md"
-        "examples/llms/llama31/index.md": "examples/llms/llama/index.md"
-        "examples/llms/llama32/index.md": "examples/llms/llama/index.md"
+        "docs/examples/llms/llama31.md": "examples/inference/vllm/index.md"
+        "docs/examples/llms/llama32.md": "examples/inference/vllm/index.md"
+        "examples/llms/llama31/index.md": "examples/inference/vllm/index.md"
+        "examples/llms/llama32/index.md": "examples/inference/vllm/index.md"
         "docs/examples/accelerators/amd/index.md": "examples/accelerators/amd/index.md"
         "docs/examples/deployment/nim/index.md": "examples/inference/nim/index.md"
         "docs/examples/deployment/vllm/index.md": "examples/inference/vllm/index.md"
@@ -308,8 +308,6 @@ nav:
           - AMD: examples/accelerators/amd/index.md
           - TPU: examples/accelerators/tpu/index.md
           - Tenstorrent: examples/accelerators/tenstorrent/index.md
-      - Models:
-          - Wan2.2: examples/models/wan22/index.md
   - Blog:
       - blog/index.md
   - Case studies: blog/case-studies.md