support per-replica-group image, docker, python, nvcc, privileged by Bihan · Pull Request #3832 · dstackai/dstack

Bihan · 2026-04-28T05:38:26Z

The driving case for this PR is PD disaggregation NVIDIA-Dynamo: one replica is a dynamo frontend (router) that has to bring up a NATS/etcd, while the other replicas are GPU workers running the SGLang prefill/decode backend.

Here is how the possible Service Configuration Looks like

type: service
name: test1-pd-router-replica
python: 3.12
https: false

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
  - ETCD_ENDPOINTS="http://192.168.0.53:2379" # http://<Internal_IP_ROUTER_MACHINE>:<etcd port>
  - NATS_SERVER="nats://192.168.0.53:4222" # nats://<Internal_IP_ROUTER_MACHINE>:<nats port>

replicas:
  - count: 1
    docker: true
    commands:
         - apt-get update
         - apt-get install -y python3-dev python3-venv
         - python3 -m venv ~/dyn-venv
         - source ~/dyn-venv/bin/activate
         - pip install -U pip
         - pip install --pre "ai-dynamo[sglang]"
         - git clone https://github.com/ai-dynamo/dynamo.git
         - docker compose -f dynamo/deploy/docker-compose.yml up -d
         - |
            python3 -m dynamo.frontend \
                  --http-host 0.0.0.0 \
                  --http-port 8000 \
                  --discovery-backend etcd \
                  --router-mode kv
    router:
      type: dynamo
    resources:
      cpu: 4
          
  - count: 1
     python: 3.12
     nvcc: true
     commands:
         - pip install "ai-dynamo[sglang]"
         - |
            python3 -m dynamo.sglang \
                --model-path $MODEL_ID \
                --served-model-name $MODEL_ID \
                --discovery-backend etcd \
                --host 0.0.0.0 \
                --page-size 64 \
                --disaggregation-mode prefill \
                --disaggregation-transfer-backend nixl
      resources:
        gpu: 1

  - count: 1..2
    python: 3.12
    nvcc: true
    scaling:
      metric: rps
      target: 3
     commands:
          - pip install "ai-dynamo[sglang]"
          - |
            python3 -m dynamo.sglang \
                --model-path $MODEL_ID \
                --served-model-name $MODEL_ID \
                --discovery-backend etcd \
                --host 0.0.0.0 \
                --page-size 64 \
                --disaggregation-mode decode \
                --disaggregation-transfer-backend nixl
      resources:
        gpu: 1

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

jvstme

Looks good overall, except I would suggest to further improve validation, see my comments. Validation will be difficult or impossible to extend later, when the database already holds potentially invalid configurations, so I'd suggest to do it before merging

jvstme · 2026-04-28T08:28:36Z

+            (
+                "docker",
+                values.get("docker") is True,
+                lambda g: g.docker is True,
+            ),
+            (
+                "privileged",
+                values.get("privileged") is True,
+                lambda g: g.privileged is not None,
+            ),
+            (
+                "python",
+                values.get("python") is not None,
+                lambda g: g.python is not None,
+            ),
+            (
+                "nvcc",
+                values.get("nvcc") is True,
+                lambda g: g.nvcc is True,
+            ),


For docker and nvcc — why is only True forbidden? I would expect anything except None to fail. For example, I wouldn't expect this configuration to pass validation, but currently it does:

type: service port: 80 nvcc: true replicas: - count: 1 nvcc: false

jvstme · 2026-04-29T07:49:35Z

+    def test_replica_group_with_only_python_no_commands_allowed(self):
+        parse_run_configuration(
+            {
+                "type": "service",
+                "port": 8000,
+                "replicas": [{"count": 1, "python": "3.12"}],
+            }
+        )
+
+    def test_replica_group_with_only_nvcc_no_commands_allowed(self):
+        parse_run_configuration(
+            {
+                "type": "service",
+                "port": 8000,
+                "replicas": [{"count": 1, "nvcc": True}],
+            }
+        )


Why are these allowed though? Without replica groups, we don't allow such configurations:

$ cat test.dstack.yml type: service port: 8000 python: 3.11 $ dstack apply -f test.dstack.yml 1 validation error for BaseApplyConfigurationResponse __root__ -> ServiceConfigurationResponse -> __root__ Either `commands` or `image` must be set (type=value_error)

The idea is that each job should have something useful to run — either a custom image or custom commands on top of the default image. Just running the default image is not useful, so it can indicate a misconfiguration

jvstme · 2026-04-29T08:00:49Z

One more misconfiguration case not covered by validation is conflicting image sources on the service and replica level. For example, I wouldn't expect this configuration to pass validation, but currently it does:

type: service port: 8000 image: alpine replicas: - count: 1 commands: ["x"] nvcc: true # conflicts with `image`

Bihan requested a review from jvstme April 28, 2026 05:38

jvstme approved these changes Apr 29, 2026

View reviewed changes

Bihan Rana added 3 commits April 30, 2026 16:14

support per-replica-group image, docker, python, nvcc, privileged

26c8e8f

Merge Conflict Resolved

b17f4f6

Resolve Review Comments

f8a61f6

Bihan force-pushed the feat/replica-group-image-sources branch from e1bbb8e to f8a61f6 Compare April 30, 2026 10:57

jvstme approved these changes Apr 30, 2026

View reviewed changes

Bihan merged commit 0585e95 into dstackai:master May 1, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support per-replica-group image, docker, python, nvcc, privileged#3832

support per-replica-group image, docker, python, nvcc, privileged#3832
Bihan merged 3 commits intodstackai:masterfrom
Bihan:feat/replica-group-image-sources

Bihan commented Apr 28, 2026 •

edited

Loading

Uh oh!

jvstme left a comment

Uh oh!

jvstme Apr 28, 2026

Uh oh!

Bihan Apr 30, 2026

Uh oh!

jvstme Apr 29, 2026

Uh oh!

Bihan Apr 30, 2026

Uh oh!

jvstme Apr 29, 2026

Uh oh!

Bihan Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Bihan commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvstme left a comment

Choose a reason for hiding this comment

Uh oh!

jvstme Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Bihan Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jvstme Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Bihan Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jvstme Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Bihan Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bihan commented Apr 28, 2026 •

edited

Loading