Skip to content

support per-replica-group image, docker, python, nvcc, privileged#3832

Merged
Bihan merged 3 commits intodstackai:masterfrom
Bihan:feat/replica-group-image-sources
May 1, 2026
Merged

support per-replica-group image, docker, python, nvcc, privileged#3832
Bihan merged 3 commits intodstackai:masterfrom
Bihan:feat/replica-group-image-sources

Conversation

@Bihan
Copy link
Copy Markdown
Collaborator

@Bihan Bihan commented Apr 28, 2026

The driving case for this PR is PD disaggregation NVIDIA-Dynamo: one replica is a dynamo frontend (router) that has to bring up a NATS/etcd, while the other replicas are GPU workers running the SGLang prefill/decode backend.

Here is how the possible Service Configuration Looks like

type: service
name: test1-pd-router-replica
python: 3.12
https: false

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
  - ETCD_ENDPOINTS="http://192.168.0.53:2379" # http://<Internal_IP_ROUTER_MACHINE>:<etcd port>
  - NATS_SERVER="nats://192.168.0.53:4222" # nats://<Internal_IP_ROUTER_MACHINE>:<nats port>

replicas:
  - count: 1
    docker: true
    commands:
         - apt-get update
         - apt-get install -y python3-dev python3-venv
         - python3 -m venv ~/dyn-venv
         - source ~/dyn-venv/bin/activate
         - pip install -U pip
         - pip install --pre "ai-dynamo[sglang]"
         - git clone https://github.com/ai-dynamo/dynamo.git
         - docker compose -f dynamo/deploy/docker-compose.yml up -d
         - |
            python3 -m dynamo.frontend \
                  --http-host 0.0.0.0 \
                  --http-port 8000 \
                  --discovery-backend etcd \
                  --router-mode kv
    router:
      type: dynamo
    resources:
      cpu: 4
          
  - count: 1
     python: 3.12
     nvcc: true
     commands:
         - pip install "ai-dynamo[sglang]"
         - |
            python3 -m dynamo.sglang \
                --model-path $MODEL_ID \
                --served-model-name $MODEL_ID \
                --discovery-backend etcd \
                --host 0.0.0.0 \
                --page-size 64 \
                --disaggregation-mode prefill \
                --disaggregation-transfer-backend nixl
      resources:
        gpu: 1

  - count: 1..2
    python: 3.12
    nvcc: true
    scaling:
      metric: rps
      target: 3
     commands:
          - pip install "ai-dynamo[sglang]"
          - |
            python3 -m dynamo.sglang \
                --model-path $MODEL_ID \
                --served-model-name $MODEL_ID \
                --discovery-backend etcd \
                --host 0.0.0.0 \
                --page-size 64 \
                --disaggregation-mode decode \
                --disaggregation-transfer-backend nixl
      resources:
        gpu: 1

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

@Bihan Bihan requested a review from jvstme April 28, 2026 05:38
Copy link
Copy Markdown
Collaborator

@jvstme jvstme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, except I would suggest to further improve validation, see my comments. Validation will be difficult or impossible to extend later, when the database already holds potentially invalid configurations, so I'd suggest to do it before merging

Comment on lines +1115 to +1134
(
"docker",
values.get("docker") is True,
lambda g: g.docker is True,
),
(
"privileged",
values.get("privileged") is True,
lambda g: g.privileged is not None,
),
(
"python",
values.get("python") is not None,
lambda g: g.python is not None,
),
(
"nvcc",
values.get("nvcc") is True,
lambda g: g.nvcc is True,
),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For docker and nvcc — why is only True forbidden? I would expect anything except None to fail. For example, I wouldn't expect this configuration to pass validation, but currently it does:

type: service
port: 80

nvcc: true
replicas:
- count: 1
  nvcc: false

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +473 to +489
def test_replica_group_with_only_python_no_commands_allowed(self):
parse_run_configuration(
{
"type": "service",
"port": 8000,
"replicas": [{"count": 1, "python": "3.12"}],
}
)

def test_replica_group_with_only_nvcc_no_commands_allowed(self):
parse_run_configuration(
{
"type": "service",
"port": 8000,
"replicas": [{"count": 1, "nvcc": True}],
}
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these allowed though? Without replica groups, we don't allow such configurations:

$ cat test.dstack.yml
type: service
port: 8000
python: 3.11

$ dstack apply -f test.dstack.yml
1 validation error for BaseApplyConfigurationResponse
__root__ -> ServiceConfigurationResponse -> __root__
  Either `commands` or `image` must be set (type=value_error)

The idea is that each job should have something useful to run — either a custom image or custom commands on top of the default image. Just running the default image is not useful, so it can indicate a misconfiguration

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more misconfiguration case not covered by validation is conflicting image sources on the service and replica level. For example, I wouldn't expect this configuration to pass validation, but currently it does:

type: service
port: 8000

image: alpine
replicas:
- count: 1
  commands: ["x"]
  nvcc: true  # conflicts with `image`

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Bihan Bihan force-pushed the feat/replica-group-image-sources branch from e1bbb8e to f8a61f6 Compare April 30, 2026 10:57
@Bihan Bihan merged commit 0585e95 into dstackai:master May 1, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants