support per-replica-group image, docker, python, nvcc, privileged#3832
support per-replica-group image, docker, python, nvcc, privileged#3832Bihan merged 3 commits intodstackai:masterfrom
Conversation
jvstme
left a comment
There was a problem hiding this comment.
Looks good overall, except I would suggest to further improve validation, see my comments. Validation will be difficult or impossible to extend later, when the database already holds potentially invalid configurations, so I'd suggest to do it before merging
| ( | ||
| "docker", | ||
| values.get("docker") is True, | ||
| lambda g: g.docker is True, | ||
| ), | ||
| ( | ||
| "privileged", | ||
| values.get("privileged") is True, | ||
| lambda g: g.privileged is not None, | ||
| ), | ||
| ( | ||
| "python", | ||
| values.get("python") is not None, | ||
| lambda g: g.python is not None, | ||
| ), | ||
| ( | ||
| "nvcc", | ||
| values.get("nvcc") is True, | ||
| lambda g: g.nvcc is True, | ||
| ), |
There was a problem hiding this comment.
For docker and nvcc — why is only True forbidden? I would expect anything except None to fail. For example, I wouldn't expect this configuration to pass validation, but currently it does:
type: service
port: 80
nvcc: true
replicas:
- count: 1
nvcc: false| def test_replica_group_with_only_python_no_commands_allowed(self): | ||
| parse_run_configuration( | ||
| { | ||
| "type": "service", | ||
| "port": 8000, | ||
| "replicas": [{"count": 1, "python": "3.12"}], | ||
| } | ||
| ) | ||
|
|
||
| def test_replica_group_with_only_nvcc_no_commands_allowed(self): | ||
| parse_run_configuration( | ||
| { | ||
| "type": "service", | ||
| "port": 8000, | ||
| "replicas": [{"count": 1, "nvcc": True}], | ||
| } | ||
| ) |
There was a problem hiding this comment.
Why are these allowed though? Without replica groups, we don't allow such configurations:
$ cat test.dstack.yml
type: service
port: 8000
python: 3.11
$ dstack apply -f test.dstack.yml
1 validation error for BaseApplyConfigurationResponse
__root__ -> ServiceConfigurationResponse -> __root__
Either `commands` or `image` must be set (type=value_error)The idea is that each job should have something useful to run — either a custom image or custom commands on top of the default image. Just running the default image is not useful, so it can indicate a misconfiguration
There was a problem hiding this comment.
One more misconfiguration case not covered by validation is conflicting image sources on the service and replica level. For example, I wouldn't expect this configuration to pass validation, but currently it does:
type: service
port: 8000
image: alpine
replicas:
- count: 1
commands: ["x"]
nvcc: true # conflicts with `image`e1bbb8e to
f8a61f6
Compare
The driving case for this PR is
PD disaggregation NVIDIA-Dynamo: one replica is a dynamo frontend (router) that has to bring up aNATS/etcd, while the other replicas are GPU workers running the SGLang prefill/decode backend.Here is how the possible
Service ConfigurationLooks like