alez007 · alez007 · Apr 13, 2026 · Apr 13, 2026
diff --git a/README.md b/README.md
@@ -72,10 +72,13 @@ Pull the latest image from GHCR:
 docker pull ghcr.io/alez007/modelship:latest
 ```
 
-Grab an example config for your GPU and edit it to your liking:
+Create a `models.yaml` config file (see [config/models.yaml](config/models.yaml) for an example):
 
-```bash
-docker run --rm ghcr.io/alez007/modelship:latest cat /modelship/config/models.example.16GB.yaml > models.yaml
+```yaml
+models:
+  - name: qwen
+    model: Qwen/Qwen3-0.6B
+    loader: vllm
 ```
 
 Start the server:
@@ -104,7 +107,19 @@ curl http://localhost:8000/v1/chat/completions \
 - Prometheus metrics: `http://localhost:8079`
 - Ray dashboard: `http://localhost:8265`
 
-Example configs are included for 8 GB, 16 GB, 24 GB, and 2×16 GB GPU setups.
+### Additive Deploys
+
+By default, running `start.py` with a new config adds models to the running cluster without disrupting existing deployments:
+
+```bash
+# Deploy LLMs
+python start.py --config config/llm.yaml
+
+# Later, add TTS models — LLMs keep running
+python start.py --config config/tts.yaml
+```
+
+Use `--redeploy` to tear down everything and start fresh. See [Model Configuration](docs/model-configuration.md) for the full CLI reference.
 
 ## Plugin Support
 
@@ -155,7 +170,7 @@ See the full [Production Readiness Plan](docs/production-readiness.md) for detai
 | Resilience                   | 5/10  | Good shutdown, weak self-healing |
 | Testing                      | 3/10  | Config tests only, no integration/API tests |
 | DevOps Experience            | 5/10  | Good docs, no K8s/Helm, no runbooks |
-| Update/Deploy Strategy       | 3/10  | No rolling updates, no hot-reload |
+| Update/Deploy Strategy       | 5/10  | Additive deploys supported, no rolling updates for existing models |
 
 ### Critical items before production
 

diff --git a/config/models.example.16GB.yaml b/config/models.example.16GB.yaml
diff --git a/config/models.example.24GB.yaml b/config/models.example.24GB.yaml
diff --git a/config/models.example.2x16GB.yaml b/config/models.example.2x16GB.yaml
diff --git a/config/models.example.8GB.yaml b/config/models.example.8GB.yaml
diff --git a/config/models.example.ha.yaml b/config/models.example.ha.yaml
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -25,7 +25,9 @@ Each model in `models.yaml` becomes an isolated Ray Serve deployment (`ModelDepl
 - **Independent lifecycle** — one model crashing doesn't affect others
 - **Per-model GPU budgeting** — `num_gpus` controls VRAM allocation (e.g. 0.70 for 70%)
 - **Sequential startup** — models deploy one at a time to prevent memory spikes, ordered by tensor parallelism size (TP > 1 first)
+- **Additive deploys** — by default, `start.py` adds models to a running cluster without disrupting existing deployments, enabling incremental composition from multiple config files. Use `--redeploy` to tear down and start fresh
 - **Multi-deployment routing** — the same model name can appear multiple times with different configs (e.g. GPU + CPU). The gateway round-robins requests across all deployments sharing a name. Each deployment also supports `num_replicas` for scaling identical copies via Ray Serve's built-in load balancing
+- **Multi-gateway support** — multiple independent gateways can run on the same cluster via `--gateway-name`, each managing its own set of models
 
 ### Inference Loaders
 
@@ -57,7 +59,7 @@ See [Plugin Development](plugins.md) for details.
 
 | File | Purpose |
 |------|---------|
-| `start.py` | Entry point — initializes Ray, deploys models |
+| `start.py` | Entry point — initializes Ray, deploys models additively (or fresh with `--redeploy`) |
 | `modelship/openai/api.py` | FastAPI gateway with OpenAI endpoints |
 | `modelship/infer/model_deployment.py` | Ray Serve deployment actor |
 | `modelship/infer/infer_config.py` | Pydantic config models and protocols |