Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,13 @@ Pull the latest image from GHCR:
docker pull ghcr.io/alez007/modelship:latest
```

Grab an example config for your GPU and edit it to your liking:
Create a `models.yaml` config file (see [config/models.yaml](config/models.yaml) for an example):

```bash
docker run --rm ghcr.io/alez007/modelship:latest cat /modelship/config/models.example.16GB.yaml > models.yaml
```yaml
models:
- name: qwen
model: Qwen/Qwen3-0.6B
loader: vllm
```

Start the server:
Expand Down Expand Up @@ -104,7 +107,19 @@ curl http://localhost:8000/v1/chat/completions \
- Prometheus metrics: `http://localhost:8079`
- Ray dashboard: `http://localhost:8265`

Example configs are included for 8 GB, 16 GB, 24 GB, and 2×16 GB GPU setups.
### Additive Deploys

By default, running `start.py` with a new config adds models to the running cluster without disrupting existing deployments:

```bash
# Deploy LLMs
python start.py --config config/llm.yaml

# Later, add TTS models — LLMs keep running
python start.py --config config/tts.yaml
```

Use `--redeploy` to tear down everything and start fresh. See [Model Configuration](docs/model-configuration.md) for the full CLI reference.

## Plugin Support

Expand Down Expand Up @@ -155,7 +170,7 @@ See the full [Production Readiness Plan](docs/production-readiness.md) for detai
| Resilience | 5/10 | Good shutdown, weak self-healing |
| Testing | 3/10 | Config tests only, no integration/API tests |
| DevOps Experience | 5/10 | Good docs, no K8s/Helm, no runbooks |
| Update/Deploy Strategy | 3/10 | No rolling updates, no hot-reload |
| Update/Deploy Strategy | 5/10 | Additive deploys supported, no rolling updates for existing models |

### Critical items before production

Expand Down
44 changes: 0 additions & 44 deletions config/models.example.16GB.yaml

This file was deleted.

52 changes: 0 additions & 52 deletions config/models.example.24GB.yaml

This file was deleted.

55 changes: 0 additions & 55 deletions config/models.example.2x16GB.yaml

This file was deleted.

42 changes: 0 additions & 42 deletions config/models.example.8GB.yaml

This file was deleted.

38 changes: 0 additions & 38 deletions config/models.example.ha.yaml

This file was deleted.

4 changes: 3 additions & 1 deletion docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ Each model in `models.yaml` becomes an isolated Ray Serve deployment (`ModelDepl
- **Independent lifecycle** — one model crashing doesn't affect others
- **Per-model GPU budgeting** — `num_gpus` controls VRAM allocation (e.g. 0.70 for 70%)
- **Sequential startup** — models deploy one at a time to prevent memory spikes, ordered by tensor parallelism size (TP > 1 first)
- **Additive deploys** — by default, `start.py` adds models to a running cluster without disrupting existing deployments, enabling incremental composition from multiple config files. Use `--redeploy` to tear down and start fresh
- **Multi-deployment routing** — the same model name can appear multiple times with different configs (e.g. GPU + CPU). The gateway round-robins requests across all deployments sharing a name. Each deployment also supports `num_replicas` for scaling identical copies via Ray Serve's built-in load balancing
- **Multi-gateway support** — multiple independent gateways can run on the same cluster via `--gateway-name`, each managing its own set of models

### Inference Loaders

Expand Down Expand Up @@ -57,7 +59,7 @@ See [Plugin Development](plugins.md) for details.

| File | Purpose |
|------|---------|
| `start.py` | Entry point — initializes Ray, deploys models |
| `start.py` | Entry point — initializes Ray, deploys models additively (or fresh with `--redeploy`) |
| `modelship/openai/api.py` | FastAPI gateway with OpenAI endpoints |
| `modelship/infer/model_deployment.py` | Ray Serve deployment actor |
| `modelship/infer/infer_config.py` | Pydantic config models and protocols |
Expand Down
Loading
Loading