Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.
- deplodock/ — Python package
- deplodock.py — CLI entrypoint
- logging_setup.py — CLI logging configuration
- hardware.py — GPU specs and instance type mapping
- commands/ — CLI layer (thin argparse handlers, see ARCHITECTURE.md)
- deploy/ —
deploy local,deploy ssh,deploy cloudcommands - bench/ —
benchcommand - teardown.py —
teardowncommand - vm/ —
vm create/deletecommands (GCP, CloudRift)
- deploy/ —
- recipe/ — Recipe loading, dataclass types, engine flag mapping (see ARCHITECTURE.md)
- deploy/ — Compose generation, deploy orchestration
- provisioning/ — Cloud provisioning, SSH transport, VM lifecycle
- benchmark/ — Benchmark tracking, config, task enumeration, execution
- planner/ — Groups benchmark tasks into execution groups for VM allocation
- recipes/ — Model deploy recipes (YAML configs per model)
- experiments/ — Experiment parameter sweeps (self-contained recipe + results)
- docs/ — Technical notes and engine-specific guides
- sglang-awq-moe.md — SGLang quantization for AWQ MoE models
- tests/ — pytest tests (see ARCHITECTURE.md)
- scripts/ — Analysis and visualization scripts
- utils/ — Standalone utility scripts
- config.yaml — Benchmark configuration
- Makefile — Build automation
- pyproject.toml — Package metadata and tool config
git clone https://github.com/cloudrift-ai/deplodock.git
cd deplodock
make setupdeplodock deploy ssh \
--recipe recipes/GLM-4.6-FP8 \
--server user@hostdeplodock deploy local \
--recipe recipes/Qwen3-Coder-30B-A3B-Instruct-AWQdeplodock deploy ssh \
--recipe recipes/GLM-4.6-FP8 \
--server user@host \
--teardownPreview commands without executing:
deplodock deploy ssh \
--recipe recipes/GLM-4.6-FP8 \
--server user@host \
--dry-runRecipes are declarative YAML configs in recipes/<model>/recipe.yaml. Each recipe defines a model, engine settings, and a matrices section for benchmark configurations.
model:
huggingface: "org/model-name"
engine:
llm:
tensor_parallel_size: 8
pipeline_parallel_size: 1
gpu_memory_utilization: 0.9
context_length: 16384
max_concurrent_requests: 512
vllm:
image: "vllm/vllm-openai:latest"
extra_args: "--kv-cache-dtype fp8" # Flags not covered by named fields
benchmark:
max_concurrency: 128
num_prompts: 256
random_input_len: 8000
random_output_len: 8000
matrices:
# Simple single-point entry
- deploy.gpu: "NVIDIA H200 141GB"
deploy.gpu_count: 8
# Override engine and benchmark settings
- deploy.gpu: "NVIDIA H100 80GB"
deploy.gpu_count: 8
engine.llm.max_concurrent_requests: 256
benchmark.max_concurrency: 64
# Concurrency sweep (8 runs from one entry)
- deploy.gpu: "NVIDIA GeForce RTX 5090"
benchmark.max_concurrency: [1, 2, 4, 8, 16, 32, 64, 128]
# Correlated engine+bench sweep (3 zip runs)
- deploy.gpu: "NVIDIA GeForce RTX 5090"
engine.llm.max_concurrent_requests: [128, 256, 512]
benchmark.max_concurrency: [128, 256, 512]Matrix entries use dot-notation for all parameter paths. Scalars are broadcast; lists are zipped (all lists in one entry must have the same length). deploy.gpu is required in each entry.
Engine-agnostic fields (tensor_parallel_size, context_length, etc.) live at engine.llm. Engine-specific fields (image, extra_args) nest under engine.llm.vllm or engine.llm.sglang.
To benchmark with SGLang alongside vLLM, add a matrix entry with engine.llm.sglang.* overrides:
matrices:
- deploy.gpu: "NVIDIA GeForce RTX 5090"
deploy.gpu_count: 1
- deploy.gpu: "NVIDIA GeForce RTX 5090"
deploy.gpu_count: 1
engine.llm.sglang.image: "lmsysorg/sglang:latest"| Recipe YAML key | vLLM CLI flag | SGLang CLI flag |
|---|---|---|
tensor_parallel_size |
--tensor-parallel-size |
--tp |
pipeline_parallel_size |
--pipeline-parallel-size |
--dp |
gpu_memory_utilization |
--gpu-memory-utilization |
--mem-fraction-static |
context_length |
--max-model-len |
--context-length |
max_concurrent_requests |
--max-num-seqs |
--max-running-requests |
These flags must not appear in extra_args — load_recipe() validates this and raises an error on duplicates.
Experiments are self-contained parameter sweeps that live in experiments/. Each experiment directory contains a recipe.yaml and stores its results alongside it. The directory structure follows experiments/{model_name}/{experiment_name}/.
deplodock bench experiments/Qwen3-Coder-30B-A3B-Instruct-AWQ/optimal_mcr_rtx5090Results are saved directly in the experiment directory:
experiments/Qwen3-Coder-30B-A3B-Instruct-AWQ/optimal_mcr_rtx5090/
recipe.yaml
2026-02-24_19-13-50_abc12345/
tasks.json
recipe.yaml
RTX5090_mcr8_c8_vllm_benchmark.txt
RTX5090_mcr12_c12_vllm_benchmark.txt
...
External developers can submit experiments via pull requests. A maintainer triggers benchmarks by commenting /run-experiment on the PR.
- Submit a PR with an experiment definition in
experiments/{model}/{experiment}/recipe.yaml - A maintainer reviews and comments
/run-experimenton the PR - CI runs benchmarks on cloud GPUs, commits results back to the PR branch
- Review results in the PR comment summary and committed files
/run-experiment # Auto-detect: benchmarks all experiments changed in the PR
/run-experiment experiments/MyModel/my_experiment # Explicit: benchmark specific experiment(s)
/run-experiment experiments/MyModel/my_experiment --gpu-concurrency 2 # Split groups across 2 VMs each
Only users with write or admin access to the repository can trigger benchmarks.
For the workflow to push results back to a fork's branch, the PR must have "Allow edits from maintainers" checked (this is the GitHub default). If unchecked, results are still available as downloadable workflow artifacts.
Runs docker compose directly on the current machine.
deplodock deploy local --recipe <path> [--dry-run]Deploys to a remote server via SSH + SCP.
deplodock deploy ssh --recipe <path> --server user@host [--dry-run]Provisions a cloud VM based on recipe GPU requirements (from the deploy section), then deploys via SSH.
deplodock deploy cloud --recipe <path> [--name <vm-name>] [--dry-run]| Flag | Required | Default | Description |
|---|---|---|---|
--recipe |
Yes | - | Path to recipe directory |
--hf-token |
No | $HF_TOKEN |
HuggingFace token |
--model-dir |
No | /mnt/models |
Model cache dir |
--teardown |
No | false | Stop containers instead of deploying |
--dry-run |
No | false | Print commands without executing |
| Flag | Required | Default | Description |
|---|---|---|---|
--server |
Yes | - | SSH address (user@host) |
--ssh-key |
No | ~/.ssh/id_ed25519 |
SSH key path |
--ssh-port |
No | 22 | SSH port |
| Flag | Required | Default | Description |
|---|---|---|---|
--name |
No | cloud-deploy |
VM name prefix |
--ssh-key |
No | ~/.ssh/id_ed25519 |
SSH private key path |
The vm command manages cloud GPU VM lifecycles. Supports GCP and CloudRift providers. Instances are ephemeral — delete removes them entirely.
deplodock vm create gcp --instance my-gpu-vm --zone us-central1-a --machine-type a2-highgpu-1g
deplodock vm create gcp --instance my-gpu-vm --zone us-central1-a --machine-type e2-micro --wait-ssh
deplodock vm create gcp --instance my-gpu-vm --zone us-central1-a --machine-type e2-micro --gcloud-args "--no-service-account --no-scopes" --dry-run
deplodock vm delete gcp --instance my-gpu-vm --zone us-central1-a| Flag | Default | Description |
|---|---|---|
--instance |
(required) | GCP instance name |
--zone |
(required) | GCP zone (e.g. us-central1-a) |
--machine-type |
(required) | Machine type (e.g. a2-highgpu-1g) |
--provisioning-model |
FLEX_START |
Provisioning model (FLEX_START, SPOT, or STANDARD) |
--max-run-duration |
7d |
Max VM run time (10m–7d) |
--request-valid-for-duration |
2h |
How long to wait for capacity |
--termination-action |
DELETE |
Action when max-run-duration expires (STOP or DELETE) |
--image-family |
debian-12 |
Boot disk image family |
--image-project |
debian-cloud |
Boot disk image project |
--gcloud-args |
- | Extra args passed to gcloud compute instances create |
--timeout |
14400 |
How long to poll for RUNNING status (seconds) |
--wait-ssh |
false | Wait for SSH after VM is RUNNING |
--wait-ssh-timeout |
300 |
SSH wait timeout in seconds |
--ssh-gateway |
- | SSH gateway host for ProxyJump (e.g. gcp-ssh-gateway) |
--dry-run |
false | Print commands without executing |
| Flag | Default | Description |
|---|---|---|
--instance |
(required) | GCP instance name |
--zone |
(required) | GCP zone (e.g. us-central1-a) |
--dry-run |
false | Print commands without executing |
GCP project is inferred from gcloud config (no --project flag needed).
deplodock vm create cloudrift --instance-type rtx4090.1 --ssh-key ~/.ssh/id_ed25519.pub
deplodock vm delete cloudrift --instance-id <id>| Flag | Default | Description |
|---|---|---|
--instance-type |
(required) | Instance type (e.g. rtx4090.1) |
--ssh-key |
(required) | Path to SSH public key file |
--api-key |
$CLOUDRIFT_API_KEY |
CloudRift API key |
--image-url |
Ubuntu 24.04 | VM image URL |
--ports |
22,8000 |
Comma-separated ports to open |
--timeout |
600 |
Seconds to wait for Active status |
--dry-run |
false | Print requests without executing |
| Flag | Default | Description |
|---|---|---|
--instance-id |
(required) | CloudRift instance ID |
--api-key |
$CLOUDRIFT_API_KEY |
CloudRift API key |
--dry-run |
false | Print requests without executing |
The bench command accepts recipe directories as positional arguments. It loads each recipe, provisions cloud VMs, deploys the model, runs vllm bench serve, captures results, and tears down. Recipes sharing the same model and GPU type are grouped onto the same VM.
deplodock bench recipes/* # Run all recipes (results in each recipe dir)
deplodock bench experiments/.../optimal_mcr_rtx5090 # Run an experiment
deplodock bench recipes/* --gpu-concurrency 4 # Number of VMs per GPU type to spin up
deplodock bench recipes/* --dry-run # Preview commands| Flag | Default | Description |
|---|---|---|
recipes |
(required) | Recipe directories (positional args) |
--ssh-key |
~/.ssh/id_ed25519 |
SSH private key path |
--config |
config.yaml |
Path to configuration file |
--max-workers |
num groups | Max parallel execution groups |
--gpu-concurrency |
1 | Split each (model, GPU) group across up to N VMs |
--dry-run |
false | Print commands without executing |
--no-teardown |
false | Skip teardown and VM deletion (saves instances.json for later cleanup) |
Results are always stored in {recipe_dir}/{timestamp}_{hash}/ — each recipe directory holds its own run directories alongside recipe.yaml.
Clean up VMs left running by bench --no-teardown:
deplodock teardown results/intermediate/2026-02-24_12-00-00_abc12345
deplodock teardown results/intermediate/2026-02-24_12-00-00_abc12345 --ssh-key ~/.ssh/id_ed25519| Flag | Default | Description |
|---|---|---|
run_dir |
(required) | Run directory with instances.json (positional arg) |
--ssh-key |
~/.ssh/id_ed25519 |
SSH private key path |
make testThe project uses Ruff for linting and formatting. Configuration is in pyproject.toml.
make lint # check for lint errors and formatting issues
make format # auto-fix formatting and lint violations