GitHub - cloudrift-ai/deplodock: Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.

Benchmark and deploy optimized LLM models on GPU servers with vLLM or SGLang. Chose from a list of optimized recipes for popular models or create your own with custom configurations. Run benchmarks across different GPU types and configurations, track results, and share experiments with the community.

Project Structure

deplodock/ — Python package
- deplodock.py — CLI entrypoint
- logging_setup.py — CLI logging configuration
- hardware.py — GPU specs and instance type mapping
- commands/ — CLI layer (thin argparse handlers, see ARCHITECTURE.md)
  - deploy/ — deploy local, deploy ssh, deploy cloud commands
  - bench/ — bench command
  - teardown.py — teardown command
  - vm/ — vm create/delete commands (GCP, CloudRift)
- recipe/ — Recipe loading, dataclass types, engine flag mapping (see ARCHITECTURE.md)
- deploy/ — Compose generation, deploy orchestration
- provisioning/ — Cloud provisioning, SSH transport, VM lifecycle
- benchmark/ — Benchmark tracking, config, task enumeration, execution
- planner/ — Groups benchmark tasks into execution groups for VM allocation
recipes/ — Model deploy recipes (YAML configs per model)
experiments/ — Experiment parameter sweeps (self-contained recipe + results)
docs/ — Technical notes and engine-specific guides
- sglang-awq-moe.md — SGLang quantization for AWQ MoE models
tests/ — pytest tests (see ARCHITECTURE.md)
scripts/ — Analysis and visualization scripts
utils/ — Standalone utility scripts
config.yaml — Benchmark configuration
Makefile — Build automation
pyproject.toml — Package metadata and tool config

Quick Start

Install

git clone https://github.com/cloudrift-ai/deplodock.git
cd deplodock
make setup

Deploy a Model

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --server user@host

Deploy Locally

deplodock deploy local \
  --recipe recipes/Qwen3-Coder-30B-A3B-Instruct-AWQ

Teardown

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --server user@host \
  --teardown

Dry Run

Preview commands without executing:

deplodock deploy ssh \
  --recipe recipes/GLM-4.6-FP8 \
  --server user@host \
  --dry-run

Recipes

Recipes are declarative YAML configs in recipes/<model>/recipe.yaml. Each recipe defines a model, engine settings, and a matrices section for benchmark configurations.

Format

model:
  huggingface: "org/model-name"

engine:
  llm:
    tensor_parallel_size: 8
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    context_length: 16384
    max_concurrent_requests: 512
    vllm:
      image: "vllm/vllm-openai:latest"
      extra_args: "--kv-cache-dtype fp8"    # Flags not covered by named fields

benchmark:
  max_concurrency: 128
  num_prompts: 256
  random_input_len: 8000
  random_output_len: 8000

matrices:
  # Simple single-point entry
  - deploy.gpu: "NVIDIA H200 141GB"
    deploy.gpu_count: 8

  # Override engine and benchmark settings
  - deploy.gpu: "NVIDIA H100 80GB"
    deploy.gpu_count: 8
    engine.llm.max_concurrent_requests: 256
    benchmark.max_concurrency: 64

  # Concurrency sweep (8 runs from one entry)
  - deploy.gpu: "NVIDIA GeForce RTX 5090"
    benchmark.max_concurrency: [1, 2, 4, 8, 16, 32, 64, 128]

  # Correlated engine+bench sweep (3 zip runs)
  - deploy.gpu: "NVIDIA GeForce RTX 5090"
    engine.llm.max_concurrent_requests: [128, 256, 512]
    benchmark.max_concurrency: [128, 256, 512]

Matrix entries use dot-notation for all parameter paths. Scalars are broadcast; lists are zipped (all lists in one entry must have the same length). deploy.gpu is required in each entry.

Engine-agnostic fields (tensor_parallel_size, context_length, etc.) live at engine.llm. Engine-specific fields (image, extra_args) nest under engine.llm.vllm or engine.llm.sglang.

SGLang Matrix Entry Example

To benchmark with SGLang alongside vLLM, add a matrix entry with engine.llm.sglang.* overrides:

matrices:
  - deploy.gpu: "NVIDIA GeForce RTX 5090"
    deploy.gpu_count: 1
  - deploy.gpu: "NVIDIA GeForce RTX 5090"
    deploy.gpu_count: 1
    engine.llm.sglang.image: "lmsysorg/sglang:latest"

Named Fields → CLI Flags

Recipe YAML key	vLLM CLI flag	SGLang CLI flag
`tensor_parallel_size`	`--tensor-parallel-size`	`--tp`
`pipeline_parallel_size`	`--pipeline-parallel-size`	`--dp`
`gpu_memory_utilization`	`--gpu-memory-utilization`	`--mem-fraction-static`
`context_length`	`--max-model-len`	`--context-length`
`max_concurrent_requests`	`--max-num-seqs`	`--max-running-requests`

These flags must not appear in extra_args — load_recipe() validates this and raises an error on duplicates.

Experiments

Experiments are self-contained parameter sweeps that live in experiments/. Each experiment directory contains a recipe.yaml and stores its results alongside it. The directory structure follows experiments/{model_name}/{experiment_name}/.

Example: Optimal max_concurrent_requests on RTX 5090

deplodock bench experiments/Qwen3-Coder-30B-A3B-Instruct-AWQ/optimal_mcr_rtx5090

Results are saved directly in the experiment directory:

experiments/Qwen3-Coder-30B-A3B-Instruct-AWQ/optimal_mcr_rtx5090/
  recipe.yaml
  2026-02-24_19-13-50_abc12345/
    tasks.json
    recipe.yaml
    RTX5090_mcr8_c8_vllm_benchmark.txt
    RTX5090_mcr12_c12_vllm_benchmark.txt
    ...

CI Benchmark Workflow

External developers can submit experiments via pull requests. A maintainer triggers benchmarks by commenting /run-experiment on the PR.

How It Works

Submit a PR with an experiment definition in experiments/{model}/{experiment}/recipe.yaml
A maintainer reviews and comments /run-experiment on the PR
CI runs benchmarks on cloud GPUs, commits results back to the PR branch
Review results in the PR comment summary and committed files

Trigger Modes

/run-experiment                                                        # Auto-detect: benchmarks all experiments changed in the PR
/run-experiment experiments/MyModel/my_experiment                       # Explicit: benchmark specific experiment(s)
/run-experiment experiments/MyModel/my_experiment --gpu-concurrency 2   # Split groups across 2 VMs each

Only users with write or admin access to the repository can trigger benchmarks.

Fork PRs

For the workflow to push results back to a fork's branch, the PR must have "Allow edits from maintainers" checked (this is the GitHub default). If unchecked, results are still available as downloadable workflow artifacts.

Deploy Targets

Local

Runs docker compose directly on the current machine.

deplodock deploy local --recipe <path> [--dry-run]

SSH

Deploys to a remote server via SSH + SCP.

deplodock deploy ssh --recipe <path> --server user@host [--dry-run]

Cloud

Provisions a cloud VM based on recipe GPU requirements (from the deploy section), then deploys via SSH.

deplodock deploy cloud --recipe <path> [--name <vm-name>] [--dry-run]

Common Flags

Flag	Required	Default	Description
`--recipe`	Yes	-	Path to recipe directory
`--hf-token`	No	`$HF_TOKEN`	HuggingFace token
`--model-dir`	No	`/mnt/models`	Model cache dir
`--teardown`	No	false	Stop containers instead of deploying
`--dry-run`	No	false	Print commands without executing

SSH-only Flags

Flag	Required	Default	Description
`--server`	Yes	-	SSH address (user@host)
`--ssh-key`	No	`~/.ssh/id_ed25519`	SSH key path
`--ssh-port`	No	22	SSH port

Cloud-only Flags

Flag	Required	Default	Description
`--name`	No	`cloud-deploy`	VM name prefix
`--ssh-key`	No	`~/.ssh/id_ed25519`	SSH private key path

VM Management

The vm command manages cloud GPU VM lifecycles. Supports GCP and CloudRift providers. Instances are ephemeral — delete removes them entirely.

GCP

deplodock vm create gcp --instance my-gpu-vm --zone us-central1-a --machine-type a2-highgpu-1g
deplodock vm create gcp --instance my-gpu-vm --zone us-central1-a --machine-type e2-micro --wait-ssh
deplodock vm create gcp --instance my-gpu-vm --zone us-central1-a --machine-type e2-micro --gcloud-args "--no-service-account --no-scopes" --dry-run
deplodock vm delete gcp --instance my-gpu-vm --zone us-central1-a

GCP Create Flags

Flag	Default	Description
`--instance`	(required)	GCP instance name
`--zone`	(required)	GCP zone (e.g. us-central1-a)
`--machine-type`	(required)	Machine type (e.g. a2-highgpu-1g)
`--provisioning-model`	`FLEX_START`	Provisioning model (`FLEX_START`, `SPOT`, or `STANDARD`)
`--max-run-duration`	`7d`	Max VM run time (10m–7d)
`--request-valid-for-duration`	`2h`	How long to wait for capacity
`--termination-action`	`DELETE`	Action when max-run-duration expires (`STOP` or `DELETE`)
`--image-family`	`debian-12`	Boot disk image family
`--image-project`	`debian-cloud`	Boot disk image project
`--gcloud-args`	-	Extra args passed to `gcloud compute instances create`
`--timeout`	`14400`	How long to poll for RUNNING status (seconds)
`--wait-ssh`	false	Wait for SSH after VM is RUNNING
`--wait-ssh-timeout`	`300`	SSH wait timeout in seconds
`--ssh-gateway`	-	SSH gateway host for ProxyJump (e.g. gcp-ssh-gateway)
`--dry-run`	false	Print commands without executing

GCP Delete Flags

Flag	Default	Description
`--instance`	(required)	GCP instance name
`--zone`	(required)	GCP zone (e.g. us-central1-a)
`--dry-run`	false	Print commands without executing

GCP project is inferred from gcloud config (no --project flag needed).

CloudRift

deplodock vm create cloudrift --instance-type rtx4090.1 --ssh-key ~/.ssh/id_ed25519.pub
deplodock vm delete cloudrift --instance-id <id>

CloudRift Create Flags

Flag	Default	Description
`--instance-type`	(required)	Instance type (e.g. rtx4090.1)
`--ssh-key`	(required)	Path to SSH public key file
`--api-key`	`$CLOUDRIFT_API_KEY`	CloudRift API key
`--image-url`	Ubuntu 24.04	VM image URL
`--ports`	`22,8000`	Comma-separated ports to open
`--timeout`	`600`	Seconds to wait for Active status
`--dry-run`	false	Print requests without executing

CloudRift Delete Flags

Flag	Default	Description
`--instance-id`	(required)	CloudRift instance ID
`--api-key`	`$CLOUDRIFT_API_KEY`	CloudRift API key
`--dry-run`	false	Print requests without executing

Benchmarking

The bench command accepts recipe directories as positional arguments. It loads each recipe, provisions cloud VMs, deploys the model, runs vllm bench serve, captures results, and tears down. Recipes sharing the same model and GPU type are grouped onto the same VM.

Run Benchmarks

deplodock bench recipes/*                                    # Run all recipes (results in each recipe dir)
deplodock bench experiments/.../optimal_mcr_rtx5090          # Run an experiment
deplodock bench recipes/* --gpu-concurrency 4                # Number of VMs per GPU type to spin up
deplodock bench recipes/* --dry-run                          # Preview commands

Flag	Default	Description
`recipes`	(required)	Recipe directories (positional args)
`--ssh-key`	`~/.ssh/id_ed25519`	SSH private key path
`--config`	`config.yaml`	Path to configuration file
`--max-workers`	num groups	Max parallel execution groups
`--gpu-concurrency`	1	Split each (model, GPU) group across up to N VMs
`--dry-run`	false	Print commands without executing
`--no-teardown`	false	Skip teardown and VM deletion (saves `instances.json` for later cleanup)

Results are always stored in {recipe_dir}/{timestamp}_{hash}/ — each recipe directory holds its own run directories alongside recipe.yaml.

Teardown

Clean up VMs left running by bench --no-teardown:

deplodock teardown results/intermediate/2026-02-24_12-00-00_abc12345
deplodock teardown results/intermediate/2026-02-24_12-00-00_abc12345 --ssh-key ~/.ssh/id_ed25519

Flag	Default	Description
`run_dir`	(required)	Run directory with `instances.json` (positional arg)
`--ssh-key`	`~/.ssh/id_ed25519`	SSH private key path

Running Tests

make test

Linting & Formatting

The project uses Ruff for linting and formatting. Configuration is in pyproject.toml.

make lint      # check for lint errors and formatting issues
make format    # auto-fix formatting and lint violations

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
.github		.github
.idea		.idea
deplodock		deplodock
docs		docs
experiments		experiments
recipes		recipes
results		results
scripts		scripts
tests		tests
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
STYLE.md		STYLE.md
config.yaml		config.yaml
logo.png		logo.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Project Structure

Quick Start

Install

Deploy a Model

Deploy Locally

Teardown

Dry Run

Recipes

Format

SGLang Matrix Entry Example

Named Fields → CLI Flags

Experiments

Example: Optimal max_concurrent_requests on RTX 5090

CI Benchmark Workflow

How It Works

Trigger Modes

Fork PRs

Deploy Targets

Local

SSH

Cloud

Common Flags

SSH-only Flags

Cloud-only Flags

VM Management

GCP

GCP Create Flags

GCP Delete Flags

CloudRift

CloudRift Create Flags

CloudRift Delete Flags

Benchmarking

Run Benchmarks

Teardown

Running Tests

Linting & Formatting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages