monogpu

Pack many small (≈1B) quantized models onto a single GPU and serve them efficiently — reusing the Shepherd Model Gateway (SMG) for routing, and adding only the thin piece SMG lacks: GPU-aware worker placement.

Status: Phase 0 — scaffold. No engine code yet. See docs/design.md for the full design and rationale.

The idea in one paragraph

A router routes; it does not place. SMG already does multi-backend, cache-aware routing across vLLM / SGLang / TensorRT-LLM workers — but it assumes those workers already exist. monogpu supplies the missing half: a small, shell-first launcher that probes the GPU, sizes and starts N memory-capped model workers so they co-fit one GPU (optionally under MPS), and registers them with SMG. That's the whole MVP — no Rust, no Python code from us.

Why this is interesting on real hardware

The two target GPUs have opposite scarce resources, and the design adapts to whichever bites:

GPU	Capacity	Bandwidth	Scarce resource
RTX 5080	16 GB	960 GB/s	capacity (fits few; runs fast)
DGX Spark (GB10)	128 GB	273 GB/s	bandwidth (fits many; time-share the bus)

Quick start (Phase 1 target — not yet implemented)

scripts/env.sh            # apply sm_121 ptxas fix etc. (see docs)
scripts/mps.sh start      # optional: CUDA MPS for kernel overlap
scripts/launch_workers.sh # probe GPU, start N memory-capped workers, register with SMG
scripts/run_gateway.sh    # run the SMG binary in front of them

Non-negotiables (see CLAUDE.md)

Reuse-first — prefer SMG / existing engines; justify any from-scratch code.
No Python by default — added only when a concrete need proves it.
Optimize with evidence — every perf change ships with a benchmark delta.
Portable — no hardcoded GPU assumptions; probe at runtime.

Layout

docs/      design, governance, benchmarking
scripts/   shell: env / MPS / worker launcher / gateway runner
bench/     multi-tenant benchmark harness (reuses sibling dgx-spark-benchmark methodology)

License

Apache-2.0 (matches SMG, which this builds around).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
bench		bench
docs		docs
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

monogpu

The idea in one paragraph

Why this is interesting on real hardware

Quick start (Phase 1 target — not yet implemented)

Non-negotiables (see CLAUDE.md)

Layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

monogpu

The idea in one paragraph

Why this is interesting on real hardware

Quick start (Phase 1 target — not yet implemented)

Non-negotiables (see CLAUDE.md)

Layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages