This proxy dynamically selects hardware resources (CPU/GPU/accelerators) for incoming workloads to maximize cost efficiency.
This proxy uses llama-server as the backend inference engine, and orchestrates the servers with docker containers.
- Docker installed and running
- UV package manager
Install UV if you haven't already:
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | shInstall project dependencies:
# Install dependencies and create virtual environment
uv sync- Create a
modelsdirectory in the project root:
mkdir -p models- Place your llama.cpp compatible model files (
.ggufformat) in themodelsdirectory:
models/
├── model1.gguf
├── model2.gguf
└── ...Run the main proxy server:
uv run main.pyAn OpenAI-compatible proxy server will start on http://localhost:8000 by default.
The test suite validates proxy functionality, model discovery, and container management.
- Ensure the proxy server is running on
http://localhost:8000 - Have at least one model available (tests expect
01-DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_Mby default)
uv run pytestuv run pytest test_proxy.py::TestProxy -v
uv run pytest test_proxy.py::TestModelDiscovery -v
uv run pytest test_proxy.py::TestContainerManagement -v