On-prem LLM inference server for CPU-only enterprise environments. Supports GGUF and ONNX model formats with an OpenAI-compatible REST API.
EdgeInfer Architecture
+------------------------------------------------------------------+
| |
| Client (curl / SDK / CLI) |
| | |
| v |
| +---------------------------+ +-------------------------+ |
| | REST API Server | | Prometheus Metrics | |
| | /v1/completions |--->| /metrics | |
| | /v1/chat/completions | +-------------------------+ |
| | /v1/models | |
| | /v1/benchmark | |
| +-------------+-------------+ |
| | |
| +---------+---------+ |
| | | |
| v v |
| +----------+ +----------------+ |
| | Inference | | Resource | |
| | Engine | | Manager | |
| | | | CPU/Mem Budget | |
| | Mock | | Queue Mgmt | |
| | Latency | | Load/Unload | |
| +----+-----+ +-------+--------+ |
| | | |
| v v |
| +-----------------------------------+ |
| | Model Registry | |
| | GGUF / ONNX metadata | |
| | Versioning + Rollback | |
| | Quantization tracking | |
| +-----------------------------------+ |
| | |
| v |
| +-----------------------------------+ |
| | Model Store (Volume) | |
| | /models/*.gguf /models/*.onnx | |
| +-----------------------------------+ |
| |
+------------------------------------------------------------------+
- Model Registry -- Register GGUF and ONNX models with metadata (size, quantization level, task type). Full version history with rollback support.
- Inference Engine -- Mock inference with configurable latency and quality based on model size and quantization. Batch inference and streaming stubs.
- REST API -- OpenAI-compatible
/v1/completionsand/v1/chat/completionsendpoints. Token counting. Model load/unload management. - Resource Management -- CPU/memory budget tracking. Model loading/unloading based on demand. Priority queue for concurrent requests.
- Health Monitoring -- Latency percentiles (P50/P90/P99), throughput, error rates, queue depth. Prometheus-compatible
/metricsendpoint. - Benchmarking -- Compare models on latency vs quality vs memory footprint. Generate ranked reports. Available via API and CLI.
pip install edgeinferFor development:
pip install -e ".[dev]"edgeinfer serve --port 8080 --max-memory-mb 16384 --max-models 4# List models
curl http://localhost:8080/v1/models
# Load a model
curl -X POST http://localhost:8080/v1/models/my-model/load
# Run inference
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Explain zero-trust networking."}],
"max_tokens": 256
}'# Start server
edgeinfer serve
# List models (requires running server)
edgeinfer models
edgeinfer models --format json
# Run benchmarks
edgeinfer benchmark
edgeinfer benchmark --format json --iterations 5
# Check health
edgeinfer health
edgeinfer health --format jsondocker build -t edgeinfer .
docker run -p 8080:8080 -v /path/to/models:/models edgeinferdocker compose up -dServices:
inference-server-- Main API server on port 8080model-store-- Shared volume for model filesmetrics-- Dedicated metrics instance on port 9090
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Server info |
/health |
GET | Health status and metrics |
/v1/models |
GET | List all models |
/v1/models/{id} |
GET | Model details |
/v1/models/{id}/load |
POST | Load model into memory |
/v1/models/{id}/unload |
POST | Unload model from memory |
/v1/completions |
POST | Text completion (OpenAI-compatible) |
/v1/chat/completions |
POST | Chat completion (OpenAI-compatible) |
/v1/token/count |
POST | Estimate token count |
/v1/benchmark |
POST | Run model benchmarks |
/v1/resources |
GET | Resource budget and usage |
/metrics |
GET | Prometheus metrics |
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest -v
# Lint
ruff check src/ tests/
ruff format src/ tests/MIT License. Copyright (c) 2026 Corey Wade.