Skip to content

cwccie/edgeinfer

Repository files navigation

EdgeInfer

On-prem LLM inference server for CPU-only enterprise environments. Supports GGUF and ONNX model formats with an OpenAI-compatible REST API.

Architecture

                         EdgeInfer Architecture
    +------------------------------------------------------------------+
    |                                                                  |
    |   Client (curl / SDK / CLI)                                      |
    |       |                                                          |
    |       v                                                          |
    |   +---------------------------+    +-------------------------+   |
    |   |     REST API Server       |    |   Prometheus Metrics    |   |
    |   |  /v1/completions          |--->|   /metrics              |   |
    |   |  /v1/chat/completions     |    +-------------------------+   |
    |   |  /v1/models               |                                  |
    |   |  /v1/benchmark            |                                  |
    |   +-------------+-------------+                                  |
    |                 |                                                 |
    |       +---------+---------+                                      |
    |       |                   |                                      |
    |       v                   v                                      |
    |   +----------+    +----------------+                             |
    |   | Inference |    |   Resource     |                            |
    |   | Engine    |    |   Manager      |                            |
    |   |          |    |  CPU/Mem Budget |                            |
    |   |  Mock    |    |  Queue Mgmt    |                            |
    |   |  Latency |    |  Load/Unload   |                            |
    |   +----+-----+    +-------+--------+                             |
    |        |                  |                                       |
    |        v                  v                                       |
    |   +-----------------------------------+                          |
    |   |         Model Registry            |                          |
    |   |  GGUF / ONNX metadata             |                          |
    |   |  Versioning + Rollback            |                          |
    |   |  Quantization tracking            |                          |
    |   +-----------------------------------+                          |
    |        |                                                         |
    |        v                                                         |
    |   +-----------------------------------+                          |
    |   |       Model Store (Volume)        |                          |
    |   |  /models/*.gguf  /models/*.onnx   |                          |
    |   +-----------------------------------+                          |
    |                                                                  |
    +------------------------------------------------------------------+

Features

  • Model Registry -- Register GGUF and ONNX models with metadata (size, quantization level, task type). Full version history with rollback support.
  • Inference Engine -- Mock inference with configurable latency and quality based on model size and quantization. Batch inference and streaming stubs.
  • REST API -- OpenAI-compatible /v1/completions and /v1/chat/completions endpoints. Token counting. Model load/unload management.
  • Resource Management -- CPU/memory budget tracking. Model loading/unloading based on demand. Priority queue for concurrent requests.
  • Health Monitoring -- Latency percentiles (P50/P90/P99), throughput, error rates, queue depth. Prometheus-compatible /metrics endpoint.
  • Benchmarking -- Compare models on latency vs quality vs memory footprint. Generate ranked reports. Available via API and CLI.

Installation

pip install edgeinfer

For development:

pip install -e ".[dev]"

Quick Start

Start the server

edgeinfer serve --port 8080 --max-memory-mb 16384 --max-models 4

Register and load a model (via API)

# List models
curl http://localhost:8080/v1/models

# Load a model
curl -X POST http://localhost:8080/v1/models/my-model/load

# Run inference
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Explain zero-trust networking."}],
    "max_tokens": 256
  }'

CLI Commands

# Start server
edgeinfer serve

# List models (requires running server)
edgeinfer models
edgeinfer models --format json

# Run benchmarks
edgeinfer benchmark
edgeinfer benchmark --format json --iterations 5

# Check health
edgeinfer health
edgeinfer health --format json

Docker

Single container

docker build -t edgeinfer .
docker run -p 8080:8080 -v /path/to/models:/models edgeinfer

Docker Compose (full stack)

docker compose up -d

Services:

  • inference-server -- Main API server on port 8080
  • model-store -- Shared volume for model files
  • metrics -- Dedicated metrics instance on port 9090

API Reference

Endpoint Method Description
/ GET Server info
/health GET Health status and metrics
/v1/models GET List all models
/v1/models/{id} GET Model details
/v1/models/{id}/load POST Load model into memory
/v1/models/{id}/unload POST Unload model from memory
/v1/completions POST Text completion (OpenAI-compatible)
/v1/chat/completions POST Chat completion (OpenAI-compatible)
/v1/token/count POST Estimate token count
/v1/benchmark POST Run model benchmarks
/v1/resources GET Resource budget and usage
/metrics GET Prometheus metrics

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

License

MIT License. Copyright (c) 2026 Corey Wade.

About

On-prem LLM inference server for CPU-only enterprise GGUF ONNX REST

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors