EdgeInfer

On-prem LLM inference server for CPU-only enterprise environments. Supports GGUF and ONNX model formats with an OpenAI-compatible REST API.

Architecture

                         EdgeInfer Architecture
    +------------------------------------------------------------------+
    |                                                                  |
    |   Client (curl / SDK / CLI)                                      |
    |       |                                                          |
    |       v                                                          |
    |   +---------------------------+    +-------------------------+   |
    |   |     REST API Server       |    |   Prometheus Metrics    |   |
    |   |  /v1/completions          |--->|   /metrics              |   |
    |   |  /v1/chat/completions     |    +-------------------------+   |
    |   |  /v1/models               |                                  |
    |   |  /v1/benchmark            |                                  |
    |   +-------------+-------------+                                  |
    |                 |                                                 |
    |       +---------+---------+                                      |
    |       |                   |                                      |
    |       v                   v                                      |
    |   +----------+    +----------------+                             |
    |   | Inference |    |   Resource     |                            |
    |   | Engine    |    |   Manager      |                            |
    |   |          |    |  CPU/Mem Budget |                            |
    |   |  Mock    |    |  Queue Mgmt    |                            |
    |   |  Latency |    |  Load/Unload   |                            |
    |   +----+-----+    +-------+--------+                             |
    |        |                  |                                       |
    |        v                  v                                       |
    |   +-----------------------------------+                          |
    |   |         Model Registry            |                          |
    |   |  GGUF / ONNX metadata             |                          |
    |   |  Versioning + Rollback            |                          |
    |   |  Quantization tracking            |                          |
    |   +-----------------------------------+                          |
    |        |                                                         |
    |        v                                                         |
    |   +-----------------------------------+                          |
    |   |       Model Store (Volume)        |                          |
    |   |  /models/*.gguf  /models/*.onnx   |                          |
    |   +-----------------------------------+                          |
    |                                                                  |
    +------------------------------------------------------------------+

Features

Model Registry -- Register GGUF and ONNX models with metadata (size, quantization level, task type). Full version history with rollback support.
Inference Engine -- Mock inference with configurable latency and quality based on model size and quantization. Batch inference and streaming stubs.
REST API -- OpenAI-compatible /v1/completions and /v1/chat/completions endpoints. Token counting. Model load/unload management.
Resource Management -- CPU/memory budget tracking. Model loading/unloading based on demand. Priority queue for concurrent requests.
Health Monitoring -- Latency percentiles (P50/P90/P99), throughput, error rates, queue depth. Prometheus-compatible /metrics endpoint.
Benchmarking -- Compare models on latency vs quality vs memory footprint. Generate ranked reports. Available via API and CLI.

Installation

pip install edgeinfer

For development:

pip install -e ".[dev]"

Quick Start

Start the server

edgeinfer serve --port 8080 --max-memory-mb 16384 --max-models 4

Register and load a model (via API)

# List models
curl http://localhost:8080/v1/models

# Load a model
curl -X POST http://localhost:8080/v1/models/my-model/load

# Run inference
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Explain zero-trust networking."}],
    "max_tokens": 256
  }'

CLI Commands

# Start server
edgeinfer serve

# List models (requires running server)
edgeinfer models
edgeinfer models --format json

# Run benchmarks
edgeinfer benchmark
edgeinfer benchmark --format json --iterations 5

# Check health
edgeinfer health
edgeinfer health --format json

Docker

Single container

docker build -t edgeinfer .
docker run -p 8080:8080 -v /path/to/models:/models edgeinfer

Docker Compose (full stack)

docker compose up -d

Services:

inference-server -- Main API server on port 8080
model-store -- Shared volume for model files
metrics -- Dedicated metrics instance on port 9090

API Reference

Endpoint	Method	Description
`/`	GET	Server info
`/health`	GET	Health status and metrics
`/v1/models`	GET	List all models
`/v1/models/{id}`	GET	Model details
`/v1/models/{id}/load`	POST	Load model into memory
`/v1/models/{id}/unload`	POST	Unload model from memory
`/v1/completions`	POST	Text completion (OpenAI-compatible)
`/v1/chat/completions`	POST	Chat completion (OpenAI-compatible)
`/v1/token/count`	POST	Estimate token count
`/v1/benchmark`	POST	Run model benchmarks
`/v1/resources`	GET	Resource budget and usage
`/metrics`	GET	Prometheus metrics

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src/edgeinfer		src/edgeinfer
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EdgeInfer

Architecture

Features

Installation

Quick Start

Start the server

Register and load a model (via API)

CLI Commands

Docker

Single container

Docker Compose (full stack)

API Reference

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EdgeInfer

Architecture

Features

Installation

Quick Start

Start the server

Register and load a model (via API)

CLI Commands

Docker

Single container

Docker Compose (full stack)

API Reference

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages