Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration
A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring.
LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:
- Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
- Serve models via a fully OpenAI-compatible
/v1/chat/completionsendpoint - Route queries to the optimal model using ML-powered task classification + multi-signal scoring
- Learn from usage patterns via a vector-based memory layer that improves routing over time
- Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
- Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
- Augment responses with a RAG knowledge base layer for domain-specific context injection
┌────────────────────────────────────────────────────────────────────┐
│ Next.js Frontend │
│ Dashboard · Models · Benchmarks · Traces · Memory · Finetune │
└────────────────────────────┬───────────────────────────────────────┘
│ REST + SSE
┌────────────────────────────▼───────────────────────────────────────┐
│ FastAPI Backend │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐ │
│ │ Router │ │Lifecycle│ │Inference │ │ Memory │ │ Finetune │ │
│ │ Engine │ │ Manager │ │ Engine │ │ Layer │ │ Engine │ │
│ └────┬─────┘ └────┬────┘ └────┬─────┘ └───┬────┘ └─────┬──────┘ │
│ │ │ │ │ │ │
│ ┌────▼─────┐ ┌────▼────┐ ┌───▼────┐ ┌────▼────┐ ┌─────▼──────┐ │
│ │Classifier│ │ SQLite │ │ llama │ │ Qdrant │ │ Training │ │
│ │(TF-IDF) │ │ (WAL) │ │ .cpp │ │(Vector) │ │ Worker │ │
│ └──────────┘ └─────────┘ │ server │ └─────────┘ │ (Subprocess│ │
│ └────────┘ │ PEFT/TRL) │ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ └────────────┘ │
│ │Benchmark │ │ Auth │ │ RAG Layer │ │
│ │ Fetcher │ │ (Bearer) │ │ (LlamaIndex + │ │
│ └──────────┘ └──────────┘ │ Qdrant) │ │
│ └──────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
- ML-Powered Task Classification — TF-IDF + Logistic Regression classifier categorizes queries into
coding,math,reasoning,instruction,hard_reasoning, orgeneralwith ~85% accuracy and <5ms inference - Multi-Signal Scoring — Routes based on weighted combination of benchmark scores (40%), memory-based success history (30%), latency (15%), and user feedback (15%)
- Memory-Enhanced Routing — Qdrant vector store indexes past query→model outcomes; recency-weighted exponential decay ensures fresh interactions matter more
- Fallback Evidence — When routing confidence is low, the system checks for historical evidence of any model succeeding on similar queries
- One-Click Downloads from HuggingFace with VRAM-aware filtering
- Hot-Swap Architecture — Single-model-hot constraint for consumer hardware; atomic state transitions (UNLOADED → LOADING → HOT → UNLOADING)
- Resident Model — Most frequently used model auto-detected and kept loaded
- Finetuning Lock — Models being finetuned are excluded from routing
- Drop-in replacement for
openai.ChatCompletion.create() - Streaming (SSE) and non-streaming responses
- Bearer token authentication with auto-generated API keys (
lf-{hex}format) - Custom headers expose routing metadata (
X-LocalForge-Model,X-LocalForge-Task)
- Automated Fetch — Pulls scores from HuggingFace model cards and Open LLM Leaderboard
- Local Mini-Eval — Runs curated questions per task type through the inference engine for models without published scores
- Multi-Model Comparison — Side-by-side radar charts across MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench
- Managed Training — Background subprocess with full lifecycle control (start, monitor, cancel)
- Live Loss Streaming — SSE-powered real-time loss curves via JSONL log tailing
- Dual Backend — Unsloth (2× faster, 60% less VRAM) or standard PEFT + TRL
- Automatic GGUF Export — Finetuned models exported and auto-registered in the model registry
- Before/After Comparison — Generates side-by-side outputs on held-out validation samples
- Dataset Validation — Supports CSV, JSONL, Alpaca JSON with preview and error reporting
- Document Ingestion — Upload PDFs, text files; chunked via LlamaIndex
SentenceSplitter - Semantic Search — Embedded chunks stored in Qdrant; retrieved at query time and injected into system prompt
- Task-Aware KB Routing — Router automatically selects the matching knowledge base by task type
- Real-time hardware profiling (GPU, VRAM, RAM, CPU via pynvml/psutil)
- Request volume, latency trends, model distribution charts
- Full routing decision traces with per-candidate scoring breakdowns
- Memory layer statistics with per-model success rates
- Python 3.10+ with a virtual environment
- Node.js 20+ and npm
- NVIDIA GPU (recommended) with compatible drivers
- ~4GB disk for the smallest GGUF model
git clone https://github.com/al1-nasir/LocalForge.git
cd LocalForge/backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env — set LOCALFORGE_SECRET_KEY, optionally add HF_TOKEN for gated modelsuvicorn app.main:app --port 8010The API is now live at http://127.0.0.1:8010. Visit http://127.0.0.1:8010/docs for interactive API documentation.
cd ../frontend
npm install
# Create .env.local pointing to your backend
echo "NEXT_PUBLIC_API_URL=http://127.0.0.1:8010" > .env.local
npm run devOpen http://localhost:3000 to access the dashboard.
Navigate to Models in the dashboard, search for a model (e.g. Qwen2.5), and click download. The system will auto-filter GGUF files that fit your hardware's VRAM.
curl http://127.0.0.1:8010/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain quantum computing in one paragraph"}]
}'| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | OpenAI-compatible chat (streaming + non-streaming) |
/health |
GET | System health check |
/docs |
GET | Interactive Swagger documentation |
| Endpoint | Method | Description |
|---|---|---|
/api/models |
GET | List all registered models |
/api/models/browse |
GET | Search HuggingFace for GGUF models |
/api/models/download |
POST | Download a model from HuggingFace |
/api/models/load |
POST | Load a model into GPU memory |
/api/models/unload |
POST | Unload the current hot model |
/api/models/{id} |
GET / DELETE | Get or remove a specific model |
| Endpoint | Method | Description |
|---|---|---|
/api/benchmarks/{id} |
GET | Get benchmark scores for a model |
/api/benchmarks/{id}/fetch |
POST | Fetch scores from HuggingFace/Leaderboard |
/api/benchmarks/{id}/eval |
POST | Run local mini-evaluation |
/api/benchmarks/compare/models |
GET | Multi-model benchmark comparison |
| Endpoint | Method | Description |
|---|---|---|
/api/finetune/backend |
GET | Check available training backend |
/api/finetune/upload |
POST | Upload and validate a dataset |
/api/finetune/start |
POST | Start a finetuning job |
/api/finetune/{id} |
GET | Get job status with live loss data |
/api/finetune/{id}/stream |
GET | SSE stream for real-time loss updates |
/api/finetune/{id}/cancel |
POST | Cancel a running job |
| Endpoint | Method | Description |
|---|---|---|
/api/knowledge-bases |
GET / POST | List or create knowledge bases |
/api/knowledge-bases/{id} |
DELETE | Delete a knowledge base |
/api/knowledge-bases/{id}/documents |
GET / POST | List or upload documents |
| Endpoint | Method | Description |
|---|---|---|
/api/dashboard/stats |
GET | Aggregate dashboard statistics |
/api/dashboard/traces |
GET | Recent routing decision traces |
/api/dashboard/memory-stats |
GET | Memory layer statistics |
/api/hardware |
GET | Current hardware profile |
/api/keys |
GET / POST | Manage API keys |
All settings use the LOCALFORGE_ prefix and can be set via environment variables or .env:
| Variable | Default | Description |
|---|---|---|
LOCALFORGE_PORT |
8000 |
Backend server port |
LOCALFORGE_SECRET_KEY |
— | Secret for API key hashing |
LOCALFORGE_DB_PATH |
data/localforge.db |
SQLite database location |
LOCALFORGE_MODELS_DIR |
data/models |
Downloaded model storage |
LOCALFORGE_DEFAULT_CTX_SIZE |
4096 |
Default context window |
LOCALFORGE_DEFAULT_N_GPU_LAYERS |
-1 |
GPU layers (-1 = all) |
LOCALFORGE_ROUTER_BENCHMARK_WEIGHT |
0.4 |
Benchmark signal weight |
LOCALFORGE_ROUTER_MEMORY_WEIGHT |
0.3 |
Memory signal weight |
LOCALFORGE_MEMORY_EMBEDDING_MODEL |
nomic-ai/nomic-embed-text-v1.5 |
Embedding model |
LOCALFORGE_FINETUNE_MAX_SEQ_LENGTH |
2048 |
Max sequence length for training |
HF_TOKEN |
— | HuggingFace token for gated models |
| Component | Technology |
|---|---|
| API Framework | FastAPI 0.115+ |
| Database | SQLite (aiosqlite, WAL mode) |
| Inference | llama.cpp (via llama-cpp-python) |
| Vector Store | Qdrant (disk-persisted, no Docker) |
| Embeddings | nomic-embed-text-v1.5 (sentence-transformers) |
| Finetuning | PEFT + TRL (or Unsloth) |
| RAG | LlamaIndex Core |
| Task Classifier | scikit-learn (TF-IDF + LogReg) |
| Hardware Detection | pynvml + psutil |
| Component | Technology |
|---|---|
| Framework | Next.js 16 (Turbopack) |
| UI | React 19 + Lucide Icons |
| Charts | Recharts |
| Styling | Vanilla CSS with design tokens |
| Typography | Inter + JetBrains Mono (Google Fonts) |
LocalForge/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI entry point & lifespan
│ │ ├── config.py # Pydantic settings (env-driven)
│ │ ├── database.py # SQLite schema & connection
│ │ ├── schemas.py # Request/response Pydantic models
│ │ ├── api/ # Route handlers
│ │ │ ├── chat.py # /v1/chat/completions (OpenAI-compat)
│ │ │ ├── models.py # Model CRUD, browse, download
│ │ │ ├── benchmarks.py # Benchmark fetch & local eval
│ │ │ ├── finetune.py # Finetune job management
│ │ │ ├── knowledge.py # RAG knowledge base management
│ │ │ ├── dashboard.py # Stats, traces, trends
│ │ │ ├── hardware.py # GPU/RAM detection
│ │ │ ├── keys.py # API key management
│ │ │ └── feedback.py # Thumbs up/down feedback
│ │ └── core/ # Business logic engines
│ │ ├── router.py # Multi-signal model router
│ │ ├── lifecycle.py # Model state machine
│ │ ├── inference.py # llama.cpp server management
│ │ ├── memory.py # Qdrant-backed memory layer
│ │ ├── finetune_engine.py # Finetune orchestrator
│ │ ├── _train_worker.py # Training subprocess
│ │ ├── rag.py # Document ingestion & retrieval
│ │ ├── query_classifier.py # TF-IDF task classifier
│ │ ├── benchmark_fetcher.py # HF/Leaderboard score fetch
│ │ ├── local_eval.py # Local benchmark evaluation
│ │ ├── model_browser.py # HuggingFace GGUF search
│ │ ├── hardware.py # GPU/VRAM detection
│ │ └── auth.py # API key auth
│ ├── data/ # Runtime data (DB, models, etc.)
│ ├── requirements.txt
│ └── .env
├── frontend/
│ ├── src/
│ │ ├── app/ # Next.js pages
│ │ │ ├── page.tsx # Dashboard
│ │ │ ├── models/ # Model browser & registry
│ │ │ ├── benchmarks/ # Benchmark comparison
│ │ │ ├── traces/ # Routing decision traces
│ │ │ ├── memory/ # Memory layer stats
│ │ │ ├── knowledge/ # Knowledge base management
│ │ │ ├── finetune/ # Finetuning UI
│ │ │ └── keys/ # API key management
│ │ ├── components/
│ │ │ └── Sidebar.tsx # Navigation sidebar
│ │ └── lib/
│ │ └── api.ts # Typed API client
│ ├── package.json
│ └── .env.local
└── README.md
cd backend
source venv/bin/activate
# Test API endpoints
python test_endpoints.py
# Test RAG pipeline
python ../test_rag.py- Define Pydantic schemas in
backend/app/schemas.py - Add business logic in
backend/app/core/ - Create route handler in
backend/app/api/ - Register the router in
backend/app/main.py - Add frontend API call in
frontend/src/lib/api.ts
- Multi-GPU support and model parallelism
- Cloud fallback (OpenAI/Gemini) when local models are insufficient
- Automated A/B testing between model versions
- Plugin system for custom routing strategies
- Docker Compose deployment with Qdrant server mode
- RLHF data collection from user feedback
MIT License — see LICENSE for details.
Built with 🔥 by the LocalForge team
