⚡ LocalForge

Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration

A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring.

Overview

LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:

Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
Serve models via a fully OpenAI-compatible /v1/chat/completions endpoint
Route queries to the optimal model using ML-powered task classification + multi-signal scoring
Learn from usage patterns via a vector-based memory layer that improves routing over time
Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
Augment responses with a RAG knowledge base layer for domain-specific context injection

Architecture

┌────────────────────────────────────────────────────────────────────┐
│                        Next.js Frontend                            │
│   Dashboard · Models · Benchmarks · Traces · Memory · Finetune     │
└────────────────────────────┬───────────────────────────────────────┘
                             │ REST + SSE
┌────────────────────────────▼───────────────────────────────────────┐
│                       FastAPI Backend                               │
│  ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐  │
│  │  Router   │ │Lifecycle│ │Inference │ │ Memory │ │  Finetune   │  │
│  │  Engine   │ │ Manager │ │  Engine  │ │ Layer  │ │  Engine     │  │
│  └────┬─────┘ └────┬────┘ └────┬─────┘ └───┬────┘ └─────┬──────┘  │
│       │             │           │            │            │         │
│  ┌────▼─────┐  ┌────▼────┐ ┌───▼────┐ ┌────▼────┐ ┌─────▼──────┐  │
│  │Classifier│  │ SQLite  │ │ llama  │ │ Qdrant  │ │  Training  │  │
│  │(TF-IDF)  │  │  (WAL)  │ │ .cpp   │ │(Vector) │ │  Worker    │  │
│  └──────────┘  └─────────┘ │ server │ └─────────┘ │ (Subprocess│  │
│                             └────────┘             │  PEFT/TRL) │  │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  └────────────┘  │
│  │Benchmark │  │   Auth   │  │   RAG Layer      │                  │
│  │ Fetcher  │  │ (Bearer) │  │ (LlamaIndex +    │                  │
│  └──────────┘  └──────────┘  │  Qdrant)         │                  │
│                              └──────────────────┘                  │
└────────────────────────────────────────────────────────────────────┘

Features

🧠 Intelligent Multi-Model Router

ML-Powered Task Classification — TF-IDF + Logistic Regression classifier categorizes queries into coding, math, reasoning, instruction, hard_reasoning, or general with ~85% accuracy and <5ms inference
Multi-Signal Scoring — Routes based on weighted combination of benchmark scores (40%), memory-based success history (30%), latency (15%), and user feedback (15%)
Memory-Enhanced Routing — Qdrant vector store indexes past query→model outcomes; recency-weighted exponential decay ensures fresh interactions matter more
Fallback Evidence — When routing confidence is low, the system checks for historical evidence of any model succeeding on similar queries

📦 Model Lifecycle Management

One-Click Downloads from HuggingFace with VRAM-aware filtering
Hot-Swap Architecture — Single-model-hot constraint for consumer hardware; atomic state transitions (UNLOADED → LOADING → HOT → UNLOADING)
Resident Model — Most frequently used model auto-detected and kept loaded
Finetuning Lock — Models being finetuned are excluded from routing

🔥 OpenAI-Compatible API

Drop-in replacement for openai.ChatCompletion.create()
Streaming (SSE) and non-streaming responses
Bearer token authentication with auto-generated API keys (lf-{hex} format)
Custom headers expose routing metadata (X-LocalForge-Model, X-LocalForge-Task)

📊 Benchmarking & Evaluation

Automated Fetch — Pulls scores from HuggingFace model cards and Open LLM Leaderboard
Local Mini-Eval — Runs curated questions per task type through the inference engine for models without published scores
Multi-Model Comparison — Side-by-side radar charts across MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench

🧬 LoRA Finetuning Pipeline

Managed Training — Background subprocess with full lifecycle control (start, monitor, cancel)
Live Loss Streaming — SSE-powered real-time loss curves via JSONL log tailing
Dual Backend — Unsloth (2× faster, 60% less VRAM) or standard PEFT + TRL
Automatic GGUF Export — Finetuned models exported and auto-registered in the model registry
Before/After Comparison — Generates side-by-side outputs on held-out validation samples
Dataset Validation — Supports CSV, JSONL, Alpaca JSON with preview and error reporting

📚 RAG Knowledge Base

Document Ingestion — Upload PDFs, text files; chunked via LlamaIndex SentenceSplitter
Semantic Search — Embedded chunks stored in Qdrant; retrieved at query time and injected into system prompt
Task-Aware KB Routing — Router automatically selects the matching knowledge base by task type

🖥️ Dashboard & Observability

Real-time hardware profiling (GPU, VRAM, RAM, CPU via pynvml/psutil)
Request volume, latency trends, model distribution charts
Full routing decision traces with per-candidate scoring breakdowns
Memory layer statistics with per-model success rates

Quick Start

Prerequisites

Python 3.10+ with a virtual environment
Node.js 20+ and npm
NVIDIA GPU (recommended) with compatible drivers
~4GB disk for the smallest GGUF model

1. Clone & Install Backend

git clone https://github.com/al1-nasir/LocalForge.git
cd LocalForge/backend

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env — set LOCALFORGE_SECRET_KEY, optionally add HF_TOKEN for gated models

3. Start the Backend

uvicorn app.main:app --port 8010

The API is now live at http://127.0.0.1:8010. Visit http://127.0.0.1:8010/docs for interactive API documentation.

4. Install & Start Frontend

cd ../frontend
npm install

# Create .env.local pointing to your backend
echo "NEXT_PUBLIC_API_URL=http://127.0.0.1:8010" > .env.local

npm run dev

Open http://localhost:3000 to access the dashboard.

5. Download Your First Model

Navigate to Models in the dashboard, search for a model (e.g. Qwen2.5), and click download. The system will auto-filter GGUF files that fit your hardware's VRAM.

6. Send Your First Request

curl http://127.0.0.1:8010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Explain quantum computing in one paragraph"}]
  }'

API Reference

Core Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat (streaming + non-streaming)
`/health`	GET	System health check
`/docs`	GET	Interactive Swagger documentation

Model Management

Endpoint	Method	Description
`/api/models`	GET	List all registered models
`/api/models/browse`	GET	Search HuggingFace for GGUF models
`/api/models/download`	POST	Download a model from HuggingFace
`/api/models/load`	POST	Load a model into GPU memory
`/api/models/unload`	POST	Unload the current hot model
`/api/models/{id}`	GET / DELETE	Get or remove a specific model

Benchmarking

Endpoint	Method	Description
`/api/benchmarks/{id}`	GET	Get benchmark scores for a model
`/api/benchmarks/{id}/fetch`	POST	Fetch scores from HuggingFace/Leaderboard
`/api/benchmarks/{id}/eval`	POST	Run local mini-evaluation
`/api/benchmarks/compare/models`	GET	Multi-model benchmark comparison

Finetuning

Endpoint	Method	Description
`/api/finetune/backend`	GET	Check available training backend
`/api/finetune/upload`	POST	Upload and validate a dataset
`/api/finetune/start`	POST	Start a finetuning job
`/api/finetune/{id}`	GET	Get job status with live loss data
`/api/finetune/{id}/stream`	GET	SSE stream for real-time loss updates
`/api/finetune/{id}/cancel`	POST	Cancel a running job

Knowledge Base (RAG)

Endpoint	Method	Description
`/api/knowledge-bases`	GET / POST	List or create knowledge bases
`/api/knowledge-bases/{id}`	DELETE	Delete a knowledge base
`/api/knowledge-bases/{id}/documents`	GET / POST	List or upload documents

Dashboard & Observability

Endpoint	Method	Description
`/api/dashboard/stats`	GET	Aggregate dashboard statistics
`/api/dashboard/traces`	GET	Recent routing decision traces
`/api/dashboard/memory-stats`	GET	Memory layer statistics
`/api/hardware`	GET	Current hardware profile
`/api/keys`	GET / POST	Manage API keys

Configuration

All settings use the LOCALFORGE_ prefix and can be set via environment variables or .env:

Variable	Default	Description
`LOCALFORGE_PORT`	`8000`	Backend server port
`LOCALFORGE_SECRET_KEY`	—	Secret for API key hashing
`LOCALFORGE_DB_PATH`	`data/localforge.db`	SQLite database location
`LOCALFORGE_MODELS_DIR`	`data/models`	Downloaded model storage
`LOCALFORGE_DEFAULT_CTX_SIZE`	`4096`	Default context window
`LOCALFORGE_DEFAULT_N_GPU_LAYERS`	`-1`	GPU layers (-1 = all)
`LOCALFORGE_ROUTER_BENCHMARK_WEIGHT`	`0.4`	Benchmark signal weight
`LOCALFORGE_ROUTER_MEMORY_WEIGHT`	`0.3`	Memory signal weight
`LOCALFORGE_MEMORY_EMBEDDING_MODEL`	`nomic-ai/nomic-embed-text-v1.5`	Embedding model
`LOCALFORGE_FINETUNE_MAX_SEQ_LENGTH`	`2048`	Max sequence length for training
`HF_TOKEN`	—	HuggingFace token for gated models

Tech Stack

Backend

Component	Technology
API Framework	FastAPI 0.115+
Database	SQLite (aiosqlite, WAL mode)
Inference	llama.cpp (via llama-cpp-python)
Vector Store	Qdrant (disk-persisted, no Docker)
Embeddings	nomic-embed-text-v1.5 (sentence-transformers)
Finetuning	PEFT + TRL (or Unsloth)
RAG	LlamaIndex Core
Task Classifier	scikit-learn (TF-IDF + LogReg)
Hardware Detection	pynvml + psutil

Frontend

Component	Technology
Framework	Next.js 16 (Turbopack)
UI	React 19 + Lucide Icons
Charts	Recharts
Styling	Vanilla CSS with design tokens
Typography	Inter + JetBrains Mono (Google Fonts)

Project Structure

LocalForge/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI entry point & lifespan
│   │   ├── config.py            # Pydantic settings (env-driven)
│   │   ├── database.py          # SQLite schema & connection
│   │   ├── schemas.py           # Request/response Pydantic models
│   │   ├── api/                 # Route handlers
│   │   │   ├── chat.py          # /v1/chat/completions (OpenAI-compat)
│   │   │   ├── models.py        # Model CRUD, browse, download
│   │   │   ├── benchmarks.py    # Benchmark fetch & local eval
│   │   │   ├── finetune.py      # Finetune job management
│   │   │   ├── knowledge.py     # RAG knowledge base management
│   │   │   ├── dashboard.py     # Stats, traces, trends
│   │   │   ├── hardware.py      # GPU/RAM detection
│   │   │   ├── keys.py          # API key management
│   │   │   └── feedback.py      # Thumbs up/down feedback
│   │   └── core/                # Business logic engines
│   │       ├── router.py        # Multi-signal model router
│   │       ├── lifecycle.py     # Model state machine
│   │       ├── inference.py     # llama.cpp server management
│   │       ├── memory.py        # Qdrant-backed memory layer
│   │       ├── finetune_engine.py # Finetune orchestrator
│   │       ├── _train_worker.py # Training subprocess
│   │       ├── rag.py           # Document ingestion & retrieval
│   │       ├── query_classifier.py # TF-IDF task classifier
│   │       ├── benchmark_fetcher.py # HF/Leaderboard score fetch
│   │       ├── local_eval.py    # Local benchmark evaluation
│   │       ├── model_browser.py # HuggingFace GGUF search
│   │       ├── hardware.py      # GPU/VRAM detection
│   │       └── auth.py          # API key auth
│   ├── data/                    # Runtime data (DB, models, etc.)
│   ├── requirements.txt
│   └── .env
├── frontend/
│   ├── src/
│   │   ├── app/                 # Next.js pages
│   │   │   ├── page.tsx         # Dashboard
│   │   │   ├── models/          # Model browser & registry
│   │   │   ├── benchmarks/      # Benchmark comparison
│   │   │   ├── traces/          # Routing decision traces
│   │   │   ├── memory/          # Memory layer stats
│   │   │   ├── knowledge/       # Knowledge base management
│   │   │   ├── finetune/        # Finetuning UI
│   │   │   └── keys/            # API key management
│   │   ├── components/
│   │   │   └── Sidebar.tsx      # Navigation sidebar
│   │   └── lib/
│   │       └── api.ts           # Typed API client
│   ├── package.json
│   └── .env.local
└── README.md

Development

Running Tests

cd backend
source venv/bin/activate

# Test API endpoints
python test_endpoints.py

# Test RAG pipeline
python ../test_rag.py

Adding a New API Endpoint

Define Pydantic schemas in backend/app/schemas.py
Add business logic in backend/app/core/
Create route handler in backend/app/api/
Register the router in backend/app/main.py
Add frontend API call in frontend/src/lib/api.ts

Roadmap

Multi-GPU support and model parallelism
Cloud fallback (OpenAI/Gemini) when local models are insufficient
Automated A/B testing between model versions
Plugin system for custom routing strategies
Docker Compose deployment with Qdrant server mode
RLHF data collection from user feedback

License

MIT License — see LICENSE for details.

Built with 🔥 by the LocalForge team

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LocalForge_Project_Report.md		LocalForge_Project_Report.md
README.md		README.md
WHITEPAPER.md		WHITEPAPER.md
banner.png		banner.png
test_rag.py		test_rag.py
test_system_call.py		test_system_call.py

Folders and files

Latest commit

History

Repository files navigation