Skip to content

al1-nasir/LocalForge

Repository files navigation

LocalForge Header

Python 3.10+ Next.js 16 FastAPI MIT License

⚡ LocalForge

Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration

A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring.


Overview

LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:

  1. Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
  2. Serve models via a fully OpenAI-compatible /v1/chat/completions endpoint
  3. Route queries to the optimal model using ML-powered task classification + multi-signal scoring
  4. Learn from usage patterns via a vector-based memory layer that improves routing over time
  5. Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
  6. Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
  7. Augment responses with a RAG knowledge base layer for domain-specific context injection

Architecture

┌────────────────────────────────────────────────────────────────────┐
│                        Next.js Frontend                            │
│   Dashboard · Models · Benchmarks · Traces · Memory · Finetune     │
└────────────────────────────┬───────────────────────────────────────┘
                             │ REST + SSE
┌────────────────────────────▼───────────────────────────────────────┐
│                       FastAPI Backend                               │
│  ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐  │
│  │  Router   │ │Lifecycle│ │Inference │ │ Memory │ │  Finetune   │  │
│  │  Engine   │ │ Manager │ │  Engine  │ │ Layer  │ │  Engine     │  │
│  └────┬─────┘ └────┬────┘ └────┬─────┘ └───┬────┘ └─────┬──────┘  │
│       │             │           │            │            │         │
│  ┌────▼─────┐  ┌────▼────┐ ┌───▼────┐ ┌────▼────┐ ┌─────▼──────┐  │
│  │Classifier│  │ SQLite  │ │ llama  │ │ Qdrant  │ │  Training  │  │
│  │(TF-IDF)  │  │  (WAL)  │ │ .cpp   │ │(Vector) │ │  Worker    │  │
│  └──────────┘  └─────────┘ │ server │ └─────────┘ │ (Subprocess│  │
│                             └────────┘             │  PEFT/TRL) │  │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  └────────────┘  │
│  │Benchmark │  │   Auth   │  │   RAG Layer      │                  │
│  │ Fetcher  │  │ (Bearer) │  │ (LlamaIndex +    │                  │
│  └──────────┘  └──────────┘  │  Qdrant)         │                  │
│                              └──────────────────┘                  │
└────────────────────────────────────────────────────────────────────┘

Features

🧠 Intelligent Multi-Model Router

  • ML-Powered Task Classification — TF-IDF + Logistic Regression classifier categorizes queries into coding, math, reasoning, instruction, hard_reasoning, or general with ~85% accuracy and <5ms inference
  • Multi-Signal Scoring — Routes based on weighted combination of benchmark scores (40%), memory-based success history (30%), latency (15%), and user feedback (15%)
  • Memory-Enhanced Routing — Qdrant vector store indexes past query→model outcomes; recency-weighted exponential decay ensures fresh interactions matter more
  • Fallback Evidence — When routing confidence is low, the system checks for historical evidence of any model succeeding on similar queries

📦 Model Lifecycle Management

  • One-Click Downloads from HuggingFace with VRAM-aware filtering
  • Hot-Swap Architecture — Single-model-hot constraint for consumer hardware; atomic state transitions (UNLOADED → LOADING → HOT → UNLOADING)
  • Resident Model — Most frequently used model auto-detected and kept loaded
  • Finetuning Lock — Models being finetuned are excluded from routing

🔥 OpenAI-Compatible API

  • Drop-in replacement for openai.ChatCompletion.create()
  • Streaming (SSE) and non-streaming responses
  • Bearer token authentication with auto-generated API keys (lf-{hex} format)
  • Custom headers expose routing metadata (X-LocalForge-Model, X-LocalForge-Task)

📊 Benchmarking & Evaluation

  • Automated Fetch — Pulls scores from HuggingFace model cards and Open LLM Leaderboard
  • Local Mini-Eval — Runs curated questions per task type through the inference engine for models without published scores
  • Multi-Model Comparison — Side-by-side radar charts across MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench

🧬 LoRA Finetuning Pipeline

  • Managed Training — Background subprocess with full lifecycle control (start, monitor, cancel)
  • Live Loss Streaming — SSE-powered real-time loss curves via JSONL log tailing
  • Dual Backend — Unsloth (2× faster, 60% less VRAM) or standard PEFT + TRL
  • Automatic GGUF Export — Finetuned models exported and auto-registered in the model registry
  • Before/After Comparison — Generates side-by-side outputs on held-out validation samples
  • Dataset Validation — Supports CSV, JSONL, Alpaca JSON with preview and error reporting

📚 RAG Knowledge Base

  • Document Ingestion — Upload PDFs, text files; chunked via LlamaIndex SentenceSplitter
  • Semantic Search — Embedded chunks stored in Qdrant; retrieved at query time and injected into system prompt
  • Task-Aware KB Routing — Router automatically selects the matching knowledge base by task type

🖥️ Dashboard & Observability

  • Real-time hardware profiling (GPU, VRAM, RAM, CPU via pynvml/psutil)
  • Request volume, latency trends, model distribution charts
  • Full routing decision traces with per-candidate scoring breakdowns
  • Memory layer statistics with per-model success rates

Quick Start

Prerequisites

  • Python 3.10+ with a virtual environment
  • Node.js 20+ and npm
  • NVIDIA GPU (recommended) with compatible drivers
  • ~4GB disk for the smallest GGUF model

1. Clone & Install Backend

git clone https://github.com/al1-nasir/LocalForge.git
cd LocalForge/backend

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env — set LOCALFORGE_SECRET_KEY, optionally add HF_TOKEN for gated models

3. Start the Backend

uvicorn app.main:app --port 8010

The API is now live at http://127.0.0.1:8010. Visit http://127.0.0.1:8010/docs for interactive API documentation.

4. Install & Start Frontend

cd ../frontend
npm install

# Create .env.local pointing to your backend
echo "NEXT_PUBLIC_API_URL=http://127.0.0.1:8010" > .env.local

npm run dev

Open http://localhost:3000 to access the dashboard.

5. Download Your First Model

Navigate to Models in the dashboard, search for a model (e.g. Qwen2.5), and click download. The system will auto-filter GGUF files that fit your hardware's VRAM.

6. Send Your First Request

curl http://127.0.0.1:8010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Explain quantum computing in one paragraph"}]
  }'

API Reference

Core Endpoints

Endpoint Method Description
/v1/chat/completions POST OpenAI-compatible chat (streaming + non-streaming)
/health GET System health check
/docs GET Interactive Swagger documentation

Model Management

Endpoint Method Description
/api/models GET List all registered models
/api/models/browse GET Search HuggingFace for GGUF models
/api/models/download POST Download a model from HuggingFace
/api/models/load POST Load a model into GPU memory
/api/models/unload POST Unload the current hot model
/api/models/{id} GET / DELETE Get or remove a specific model

Benchmarking

Endpoint Method Description
/api/benchmarks/{id} GET Get benchmark scores for a model
/api/benchmarks/{id}/fetch POST Fetch scores from HuggingFace/Leaderboard
/api/benchmarks/{id}/eval POST Run local mini-evaluation
/api/benchmarks/compare/models GET Multi-model benchmark comparison

Finetuning

Endpoint Method Description
/api/finetune/backend GET Check available training backend
/api/finetune/upload POST Upload and validate a dataset
/api/finetune/start POST Start a finetuning job
/api/finetune/{id} GET Get job status with live loss data
/api/finetune/{id}/stream GET SSE stream for real-time loss updates
/api/finetune/{id}/cancel POST Cancel a running job

Knowledge Base (RAG)

Endpoint Method Description
/api/knowledge-bases GET / POST List or create knowledge bases
/api/knowledge-bases/{id} DELETE Delete a knowledge base
/api/knowledge-bases/{id}/documents GET / POST List or upload documents

Dashboard & Observability

Endpoint Method Description
/api/dashboard/stats GET Aggregate dashboard statistics
/api/dashboard/traces GET Recent routing decision traces
/api/dashboard/memory-stats GET Memory layer statistics
/api/hardware GET Current hardware profile
/api/keys GET / POST Manage API keys

Configuration

All settings use the LOCALFORGE_ prefix and can be set via environment variables or .env:

Variable Default Description
LOCALFORGE_PORT 8000 Backend server port
LOCALFORGE_SECRET_KEY Secret for API key hashing
LOCALFORGE_DB_PATH data/localforge.db SQLite database location
LOCALFORGE_MODELS_DIR data/models Downloaded model storage
LOCALFORGE_DEFAULT_CTX_SIZE 4096 Default context window
LOCALFORGE_DEFAULT_N_GPU_LAYERS -1 GPU layers (-1 = all)
LOCALFORGE_ROUTER_BENCHMARK_WEIGHT 0.4 Benchmark signal weight
LOCALFORGE_ROUTER_MEMORY_WEIGHT 0.3 Memory signal weight
LOCALFORGE_MEMORY_EMBEDDING_MODEL nomic-ai/nomic-embed-text-v1.5 Embedding model
LOCALFORGE_FINETUNE_MAX_SEQ_LENGTH 2048 Max sequence length for training
HF_TOKEN HuggingFace token for gated models

Tech Stack

Backend

Component Technology
API Framework FastAPI 0.115+
Database SQLite (aiosqlite, WAL mode)
Inference llama.cpp (via llama-cpp-python)
Vector Store Qdrant (disk-persisted, no Docker)
Embeddings nomic-embed-text-v1.5 (sentence-transformers)
Finetuning PEFT + TRL (or Unsloth)
RAG LlamaIndex Core
Task Classifier scikit-learn (TF-IDF + LogReg)
Hardware Detection pynvml + psutil

Frontend

Component Technology
Framework Next.js 16 (Turbopack)
UI React 19 + Lucide Icons
Charts Recharts
Styling Vanilla CSS with design tokens
Typography Inter + JetBrains Mono (Google Fonts)

Project Structure

LocalForge/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI entry point & lifespan
│   │   ├── config.py            # Pydantic settings (env-driven)
│   │   ├── database.py          # SQLite schema & connection
│   │   ├── schemas.py           # Request/response Pydantic models
│   │   ├── api/                 # Route handlers
│   │   │   ├── chat.py          # /v1/chat/completions (OpenAI-compat)
│   │   │   ├── models.py        # Model CRUD, browse, download
│   │   │   ├── benchmarks.py    # Benchmark fetch & local eval
│   │   │   ├── finetune.py      # Finetune job management
│   │   │   ├── knowledge.py     # RAG knowledge base management
│   │   │   ├── dashboard.py     # Stats, traces, trends
│   │   │   ├── hardware.py      # GPU/RAM detection
│   │   │   ├── keys.py          # API key management
│   │   │   └── feedback.py      # Thumbs up/down feedback
│   │   └── core/                # Business logic engines
│   │       ├── router.py        # Multi-signal model router
│   │       ├── lifecycle.py     # Model state machine
│   │       ├── inference.py     # llama.cpp server management
│   │       ├── memory.py        # Qdrant-backed memory layer
│   │       ├── finetune_engine.py # Finetune orchestrator
│   │       ├── _train_worker.py # Training subprocess
│   │       ├── rag.py           # Document ingestion & retrieval
│   │       ├── query_classifier.py # TF-IDF task classifier
│   │       ├── benchmark_fetcher.py # HF/Leaderboard score fetch
│   │       ├── local_eval.py    # Local benchmark evaluation
│   │       ├── model_browser.py # HuggingFace GGUF search
│   │       ├── hardware.py      # GPU/VRAM detection
│   │       └── auth.py          # API key auth
│   ├── data/                    # Runtime data (DB, models, etc.)
│   ├── requirements.txt
│   └── .env
├── frontend/
│   ├── src/
│   │   ├── app/                 # Next.js pages
│   │   │   ├── page.tsx         # Dashboard
│   │   │   ├── models/          # Model browser & registry
│   │   │   ├── benchmarks/      # Benchmark comparison
│   │   │   ├── traces/          # Routing decision traces
│   │   │   ├── memory/          # Memory layer stats
│   │   │   ├── knowledge/       # Knowledge base management
│   │   │   ├── finetune/        # Finetuning UI
│   │   │   └── keys/            # API key management
│   │   ├── components/
│   │   │   └── Sidebar.tsx      # Navigation sidebar
│   │   └── lib/
│   │       └── api.ts           # Typed API client
│   ├── package.json
│   └── .env.local
└── README.md

Development

Running Tests

cd backend
source venv/bin/activate

# Test API endpoints
python test_endpoints.py

# Test RAG pipeline
python ../test_rag.py

Adding a New API Endpoint

  1. Define Pydantic schemas in backend/app/schemas.py
  2. Add business logic in backend/app/core/
  3. Create route handler in backend/app/api/
  4. Register the router in backend/app/main.py
  5. Add frontend API call in frontend/src/lib/api.ts

Roadmap

  • Multi-GPU support and model parallelism
  • Cloud fallback (OpenAI/Gemini) when local models are insufficient
  • Automated A/B testing between model versions
  • Plugin system for custom routing strategies
  • Docker Compose deployment with Qdrant server mode
  • RLHF data collection from user feedback

License

MIT License — see LICENSE for details.


Built with 🔥 by the LocalForge team

About

Self-hosted AI control plane for intelligent local LLM orchestration. OpenAI-compatible API · ML-powered multi-model routing · LoRA finetuning · vector memory · RAG

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages