A web application for MINT (Mixed-precision Integer quantization via kNapsack opTimization) on Apple Silicon. Quantize any HuggingFace LLM to mixed-precision, serve it locally with an OpenAI-compatible API, and chat with it — all from your browser.
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- 16 GB+ unified memory (64 GB+ recommended for large models)
pip install -e .Note: All dependencies (FastAPI, MLX, PyTorch, etc.) are installed automatically. PyTorch is only used during RD analysis — not needed at inference time.
# Install from source
git clone https://github.com/baa-ai/MINT-UI.git
cd MINT-UI
pip install -e .
# Install macOS app (optional — adds to Applications, Spotlight, Dock)
./scripts/install-app.sh
# Launch
mint-uiOpens http://localhost:8800 in your browser.
| Feature | Description |
|---|---|
| Quick Launch | One-click load and chat with any local quantized model |
| MINT Wizard | Guided 6-step pipeline from HuggingFace model to local chat |
| OpenAI API | Serve models via standard /v1/chat/completions endpoint |
| Budget Optimizer | Interactive quality-vs-size chart with knee-point detection |
| MLX + GGUF | Convert to MLX (Apple Silicon) or GGUF (llama.cpp) format |
| Memory-Aware | KV cache estimation, budget recommendations, resource warnings |
| Context Compression | Rolling conversation summary reduces tokens sent per message as chats grow |
| Model Library | Auto-discovers models from HuggingFace cache, grouped by org |
| Session Resume | Resume any MINT pipeline step — analysis, budget, conversion |
| Thinking Filter | Hides model reasoning/thinking tokens, toggle to show |
| Auto-Update | Checks GitHub for new releases on startup |
| macOS App | Launch from Applications, Spotlight, or Dock |
| Built-in Docs | Help documentation served at /docs/ |
- Launch
mint-ui(or open MINT-UI from Applications) - Go to the Models tab — your quantized MLX/GGUF models are listed
- Click Load on any model
- Chat in the built-in UI, or use the API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)- Click MINT New in the top nav
- Select — Search HuggingFace or browse baa-ai models (grouped by org)
- System Check — Review memory, disk, KV cache requirements
- Analyze — RD curve computation with live progress and logs
- Budget — Pick target size on the interactive chart (or auto-select optimal)
- Convert — Allocate, build manifest, convert (MLX or GGUF)
- Serve — Load model and chat
Resume any step from a previous session — no need to re-run analysis.
MINT-UI wraps the full MINT pipeline in a guided web wizard:
HuggingFace Model
|
v
[Step 1] Select model from HF or local disk
|
v
[Step 2] System assessment — memory, disk, KV cache estimates
|
v
[Step 3] Rate-distortion analysis — NRMSE + SQNR at 13 configs per tensor
|
v
[Step 4] Budget selection — interactive quality-vs-size chart (MCKP solver)
|
v
[Step 5] Conversion — MLX or GGUF with per-tensor mixed-precision
|
v
[Step 6] Serve & Chat — OpenAI-compatible API + built-in chat UI
OpenAI-compatible endpoints provided by mlx_lm.server (MLX) or llama-server (GGUF):
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions (streaming supported) |
/v1/completions |
POST | Text completions |
/v1/models |
GET | List loaded models |
/health |
GET | Health check |
| Endpoint | Method | Description |
|---|---|---|
/api/models/quantized |
GET | List local quantized models |
/api/models/search?q=... |
GET | Search HuggingFace + local models |
/api/models/baa-ai |
GET | List baa-ai organization models |
/api/serve/start |
POST | Load a model |
/api/serve/stop |
POST | Unload model |
/api/serve/status |
GET | Current serving status |
/api/system/assess |
POST | Memory, disk, KV cache assessment |
/api/version |
GET | Version info + update check |
curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 512}'Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
resp = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)JavaScript:
const resp = await fetch("http://localhost:8080/v1/chat/completions", {
method: "POST",
headers: {"Content-Type": "application/json"},
body: JSON.stringify({
messages: [{role: "user", content: "Hello!"}],
stream: true,
}),
});mint-ui [options]
Options:
--host HOST Web UI host (default: 127.0.0.1)
--port PORT Web UI port (default: 8800)
--models-dir DIR Additional directory to scan for models
--no-browser Don't auto-open browser on start
| Variable | Default | Description |
|---|---|---|
MINT_UI_MODELS_DIR |
~/models |
Additional model scan directory |
HF_READ_TOKEN |
HuggingFace token for gated models |
MINT-UI/
mint_ui/
app.py # FastAPI application entry point
config.py # Configuration and paths
routes/ # API endpoints (models, system, analysis, budget, conversion, serve)
services/ # Business logic (HF, system, RD curves, allocator, serving)
tasks/ # Background task management with WebSocket progress
pipeline/ # Bundled MINT quantization pipeline
templates/ # HTML templates (SPA)
static/ # CSS + JavaScript
docs/ # Built-in documentation (served at /docs/)
scripts/
install-app.sh # macOS app bundle installer
tests/
test_routes.py # API route tests (18 tests)
Built-in docs are served at http://localhost:8800/docs/ when the app is running:
- Quick Start — Install and chat in 5 minutes
- How MINT Works — Rate-distortion analysis, MCKP allocation, conversion
- API Reference — All endpoints with code examples
- Research — Published results and methodology
- baa.ai — Project website
- baa-ai/MINT — Core MINT pipeline
- baa-ai/MINT-UI — This repository
- baa-ai on HuggingFace — Published MINT models
PolyForm Noncommercial 1.0.0 — free for personal, research, and noncommercial use.