Skip to content

baa-ai/MINT-UI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MINT-UI: Web Interface for Mixed-Precision LLM Quantization

A web application for MINT (Mixed-precision Integer quantization via kNapsack opTimization) on Apple Silicon. Quantize any HuggingFace LLM to mixed-precision, serve it locally with an OpenAI-compatible API, and chat with it — all from your browser.

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • 16 GB+ unified memory (64 GB+ recommended for large models)

Install dependencies

pip install -e .

Note: All dependencies (FastAPI, MLX, PyTorch, etc.) are installed automatically. PyTorch is only used during RD analysis — not needed at inference time.


Quick Start

# Install from source
git clone https://github.com/baa-ai/MINT-UI.git
cd MINT-UI
pip install -e .

# Install macOS app (optional — adds to Applications, Spotlight, Dock)
./scripts/install-app.sh

# Launch
mint-ui

Opens http://localhost:8800 in your browser.


Features

Feature Description
Quick Launch One-click load and chat with any local quantized model
MINT Wizard Guided 6-step pipeline from HuggingFace model to local chat
OpenAI API Serve models via standard /v1/chat/completions endpoint
Budget Optimizer Interactive quality-vs-size chart with knee-point detection
MLX + GGUF Convert to MLX (Apple Silicon) or GGUF (llama.cpp) format
Memory-Aware KV cache estimation, budget recommendations, resource warnings
Context Compression Rolling conversation summary reduces tokens sent per message as chats grow
Model Library Auto-discovers models from HuggingFace cache, grouped by org
Session Resume Resume any MINT pipeline step — analysis, budget, conversion
Thinking Filter Hides model reasoning/thinking tokens, toggle to show
Auto-Update Checks GitHub for new releases on startup
macOS App Launch from Applications, Spotlight, or Dock
Built-in Docs Help documentation served at /docs/

Usage

Chat with an Existing Model

  1. Launch mint-ui (or open MINT-UI from Applications)
  2. Go to the Models tab — your quantized MLX/GGUF models are listed
  3. Click Load on any model
  4. Chat in the built-in UI, or use the API:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

MINT a New Model

  1. Click MINT New in the top nav
  2. Select — Search HuggingFace or browse baa-ai models (grouped by org)
  3. System Check — Review memory, disk, KV cache requirements
  4. Analyze — RD curve computation with live progress and logs
  5. Budget — Pick target size on the interactive chart (or auto-select optimal)
  6. Convert — Allocate, build manifest, convert (MLX or GGUF)
  7. Serve — Load model and chat

Resume any step from a previous session — no need to re-run analysis.


Pipeline Overview

MINT-UI wraps the full MINT pipeline in a guided web wizard:

HuggingFace Model
    |
    v
[Step 1] Select model from HF or local disk
    |
    v
[Step 2] System assessment — memory, disk, KV cache estimates
    |
    v
[Step 3] Rate-distortion analysis — NRMSE + SQNR at 13 configs per tensor
    |
    v
[Step 4] Budget selection — interactive quality-vs-size chart (MCKP solver)
    |
    v
[Step 5] Conversion — MLX or GGUF with per-tensor mixed-precision
    |
    v
[Step 6] Serve & Chat — OpenAI-compatible API + built-in chat UI

API Reference

Model Serving (port 8080)

OpenAI-compatible endpoints provided by mlx_lm.server (MLX) or llama-server (GGUF):

Endpoint Method Description
/v1/chat/completions POST Chat completions (streaming supported)
/v1/completions POST Text completions
/v1/models GET List loaded models
/health GET Health check

MINT-UI Management (port 8800)

Endpoint Method Description
/api/models/quantized GET List local quantized models
/api/models/search?q=... GET Search HuggingFace + local models
/api/models/baa-ai GET List baa-ai organization models
/api/serve/start POST Load a model
/api/serve/stop POST Unload model
/api/serve/status GET Current serving status
/api/system/assess POST Memory, disk, KV cache assessment
/api/version GET Version info + update check

Code Examples

curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 512}'

Python:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
resp = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

JavaScript:

const resp = await fetch("http://localhost:8080/v1/chat/completions", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({
        messages: [{role: "user", content: "Hello!"}],
        stream: true,
    }),
});

Configuration

Command Line

mint-ui [options]

Options:
  --host HOST         Web UI host (default: 127.0.0.1)
  --port PORT         Web UI port (default: 8800)
  --models-dir DIR    Additional directory to scan for models
  --no-browser        Don't auto-open browser on start

Environment Variables

Variable Default Description
MINT_UI_MODELS_DIR ~/models Additional model scan directory
HF_READ_TOKEN HuggingFace token for gated models

Project Structure

MINT-UI/
  mint_ui/
    app.py              # FastAPI application entry point
    config.py           # Configuration and paths
    routes/             # API endpoints (models, system, analysis, budget, conversion, serve)
    services/           # Business logic (HF, system, RD curves, allocator, serving)
    tasks/              # Background task management with WebSocket progress
    pipeline/           # Bundled MINT quantization pipeline
    templates/          # HTML templates (SPA)
    static/             # CSS + JavaScript
  docs/                 # Built-in documentation (served at /docs/)
  scripts/
    install-app.sh      # macOS app bundle installer
  tests/
    test_routes.py      # API route tests (18 tests)

Documentation

Built-in docs are served at http://localhost:8800/docs/ when the app is running:

  • Quick Start — Install and chat in 5 minutes
  • How MINT Works — Rate-distortion analysis, MCKP allocation, conversion
  • API Reference — All endpoints with code examples
  • Research — Published results and methodology

Links

License

PolyForm Noncommercial 1.0.0 — free for personal, research, and noncommercial use.

About

Web UI for MINT mixed-precision LLM quantization on Apple Silicon

Resources

License

Stars

Watchers

Forks

Packages