MINT-UI: Web Interface for Mixed-Precision LLM Quantization

A web application for MINT (Mixed-precision Integer quantization via kNapsack opTimization) on Apple Silicon. Quantize any HuggingFace LLM to mixed-precision, serve it locally with an OpenAI-compatible API, and chat with it — all from your browser.

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
16 GB+ unified memory (64 GB+ recommended for large models)

Install dependencies

pip install -e .

Note: All dependencies (FastAPI, MLX, PyTorch, etc.) are installed automatically. PyTorch is only used during RD analysis — not needed at inference time.

Quick Start

# Install from source
git clone https://github.com/baa-ai/MINT-UI.git
cd MINT-UI
pip install -e .

# Install macOS app (optional — adds to Applications, Spotlight, Dock)
./scripts/install-app.sh

# Launch
mint-ui

Opens http://localhost:8800 in your browser.

Features

Feature	Description
Quick Launch	One-click load and chat with any local quantized model
MINT Wizard	Guided 6-step pipeline from HuggingFace model to local chat
OpenAI API	Serve models via standard `/v1/chat/completions` endpoint
Budget Optimizer	Interactive quality-vs-size chart with knee-point detection
MLX + GGUF	Convert to MLX (Apple Silicon) or GGUF (llama.cpp) format
Memory-Aware	KV cache estimation, budget recommendations, resource warnings
Context Compression	Rolling conversation summary reduces tokens sent per message as chats grow
Model Library	Auto-discovers models from HuggingFace cache, grouped by org
Session Resume	Resume any MINT pipeline step — analysis, budget, conversion
Thinking Filter	Hides model reasoning/thinking tokens, toggle to show
Auto-Update	Checks GitHub for new releases on startup
macOS App	Launch from Applications, Spotlight, or Dock
Built-in Docs	Help documentation served at `/docs/`

Usage

Chat with an Existing Model

Launch mint-ui (or open MINT-UI from Applications)
Go to the Models tab — your quantized MLX/GGUF models are listed
Click Load on any model
Chat in the built-in UI, or use the API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

MINT a New Model

Click MINT New in the top nav
Select — Search HuggingFace or browse baa-ai models (grouped by org)
System Check — Review memory, disk, KV cache requirements
Analyze — RD curve computation with live progress and logs
Budget — Pick target size on the interactive chart (or auto-select optimal)
Convert — Allocate, build manifest, convert (MLX or GGUF)
Serve — Load model and chat

Resume any step from a previous session — no need to re-run analysis.

Pipeline Overview

MINT-UI wraps the full MINT pipeline in a guided web wizard:

HuggingFace Model
    |
    v
[Step 1] Select model from HF or local disk
    |
    v
[Step 2] System assessment — memory, disk, KV cache estimates
    |
    v
[Step 3] Rate-distortion analysis — NRMSE + SQNR at 13 configs per tensor
    |
    v
[Step 4] Budget selection — interactive quality-vs-size chart (MCKP solver)
    |
    v
[Step 5] Conversion — MLX or GGUF with per-tensor mixed-precision
    |
    v
[Step 6] Serve & Chat — OpenAI-compatible API + built-in chat UI

API Reference

Model Serving (port 8080)

OpenAI-compatible endpoints provided by mlx_lm.server (MLX) or llama-server (GGUF):

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions (streaming supported)
`/v1/completions`	POST	Text completions
`/v1/models`	GET	List loaded models
`/health`	GET	Health check

MINT-UI Management (port 8800)

Endpoint	Method	Description
`/api/models/quantized`	GET	List local quantized models
`/api/models/search?q=...`	GET	Search HuggingFace + local models
`/api/models/baa-ai`	GET	List baa-ai organization models
`/api/serve/start`	POST	Load a model
`/api/serve/stop`	POST	Unload model
`/api/serve/status`	GET	Current serving status
`/api/system/assess`	POST	Memory, disk, KV cache assessment
`/api/version`	GET	Version info + update check

Code Examples

curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 512}'

Python:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
resp = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

JavaScript:

const resp = await fetch("http://localhost:8080/v1/chat/completions", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({
        messages: [{role: "user", content: "Hello!"}],
        stream: true,
    }),
});

Configuration

Command Line

mint-ui [options]

Options:
  --host HOST         Web UI host (default: 127.0.0.1)
  --port PORT         Web UI port (default: 8800)
  --models-dir DIR    Additional directory to scan for models
  --no-browser        Don't auto-open browser on start

Environment Variables

Variable	Default	Description
`MINT_UI_MODELS_DIR`	`~/models`	Additional model scan directory
`HF_READ_TOKEN`		HuggingFace token for gated models

Project Structure

MINT-UI/
  mint_ui/
    app.py              # FastAPI application entry point
    config.py           # Configuration and paths
    routes/             # API endpoints (models, system, analysis, budget, conversion, serve)
    services/           # Business logic (HF, system, RD curves, allocator, serving)
    tasks/              # Background task management with WebSocket progress
    pipeline/           # Bundled MINT quantization pipeline
    templates/          # HTML templates (SPA)
    static/             # CSS + JavaScript
  docs/                 # Built-in documentation (served at /docs/)
  scripts/
    install-app.sh      # macOS app bundle installer
  tests/
    test_routes.py      # API route tests (18 tests)

Documentation

Built-in docs are served at http://localhost:8800/docs/ when the app is running:

Quick Start — Install and chat in 5 minutes
How MINT Works — Rate-distortion analysis, MCKP allocation, conversion
API Reference — All endpoints with code examples
Research — Published results and methodology

Links

baa.ai — Project website
baa-ai/MINT — Core MINT pipeline
baa-ai/MINT-UI — This repository
baa-ai on HuggingFace — Published MINT models

License

PolyForm Noncommercial 1.0.0 — free for personal, research, and noncommercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
mint_ui		mint_ui
results		results
scripts		scripts
skills		skills
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MINT-UI: Web Interface for Mixed-Precision LLM Quantization

Requirements

Install dependencies

Quick Start

Features

Usage

Chat with an Existing Model

MINT a New Model

Pipeline Overview

API Reference

Model Serving (port 8080)

MINT-UI Management (port 8800)

Code Examples

Configuration

Command Line

Environment Variables

Project Structure

Documentation

Links

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MINT-UI: Web Interface for Mixed-Precision LLM Quantization

Requirements

Install dependencies

Quick Start

Features

Usage

Chat with an Existing Model

MINT a New Model

Pipeline Overview

API Reference

Model Serving (port 8080)

MINT-UI Management (port 8800)

Code Examples

Configuration

Command Line

Environment Variables

Project Structure

Documentation

Links

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 1

Languages

Packages