Skip to content

defilantech/mlx-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mlx-server

OpenAI-compatible HTTP server for mlx-swift-lm on Apple Silicon.

Status

Phase 1 + tool calling: working. Loads an MLX model and serves /v1/chat/completions (streaming and non-streaming), /v1/models, and /health, with OpenAI-compatible tool calling. Validated end-to-end against Qwen3-4B and Qwen3.6-35B-A3B (MoE). See the roadmap for what is next.

Why this exists

mlx-swift-lm is a high-performance LLM inference library for Apple Silicon, with state-of-the-art speculative decoding, prefix KV-cache, and hybrid-architecture state replay landing in its Tier 1 roadmap. It is, by design, a Swift library, not a server. mlx-server is the missing HTTP layer.

The end goal is to be a drop-in replacement for llama-server in LLMKube's Apple Silicon path, with the perf characteristics of mlx-swift-lm.

Goals

  • OpenAI-compatible API surface: /v1/chat/completions, /v1/completions, /v1/models, /v1/embeddings
  • Streaming via Server-Sent Events
  • Tool calling and structured outputs
  • Vision-language model support (Qwen-VL, Gemma 4 VL, etc.)
  • Multi-slot concurrency with longest-prefix KV cache reuse
  • One-command install via Homebrew, plus prebuilt GitHub releases

Non-goals

  • Cross-platform support. Apple Silicon (arm64 macOS) only. If you need x86_64 or Linux, look at llama.cpp.
  • vLLM compatibility. For that path see vllm-swift (Python API + Swift compute) or vllm-metal.
  • Heavy abstractions over mlx-swift-lm. This is a thin HTTP wrapper, not a framework.

Prior art

TheTom's MLXServer (abandoned in favor of vllm-swift) was the proof-of-concept that an MLX-swift HTTP server is feasible. Several design decisions here, particularly around the slot manager and longest-prefix KV cache, are informed by his approach. The decision to rebuild rather than fork is mainly because his original used hand-rolled socket code; this repo uses Hummingbird for the HTTP layer.

Install

brew install defilantech/tap/mlx-server

Apple Silicon, macOS 14 (Sonoma) or later. This installs the latest prebuilt release; older versions and the raw tarballs are on the releases page. Then:

mlx-server --model /path/to/mlx-model-dir --port 8080

Build from source

Requires:

  • macOS 14 (Sonoma) or later, Apple Silicon
  • Swift 6.0 or later (Xcode 16+)

swift build compiles the project (and is what CI runs), but SwiftPM cannot compile mlx-swift's Metal shaders — a binary built that way fails at runtime with Failed to load the default metallib. To run the server, build with xcodebuild, which compiles and bundles the Metal library next to the binary:

xcodebuild -scheme mlx-server -destination 'platform=macOS,arch=arm64' \
  -configuration Debug -derivedDataPath .build/xcode -skipMacroValidation build

.build/xcode/Build/Products/Debug/mlx-server \
  --model /path/to/mlx-model-dir --port 8080

--model takes a local MLX model directory or a HuggingFace id. Other flags:

  • --host, --port, --max-slots
  • --tool-call-format — e.g. xml_function for Qwen3.5 / Qwen3-Coder; auto-inferred when unset
  • --reasoning — how thinking output is split into reasoning_content vs content: auto (default; splits on a literal <think>/</think>), prefilled (output starts mid-thought — use for Qwen3.5 / Qwen3.6), or off

Roadmap

Phase Scope Status
0 Scaffolding, CI, /health endpoint, dependency wiring Done
1 /v1/chat/completions (streaming + non-streaming), /v1/models, single-slot model loading Done
2 Multi-slot SlotManager, longest-prefix prompt cache, Prometheus /metrics, structured logging, graceful shutdown
3 Tool calling, thinking-model support, vision-language models, speculative decoding knobs, /v1/embeddings Tool calling done
4 LLMKube runtime: mlx-server integration

License

Apache 2.0. See LICENSE.

About

OpenAI-compatible HTTP server for mlx-swift-lm on Apple Silicon. Drop-in replacement for llama-server with state-of-the-art speculative decoding.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages