Implement Phase 1 + tool calling#1
Merged
Merged
Conversation
Build mlx-server out from the Phase 0 scaffold into a working OpenAI-compatible inference server on mlx-swift-lm. - Wire mlx-swift-lm v3.32.1-alpha; split into a testable MLXServerKit library plus a thin MLXServer executable - Load an MLX model from a local directory or a HuggingFace id - GET /v1/models - POST /v1/chat/completions, both streaming (SSE) and non-streaming - OpenAI tool calling via mlx-swift-lm's chat-template parsers - Inferencing protocol so the HTTP layer is testable without a model - 20 tests: OpenAI types, mapping, SSE framing, HTTP routes - CI builds with xcodebuild so the Metal shaders are verified, not just compilation Validated end-to-end against Qwen3-4B-4bit and Qwen3.6-35B-A3B-8bit. Running requires an xcodebuild build: SwiftPM cannot compile mlx-swift's Metal shaders.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Builds mlx-server out from the Phase 0 scaffold into a working OpenAI-compatible inference server on
mlx-swift-lm.mlx-swift-lmv3.32.1-alpha wired in; package split into a testableMLXServerKitlibrary + a thinMLXServerexecutableGET /v1/modelsPOST /v1/chat/completions— streaming (SSE) and non-streamingmlx-swift-lm's chat-template parsers (tools/tool_choicein,tool_callsout)Inferencingprotocol so the HTTP layer is testable without a modelWhy
Phase 0 was a scaffold serving only
/health. This is "where we left off" — the README's Phase 1, plus tool calling pulled forward from Phase 3 so the server is actually usable by agentic clients (opencode).How it was verified
End-to-end against two local MLX models —
Qwen3-4B-4bitandQwen3.6-35B-A3B-8bit(Qwen3.5-MoE): model load, non-streaming completion with token usage, streaming SSE, and tool calls.Notes
xcodebuildbuild, notswift build: SwiftPM cannot compile mlx-swift's Metal shaders, so aswift buildbinary fails at runtime withFailed to load the default metallib. CI now builds withxcodebuildto verify the real artifact;swift testruns the (model-free) suite.<think>content into the responsecontent— routing it toreasoning_contentis a planned follow-up.Roadmap
Phase 0 and Phase 1 (+ tool calling) are now done. Next: reasoning/content split, then LLMKube
runtime: mlx-serverintegration (Phase 4).