Skip to content

Implement Phase 1 + tool calling#1

Merged
Defilan merged 1 commit into
mainfrom
feat/phase1-tool-calling
May 15, 2026
Merged

Implement Phase 1 + tool calling#1
Defilan merged 1 commit into
mainfrom
feat/phase1-tool-calling

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 15, 2026

What

Builds mlx-server out from the Phase 0 scaffold into a working OpenAI-compatible inference server on mlx-swift-lm.

  • mlx-swift-lm v3.32.1-alpha wired in; package split into a testable MLXServerKit library + a thin MLXServer executable
  • Model loading from a local MLX directory or a HuggingFace id
  • GET /v1/models
  • POST /v1/chat/completions — streaming (SSE) and non-streaming
  • OpenAI tool calling via mlx-swift-lm's chat-template parsers (tools / tool_choice in, tool_calls out)
  • Inferencing protocol so the HTTP layer is testable without a model
  • 20 tests: OpenAI type encode/decode, mapping, SSE framing, HTTP routes (stub engine)

Why

Phase 0 was a scaffold serving only /health. This is "where we left off" — the README's Phase 1, plus tool calling pulled forward from Phase 3 so the server is actually usable by agentic clients (opencode).

How it was verified

End-to-end against two local MLX models — Qwen3-4B-4bit and Qwen3.6-35B-A3B-8bit (Qwen3.5-MoE): model load, non-streaming completion with token usage, streaming SSE, and tool calls.

Notes

  • Running requires an xcodebuild build, not swift build: SwiftPM cannot compile mlx-swift's Metal shaders, so a swift build binary fails at runtime with Failed to load the default metallib. CI now builds with xcodebuild to verify the real artifact; swift test runs the (model-free) suite.
  • Reasoning models currently emit <think> content into the response content — routing it to reasoning_content is a planned follow-up.

Roadmap

Phase 0 and Phase 1 (+ tool calling) are now done. Next: reasoning/content split, then LLMKube runtime: mlx-server integration (Phase 4).

Build mlx-server out from the Phase 0 scaffold into a working
OpenAI-compatible inference server on mlx-swift-lm.

- Wire mlx-swift-lm v3.32.1-alpha; split into a testable MLXServerKit
  library plus a thin MLXServer executable
- Load an MLX model from a local directory or a HuggingFace id
- GET /v1/models
- POST /v1/chat/completions, both streaming (SSE) and non-streaming
- OpenAI tool calling via mlx-swift-lm's chat-template parsers
- Inferencing protocol so the HTTP layer is testable without a model
- 20 tests: OpenAI types, mapping, SSE framing, HTTP routes
- CI builds with xcodebuild so the Metal shaders are verified, not
  just compilation

Validated end-to-end against Qwen3-4B-4bit and Qwen3.6-35B-A3B-8bit.

Running requires an xcodebuild build: SwiftPM cannot compile
mlx-swift's Metal shaders.
@Defilan Defilan merged commit 4a0f547 into main May 15, 2026
1 check passed
@Defilan Defilan deleted the feat/phase1-tool-calling branch May 15, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant