Skip to content

8.8 Run MLX on Apple Silicon with Harbor

av edited this page Jun 14, 2026 · 2 revisions

Run MLX on Apple Silicon with Harbor: Local LLMs and HuggingFace Models

Harbor makes it simple to run Apple's MLX framework for high-performance local LLM inference on M-series Macs. MLX delivers excellent Metal-accelerated performance and memory efficiency on Apple Silicon, while Harbor keeps the rest of your local AI stack (chat UI, web search, coding agents, image tools) wired up through the same CLI and Docker Compose workflow.

This guide covers installation, starting the backend, finding and pulling models from HuggingFace, configuration, and using MLX inside a full Harbor local LLM stack.

Why MLX for Local LLMs on Apple Silicon

Apple Silicon Macs have unified memory and a powerful GPU architecture. MLX is purpose-built by Apple to take full advantage of that hardware:

  • Optimized Metal kernels for transformer workloads
  • Efficient handling of large context and quantized models
  • Direct integration with the host's memory and accelerators

Harbor's mlx backend runs the official mlx-lm server natively on macOS (for full Metal access) and exposes an OpenAI-compatible API to containers and host tools through a lightweight proxy. You get the performance of a host-native runtime with the convenience of Harbor service management, cross-service integrations, and one-command workflows.

Prerequisites

  • Apple Silicon Mac (M1 or newer recommended)
  • macOS 11 (Big Sur) or later
  • Docker Desktop for Mac (Apple Silicon version)
  • uv package manager (required for Harbor's automatic host lifecycle management of mlx-lm)
  • 16GB+ unified memory recommended for comfortable 7B-14B model usage

Install uv if you do not have it:

# Homebrew
brew install uv

# Or official installer
curl -LsSf https://astral.sh/uv/install.sh | sh

Install or update Harbor following the installation guide.

Start the MLX Backend

harbor up mlx

Harbor performs the following when HARBOR_MLX_MANAGE_HOST=true (the default):

  1. Ensures the workspace at services/mlx/ has a Python virtual environment with mlx-lm[server]
  2. Starts mlx-lm.server on the host (default port 8095) pointing at the configured model
  3. Brings up a Caddy proxy container that forwards traffic from the Compose network and host port 34930

The first start downloads the default model from HuggingFace if it is not already cached.

Open the proxy health endpoint to confirm it is ready:

curl http://localhost:34930/health

Finding MLX Models on HuggingFace

MLX works best with models that have been converted to the MLX format (weights + tokenizer in a layout optimized for the framework). The easiest place to find them is the mlx-community organization on HuggingFace:

Recommended starting points for Apple Silicon (good speed/quality balance in 2025):

  • Small and fast: mlx-community/Qwen3.5-4B-4bit, mlx-community/Qwen3-4B-4bit
  • Strong general performance: mlx-community/Meta-Llama-3.1-8B-Instruct-4bit, mlx-community/Qwen2.5-7B-Instruct-4bit
  • Larger: 14B and 32B 4bit/8bit variants when you have 24GB+ unified memory

Many community converters publish both 4-bit and 8-bit quantized versions. 4-bit variants usually offer the best tokens-per-second on Apple hardware while staying within reasonable memory limits.

You can also use the HuggingFace model search with the mlx tag filter or look for repositories that contain mlx in the name or config.json with "model_type": "mlx" signals.

Pull Models with Harbor

Use the unified harbor models command or the mlx-specific subcommand:

# Unified interface (recommended)
harbor models pull --source mlx mlx-community/Qwen3.5-4B-4bit

# Direct mlx command (also works)
harbor mlx pull mlx-community/Qwen3.5-4B-4bit

# Alias form
harbor models mlx pull mlx-community/Qwen3.5-4B-4bit

Harbor delegates the download to hf download inside the managed uv environment in services/mlx/. Weights land in the standard HuggingFace cache (~/.cache/huggingface by default).

Model removal is not supported through Harbor (harbor mlx rm and harbor models rm --source mlx are unavailable). Delete cached weights from ~/.cache/huggingface manually when you need to reclaim disk space.

List what your mlx-lm instance currently sees:

harbor models ls --source mlx
# or
harbor mlx ls

Change the Default Model

Update both the short name Harbor uses internally and the HuggingFace repository path:

harbor config set mlx.model mlx-community/Qwen3.5-4B-4bit
harbor config set mlx.hf.path mlx-community/Qwen3.5-4B-4bit

Restart the host runner (the proxy container can keep running):

harbor mlx stop
harbor mlx start

The next harbor up mlx (or any stack that includes mlx) will also launch the new model after a full stack restart.

Run a Full Local LLM Stack with MLX

Start Open WebUI pointed at MLX:

harbor up webui mlx
harbor open

Open WebUI will automatically discover the models served by the MLX backend and let you chat, use tools, RAG, and voice features while the actual inference stays accelerated by MLX on the host.

Use MLX from host-side coding agents and tools:

harbor launch --backend mlx --model mlx-community/Qwen3.5-4B-4bit codex
harbor launch --backend mlx --model mlx-community/Qwen3.5-4B-4bit claude

The --web flag also works to add SearXNG-powered web search to the launched agent:

harbor launch --backend mlx --web codex

Direct API Access

  • From the Mac host: http://localhost:34930/v1
  • From other Harbor containers: http://mlx:8080/v1

The endpoint implements the standard OpenAI chat completions, completions, and models routes, so any client that speaks OpenAI can target it.

Configuration Reference

Key variables (set via harbor config set):

  • mlx.model — short name shown in UIs and launchers
  • mlx.hf.path — HuggingFace repository to load on start
  • mlx.host.port / mlx.runner.port — ports for proxy and host process
  • mlx.manage.host — set to false if you want to run mlx-lm yourself and only use Harbor for proxying

See the full MLX backend documentation for the complete list and volume notes.

Troubleshooting

mlx-lm fails to start

  • Confirm uv is on your PATH: uv --version
  • Check the log: harbor mlx logs
  • The first run can take time while uv creates the venv and downloads mlx-lm plus the model weights.

Out of memory or slow performance

  • Switch to a smaller 4-bit model (3B-4B range is very responsive on base M1/M2 machines)
  • Close memory-heavy applications; Apple Silicon shares memory between CPU and GPU

Proxy reports unhealthy

  • Verify the host process is listening: curl http://localhost:8095/v1/models
  • If you manage mlx-lm manually, disable host management and point the proxy at your endpoint:
harbor config set mlx.manage.host false
harbor config set mlx.upstream.url http://host.docker.internal:8095

Next Steps

With Harbor and MLX you get a complete, reproducible local LLM environment that feels native to your Mac while still giving you the full power of the broader Harbor ecosystem.

Clone this wiki locally