Distributed LLM Inference for Apple Silicon Clusters
RUN BIG MODELS | RUN LONG CONTEXT | MAXIMIZE UTILIZATION
dnet runs LLMs across Apple Silicon devices. Modular execution strategies, automatic device profiling, drop-in OpenAI API.
-
Execution
- No Memory Ceiling: Run models that exceed total cluster memory—compute/I/O overlap keeps data flowing
- UMA specific: Designed for Apple Silicon's unified memory for efficient layer swapping
- OpenAI-Compatible: Drop-in
/v1/chat/completionsendpoint
-
Cluster Management
- Automatic Discovery: Nodes find each other; no manual topology configuration
- Thunderbolt Detection: Automatically utilizes Thunderbolt for high-bandwidth inter-device communication
-
Workload Assignment
- Device Profiling: Measures FLOPs, memory, and inter-device latency per node
- Model Profiling: Analyzes compute and memory requirements per layer
- Heterogeneity-Aware Solver: Topology aware assignment that accounts for device capability, network speed, KV cache size, and disk speed
-
✅ Pipelined-ring - Run >32B 8-bit models across devices with insufficient total memory
-
🚧 Long context - Make >128K context windows a reality for home clusters
-
🚧 High throughput - Maximize throughput via tensor parallelism
-
🚧 Unified backend - A single optimized backend for Apple Silicon, NVIDIA, and AMD (currently Apple Silicon only, via MLX)
dnet requires several submodules, which can all be cloned with the following command:
git clone --recurse-submodules https://github.com/firstbatchxyz/dnet.gitdnet uses uv, so make sure it is installed. You can check for uv with the command below, and follow the installation guide if you do not have it.
uv --versiondnet currently only supports MLX on Apple Silicon. To install, run:
uv sync --extra macAfter syncing dependencies, generate protos:
uv run ./scripts/generate_protos.pydnet supports make commands, run make protos to generate protos.
dnet uses a dynamic topology approach where nodes start without models, then the API discovers devices and distributes layers optimally using distilp.
- Start Shards: Launch shard nodes on each device.
- Start API: Launch the API node, one of the shards SHOULD reside in the same device.
- Prepare Topology: API discovers devices and solves for optimal layer distribution.
- Load Model: API instructs shards to load their assigned layers.
- Inference: Use
/v1/chat/completionsendpoint for generation.
See catalog for supported models.
dnet comes with a TUI built in Rust, providing a neat interface for you to load models, view the topology and chat with the loaded models.
Install the TUI with:
cargo install --git https://github.com/firstbatchxyz/dnet-tui.gitThen simply run with:
dnet-tuiFor more details, check out the repository.
Start a shard node with gRPC and HTTP ports:
uv run dnet-shard --http-port 8081 --grpc-port 58081Each shard should be started on a different device and with a different port (try increment by one for each shard), like the following:
uv run dnet-shard --http-port 8082 --grpc-port 58082Start the API node:
uv run dnet-api --http-port 8080 --grpc-port 58080To do inference, first, we must prepare the topology (discover nodes) and then load the model itself. After that, we can call the completions endpoint as usual.
Tip
We have a script that can prepare the model and load it at once:
uv run ./scripts/prepare_model.py Qwen/Qwen3-4B-MLX-4bitDiscover devices and compute optimal layer distribution:
curl -X POST http://localhost:8080/v1/prepare_topology \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B-MLX-4bit"
}'Response will be the otpimal topology (as given by the solver) for the discovered devices.
Note
Once the topology is prepared, you can fetch it after via the /topology endpoint:
curl http://localhost:8080/v1/topology \
-H "Content-Type: application/json" \Load the model on shards with prepared topology:
curl -X POST http://localhost:8080/v1/load_model \
-H "Content-Type: application/json" \
-d $OUTPUT_FROM_PREPARE_TOPOLOGYGenerate text using the loaded model:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100
}'You can get the list of discoverable devices with:
curl http://localhost:8080/v1/devices \
-H "Content-Type: application/json"Before testing make sure to install dev path
uv sync --extra dev --extra mac
You can run Pytest tests via:
uv run pytest -vYou can check linting and formatting via Ruff:
# lint
uvx ruff check
# format
uvx ruff format --diff
# typecheck
uv run mypy .Tip
If you are using VsCode, we have prepared tasks that you can run easily from the Command Palette > Tasks: Run Task .
dnet is built on top of MLX and inspired by pioneering work in distributed inference:
PRIMA.CPP: Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters
Exo: Run your own AI cluster at home with everyday devices
Petals: Collaborative Inference for Large Language Models
You can find the license here.
If you have used this work please feel free to cite us!

