Skip to content

firstbatchxyz/dnet

Repository files navigation

logo

dnet

Distributed LLM Inference for Apple Silicon Clusters

License: Apache-2.0 Workflow: Tests License: Apache-2.0

RUN BIG MODELS | RUN LONG CONTEXT | MAXIMIZE UTILIZATION

dnet runs LLMs across Apple Silicon devices. Modular execution strategies, automatic device profiling, drop-in OpenAI API.

Features

  • Execution

    • No Memory Ceiling: Run models that exceed total cluster memory—compute/I/O overlap keeps data flowing
    • UMA specific: Designed for Apple Silicon's unified memory for efficient layer swapping
    • OpenAI-Compatible: Drop-in /v1/chat/completions endpoint
  • Cluster Management

    • Automatic Discovery: Nodes find each other; no manual topology configuration
    • Thunderbolt Detection: Automatically utilizes Thunderbolt for high-bandwidth inter-device communication
  • Workload Assignment

    • Device Profiling: Measures FLOPs, memory, and inter-device latency per node
    • Model Profiling: Analyzes compute and memory requirements per layer
    • Heterogeneity-Aware Solver: Topology aware assignment that accounts for device capability, network speed, KV cache size, and disk speed
  • Pipelined-ring - Run >32B 8-bit models across devices with insufficient total memory

  • 🚧 Long context - Make >128K context windows a reality for home clusters

  • 🚧 High throughput - Maximize throughput via tensor parallelism

  • 🚧 Unified backend - A single optimized backend for Apple Silicon, NVIDIA, and AMD (currently Apple Silicon only, via MLX)

Installation

dnet requires several submodules, which can all be cloned with the following command:

git clone --recurse-submodules https://github.com/firstbatchxyz/dnet.git

dnet uses uv, so make sure it is installed. You can check for uv with the command below, and follow the installation guide if you do not have it.

uv --version

dnet currently only supports MLX on Apple Silicon. To install, run:

uv sync --extra mac

After syncing dependencies, generate protos:

uv run ./scripts/generate_protos.py

dnet supports make commands, run make protos to generate protos.

Usage

dnet uses a dynamic topology approach where nodes start without models, then the API discovers devices and distributes layers optimally using distilp.

  1. Start Shards: Launch shard nodes on each device.
  2. Start API: Launch the API node, one of the shards SHOULD reside in the same device.
  3. Prepare Topology: API discovers devices and solves for optimal layer distribution.
  4. Load Model: API instructs shards to load their assigned layers.
  5. Inference: Use /v1/chat/completions endpoint for generation.

See catalog for supported models.

image of dnet TUI

Viewing dnet TUI

dnet comes with a TUI built in Rust, providing a neat interface for you to load models, view the topology and chat with the loaded models.

Install the TUI with:

cargo install --git https://github.com/firstbatchxyz/dnet-tui.git

Then simply run with:

dnet-tui

For more details, check out the repository.

Running a Shard

Start a shard node with gRPC and HTTP ports:

uv run dnet-shard --http-port 8081 --grpc-port 58081

Each shard should be started on a different device and with a different port (try increment by one for each shard), like the following:

uv run dnet-shard --http-port 8082 --grpc-port 58082

Running an API

Start the API node:

uv run dnet-api --http-port 8080 --grpc-port 58080

To do inference, first, we must prepare the topology (discover nodes) and then load the model itself. After that, we can call the completions endpoint as usual.

Tip

We have a script that can prepare the model and load it at once:

uv run ./scripts/prepare_model.py Qwen/Qwen3-4B-MLX-4bit

Prepare Topology

Discover devices and compute optimal layer distribution:

curl -X POST http://localhost:8080/v1/prepare_topology \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B-MLX-4bit"
  }'

Response will be the otpimal topology (as given by the solver) for the discovered devices.

Note

Once the topology is prepared, you can fetch it after via the /topology endpoint:

curl http://localhost:8080/v1/topology \
 -H "Content-Type: application/json" \

Load Model

Load the model on shards with prepared topology:

curl -X POST http://localhost:8080/v1/load_model \
  -H "Content-Type: application/json" \
  -d $OUTPUT_FROM_PREPARE_TOPOLOGY

a shard with a loaded model

Chat Completions

Generate text using the loaded model:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100
  }'

Devices

You can get the list of discoverable devices with:

curl http://localhost:8080/v1/devices \
  -H "Content-Type: application/json"

Testing

Before testing make sure to install dev path

uv sync --extra dev --extra mac

You can run Pytest tests via:

uv run pytest -v

You can check linting and formatting via Ruff:

# lint
uvx ruff check

# format
uvx ruff format --diff

# typecheck
uv run mypy .

Tip

If you are using VsCode, we have prepared tasks that you can run easily from the Command Palette > Tasks: Run Task .

Acknowledgements

dnet is built on top of MLX and inspired by pioneering work in distributed inference:

PRIMA.CPP: Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters

Exo: Run your own AI cluster at home with everyday devices

Petals: Collaborative Inference for Large Language Models

License

You can find the license here.

Cite

If you have used this work please feel free to cite us!

About

Distributed LLM Inference for Apple Silicon Clusters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages