Run a transformer that doesn't fit on your Mac across your Mac and your iPhone.
A two-week solo investigation: split Qwen 2.5 7B (FP16, ~15 GB of weights) between an M1 Pro MacBook (16 GB) and an iPhone 13 Pro (6 GB), route hidden activations over Wi-Fi, and measure what happens. Working prototype, with the receipts attached.
1.68× faster than solo FP16 on the same Mac under memory pressure (1.59×–1.85× across days, depending on Wi-Fi state). The full paper, with methods, regime analysis, and limitations, is in PAPER.md.
Distribution helps when the model doesn't fit on one device. If your model fits (e.g. you can quantize 7B to Q4_K_M and run it in Ollama at ~29 tok/s), distribution loses by ~850×. DIM addresses one specific regime — FP16 required, weights spilling to swap — and that's the regime the paper measures. See Regime Analysis.
Mac (host, PyTorch FP16) iPhone (CoreML FP16) Mac (host, cont'd)
───────────────────────── ───────────────────── ──────────────────
Embedding (INT8)
Head blocks 0..10 (FP16) ── TCP ──► Middle blocks 11..16 ── TCP ─► Tail blocks 17..27
FP16 forward LM head (INT8)
→ next-token logits
The Mac runs the embedding, head transformer blocks, tail transformer blocks, and LM head. The iPhone runs six contiguous middle transformer blocks via CoreML. Hidden activations cross over Wi-Fi as FP16 tensors (~896 KB per token). Full design: docs/ARCHITECTURE.md.
DIM/
├── PAPER.md ← research-paper-style writeup, the main read
├── README.md ← you are here
├── system/ ← the working distributed system (Python + Swift)
│ ├── pipeline/ ← head/middle/tail Qwen pipeline + evaluation harness
│ ├── ios_worker/ ← iOS app (Xcode project, Swift, CoreML inference)
│ ├── swift_worker/ ← macOS Swift CLI worker (early prototype)
│ └── benchmarks/ ← single-device baselines
├── simulation/ ← Python cluster simulator (devices.json, scenarios)
├── analysis/ ← figure + table generation pipeline
│ ├── figures/ ← all 8 PNGs in PAPER.md
│ └── tables/ ← run summary table
├── docs/ ← ARCHITECTURE, SETUP, REPRODUCING, LIMITATIONS
│ └── notes/ ← internal lab notes (kept for completeness)
└── reports/ ← 20 dated lab-notebook entries from the 2-week build
Three steps to reproduce the headline figure on your own machines (not the full setup — that's in docs/SETUP.md):
git clone https://github.com/dannydyl/DIM.git
cd DIM
python -m venv .venv && source .venv/bin/activate
pip install matplotlib numpy
# Regenerate every figure and table from the committed JSONL data
python analysis/generate_all.pyTo run the system end-to-end (split a model, build the iOS app, take measurements), follow docs/SETUP.md and docs/REPRODUCING.md.
| Component | Version |
|---|---|
| macOS | 14+ |
| Xcode | 15+ |
| Python | 3.11+ |
| iPhone | 12 Pro or later, iOS 17+ |
| Apple Developer account | Free personal team is enough |
| Disk | ~60 GB for model weights |
| Hugging Face token | Required for downloading Qwen 2.5 7B |
- The full paper, with figures and methodology: PAPER.md
- Why FP16 over the wire and not INT8: docs/ARCHITECTURE.md
- Step-by-step setup, including iOS signing: docs/SETUP.md
- Exact reproduction commands: docs/REPRODUCING.md
- Honest list of what this work does not do: docs/LIMITATIONS.md
- Two weeks of dated lab notes (dead ends included): reports/
Working prototype, end of the initial two-week sprint. Not maintained as a product; not packaged for non-technical users. Open-sourced as a faithful record of one specific configuration that worked, intended for anyone curious about distributed local inference on consumer Apple hardware.
MIT — see LICENSE.
If this work is useful in your own writing, citing the repo URL and the commit hash is fine. There is no formal publication.
