Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.
The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).
Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.
Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.
The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.
You may join skyzh's Discord server and study with the tiny-llm community.
| Week + Chapter | Topic | Code | Test | Doc |
|---|---|---|---|---|
| 1.1 | Attention | β | β | β |
| 1.2 | RoPE | β | β | β |
| 1.3 | Grouped Query Attention | β | π§ | π§ |
| 1.4 | RMSNorm and MLP | β | π§ | π§ |
| 1.5 | Transformer Block | β | π§ | π§ |
| 1.6 | Load the Model | β | π§ | π§ |
| 1.7 | Generate Responses (aka Decoding) | β | β | π§ |
| 2.1 | KV Cache | β | π§ | π§ |
| 2.2 | Quantized Matmul and Linear - CPU | β | π§ | π§ |
| 2.3 | Quantized Matmul and Linear - GPU | β | π§ | π§ |
| 2.4 | Flash Attention - CPU | β | π§ | π§ |
| 2.5 | Flash Attention - GPU | π§ | π§ | π§ |
| 2.6 | Continuous Batching | π§ | π§ | π§ |
| 2.7 | Speculative Decoding | π§ | π§ | π§ |
| 3.1 | Paged Attention - Part 1 | π§ | π§ | π§ |
| 3.2 | Paged Attention - Part 2 | π§ | π§ | π§ |
| 3.3 | MoE (Mixture of Experts) | π§ | π§ | π§ |
| 3.4 | Prefill-Decode Separation | π§ | π§ | π§ |
| 3.5 | Scheduler | π§ | π§ | π§ |
| 3.6 | AI Agent | π§ | π§ | π§ |
| 3.7 | Streaming API Server | π§ | π§ | π§ |
Other topics not covered: quantized/compressed kv cache