tiny-llm - LLM Serving in a Week

Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

Book

The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Roadmap

Week + Chapter	Topic	Code	Test	Doc
1.1	Attention	✅	✅	✅
1.2	RoPE	✅	✅	✅
1.3	Grouped Query Attention	✅	🚧	🚧
1.4	RMSNorm and MLP	✅	🚧	🚧
1.5	Transformer Block	✅	🚧	🚧
1.6	Load the Model	✅	🚧	🚧
1.7	Generate Responses (aka Decoding)	✅	✅	🚧
2.1	KV Cache	✅	🚧	🚧
2.2	Quantized Matmul and Linear - CPU	✅	🚧	🚧
2.3	Quantized Matmul and Linear - GPU	✅	🚧	🚧
2.4	Flash Attention - CPU	✅	🚧	🚧
2.5	Flash Attention - GPU	🚧	🚧	🚧
2.6	Continuous Batching	🚧	🚧	🚧
2.7	Speculative Decoding	🚧	🚧	🚧
3.1	Paged Attention - Part 1	🚧	🚧	🚧
3.2	Paged Attention - Part 2	🚧	🚧	🚧
3.3	MoE (Mixture of Experts)	🚧	🚧	🚧
3.4	Prefill-Decode Separation	🚧	🚧	🚧
3.5	Scheduler	🚧	🚧	🚧
3.6	AI Agent	🚧	🚧	🚧
3.7	Streaming API Server	🚧	🚧	🚧

Other topics not covered: quantized/compressed kv cache

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
.vscode		.vscode
book		book
src		src
tests		tests
tests_ref_impl_week1		tests_ref_impl_week1
tests_ref_impl_week2		tests_ref_impl_week2
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_ext.sh		build_ext.sh
check.py		check.py
main.py		main.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

About

Uh oh!

Releases

Packages

Languages

License

davidkims/tiny-llm

Folders and files

Latest commit

History

Repository files navigation

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages