CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
-
Requirements
- CUDA >= 12.4
- Transfomers == 4.47.1
- torch == 2.4.0
-
GDRCopy
Please refer to the official reposity of GDRCopy.
-
Others
pip install -r requirements.txt pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4 cd 3rdparty bash download.sh cd .. git submodule update --init --recursive
bash install.shcd config
bash run_profile_bandwidth.sh
run_profile_prefetch.shcd speedup
bash run_n2n_*.shFor latency breakdown, try to run these scripts with nsys profile.
cd speedup
bash run_ablation_*.shPlease refer to the official reposity of RULER.
cd accuracy
bash run_hash_offloading_*
bash run_fullattn_*
bash run_hash_original_*Here, hash_offloading, hash_original, fullattn means CLO+HATA, original HATA and dense attention with full KVCache in GPU HBM, respectively.
Besides HATA, CLO additional integrates Quest, Loki as alternative top-k attention algorithms. Just replace the hash in the above scripts with quest or loki to use these algorithms.
Additionally, we also provide a custom implementation for InfiniGen's accuracy evaluation, because its open-source codes lack support for Llama and Qwen:
cd accuracy
bash run_infinigen_*cd accuracy
bash run_ablation_*