CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Install

Dependencies

Requirements
- CUDA >= 12.4
- Transfomers == 4.47.1
- torch == 2.4.0
GDRCopy

Please refer to the official reposity of GDRCopy.

Others

pip install -r requirements.txt
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
cd 3rdparty
bash download.sh
cd ..
git submodule update --init --recursive

Build

bash install.sh

Preparations for Running

cd config
bash run_profile_bandwidth.sh
run_profile_prefetch.sh

Performance

End-to-End Performance

cd speedup
bash run_n2n_*.sh

For latency breakdown, try to run these scripts with nsys profile.

Ablation Study

cd speedup
bash run_ablation_*.sh

Accuracy

Build RULER Dataset

Please refer to the official reposity of RULER.

Main Results

cd accuracy
bash run_hash_offloading_*
bash run_fullattn_*
bash run_hash_original_*

Here, hash_offloading, hash_original, fullattn means CLO+HATA, original HATA and dense attention with full KVCache in GPU HBM, respectively.

Besides HATA, CLO additional integrates Quest, Loki as alternative top-k attention algorithms. Just replace the hash in the above scripts with quest or loki to use these algorithms.

Additionally, we also provide a custom implementation for InfiniGen's accuracy evaluation, because its open-source codes lack support for Llama and Qwen:

cd accuracy
bash run_infinigen_*

Ablation Study

cd accuracy
bash run_ablation_*

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
3rdparty		3rdparty
accuracy		accuracy
auxiliary		auxiliary
config		config
python		python
scripts		scripts
speedup		speedup
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Install

Dependencies

Build

Preparations for Running

Performance

End-to-End Performance

Ablation Study

Accuracy

Build RULER Dataset

Main Results

Ablation Study

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Install

Dependencies

Build

Preparations for Running

Performance

End-to-End Performance

Ablation Study

Accuracy

Build RULER Dataset

Main Results

Ablation Study

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages