Skip to content

CommediaJW/CLO

Repository files navigation

CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Install

Dependencies

  • Requirements

    • CUDA >= 12.4
    • Transfomers == 4.47.1
    • torch == 2.4.0
  • GDRCopy

    Please refer to the official reposity of GDRCopy.

  • Others

    pip install -r requirements.txt
    pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
    cd 3rdparty
    bash download.sh
    cd ..
    git submodule update --init --recursive

Build

bash install.sh

Preparations for Running

cd config
bash run_profile_bandwidth.sh
run_profile_prefetch.sh

Performance

End-to-End Performance

cd speedup
bash run_n2n_*.sh

For latency breakdown, try to run these scripts with nsys profile.

Ablation Study

cd speedup
bash run_ablation_*.sh

Accuracy

Build RULER Dataset

Please refer to the official reposity of RULER.

Main Results

cd accuracy
bash run_hash_offloading_*
bash run_fullattn_*
bash run_hash_original_*

Here, hash_offloading, hash_original, fullattn means CLO+HATA, original HATA and dense attention with full KVCache in GPU HBM, respectively.

Besides HATA, CLO additional integrates Quest, Loki as alternative top-k attention algorithms. Just replace the hash in the above scripts with quest or loki to use these algorithms.

Additionally, we also provide a custom implementation for InfiniGen's accuracy evaluation, because its open-source codes lack support for Llama and Qwen:

cd accuracy
bash run_infinigen_*

Ablation Study

cd accuracy
bash run_ablation_*

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors