KVCache-Factory is a unified playground for KV cache compression, retrieval, merging, and quantization methods for long-context LLM inference. It started from PyramidKV and now includes multiple KV cache baselines under one evaluation interface.
- 2024-11-28: Renamed the project to KVCache-Factory to reflect the broader goal of supporting diverse KV cache compression methods.
- 2024-06-25: Added multi-GPU inference support for large LLMs, including Llama-3-70B-Instruct.
- 2024-06-10: Added FlashAttention v2 and SDPA paths for PyramidKV, SnapKV, H2O, and StreamingLLM. On GPUs without FlashAttention v2 support, set
--attn_implementation sdpa.
| Method | Type | Notes |
|---|---|---|
FullKV |
Baseline | Keeps the full KV cache. |
StreamingLLM |
Compression/eviction | Attention-sink plus sliding-window cache. |
H2O |
Retrieval/compression | Heavy-hitter token retention. |
SnapKV |
Retrieval/compression | Observation-window attention pooling. |
Quest |
Query-aware retrieval | Page-level key min/max metadata and query-aware page/token selection. |
NACL |
Encoding-time eviction | Proxy-token score reduction with optional random eviction. |
Scissorhands |
Persistence-based eviction | Historical importance accumulation with fixed-budget pivotal-token selection. |
MiniCache |
Cross-layer compression | Adjacent-layer SLERP direction sharing with magnitude restore and token retention. |
PyramidKV |
Compression/budget allocation | Layer-wise pyramidal cache budget. |
CAM |
Merge/compression | Cache merging with attention-informed value aggregation. |
L2Norm |
Retrieval/compression | Norm-based token selection. |
AdaKV |
Adaptive compression | Head-adaptive KV cache budgets. |
HeadKV |
Adaptive retrieval/compression | Head-aware retrieval/reasoning cache allocation. |
ThinK |
Key-cache pruning | Query-driven key-channel pruning for Llama LongBench runs. |
MInference |
Sparse prefill acceleration | Optional integration through the MInference dependency. |
KIVI / KVQuant / GEAR |
Quantization | Enabled with --quant_method kivi, --quant_method kvquant, or --quant_method gear. GEAR additionally accepts --rank and --outlier_ratio. |
Llama and Mistral attention paths are supported for the main compression methods. Some newer methods currently have narrower runner/model coverage; check the runner argument choices before launching large jobs.

LongBench performance comparison.

Needle-in-a-haystack retrieval results.
The following attention map shows a Llama model attending over a prompt with three documents.
Use examples/visualization.ipynb and the utilities under pyramidkv/viztools/ to reproduce or customize attention visualizations. Generated attention maps are stored under ./attention by default.
transformers==4.44.2
torch
flash-attn>=2.4.0.post1
Install the complete dependency set with requirements.txt. flash-attn is optional when using --attn_implementation sdpa or eager, but required for FlashAttention v2 experiments; install it manually after torch with pip install flash-attn --no-build-isolation. The optional MInference integration (--method minference) is kept out of the base requirements — install it with pip install -r requirements-minference.txt.
git clone https://github.com/Zefan-Cai/KVCache-Factory.git
cd KVCache-Factory
pip install -r requirements.txt
export PYTHONPATH="$PWD:${PYTHONPATH}"Edit scripts/scripts_longBench/eval.sh or call run_longbench.py directly:
export CUDA_VISIBLE_DEVICES=0
python3 run_longbench.py \
--method pyramidkv \
--model_path /path/to/Llama-3-8B-Instruct \
--max_capacity_prompts 128 \
--attn_implementation flash_attention_2 \
--save_dir ./results_long_bench \
--use_cache TrueThe quickstart uses a budget of 128; the PyramidKV paper reports results at budgets of 128 and 2048.
Common arguments:
--method:FullKV,pyramidkv,snapkv,streamingllm,h2o,cam,l2norm,adakv,headkv,think, orminference.--model_path: local or Hugging Face model path.--datasets: comma-separated LongBench datasets to evaluate (e.g.--datasets narrativeqa,qasper); defaults to the full 16-dataset list.--attn_implementation:flash_attention_2,sdpa, oreager.--method thinkrequireseager.--dtype:float16(default),bfloat16, orauto.--max_capacity_prompts: target KV cache budget per layer. PyramidKV redistributes the total budget across layers.--kv_cache_granularity:query_head(default, legacy layout) orkv_head(GQA-efficient layout; supported forsnapkv,pyramidkv,h2o,streamingllm,cam,l2norm). Seedocs/gqa_cache_layout.md.--gqa_score_agg:mean(default),max, orsum; how per-query-head scores are aggregated per KV head when--kv_cache_granularity kv_head.--merge: optional merge strategy,pivotorweighted.--quant_method: optional quantized cache path,kivi,kvquant, orgear.--nbits: quantization bit width when--quant_methodis set.--quant_backend: quantized cache backend,hqqby default.--quant_residual_length: full-precision residual cache window; defaults tomax_new_tokens.--q_group_size,--axis_key,--axis_value: advanced quantization layout controls. KIVI defaults to key axis1and value axis0.
The helper script accepts:
bash scripts/scripts_longBench/eval.sh \
0 pyramidkv 128 flash_attention_2 ./ /path/to/model none none 8Argument order is CUDA_VISIBLE_DEVICES, method, max_capacity_prompts, attn_implementation, source_path, model_path, merge_method, quant_method, nbits.
Edit scripts/scripts_needle/eval.sh or run:
python -u run_needle_in_haystack.py \
--s_len 1000 \
--e_len 8001 \
--model_provider LLaMA3 \
--model_name /path/to/Llama-3-8B-Instruct \
--attn_implementation flash_attention_2 \
--step 100 \
--method pyramidkv \
--max_capacity_prompt 96 \
--model_version Llama3_pyramidkv_96_testSupported --method values for this runner are full, pyramidkv, snapkv, streamingllm, h2o, and cam.
After inference, update FOLDER_PATH in scripts/scripts_needle/visualize.py, then run:
python scripts/scripts_needle/visualize.pyEdit scripts/scripts_ruler/eval.sh or call run_ruler.py directly. The RULER path shares the same core arguments as LongBench and supports snapkv, pyramidkv, h2o, cam, l2norm, streamingllm, plus optional quantized cache runs.
To compare decoding latency and peak memory for one prompt:
python scripts/benchmark_latency_memory.py \
--model_path /path/to/model \
--method pyramidkv \
--attn_implementation flash_attention_2 \
--max_capacity_prompt 512 \
--max_new_tokens 256 \
--repeat 3- Llama-3 LongBench runs now apply the official LongBench chat template (
<|begin_of_text|>...<|eot_id|>user/assistant wrap on non-few-shot datasets) and stop on both Llama-3 terminators (<|eot_id|>and<|end_of_text|>). Earlier revisions used a single EOS id and no Llama-3 chat wrap, which depressed scores (issue #46); scores from earlier revisions are not directly comparable. transformersis pinned to4.44.2; the attention monkeypatches are version-sensitive.- KV quantization (
--quant_method) and KV merging (--merge) are OFF by default; the helper scripts only enable them when the corresponding arguments are set to something other thannone. - Each LongBench run writes a
run_meta.json(git commit, full argument list, library versions, timestamp) into the per-model results directory. --eval_batch_sizemust stay at1; batching is not supported yet and larger values are rejected at startup.- LongBench and RULER use
7500as the default Llama-3 prompt truncation length, matching the LongBench safety margin for 8K-context Llama-3 models. - Needle-in-a-haystack context files are loaded in sorted order so runs do not depend on filesystem-specific
globordering. - The monkey-patched generation path resets per-layer
kv_seq_lenwhenever a new empty cache is prepared. This prevents stale sequence-length state from leaking across independentgenerate()calls on the same model instance. - The Mistral CAM monkeypatch patches Mistral attention classes directly; it no longer redirects CAM to Llama attention classes.
- Per-cluster
max_capacity_promptdiagnostics are silent by default. SetKVCACHE_FACTORY_DEBUG=1to print them; leaving it unset keeps benchmark stdout clean so predictions are easy to inspect.
- Support StreamingLLM, H2O, SnapKV, and PyramidKV.
- Support Mistral models.
- Support Needle-in-a-haystack evaluation.
- Support SDPA cache compression for GPUs without FlashAttention v2.
- Support multi-GPU inference for 70B Llama-3.
- Add cache quantization options.
- Add explicit KIVI-style asymmetric quantized cache configuration.
- Add KV merge options.
- Add KVMerger-style weighted nearest-neighbor merge and merge-shape tests.
- Add a tested Quest-style query-aware page/token selector contract.
- Add a tested NACL-style proxy/random eviction selector contract.
- Add a tested Scissorhands-style persistence selector contract.
- Add a tested MiniCache-style cross-layer merge/restore contract.
- Add more representative high-citation/high-star KV cache algorithms.
- Support Mixtral.
- Support batch inference.
- Wire more decode-stage KV cache compression methods into runtime attention hot paths.
- Port the algorithm suite to nano-vllm and mini-sglang runtimes.
If you find PyramidKV or this project useful, please cite:
@article{cai2024pyramidkv,
title={Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling},
author={Cai, Zefan and Zhang, Yichi and Gao, Bofei and Liu, Yuliang and Liu, Tianyu and Lu, Keming and Xiong, Wayne and Dong, Yue and Chang, Baobao and Hu, Junjie and Xiao, Wen},
journal={arXiv preprint arXiv:2406.02069},
year={2024}
}@article{fu2024not,
title={Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning},
author={Fu, Yu and Cai, Zefan and Asi, Abedelkadir and Xiong, Wayne and Dong, Yue and Xiao, Wen},
journal={arXiv preprint arXiv:2410.19258},
year={2024}
}Thanks to SnapKV, H2O, StreamingLLM, Quest, NACL, Scissorhands, MiniCache, AdaKV, and related open-source KV cache projects for making this research area easier to build on.


