KVCache-Factory

KVCache-Factory is a unified playground for KV cache compression, retrieval, merging, and quantization methods for long-context LLM inference. It started from PyramidKV and now includes multiple KV cache baselines under one evaluation interface.

News

2024-11-28: Renamed the project to KVCache-Factory to reflect the broader goal of supporting diverse KV cache compression methods.
2024-06-25: Added multi-GPU inference support for large LLMs, including Llama-3-70B-Instruct.
2024-06-10: Added FlashAttention v2 and SDPA paths for PyramidKV, SnapKV, H2O, and StreamingLLM. On GPUs without FlashAttention v2 support, set --attn_implementation sdpa.

Supported Methods

Method	Type	Notes
`FullKV`	Baseline	Keeps the full KV cache.
`StreamingLLM`	Compression/eviction	Attention-sink plus sliding-window cache.
`H2O`	Retrieval/compression	Heavy-hitter token retention.
`SnapKV`	Retrieval/compression	Observation-window attention pooling.
`Quest`	Query-aware retrieval	Page-level key min/max metadata and query-aware page/token selection.
`NACL`	Encoding-time eviction	Proxy-token score reduction with optional random eviction.
`Scissorhands`	Persistence-based eviction	Historical importance accumulation with fixed-budget pivotal-token selection.
`MiniCache`	Cross-layer compression	Adjacent-layer SLERP direction sharing with magnitude restore and token retention.
`PyramidKV`	Compression/budget allocation	Layer-wise pyramidal cache budget.
`CAM`	Merge/compression	Cache merging with attention-informed value aggregation.
`L2Norm`	Retrieval/compression	Norm-based token selection.
`AdaKV`	Adaptive compression	Head-adaptive KV cache budgets.
`HeadKV`	Adaptive retrieval/compression	Head-aware retrieval/reasoning cache allocation.
`ThinK`	Key-cache pruning	Query-driven key-channel pruning for Llama LongBench runs.
`MInference`	Sparse prefill acceleration	Optional integration through the MInference dependency.
`KIVI` / `KVQuant` / `GEAR`	Quantization	Enabled with `--quant_method kivi`, `--quant_method kvquant`, or `--quant_method gear`. GEAR additionally accepts `--rank` and `--outlier_ratio`.

Llama and Mistral attention paths are supported for the main compression methods. Some newer methods currently have narrower runner/model coverage; check the runner argument choices before launching large jobs.

Results

_{LongBench performance comparison.}

_{Needle-in-a-haystack retrieval results.}

PyramidKV Overview

Visualization: Inefficient Attention

The following attention map shows a Llama model attending over a prompt with three documents.

Use examples/visualization.ipynb and the utilities under pyramidkv/viztools/ to reproduce or customize attention visualizations. Generated attention maps are stored under ./attention by default.

Requirements

transformers==4.44.2
torch
flash-attn>=2.4.0.post1

Install the complete dependency set with requirements.txt. flash-attn is optional when using --attn_implementation sdpa or eager, but required for FlashAttention v2 experiments; install it manually after torch with pip install flash-attn --no-build-isolation. The optional MInference integration (--method minference) is kept out of the base requirements — install it with pip install -r requirements-minference.txt.

Installation

git clone https://github.com/Zefan-Cai/KVCache-Factory.git
cd KVCache-Factory
pip install -r requirements.txt
export PYTHONPATH="$PWD:${PYTHONPATH}"

LongBench

Edit scripts/scripts_longBench/eval.sh or call run_longbench.py directly:

export CUDA_VISIBLE_DEVICES=0

python3 run_longbench.py \
  --method pyramidkv \
  --model_path /path/to/Llama-3-8B-Instruct \
  --max_capacity_prompts 128 \
  --attn_implementation flash_attention_2 \
  --save_dir ./results_long_bench \
  --use_cache True

The quickstart uses a budget of 128; the PyramidKV paper reports results at budgets of 128 and 2048.

Common arguments:

--method: FullKV, pyramidkv, snapkv, streamingllm, h2o, cam, l2norm, adakv, headkv, think, or minference.
--model_path: local or Hugging Face model path.
--datasets: comma-separated LongBench datasets to evaluate (e.g. --datasets narrativeqa,qasper); defaults to the full 16-dataset list.
--attn_implementation: flash_attention_2, sdpa, or eager. --method think requires eager.
--dtype: float16 (default), bfloat16, or auto.
--max_capacity_prompts: target KV cache budget per layer. PyramidKV redistributes the total budget across layers.
--kv_cache_granularity: query_head (default, legacy layout) or kv_head (GQA-efficient layout; supported for snapkv, pyramidkv, h2o, streamingllm, cam, l2norm). See docs/gqa_cache_layout.md.
--gqa_score_agg: mean (default), max, or sum; how per-query-head scores are aggregated per KV head when --kv_cache_granularity kv_head.
--merge: optional merge strategy, pivot or weighted.
--quant_method: optional quantized cache path, kivi, kvquant, or gear.
--nbits: quantization bit width when --quant_method is set.
--quant_backend: quantized cache backend, hqq by default.
--quant_residual_length: full-precision residual cache window; defaults to max_new_tokens.
--q_group_size, --axis_key, --axis_value: advanced quantization layout controls. KIVI defaults to key axis 1 and value axis 0.

The helper script accepts:

bash scripts/scripts_longBench/eval.sh \
  0 pyramidkv 128 flash_attention_2 ./ /path/to/model none none 8

Argument order is CUDA_VISIBLE_DEVICES, method, max_capacity_prompts, attn_implementation, source_path, model_path, merge_method, quant_method, nbits.

Needle In A Haystack

Edit scripts/scripts_needle/eval.sh or run:

python -u run_needle_in_haystack.py \
  --s_len 1000 \
  --e_len 8001 \
  --model_provider LLaMA3 \
  --model_name /path/to/Llama-3-8B-Instruct \
  --attn_implementation flash_attention_2 \
  --step 100 \
  --method pyramidkv \
  --max_capacity_prompt 96 \
  --model_version Llama3_pyramidkv_96_test

Supported --method values for this runner are full, pyramidkv, snapkv, streamingllm, h2o, and cam.

After inference, update FOLDER_PATH in scripts/scripts_needle/visualize.py, then run:

python scripts/scripts_needle/visualize.py

RULER

Edit scripts/scripts_ruler/eval.sh or call run_ruler.py directly. The RULER path shares the same core arguments as LongBench and supports snapkv, pyramidkv, h2o, cam, l2norm, streamingllm, plus optional quantized cache runs.

Latency And Memory Benchmark

To compare decoding latency and peak memory for one prompt:

python scripts/benchmark_latency_memory.py \
  --model_path /path/to/model \
  --method pyramidkv \
  --attn_implementation flash_attention_2 \
  --max_capacity_prompt 512 \
  --max_new_tokens 256 \
  --repeat 3

Reproducibility Notes

Llama-3 LongBench runs now apply the official LongBench chat template (<|begin_of_text|>...<|eot_id|> user/assistant wrap on non-few-shot datasets) and stop on both Llama-3 terminators (<|eot_id|> and <|end_of_text|>). Earlier revisions used a single EOS id and no Llama-3 chat wrap, which depressed scores (issue #46); scores from earlier revisions are not directly comparable.
transformers is pinned to 4.44.2; the attention monkeypatches are version-sensitive.
KV quantization (--quant_method) and KV merging (--merge) are OFF by default; the helper scripts only enable them when the corresponding arguments are set to something other than none.
Each LongBench run writes a run_meta.json (git commit, full argument list, library versions, timestamp) into the per-model results directory.
--eval_batch_size must stay at 1; batching is not supported yet and larger values are rejected at startup.
LongBench and RULER use 7500 as the default Llama-3 prompt truncation length, matching the LongBench safety margin for 8K-context Llama-3 models.
Needle-in-a-haystack context files are loaded in sorted order so runs do not depend on filesystem-specific glob ordering.
The monkey-patched generation path resets per-layer kv_seq_len whenever a new empty cache is prepared. This prevents stale sequence-length state from leaking across independent generate() calls on the same model instance.
The Mistral CAM monkeypatch patches Mistral attention classes directly; it no longer redirects CAM to Llama attention classes.
Per-cluster max_capacity_prompt diagnostics are silent by default. Set KVCACHE_FACTORY_DEBUG=1 to print them; leaving it unset keeps benchmark stdout clean so predictions are easy to inspect.

Roadmap

Citation

If you find PyramidKV or this project useful, please cite:

@article{cai2024pyramidkv,
  title={Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling},
  author={Cai, Zefan and Zhang, Yichi and Gao, Bofei and Liu, Yuliang and Liu, Tianyu and Lu, Keming and Xiong, Wayne and Dong, Yue and Chang, Baobao and Hu, Junjie and Xiao, Wen},
  journal={arXiv preprint arXiv:2406.02069},
  year={2024}
}

@article{fu2024not,
  title={Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning},
  author={Fu, Yu and Cai, Zefan and Asi, Abedelkadir and Xiong, Wayne and Dong, Yue and Xiao, Wen},
  journal={arXiv preprint arXiv:2410.19258},
  year={2024}
}

Acknowledgement

Thanks to SnapKV, H2O, StreamingLLM, Quest, NACL, Scissorhands, MiniCache, AdaKV, and related open-source KV cache projects for making this research area easier to build on.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
assets		assets
csrc		csrc
data		data
docs		docs
examples		examples
figs		figs
pyramidkv		pyramidkv
scripts		scripts
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
eval.py		eval.py
eval_ruler.py		eval_ruler.py
metrics.py		metrics.py
requirements-minference.txt		requirements-minference.txt
requirements.txt		requirements.txt
run_longbench.py		run_longbench.py
run_needle_in_haystack.py		run_needle_in_haystack.py
run_ruler.py		run_ruler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KVCache-Factory

News

Supported Methods

Results

PyramidKV Overview

Visualization: Inefficient Attention

Requirements

Installation

LongBench

Needle In A Haystack

RULER

Latency And Memory Benchmark

Reproducibility Notes

Roadmap

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KVCache-Factory

News

Supported Methods

Results

PyramidKV Overview

Visualization: Inefficient Attention

Requirements

Installation

LongBench

Needle In A Haystack

RULER

Latency And Memory Benchmark

Reproducibility Notes

Roadmap

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages