Self-Speculative Decoding

Code associated with the paper:

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Self-Speculative Decoding is a novel inference scheme for accelerating Large Language Models (LLMs) without additional neural network training and extra memory footprint. It not only maintains consistent output quality but also ensures model compatibility, making it a plug-and-play and cost-effective solution for LLM inference acceleration.

Self-Speculative Decoding involves a two-stage process:

Drafting stage: Generates draft tokens by selectively skipping certain intermediate layers.

Verification stage: Employs the original LLM to validate draft tokens in one forward pass.

Cite Our Paper

If you find this code and paper useful in your research, please consider citing:

@article{zhang2023draft,
      title={Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding}, 
      author={Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra},
      year={2023},
      eprint={2309.08168},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Requirements

PyTorch
Transformer
NumPy
More in ssd.yml

Files

searching.py: Selection of skipped layers by Bayesian optmization
decoding.py: Core process of self-speculative decoding
modeling_llama.py: Model structure with self-speculative decoding
search.ipynb: Main script searches for skipped layers
evaluate_sum.ipynb: Main script evaluates self-speculative decoding on text generation task
evaluate_code.ipynb: Main script evaluates self-speculative decoding on code generation task
skip_layers.json: Layers skipped by draft models corresponding to different base models
ssd.yml: Relevant environment

Usage

Configure the relevant environment according to ssd.yml;
Execute search.ipynb to get skipped layers to generate a draft model;
Execute evaluate_sum.ipynb to evaluate self-speculative decoding on summarization;
Execute evaluate_code.ipynb to evaluate self-speculative decoding on code generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

LICENSE

LICENSE

README.md

README.md

decoding.py

decoding.py

evaluate.ipynb

evaluate.ipynb

evaluate_code.ipynb

evaluate_code.ipynb

evaluate_sum.ipynb

evaluate_sum.ipynb

modeling_llama.py

modeling_llama.py

search.ipynb

search.ipynb

searching.py

searching.py

skip_layers.json

skip_layers.json

ssd.yml

ssd.yml

Repository files navigation

Self-Speculative Decoding

Cite Our Paper

Requirements

Files

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
decoding.py		decoding.py
evaluate.ipynb		evaluate.ipynb
evaluate_code.ipynb		evaluate_code.ipynb
evaluate_sum.ipynb		evaluate_sum.ipynb
modeling_llama.py		modeling_llama.py
search.ipynb		search.ipynb
searching.py		searching.py
skip_layers.json		skip_layers.json
ssd.yml		ssd.yml

License

dilab-zju/self-speculative-decoding

Folders and files

Latest commit

History

Repository files navigation

Self-Speculative Decoding

Cite Our Paper

Requirements

Files

Usage

About

Resources

License

Stars

Watchers

Forks

Languages