# TensorRT-LLM Speculative Decoding 
### Boost AI Inference Throughout by Up to 3.6x

In this notebook you'll learn how to use NVIDIA's [TensorRT-LLM](https://developer.nvidia.com/tensorrt#section-inference-for-llms) to boost inference throughput using speculative decoding. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models [(LLMs)](https://www.nvidia.com/en-us/glossary/large-language-models/) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single node multi-GPU, the library further expands its supported optimizations to provide best performance for generative AI applications. 

Speculative decoding, also referred to as [speculative sampling](https://arxiv.org/abs/2302.01318), works by paying a small additional computation cost to speculatively generate the next several tokens, and then using the target model to perform a built-in verification step to ensure the quality of output generation while giving a throughput boost. 


<div style="text-align: center; font-size: 16px;">
    <p><b>Figure 1. Speculating decoding algorithm</b></p>
    <img src="Draft Target Speculative Decoding.jpg" width="600" height="500" align="center"/>
</div>

<div style="text-align: center; font-size: 16px;">
    <p><b>Figure 2. Throughput Speedups with Llama 3.1 405B Target and Different Draft Models</b></p>
    <img src="Speculative decoding throughput speedups.jpg" width="500" height="400" align="center"/>
</div>

[TensorRT-LLM speculative decoding in action](https://gitlab-master.nvidia.com/anjshah/llm_inference/-/blob/main/speculative_decoding_trt-llm.mp4)

In [6]:
%matplotlib notebook
from IPython.display import Video
Video('./speculative_decoding_trt-llm.mp4', width=800, height=500, embed=True)

### Steps to run speculative decoding in TensorRT-LLM

Please make sure that you complete the following steps before launching this notebook on a Linux machine. These steps walk through running the required docker container and installing the libraries required for TensorRT-LLM. These steps are also highlight in the [installation guide](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)

- docker run --rm -it --ipc=host --net=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --volume ${PWD}:/workspace --workdir /workspace nvidia/cuda:12.4.1-devel-ubuntu22.04

- apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

- pip install jupyterlab

#### Install TensorRT-LLM

In [None]:
!pip install -q ipywidgets
!pip install tensorrt_llm -U -q --extra-index-url https://pypi.nvidia.com

!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/quantization/quantize.py -P .


#### Download draft and target models

In [None]:
# Download target model
!git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct

# Download draft models
!git clone https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
!git clone https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
!git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

#### Quantize models

In [None]:
# Create FP8 checkpoints

!python3 quantization/quantize.py --model_dir <path to draft model repo> --dtype float16 --qformat fp8 --kv_cache_dtype fp8 
--output_dir /ckpt-draft --calib_size 512 --tp_size 4

!python3 quantization/quantize.py \
    --model_dir=<path to target model repo> \
    --output_dir=./ckpt-target-405b \
    --dtype=float16 --qformat fp8 --kv_cache_dtype fp8 \
    --calib_size 512 --tp_size 4 


#### Build engines

In [None]:
# Build draft and target engines
# Important flags for the engine build process:
# --use_paged_context_fmha=enable must be specified since we need KVcache reuse for the draft/target model.

# --speculative_decoding_mode=draft_tokens_external and --max_draft_len must be specified for target model.

!trtllm-build \
    --checkpoint_dir ./ckpt-draft \
    --output_dir=./draft-engine \
    --gpt_attention_plugin float16 \
    --workers 4 \
    --gemm_plugin=fp8 \
    --reduce_fusion disable \
    --use_paged_context_fmha=enable \
    --use_fused_mlp enable \
    --multiple_profiles enable \
    --max_batch_size=32 \
    --max_num_tokens=8192 \
    --max_seq_len=131072

!trtllm-build \
    --checkpoint_dir=./ckpt-target-405b \
    --output_dir=./target-engine \
    --gpt_attention_plugin float16 \
    --workers 4 \
    --gemm_plugin=fp8 \
    --use_paged_context_fmha=enable \
    --use_fused_mlp enable \
    --multiple_profiles enable \
    --max_batch_size=32 \
    --max_num_tokens=8192 \
    --max_seq_len=131072 \
    --low_latency_gemm_plugin fp8 \
    --speculative_decoding_mode=draft_tokens_external \
    --max_draft_len 10



#### Run speculative decoding

In [None]:
#Run decoding

# Important flags to set during the run process:
#--draft_engine_dir and --engine_dir must be specified for the draft and target engines.

#--draft_target_model_config is corresponding to the configuration of Draft-Target-Model. As an example, [4,[0],[1],False] means draft_len=4, device of draft model is GPU0, device of target model is GPU1, and use tokens rather than logits to accept.

# Only CPP session (using executor as low-level API) is supported, while Python session (--use_py_session) is not supported.

# Run with 405B target model

!mpirun -n 8 --allow-run-as-root python3 ./run.py \
    --tokenizer_dir <path to draft model repo> \
    --draft_engine_dir ./draft-engine \
    --engine_dir ./target-engine \     
    --draft_target_model_config = "[10,[0,1,2,3,4,5,6,7],[0,1,2,3,4,5,6,7], False]" \
    --kv_cache_free_gpu_memory_fraction=0.35 \
    --max_output_len=1024 \
    --kv_cache_enable_block_reuse \
    --input_text="Implement a program to find the common elements in two arrays without using any extra data structures."