# Run Hugging Face `Llama-3.1-70B-Instruct` + `Llama-3.2-1B-Instruct` Speculative Decoding on Trn1 with `transformers-neuronx` and `vLLM`

In this tutorial we use [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) and the [vLLM](https://docs.vllm.ai/en/latest/) serving framework to compile and deploy two instruction-tuned [Llama models](https://www.llama.com/) for inference in a speculative decoding configuration.

Speculative decoding is a token generation optimization technique that uses a small draft model to generate `K` tokens autoregressively and a larger target model to determine which draft tokens to accept, all in a combined forward pass. For more information on speculative decoding, please see:
- Leviathan, Yaniv, Matan Kalman, and Yossi Matias. ["Fast inference from transformers via speculative decoding."](https://arxiv.org/abs/2211.17192) International Conference on Machine Learning. PMLR, 2023.
- Chen, Charlie, et al. ["Accelerating large language model decoding with speculative sampling."](https://arxiv.org/pdf/2302.01318) arXiv preprint arXiv:2302.01318 (2023).

In this exercise, we use the following models:

- **Target Model**: [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
- **Draft Model**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)

This tutorial proceeds in the following main sections:

1. Set up the Jupyter Notebook.
2. Install dependencies.
3. Access and download the target and draft models.
4. Perform speculative decoding inference using `transformers-neuronx` and `vLLM`.

This notebook is intended for a Trn1 `trn1.32xlarge` instance.

*Note: The models in this tutorial require 315 GB total disk space - Please ensure that your instance has sufficient storage to download and store Llama-3.1-70B-Instruct and Llama-3.2-1B-Instruct before proceeding.*

## Set Up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using:
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder:
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`llama-3.1-70b-speculative-decoding.ipynb`) and launch it. Follow the rest of the instructions in this tutorial.

## Install Dependencies
This tutorial requires the following `pip` packages:
- `torch-neuronx`
- `neuronx-cc`
- `sentencepiece`
- `transformers`
- `transformers-neuronx`

Most of these packages will be installed when configuring your environment using the [torch-neuronx Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). `transformers-neuronx` and additional dependencies can be installed as follows:

In [None]:
!pip install transformers-neuronx sentencepiece "transformers>=4.43.3" # Recent transformers version required for RoPE scaling in Llama 3.1/3.2

### Installing vLLM
Neuron maintains a fork of vLLM (v0.6.2) that contains the necessary changes to support inference with `transformers-neuronx`. Neuron is working with the vLLM community to upstream these changes to make them available in a future version.

*Important: Please follow the vLLM installation instructions below. Do not install vLLM from PyPI or the official vLLM repository.*

In [None]:
!git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
!cd upstreaming-to-vllm && pip install -r requirements-neuron.txt
!VLLM_TARGET_DEVICE="neuron" cd upstreaming-to-vllm && pip install -e .

## Access and Download the Target and Draft Models

The Meta-Llama-3.1-70B-Instruct and Llama-3.2-1B-Instruct models must be downloaded prior to running this tutorial. 

**Meta-Llama-3.1-70B-Instruct:** Use of the Meta-Llama-3.1-70B-Instruct model is governed by the Llama 3.1 Community License Agreement. Please follow the steps described in [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to gain access to this model.

**Llama-3.2-1B-Instruct:** Use of the Llama-3.2-1B-Instruct model is governed by the Llama 3.2 Community License Agreement. Please follow the steps described in [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) to gain access to this model.

*Note:* For this tutorial, we assume you have access to the Hugging Face models above and that they are saved in the following directories:
- `Meta-Llama-3.1-70B-Instruct`
- `Llama-3.2-1B-Instruct`

## Perform Speculative Decoding Inference Using `transformers-neuronx` and `vLLM`

In this tutorial, we use `transformers-neuronx` and `vLLM`'s `LLM` entrypoint to perform offline batched inference. We apply speculative decoding by passing both the target and draft model paths as arguments to the entrypoint. The number of draft tokens to generate is specified by `num_speculative_tokens`.

### Speculative Decoding Overview
The core intuition behind speculative decoding is that a simple model (the draft) can generate certain elements in a sequence with the same level of accuracy and significantly faster than a more complex model (the target). 

In the speculative decoding procedure, the draft model first produces `num_speculative_tokens` auto-regressively. The target model then produces logits for each generated draft token (Note that the target model can produce all logits in a single forward pass). We then iterate through the generated sequence's draft and target logits and perform a rejection sampling procedure to accept or reject draft tokens in a manner that accords with the target model's per-token probability distribution (For more details on the rejection sampling procedure, please see Theorem 1 in [(Chen et al., 2023](https://arxiv.org/pdf/2302.01318))). When a draft token is rejected, the target distribution is sampled instead. If all draft tokens are accepted, a final token is sampled from the target model. Because of this, `[1, num_speculative_tokens + 1]` tokens are guaranteed to be sampled in a single iteration of speculative decoding and the sampled tokens are selected in accordance with the target distribution.

### Creating the `LLM` Entrypoint
As a first step, we create the `vLLM` `LLM` entrypoint. Internally, this compiles the Neuron draft and target models and prepares them for use with `vLLM`'s continuous batching system (For more information, see Kwon, Woosuk, et al. ["Efficient memory management for large language model serving with pagedattention."](https://arxiv.org/pdf/2309.06180) Proceedings of the 29th Symposium on Operating Systems Principles. 2023.). Neuron currently supports `vLLM` continuous batching with a block size equal to the model's maximum sequence length, so we set `block_size`, `max_model_len`, and `speculative_max_model_len` to the same value (1024 tokens in this tutorial). We configure speculative decoding to sample 4 draft tokens per iteration by setting `num_speculative_tokens=4`. The maximum number of sequences `vLLM` will process concurrently is also set to 4 with `max_num_seqs=4`.

In [None]:
import time

from vllm import LLM, SamplingParams

In [None]:
target_model_path = "Meta-Llama-3.1-70B-Instruct"
draft_model_path = "Llama-3.2-1B-Instruct"

max_model_len=1024

llm = LLM(
    model=target_model_path,
    speculative_model=draft_model_path,
    block_size=max_model_len,
    device="neuron",
    dtype="bfloat16",
    max_model_len=max_model_len,
    max_num_seqs=4,
    num_speculative_tokens=4,
    speculative_max_model_len=max_model_len,
    swap_space=0,
    tensor_parallel_size=32,
    use_v2_block_manager=True,
)

### Generate Prompts

After this step, the models are ready to be used for batched inference with `vLLM`. We now assemble a collection of prompts. The target and draft model are instruction-tuned, so we apply the Llama 3.1 prompt template to each prompt. We also initialize our vLLM `SamplingParameters`. For this exercise, we use greedy sampling.

In [None]:
# Gather sample prompts for batched inference.
prompts = [
    "Who are you?",
    "What is the capital of France?",
    "What is the future of AI?",
    "What is Llama?"
]

# Apply the Llama 3.1 prompt template to each prompt.
# See https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/prompt_format.md
llama_prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

{0}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
prompts = [llama_prompt_template.format(prompt) for prompt in prompts]

# Set sampling parameters.
sampling_params = SamplingParams(temperature=0, top_p=1.0, top_k=1, max_tokens=256)

### Perform Batched Inference

Finally, we use the `LLM` entrypoint to perform batched inference:

In [None]:
# Perform offline batched inference
start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
    print()
print('-' * 40)
print(f"Inference Elapsed Time: {elapsed:.3f} seconds")
print('-' * 40)