# Run Hugging Face `Llama-3.1-70B-Instruct` EAGLE Speculative Decoding on Trn1 with `transformers-neuronx` and `vLLM`

In this tutorial we use [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) and the [vLLM](https://docs.vllm.ai/en/latest/) serving framework to compile and deploy an instruction-tuned [Llama model](https://www.llama.com/) and corresponding EAGLE draft model for inference in an EAGLE speculative decoding configuration.

Speculative decoding is a token generation optimization technique that uses a small draft model to generate `K` tokens autoregressively and a larger target model to determine which draft tokens to accept, all in a combined forward pass. For more information on speculative decoding, please see:
- Leviathan, Yaniv, Matan Kalman, and Yossi Matias. ["Fast inference from transformers via speculative decoding."](https://arxiv.org/abs/2211.17192) International Conference on Machine Learning. PMLR, 2023.
- Chen, Charlie, et al. ["Accelerating large language model decoding with speculative sampling."](https://arxiv.org/pdf/2302.01318) arXiv preprint arXiv:2302.01318 (2023).

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) extends this technique by:
- Reducing sampling uncertainty by using  the next auto-regressively sampled token and a current feature map as draft model inputs.
- Utilizing a specially trained EAGLE draft model that predicts feature outputs through an Autoregression Head and next token outputs through an LM Head.

For more information on EAGLE speculative decoding, please see:
- Li, Yuhui, et al. ["Eagle: Speculative sampling requires rethinking feature uncertainty."](https://arxiv.org/pdf/2401.15077) arXiv preprint arXiv:2401.15077 (2024).  

In this exercise, we use the following models:

- **Target Model**: [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
- **Draft Model**: [Llama-3.1-70B-Instruct EAGLE draft](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B)
    - Please note that this draft model is not an official Meta Llama release. It is provided by the EAGLE authors as compatible with a `Meta-Llama-3.1-70B-Instruct` target.

This tutorial proceeds in the following main sections:

1. Set up the Jupyter Notebook.
2. Install dependencies.
3. Access and download the target and draft models.
4. Perform EAGLE speculative decoding inference using `transformers-neuronx` and `vLLM`.

This notebook is intended for a Trn1 `trn1.32xlarge` instance.

*Note: The models in this tutorial require 322.2 GB total disk space - Please ensure that your instance has sufficient storage to download and store Meta-Llama-3.1-70B-Instruct and the draft model before proceeding.*

## Set Up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using:
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder:
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`llama-3.1-70b-eagle-speculative-decoding.ipynb`) and launch it. Follow the rest of the instructions in this tutorial.

## Install Dependencies
This tutorial requires the following `pip` packages:
- `torch-neuronx`
- `neuronx-cc`
- `sentencepiece`
- `transformers`
- `transformers-neuronx`

Most of these packages will be installed when configuring your environment using the [torch-neuronx Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). `transformers-neuronx` and additional dependencies can be installed as follows:

In [None]:
!pip install transformers-neuronx sentencepiece "transformers>=4.43.3" # Recent transformers version required for RoPE scaling in Llama 3.1/3.2

### Installing vLLM
Neuron maintains a fork of vLLM (v0.6.2) that contains the necessary changes to support inference with `transformers-neuronx`. Neuron is working with the vLLM community to upstream these changes to make them available in a future version.

*Important: Please follow the vLLM installation instructions below. Do not install vLLM from PyPI or the official vLLM repository.*

In [None]:
!git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
!cd upstreaming-to-vllm && pip install -r requirements-neuron.txt
!VLLM_TARGET_DEVICE="neuron" cd upstreaming-to-vllm && pip install -e .

## Access and Download the Target and Draft Models

Meta-Llama-3.1-70B-Instruct and the draft model must be downloaded prior to running this tutorial. 

**Meta-Llama-3.1-70B-Instruct:** Use of the Meta-Llama 3.1 70B-Instruct model is governed by the Llama 3.1 Community License Agreement. Please follow the steps described in [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to gain access to this model.

**Llama-3.1-70B-Instruct EAGLE Draft:** Use of the Llama-3.1-70B-Instruct EAGLE Draft is governed by the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Download the model from [Hugging Face](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B).

*Note:* For this sample, we assume you have access to the Hugging Face models above and that they are saved in the following directories:
- `Meta-Llama-3.1-70B-Instruct`
- `Llama-3.1-70B-Instruct-EAGLE-Draft`

## Perform EAGLE Speculative Decoding Using `transformers-neuronx` and `vLLM`

In this tutorial, we use `transformers-neuronx` and `vLLM`'s `LLM` entrypoint to perform offline batched inference. We apply EAGLE speculative decoding to this entrypoint by passing both the target and draft model paths in as arguments. The number of draft tokens to generate is specified by `num_speculative_tokens`.

In [None]:
import json
import os
import time

import torch
from safetensors import safe_open
from safetensors.torch import save_file
from vllm import LLM, SamplingParams

target_model_path = "Meta-Llama-3.1-70B-Instruct"
draft_model_path = "Llama-3.1-70B-Instruct-EAGLE-Draft"

max_model_len=1024

## Add the Target LM Head to the EAGLE Draft

This EAGLE draft checkpoint expects the target model LM head to be used as its own LM head. To achieve this, we copy the target model LM head to the draft model using the cell below:

**Important Note** *The following code cell overwrites the draft model Hugging Face `model.safetensors` file with a new `safetensors` file that additionally contains the target LM head.*

In [None]:
DRAFT_MODEL_SAFETENSORS_NAME = "model.safetensors"
LM_HEAD_WEIGHT_TENSOR_NAME = "lm_head.weight"
TARGET_MODEL_SAFETENSORS_INDEX_NAME = "model.safetensors.index.json"

def find_lm_head_safetensors_location(model_dir):
    model_index_location_path = os.path.join(model_dir, TARGET_MODEL_SAFETENSORS_INDEX_NAME)

    with open(model_index_location_path, 'r') as f:
        model_index_locations = json.load(f)

    lm_head_safetensors_name = model_index_locations["weight_map"][LM_HEAD_WEIGHT_TENSOR_NAME]

    return lm_head_safetensors_name

# Find the target model `lm_head.weight` location in safetensors
target_lm_head_safetensors_name = find_lm_head_safetensors_location(target_model_path)
target_lm_head_safetensors_path = os.path.join(target_model_path, target_lm_head_safetensors_name)

# Open the target model .safetensor containing `lm_head.weight`
with safe_open(target_lm_head_safetensors_path, framework="pt") as f:
    target_lm_head = f.get_tensor(LM_HEAD_WEIGHT_TENSOR_NAME)

# Collect all tensors in the draft model
draft_model_safetensors_path = os.path.join(draft_model_path, DRAFT_MODEL_SAFETENSORS_NAME)
tensors = {}
with safe_open(draft_model_safetensors_path, framework="pt") as f:
    for key in f.keys():
        tensors[key] = f.get_tensor(key)

# Add the LM head weights and save out the new draft model.safetensors file
tensors[LM_HEAD_WEIGHT_TENSOR_NAME] = target_lm_head.type(torch.float16)
save_file(tensors, draft_model_safetensors_path)

### Update the Draft Hugging Face JSON Configuration

The `vLLM` fork uses a Boolean flag `is_eagle=True` within the HuggingFace `config.json` to determine whether the draft model is an EAGLE draft. If this flag is not present within the draft model `config.json`, run the cell below to add it. *Note: Please do not alter the target model Hugging Face configuration JSON file.*

In [None]:
draft_model_config_path = os.path.join(draft_model_path, "config.json")

with open(draft_model_config_path, 'r') as f:
    draft_config = json.load(f)

draft_config["is_eagle"] = True

with open(draft_model_config_path, 'w') as f:
    json.dump(draft_config, f)

### Creating the `LLM` Entrypoint
As a next step, we create the `vLLM` `LLM` entrypoint. Internally, this compiles the Neuron draft and target models and prepares them for use with `vLLM`'s continuous batching system (For more information, see Kwon, Woosuk, et al. ["Efficient memory management for large language model serving with pagedattention."](https://arxiv.org/pdf/2309.06180) Proceedings of the 29th Symposium on Operating Systems Principles. 2023.). Neuron currently supports `vLLM` continuous batching with a block size equal to the model's maximum sequence length, so we set `block_size`, `max_model_len`, and `speculative_max_model_len` to the same value (1024 tokens in this tutorial). We configure speculative decoding to sample 4 draft tokens per iteration by setting `num_speculative_tokens=4`. The maximum number of sequences `vLLM` will process concurrently is also set to 4 with `max_num_seqs=4`.

If the draft model Hugging Face `config.json` file contains `is_eagle=True`, EAGLE speculative decoding will be applied within the entrypoint.

In [None]:
llm = LLM(
    model=target_model_path,
    speculative_model=draft_model_path,
    block_size=max_model_len,
    device="neuron",
    dtype="bfloat16",
    max_model_len=max_model_len,
    max_num_seqs=4,
    num_speculative_tokens=4,
    speculative_max_model_len=max_model_len,
    swap_space=0,
    tensor_parallel_size=32,
    use_v2_block_manager=True,
)

### Generate Prompts

After this step, the target and draft models are ready to be used for batched inference with `vLLM`. We now assemble a collection of prompts. The target is instruction-tuned and the draft is trained through EAGLE to match target feature and token output distributions, so we apply the Llama 3.1 prompt template to each prompt. We also initialize our vLLM `SamplingParameters`. For this exercise, we will use greedy sampling.

In [None]:
# Gather sample prompts for batched inference.
prompts = [
    "Who are you?",
    "What is the capital of France?",
    "What is the future of AI?",
    "What is Llama?"
]

# Apply the Llama 3.1 prompt template to each prompt.
# See https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/prompt_format.md
llama_prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

{0}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
prompts = [llama_prompt_template.format(prompt) for prompt in prompts]

# Set sampling parameters.
sampling_params = SamplingParams(temperature=0, top_p=1.0, top_k=1, max_tokens=256)

### Perform Batched Inference

Finally, we use the `LLM` entrypoint to perform batched inference:

In [None]:
# Perform offline batched inference
start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
    print()
print('-' * 40)
print(f"Inference Elapsed Time: {elapsed:.3f} seconds")
print('-' * 40)