# Run speculative sampling on Meta Llama models

In speculative sampling, we use use a smaller draft model to speculate future tokens. These are then sent to the larger target model, which accepts/rejects these tokens.  

For a more detailed understanding, please refer to the original paper by DeepMind titled ["Accelerating Large Language Model Decoding with Speculative Sampling"](https://arxiv.org/abs/2302.01318)

In this example we perform speculative sampling using the Hugging Face ["meta-llama/Llama-2-70b"](https://huggingface.co/meta-llama/Llama-2-70b) model and Hugging Face ["meta-llama/Llama-2-7b"](https://huggingface.co/meta-llama/Llama-2-7b).
Here, the 70b model is considered the target model and the 7b model is considered the draft model.

The example has the following main sections:

1. Set up the Jupyter Notebook
2. Install dependencies
3. Download and construct the model
5. Split the model `state_dict` into multiple files
6. Perform speculative sampling

This Jupyter Notebook should be run on a Trn1 instance (`trn1.32xlarge`). To run on an Inf2 instance (`inf2.48xlarge`) will require changing the `tp_degree` specified in compilation section.

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:

1. Clone the ["AWS Neuron Samples"](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
    cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in ["Jupyter Notebook Quickstart"](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.

4. Locate this tutorial in your Jupyter Notebook session (`speculative_sampling.ipynb`) and launch it. Follow the rest of the instructions in this tutorial.




## Install Dependencies

This tutorial requires the following pip packages:

- `torch-neuronx`
- `neuronx-cc`
- `sentencepiece`
- `transformers`
- `transformers-neuronx`

Most of these packages will be installed when configuring your environment using the ["torch-neuronx inference setup guide"](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:


In [None]:
!pip install transformers-neuronx sentencepiece

## Download the model

Use of the Llama 2 model is governed by the Meta license and must be downloaded and converted to the standard Hugging Face format prior to running this sample.

Follow the steps described in ["meta-llama/Llama-2-70b"](https://huggingface.co/meta-llama/Llama-2-70b) and ["meta-llama/Llama-2-7b"](https://huggingface.co/meta-llama/Llama-2-7b) to get access to the Llama 2 models from Meta and download the weights and tokenizer.

After gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the ["conversion script"](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). The script can be called with the following (example) command:

```
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 70Bf --output_dir ./Llama-2-70b
 ```

Note: For the purposes of this sample we assume you have saved the Llama-2-70b model and the Llama-2-7b model in separate directories called `Llama-2-70b`  and `Llama-2-7b` with the following formats:

## Construct the model

We download and construct the draft and target models using the Hugging Face `from_pretrained` method.


In [None]:
from transformers import LlamaForCausalLM

draft_model = LlamaForCausalLM.from_pretrained('Llama-2-7b')
target_model = LlamaForCausalLM.from_pretrained('Llama-2-70b')

## Split the model state_dict into multiple files

For the sake of reducing host memory usage, it is recommended to save the model `state_dict` as multiple files, as opposed to one monolithic file given by `torch.save`. This "split-format" `state_dict` can be created using the `save_pretrained_split` function. With this checkpoint format, the Neuron model loader can load parameters to the Neuron device high-bandwidth memory (HBM) directly by keeping at most one layer of model parameters in the CPU main memory.

In [None]:
import torch
import os
import re
import json
from transformers_neuronx.module import save_pretrained_split

save_pretrained_split(draft_model, './Llama-2-7b-split')
save_pretrained_split(target_model, './Llama-2-70b-split')

## Perform speculative sampling

We now load and compile the draft model and the target model.
We use the Neuron `LlamaForSampling` class to load both models. Without extra configuration, autoregressive sampling is used as default.

Since we need to perform regular autoregressive sampling in the draft model, we load and compile it using the default options.
For the target model, we need to explicitly enable speculative decoding by calling the function enable_speculative_decoder(k) and this will let the model compiled for computing a window of k tokens at a time.

Note that when loading the models, we must use the same `tp_degree`. Attempting to use a different value for the draft/target model will result in a load failure.

In [None]:
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling

print("\nStarting to compile Draft Model....")
# Load draft model
draft_neuron_model = LlamaForSampling.from_pretrained('./Llama-2-7b-split', n_positions=128, batch_size=1, tp_degree=32, amp='f32')
# compile to neuron 
draft_neuron_model.to_neuron()
print("\nCompleted compilation of Draft Model")

print("\nStarting to compile Target Model....")
# Load target model
target_neuron_model = LlamaForSampling.from_pretrained('./Llama-2-70b-split', n_positions=128, batch_size=1, tp_degree=32, amp='f32')
# Enable speculative decoder
target_neuron_model.enable_speculative_decoder(7)
# compile to neuron 
target_neuron_model.to_neuron()
print("\nCompleted compilation of Target Model")

Next, we initialize the tokenizer and the text prompt. 

We then initialize the `SpeculativeGenerator` class and pass the draft model, target model and speculation length as arguments. We can use this to call the `sample()` function and get the final sampled tokens after using the tokenizer to decode them. 

Comparing the response generation time between speculative sampling and autoregressive sampling, we see that speculative sampling is faster than autoregressive sampling.

In [None]:
from transformers_neuronx.speculation import SpeculativeGenerator, DraftModelForSpeculation, DefaultTokenAcceptor
import sentencepiece
from transformers import LlamaTokenizer

#Initialize tokenizer and text prompt
tokenizer = LlamaTokenizer.from_pretrained("Llama-2-70b")
prompt = "Hello, I'm a generative AI language model."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# create SpeculativeGenerator
spec_gen = SpeculativeGenerator(draft_neuron_model, target_neuron_model, 7)

# call speculative sampling on given input
start_spec_timer = time.time()

print("Starting to call Speculative Sampling..")
response = spec_gen.sample(
    input_ids=input_ids,
    sequence_length=50,
)
end_spec_timer = time.time()

generated_text = tokenizer.decode(response[0])
print(f"\nDecoded tokens: {generated_text}")

print(f"\nSpeculative sampling response generation took {end_spec_timer - start_spec_timer} ms")

start_auto_r_timer = time.time()
autor_response = target_neuron_model.sample(input_ids=input_ids, sequence_length=50)
end_auto_r_timer = time.time()

print(f"\nAutoregressive sampling response generation took {end_auto_r_timer - start_auto_r_timer} ms")

