# Llama-3.2-1b Inference

In this example we compile and deploy the Hugging Face [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model for tensor parallel inference on Neuron using the `Neuronx-Distributed` package.

> Note: This model is not currently optimized for performance on neuronx-distributed. For optimized llama-2 inference use transformers-neuronx.

The example has the following main sections:

1. Set up the Jupyter Notebook
1. Install dependencies
1. Download the model
1. Trace the model
1. Perform greedy sampling
1. Benchmark sampling

This Jupyter Notebook can be run on a Trn1 instance (`trn1.32xlarge`). 

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [Neuronx-Distributed](https://github.com/aws-neuron/neuronx-distributed.git) repo to your instance using
```
git clone https://github.com/aws-neuron/neuronx-distributed.git
```

2. Navigate to the `examples/inference` samples folder
```
cd neuronx-distributed/example/inference/
```

3. Copy the tutorial notebook `llama2_inference.ipynb` to the `example/inference/` directory. 
```
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb
```

4. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.



## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `sentencepiece`
 - `transformers`
 - `neuronx-distributed`

You can install `neuronx-distributed` using the [setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html). Most of other packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [1]:
! pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


## Download the model 
Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). 

Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-2-7b model in a directory called `models/Llama-2-7b-chat-hf` with the following format:

```
Llama-2-7b-chat-hf
 ├── LICENSE.txt
 ├── README.md
 ├── USE_POLICY.md
 ├── config.json
 ├── generation_config.json
 ├── model-00001-of-00002.safetensors
 ├── model-00002-of-00002.safetensors
 ├── model.safetensors.index.json
 ├── pytorch_model-00001-of-00002.bin
 ├── pytorch_model-00002-of-00002.bin
 ├── pytorch_model.bin.index.json
 ├── special_tokens_map.json
 ├── tokenizer.json
 ├── tokenizer.model
 └── tokenizer_config.json
```

By default, this model uses `float16` precision, which is not supported for this model at this time. Go into `config.json` and switch the `torch_dtype` field to `bfloat16`. 

In [2]:
model_path = "/home/ubuntu/models/llama-3.2-1b"
traced_model_path = "/home/ubuntu/models/llama-3.2-1b-trace"

## Trace and load the model

Now we can trace the model using the LlamaRunner script. This saves the model to the `traced_model_path`. Tracing the 7b model can take up to 70 minutes. After tracing, the model can be loaded.

In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge. 


In [3]:
from llama2.llama2_runner import LlamaRunner

max_context_length = 128
max_new_tokens = 384
batch_size = 2
tp_degree = 32

runner = LlamaRunner(model_path=model_path, 
                     tokenizer_path=model_path)

runner.trace(traced_model_path=traced_model_path,
             tp_degree=tp_degree,
             batch_size=batch_size,
             context_lengths=max_context_length,
             new_token_counts=max_new_tokens,
             on_device_sampling=True)



  from .autonotebook import tqdm as notebook_tqdm


no generation config input
original eos token : 128001


NeuronLlamaForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
root: Unsupported nprocs (32), ignoring...


2024-10-22 01:44:10.000921:  2282947  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:10.000921:  2282957  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:10.000921:  2282960  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:10.000921:  2282944  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:10.000921:  2282947  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.15.128.0+56dc5a86/MODULE_10114637376880686083+d7517139/model.neff. Exiting with a successfully compiled graph.
2024-10-22 01:44:10.000921:  2282957  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.15.128.0+56dc5a86/MODULE_10114637376880686083+d7517139/model.neff. Exiting with a successfully compiled graph.
2024-10-22 01:44:10.000922:  2282960  INFO ||NEURON_CC_WRAPPER||: Using 

root: Unsupported nprocs (32), ignoring...


2024-10-22 01:44:55.000268:  2295684  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:55.000269:  2295684  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.15.128.0+56dc5a86/MODULE_10114637376880686083+d7517139/model.neff. Exiting with a successfully compiled graph.
2024-10-22 01:44:55.000271:  2295682  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:55.000272:  2295682  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.15.128.0+56dc5a86/MODULE_10114637376880686083+d7517139/model.neff. Exiting with a successfully compiled graph.
2024-10-22 01:44:55.000288:  2295687  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-22 01:44:55.000288:  2295687  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.15.128.0+56dc5a86/MODULE_10114637376880686083+d75171



......
Compiler status PASS

Compiler status PASS
.
Compiler status PASS
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad token id:  None
On device sampling:  True
pad tok

## Inference 

Now lets use the model to perform autoregressive sampling. 

In [4]:
neuron_model = runner.load_neuron_model(traced_model_path)

generation config:  {'context_lengths': 128, 'do_sample': True, 'max_length': 256, 'new_token_counts': 384, 'on_device_sampling': True, 'pad_token_id': 128001, 'top_k': 1}


In [5]:
prompt = ["I believe the meaning of life is", "The color of the sky is"]

generate_ids, outputs = runner.generate_on_neuron(prompt, neuron_model)

for idx, output in enumerate(outputs):
    print(f"output {idx}: {output}")

2024-Oct-22 01:47:55.0111 2282769:2309312 [29] nccl_net_ofi_create_plugin:207 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2024-Oct-22 01:47:55.0119 2282769:2309312 [29] nccl_net_ofi_create_plugin:259 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Oct-22 01:47:55.0126 2282769:2309312 [29] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?
output 0: I believe the meaning of life is to find your passion and to live it. I believe that the best way to do that is to find your purpose. I believe that the best way to find your purpose is to find your passion. I believe that the best way to find your passion is to find your purpose. I believe that the best way to find your purpose is to find your passion. I believe that the best way to find your purpose is to find your passion. I believe that the best way to find your purpose is to find your passion. I believe that the best way to find your purpose is to find your passion. I believe that the best way 

## Benchmarking 

Here we benchmark the per token latency for greedy sampling. 

In [6]:
results = runner.benchmark_sampling(neuron_model)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` t

Benchmark completed and its result is as following
{
    "e2e_model": {
        "latency_ms_p50": 573.0949640274048,
        "latency_ms_p90": 575.0540494918823,
        "latency_ms_p95": 575.3099679946899,
        "latency_ms_p99": 575.3675889968872,
        "latency_ms_p100": 575.3819942474365,
        "latency_ms_avg": 564.4593238830566,
        "throughput": 907.0627029027072
    },
    "context_encoding_model": {
        "latency_ms_p50": 6.815195083618164,
        "latency_ms_p90": 6.882119178771973,
        "latency_ms_p95": 6.976807117462159,
        "latency_ms_p99": 7.186453342437744,
        "latency_ms_p100": 7.238864898681641,
        "latency_ms_avg": 6.825506687164307,
        "throughput": 75012.74608123096
    },
    "token_generation_model": {
        "latency_ms_p50": 4.126787185668945,
        "latency_ms_p90": 4.174494743347168,
        "latency_ms_p95": 4.215049743652343,
        "latency_ms_p99": 4.416244029998782,
        "latency_ms_p100": 4.905462265014648,
  