# Llama-3.2-1b Inference

In this example we compile and deploy the Hugging Face [meta-llama/Llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B) model for tensor parallel inference on Neuron using the `Neuronx-Distributed` package.

> Note: This model is not currently optimized for performance on neuronx-distributed. It serves as a tutorial for hosting a 1B Llama model on AWS Tranium and integrating it with kernels from [NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). 

The example has the following main sections:

1. Set up the Jupyter Notebook
1. Install dependencies
1. Download the model
1. Trace the model
1. Perform greedy sampling
1. Benchmark sampling

This Jupyter Notebook can be run on a Trn1 instance (`trn1.2xlarge`). 

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [Neuronx-Distributed](https://github.com/aws-neuron/neuronx-distributed.git) repo to your instance using
```
git clone https://github.com/aws-neuron/neuronx-distributed.git
```

2. Navigate to the `examples/inference` samples folder
```
cd neuronx-distributed/example/inference/
```

3. Follow the instructions on `examples/inference/README.md` to set up the virtual environment.

4. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance, or run using VSCode.



## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `sentencepiece`
 - `transformers`
 - `neuronx-distributed`

You can install `neuronx-distributed` using the [setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html). Most of other packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [None]:
! pip install -r requirements.txt

## Download the model 
Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B). 

Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-3.2-1b model in a directory called `models/llama-3.2-1b` with the following format:

```
llama-3.2-1b
 ├── original
     ├── consolidated.00.pth
     ├── params.json
     ├── tokenizer.model
 ├── config.json
 ├── generation_config.json
 ├── LICENSE.txt
 ├── model.safetensors
 ├── README.md
 ├── special_tokens_map.json
 ├── tokenizer_config.json
 ├── tokenizer.json
 ├── USE_POLICY.md

```

In [None]:
model_path = "/home/ubuntu/models/llama-3.2-1b"
traced_model_path = "/home/ubuntu/models/llama-3.2-1b-trace"

## Trace and load the model

Now we can trace the model using the LlamaRunner script. This saves the model to the `traced_model_path`. After tracing, the model can be loaded.

In this sample we use tensor parallelism degree 2 to optimize performance on trn1.2xlarge. 


In [None]:
from llama3.llama3_runner import LlamaRunner

max_context_length = 2048
max_new_tokens = 256
batch_size = 1
tp_degree = 2

runner = LlamaRunner(model_path=model_path, 
                     tokenizer_path=model_path)

runner.trace(traced_model_path=traced_model_path,
             tp_degree=tp_degree,
             batch_size=batch_size,
             context_lengths=max_context_length,
             new_token_counts=max_new_tokens,
             on_device_sampling=True)



## Inference 

Now lets use the model to perform autoregressive sampling. 

In [None]:
neuron_model = runner.load_neuron_model(traced_model_path)

In [None]:
prompt = ["I believe the meaning of life is"]

generate_ids, outputs = runner.generate_on_neuron(prompt, neuron_model)

for idx, output in enumerate(outputs):
    print(f"output {idx}: {output}")

## Benchmarking 

Here we benchmark the per token latency for greedy sampling. 

In [None]:
results = runner.benchmark_sampling(neuron_model)