# Create your own chatbot with llama-2-13B on AWS Inferentia

This guide will detail how to export, deploy and run a **LLama-2 13B** chat model on AWS inferentia.

You will learn how to:
- set up your AWS instance,
- export the Llama-2 model to the Neuron format,
- push the exported model to the Hugging Face Hub,
- deploy the model and use it in a chat application.

Note: This tutorial was created on a inf2.48xlarge AWS EC2 Instance.

## Prerequisite: Setup AWS environment

*you can skip that section if you are already running this notebook on your instance.*

In this example, we will use the *inf2.48xlarge* instance with 12 Neuron devices, corresponding to 24 Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).

This guide doesn’t cover how to create the instance in detail. You can refer to the [offical documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html). At step 4. you will select the
[Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) and at step 5. you will select an *inf2* instance type.

Once the instance is up and running, you can ssh into it. But instead of developing inside a terminal you need to launch a Jupyter server to run this notebook.

For this, you need first to add a port for forwarding in the ssh command, which will tunnel our localhost traffic to the AWS instance.

From a local terminal, type the following commands:

```shell
HOSTNAME="" # IP address, e.g. ec2-3-80-....
KEY_PATH="" # local path to key, e.g. ssh/trn.pem

ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$HOSTNAME
```

On the instance, you can now start the jupyter server.

```
python -m notebook --allow-root --port=8080
```

You should see a familiar jupyter output with a URL.

You can click on it, and a jupyter environment will open in your local browser.

You can then browse to this notebook (`notebooks/text-generation/llama2-13-chatbot`) to continue with the guide.



In [1]:
# Special widgets are required for a nicer display
import sys
!{sys.executable} -m pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


## 1. Export the Llama 2 model to Neuron

For this guide, we will use the non-gated [NousResearch/Llama-2-13b-chat-hf](https://huggingface.co/NousResearch/Llama-2-13b-chat-hf) model, which is functionally equivalent to the original [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf).

This model is part of the **Llama 2** family of models, and has been tuned to recognize chat interactions
between a *user* and an *assistant* (more on that later).

As explained in the [optimum-neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model#why-compile-to-neuron-model)
, models need to be compiled and exported to a serialized format before running them on Neuron devices.

Fortunately, 🤗 **optimum-neuron** offers a [very simple API](https://huggingface.co/docs/optimum-neuron/guides/models#configuring-the-export-of-a-generative-model)
to export standard 🤗 [transformers models](https://huggingface.co/docs/transformers/index) to the Neuron format.

When exporting the model, we will specify two sets of parameters:

- using *compiler_args*, we specify on how many cores we want the model to be deployed (each neuron device has two cores), and with which precision (here *float16*),
- using *input_shapes*, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the
*sequence_length* not only constrains the length of the input context, but also the length of the Key/Value cache, and thus, the output length.

Depending on your choice of parameters and inferentia host, this may take from a few minutes to more than an hour.

For your convenience, we host a pre-compiled version of that model on the Hugging Face hub, so you can skip the export and start using the model immediately in section 2.

In [5]:
from optimum.neuron import NeuronModelForSentenceTransformers

# Sentence Transformers model from HuggingFace
model_id = "BAAI/bge-m3"
input_shapes = {"batch_size": 1, "sequence_length": 1024}  # mandatory shapes

# Load Transformers model and export it to AWS Inferentia2
model = NeuronModelForSentenceTransformers.from_pretrained(
    model_id, export=True, cache_dir="/tmp/bge-m3-cache", **input_shapes
)

# Save model to disk
model.save_pretrained("bge_m3_inf2/")

2024-05-02 05:08:00.000242:  9292  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_7ec41fbc85504de0b4d5 not found in aws-neuron/optimum-neuron-cache: 404 Client Error. (Request ID: Root=1-66331fb0-382a5c505472e8fe3990b034;9192fbbb-4b24-4666-ab7c-63677523be50)

Entry Not Found for url: https://huggingface.co/api/models/aws-neuron/optimum-neuron-cache/tree/main/neuronxcc-2.13.66.0%2B6dfecc895%2FMODULE_7ec41fbc85504de0b4d5?recursive=True&expand=False.
neuronxcc-2.13.66.0+6dfecc895/MODULE_7ec41fbc85504de0b4d5 does not exist on "main" 
The model will be recompiled.


***** Compiling bge-m3 *****


2024-05-02T05:08:18Z Running DoNothing
2024-05-02T05:08:18Z DoNothing finished after 0.000 seconds
2024-05-02T05:08:18Z Running AliasDependencyInduction
2024-05-02T05:08:18Z AliasDependencyInduction finished after 0.006 seconds
2024-05-02T05:08:18Z Running CanonicalizeIR
2024-05-02T05:08:18Z CanonicalizeIR finished after 0.021 seconds
2024-05-02T05:08:18Z Running LegalizeCCOpLayout
2024-05-02T05:08:18Z LegalizeCCOpLayout finished after 0.022 seconds
2024-05-02T05:08:18Z Running ResolveComplicatePredicates
2024-05-02T05:08:18Z ResolveComplicatePredicates finished after 0.020 seconds
2024-05-02T05:08:18Z Running AffinePredicateResolution
2024-05-02T05:08:18Z AffinePredicateResolution finished after 0.021 seconds
2024-05-02T05:08:18Z Running EliminateDivs
2024-05-02T05:08:18Z EliminateDivs finished after 0.019 seconds
2024-05-02T05:08:18Z Running PerfectLoopNest
2024-05-02T05:08:18Z PerfectLoopNest finished after 0.020 seconds
2024-05-02T05:08:18Z Running Simplifier
2024-05-02T05:08:18Z S

[Compilation Time] 141.89 seconds.
[Total compilation Time] 141.89 seconds.


2024-05-02 05:10:26.000737:  9292  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-05-02 05:10:26.000740:  9292  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


Model cached in: /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/MODULE_24e2f560880289a86b4e.


This probably took a while.

Fortunately, you will need to do this only once because you can save your model and reload it later.

In [6]:
model.save_pretrained("bge_m3_inf2")

Even better, you can push it to the [Hugging Face hub](https://huggingface.co/models).

For that, you need to be logged in to a [HuggingFace account](https://huggingface.co/join).

If you are not connected already on your instance, you will now be prompted for an access token.

In [7]:
from huggingface_hub import notebook_login

notebook_login(new_session=False)

User is already logged in.


By default, the model will be uploaded to your account (organization equal to your user name).

Feel free to edit the cell below if you want to upload the model to a specific [Hugging Face organization](https://huggingface.co/docs/hub/organizations).

In [9]:
from huggingface_hub import whoami

org = whoami()["name"]

repo_id = f"{org}/bge_m3_inf2"

model.push_to_hub("bge_m3_inf2", repository_id=repo_id)

model.neuron:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

### A few more words about export parameters.

The minimum memory required to load a model can be computed with:

```
   memory = bytes per parameter * number of parameters
```

The **Llama 2 13B** model uses *float16* weights (stored on 2 bytes) and has 13 billion parameters, which means it requires at least 2 * 13B or ~26GB of memory to store its weights.

Each NeuronCore has 16GB of memory which means that a 26GB model cannot fit on a single NeuronCore.

In reality, the total space required is much greater than just the number of parameters due to caching attention layer projections (KV caching).
This caching mechanism grows memory allocations linearly with sequence length and batch size.

Here we set the *batch_size* to 1, meaning that we can only process one input prompt in parallel. We set the *sequence_length* to 2048, which corresponds to half the model maximum capacity (4096).

The formula to evaluate the size of the KV cache is more involved as it also depends on parameters related to the model architecture, such as the width of the embeddings and the number of decoder blocks.

Bottom-line is, to get very large language models to fit, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores, keeping in mind that the memory on each core cannot exceed 16GB.

Note that increasing the number of cores beyond the minimum requirement almost always results in a faster model.
Increasing the tensor parallelism degree improves memory bandwidth which improves model performance.

To optimize performance it's recommended to use all cores available on the instance.

In this guide we use all the 24 cores of the *inf2.48xlarge*, but this should be changed to 12 if you are
using a *inf2.24xlarge* instance.

## 2. Generate text using Llama 2 on AWS Inferentia2

Once your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this post](https://huggingface.co/blog/how-to-generate).

If as suggested you skipped the first section, don't worry: we will use a precompiled model already present on the hub instead.

In [10]:
from optimum.neuron import NeuronModelForCausalLM

try:
    model
except NameError:
    # Edit this to use another base model
    model = NeuronModelForCausalLM.from_pretrained(
        "aws-neuron/BAAI/bge-m3-13b-chat-hf-neuron-latency"
    )

We will need a *Llama 2* tokenizer to convert the prompt strings to text tokens.

In [11]:
from transformers import AutoTokenizer

tokenizer_id = "BAAI/bge-m3"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

The following generation strategies are supported:

- greedy search,
- multinomial sampling with top-k and top-p (with temperature).

Most logits pre-processing/filters (such as repetition penalty) are supported.

In [13]:
# Run inference
prompt = "I like to eat apples"
encoded_input = tokenizer(prompt, return_tensors="pt")
outputs = model(**encoded_input)

token_embeddings = outputs.token_embeddings
sentence_embedding = outputs.sentence_embedding

print(f"token embeddings: {token_embeddings.shape}")  # torch.Size([1, 7, 384])
print(f"sentence_embedding: {sentence_embedding.shape}")  # torch.Size([1, 384])
print(token_embeddings)
print(sentence_embedding)

Padding input tensors, the padding side is: right.


token embeddings: torch.Size([1, 8, 1024])
sentence_embedding: torch.Size([1, 1024])
tensor([[[ 7.2326e-02,  5.7962e-01, -1.0161e+00,  ..., -9.5186e-02,
          -8.3638e-01,  4.9044e-01],
         [-2.9362e-02,  5.0794e-02, -4.4895e-01,  ...,  7.0900e-02,
          -6.4030e-01,  4.0967e-01],
         [ 7.3407e-01, -1.4117e-01, -5.7145e-01,  ...,  3.4220e-01,
          -5.4862e-01,  7.3620e-01],
         ...,
         [ 1.0254e+00, -3.7437e-01, -2.2843e-01,  ...,  1.9539e-02,
          -4.5130e-01, -5.5243e-02],
         [ 2.2791e-01,  1.3206e-01, -9.9534e-02,  ..., -7.7430e-04,
          -8.7546e-01,  5.6866e-01],
         [-3.7476e-01,  3.4859e-01,  1.5830e-01,  ...,  6.5258e-01,
          -5.3151e-01,  5.7769e-01]]])
tensor([[ 0.0027,  0.0217, -0.0380,  ..., -0.0036, -0.0313,  0.0183]])
