# Deploying Mixtral on AWS Inferentia2

## 1. Spin up an EC2 instance

Here we'll be using:
- [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2)
- `ml.inf2.24xlarge` instance type

## 2. Convert `Mixtral-8x7B-Instruct-v0.1` to AWS Neuron with `optimum-neuron`

In [None]:
# optimium[neuronx] comes pre-installed, but upgrade to latest incase: 
# https://huggingface.co/docs/optimum-neuron/installation
!pip install --upgrade-strategy eager optimum[neuronx]

In [1]:
import sys
sys.path.append("/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages")

In [None]:
# login for gated repo
from huggingface_hub import interpreter_login

interpreter_login()

**NOTE:** 
1. The Mixtral implementation in `aws-neuron/transformers-neuronx` has [a requirement](https://github.com/aws-neuron/transformers-neuronx/blob/0623de20a3934f8d1b3cb73e1672138657134d7f/src/transformers_neuronx/mixtral/config.py#L57) on the value for `tp_degree`. 
- `tp_degree` is [auto-populated](https://github.com/huggingface/optimum-neuron/blob/7439a2d32808ce19ddece2d9a34d047af46dd1b1/optimum/neuron/modeling_decoder.py#L174) in `optimum-neuron` to be equal to `num_cores`.
- For this reason, you can only compile the model with the `num_cores` set to 8 or 16 (32 is only for tranium instances)

2. When compiling with `sequence_length==32k`, I noticed the following error:
- > Estimated peak HBM usage (32.264610) exceeds 16GB. Neff won't be able to load on chip - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new
- Each decoder block will need to allocate a static KV cache that is proportional to batch_size * sequence_length, and it seems for Neuron devices that 32k will not actually fit.. I used 8k here, but it's possible the value could be higher (need to test this)

In [None]:
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

# model id you want to compile
vanilla_model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# configs for compiling model
compiler_args = {"num_cores": 8 , "auto_cast_type": "bf16"}
input_shapes = {
  "sequence_length": 8192, #32768, # max length to generate
  "batch_size": 1 # batch size for the model
  }

llm = NeuronModelForCausalLM.from_pretrained(vanilla_model_id, export=True, **input_shapes, **compiler_args)
tokenizer = AutoTokenizer.from_pretrained(vanilla_model_id)

# Save locally or upload to the HuggingFace Hub
save_directory = "./mixtral-neuron-bs1-sl8k"
llm.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

## 3. Push to Hub

In [7]:
# optional: first remove safetensors checkpoint
import os
import shutil

shutil.rmtree(os.path.join(save_directory, "checkpoint"))

In [None]:
llm.push_to_hub(repository_id="andrewrreed/Mixtral-8x7B-Instruct-v0.1-neuron-bs1-sl8k", save_directory=save_directory, use_auth_token=True)

## 4. Deploy with TGI

**Note:** Here we're only specifying the first 4 neuron devices corresponding to the `num_cores` above (each device has two cores).

```
docker run -p 8080:80 \
       -v $(pwd)/data:/data \
       --device=/dev/neuron0 \
       --device=/dev/neuron1 \
       --device=/dev/neuron2 \
       --device=/dev/neuron3 \
       -e HF_TOKEN=${HF_TOKEN} \
       ghcr.io/huggingface/neuronx-tgi:latest \
       --model-id andrewrreed/Mixtral-8x7B-Instruct-v0.1-neuron-bs1-sl8k \
       --max-batch-size 1 \
       --max-input-length 8191 \
       --max-total-tokens 8192
```

## 5. Query the model

In [4]:
!curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is AWS inferentia2 and why would I want to use it?","parameters":{"max_new_tokens":1024}}' \
    -H 'Content-Type: application/json'

{"generated_text":"\n\nAWS Inferentia2 is a new machine learning inference chip designed and built by Amazon Web Services (AWS). It is optimized for deep learning inference workloads and is designed to deliver high performance and low latency at a low cost.\n\nThere are several reasons why you might want to use AWS Inferentia2:\n\n1. High performance: AWS Inferentia2 is designed to deliver high performance for deep learning inference workloads. It can process up to 128 TOPS (tera operations per second) and has a memory bandwidth of up to 3,000 GB/s. This allows it to handle large, complex models and deliver fast, accurate results.\n2. Low latency: AWS Inferentia2 is also designed to deliver low latency for inference workloads. It has a latency of less than 1 millisecond for many common inference tasks, which is important for applications that require real-time responses.\n3. Low cost: AWS Inferentia2 is designed to be cost-effective, with a price that is significantly lower than other 