# Deploying Mixtral on AWS Inferentia2

## 1. Spin up an EC2 instance

Here we'll be using:
- [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2)
- `ml.inf2.24xlarge` instance type

## 2. Convert `Mixtral-8x7B-Instruct-v0.1` to AWS Neuron with `optimum-neuron`

In [None]:
# optimium[neuronx] comes pre-installed, but upgrade to latest incase: 
# https://huggingface.co/docs/optimum-neuron/installation
!pip install --upgrade-strategy eager optimum[neuronx]

In [1]:
import sys
sys.path.append("/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages")

In [2]:
# login for gated repo
from huggingface_hub import interpreter_login

interpreter_login()


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your 

**NOTE:** 
1. The Mixtral implementation in `aws-neuron/transformers-neuronx` has [a requirement](https://github.com/aws-neuron/transformers-neuronx/blob/0623de20a3934f8d1b3cb73e1672138657134d7f/src/transformers_neuronx/mixtral/config.py#L57) on the value for `tp_degree`. 
- `tp_degree` is [auto-populated](https://github.com/huggingface/optimum-neuron/blob/7439a2d32808ce19ddece2d9a34d047af46dd1b1/optimum/neuron/modeling_decoder.py#L174) in `optimum-neuron` to be equal to `num_cores`.
- For this reason, you can only compile the model with the `num_cores` set to 8 or 16 (32 is only for tranium instances)

2. When compiling with `sequence_length==32k`, I noticed the following error:
- > Estimated peak HBM usage (32.264610) exceeds 16GB. Neff won't be able to load on chip - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new
- Each decoder block will need to allocate a static KV cache that is proportional to batch_size * sequence_length, and it seems for Neuron devices that 32k will not actually fit.. I used 8k here, but it's possible the value could be higher (need to test this)

In [6]:
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

# model id you want to compile
vanilla_model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# configs for compiling model
compiler_args = {"num_cores": 8 , "auto_cast_type": "bf16"}
input_shapes = {
  "sequence_length": 8192, #32768, # max length to generate
  "batch_size": 1 # batch size for the model
  }

llm = NeuronModelForCausalLM.from_pretrained(vanilla_model_id, export=True, **input_shapes, **compiler_args)
tokenizer = AutoTokenizer.from_pretrained(vanilla_model_id)

# Save locally or upload to the HuggingFace Hub
save_directory = "./mixtral-neuron-bs1-sl8k"
llm.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/92.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/19 [00:00<?, ?it/s]

model-00001-of-00019.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00002-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00005-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00006-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00007-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00008-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00009-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00010-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00011-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00012-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00013-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00014-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00015-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00016-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00017-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00018-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00019-of-00019.safetensors:   0%|          | 0.00/4.22G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

2024-05-22 15:10:53.000769:  3594  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-05-22 15:11:45.000387:  7713  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_4d814281da2495847317+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000518:  7714  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_4d814281da2495847317+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_4d814281da2495847317+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000550:  7713  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/264b382f-800a-4644-9542-4bd331cc07a4/model.MODULE_4d814281da2495847317+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/264b382f-800a-4644-9542-4bd331cc07a4/model.MODULE_4d814281da2495847317+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
2024-05-22 15:11:45.000598:  7715  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_68a36cfa373ded34f5dd+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_68a36cfa373ded34f5dd+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000685:  7716  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_68a36cfa373ded34f5dd+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000697:  7714  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/48f22398-3dfe-4684-ae2d-f2d0d5c225c7/model.MODULE_68a36cfa373ded34f5dd+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/48f22398-3dfe-4684-ae2d-f2d0d5c225c7/model.MODULE_68a36cfa373ded34f5dd+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_bcb91f2271dd1a021698+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_bcb91f2271dd1a021698+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000773:  7717  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_bcb91f2271dd1a021698+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000794:  7715  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/c8f76d93-abaa-4464-b065-e03ab21acd30/model.MODULE_bcb91f2271dd1a021698+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/c8f76d93-abaa-4464-b065-e03ab21acd30/model.MODULE_bcb91f2271dd1a021698+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
2024-05-22 15:11:45.000804:  7718  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_9ac1f331374a321df585+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_9ac1f331374a321df585+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000889:  7719  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_cfae1beea2c1f1b45827+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_9ac1f331374a321df585+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000901:  7716  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/b568a123-80fb-403e-b58e-54fe551a0f38/model.MODULE_9ac1f331374a321df585+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/b568a123-80fb-403e-b58e-54fe551a0f38/model.MODULE_9ac1f331374a321df585+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_cfae1beea2c1f1b45827+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_17b74ae2e9db9b7611df+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_cfae1beea2c1f1b45827+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_17b74ae2e9db9b7611df+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:45.000990:  7717  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/e3821e64-925a-41a6-b721-eb612b8acf58/model.MODULE_cfae1beea2c1f1b45827+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/e3821e64-925a-41a6-b721-eb612b8acf58/model.MODULE_cfae1beea2c1f1b45827+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_a80e9b0727c1903a01d8+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_17b74ae2e9db9b7611df+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000021:  7718  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/7bbad063-8bc4-4cd9-b3c7-2ca88cdc26d9/model.MODULE_17b74ae2e9db9b7611df+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/7bbad063-8bc4-4cd9-b3c7-2ca88cdc26d9/model.MODULE_17b74ae2e9db9b7611df+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_a80e9b0727c1903a01d8+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000072:  7720  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_a80e9b0727c1903a01d8+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000106:  7719  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/6d0282c9-91dc-4ac4-a61c-40413e441e08/model.MODULE_a80e9b0727c1903a01d8+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/6d0282c9-91dc-4ac4-a61c-40413e441e08/model.MODULE_a80e9b0727c1903a01d8+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_f47abf736d6ded2bb16f+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_f47abf736d6ded2bb16f+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_f47abf736d6ded2bb16f+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000280:  7720  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/2d477ec4-1c94-4c19-9c8c-9dfdffe725c5/model.MODULE_f47abf736d6ded2bb16f+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/2d477ec4-1c94-4c19-9c8c-9dfdffe725c5/model.MODULE_f47abf736d6ded2bb16f+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
2024-05-22 15:11:46.000291:  7721  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-05-22 15:11:46.000319:  7722  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_12bcbdb90af04ab7e2b9+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_ea0ece2bdf519f2e0403+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000512:  7723  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_ea0ece2bdf519f2e0403+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_12bcbdb90af04ab7e2b9+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_ea0ece2bdf519f2e0403+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000542:  7722  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/a5d15712-f74f-45e1-ac75-37e35d8749d0/model.MODULE_ea0ece2bdf519f2e0403+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/a5d15712-f74f-45e1-ac75-37e35d8749d0/model.MODULE_ea0ece2bdf519f2e0403+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_12bcbdb90af04ab7e2b9+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000548:  7721  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/5d8da60e-3613-4a5d-92ad-3a8f33ccebcf/model.MODULE_12bcbdb90af04ab7e2b9+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/5d8da60e-3613-4a5d-92ad-3a8f33ccebcf/model.MODULE_12bcbdb90af04ab7e2b9+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_a1b119d0049ffc6d6844+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_a1b119d0049ffc6d6844+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_a1b119d0049ffc6d6844+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000725:  7723  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/78d181cb-6388-4fb3-90cd-dbcea19e0f13/model.MODULE_a1b119d0049ffc6d6844+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/78d181cb-6388-4fb3-90cd-dbcea19e0f13/model.MODULE_a1b119d0049ffc6d6844+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
2024-05-22 15:11:46.000730:  7724  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-05-22 15:11:46.000807:  7726  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_5c5b0185242d62a5fe07+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000886:  7725  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_5c5b0185242d62a5fe07+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_5c5b0185242d62a5fe07+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:46.000942:  7724  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/06329c53-85d3-4556-9342-7e906c9b8e72/model.MODULE_5c5b0185242d62a5fe07+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/06329c53-85d3-4556-9342-7e906c9b8e72/model.MODULE_5c5b0185242d62a5fe07+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_67f9203ce8026a3b1e65+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_67f9203ce8026a3b1e65+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_67f9203ce8026a3b1e65+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:47.000004:  7726  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/f87ca4b9-acce-42e3-b79b-1de59884eb7e/model.MODULE_67f9203ce8026a3b1e65+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/f87ca4b9-acce-42e3-b79b-1de59884eb7e/model.MODULE_67f9203ce8026a3b1e65+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_42fdcc460043dc6159be+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_42fdcc460043dc6159be+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_42fdcc460043dc6159be+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-22 15:11:47.000092:  7725  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/bb689ad4-06ba-4f16-93a2-8a7857308bf6/model.MODULE_42fdcc460043dc6159be+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/bb689ad4-06ba-4f16-93a2-8a7857308bf6/model.MODULE_42fdcc460043dc6159be+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
..........................................performing partition vectorization on AG_2[[0, 25, 0, 0]]{1 nodes (1 sources, 0 stops)}. dags covered: {dag_25_TC_DST}
performing partition vectorization on AG_2[[0, 29, 0, 0, 0]]{1 nodes (1 sources, 0 stops)}. dags covered: {dag_29_TC_DST}
performing partition vectorization on AG_2[[0, 28, 0, 0]]{3 nodes (1 sources, 0 stops)}. dags covered: {dag_33_TC_SRC, dag_32, dag_28}
performing partition vectorization on AG_2[[0, 25, 0, 0]]{1 nodes (1 sources, 0 stops)}. dags covered: {dag_25_TC_DST}
performing

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

('./mixtral-neuron-bs1-sl8k/tokenizer_config.json',
 './mixtral-neuron-bs1-sl8k/special_tokens_map.json',
 './mixtral-neuron-bs1-sl8k/tokenizer.model',
 './mixtral-neuron-bs1-sl8k/added_tokens.json',
 './mixtral-neuron-bs1-sl8k/tokenizer.json')

## 3. Push to Hub

In [7]:
# optional: first remove safetensors checkpoint
import os
import shutil

shutil.rmtree(os.path.join(save_directory, "checkpoint"))

In [13]:
llm.push_to_hub(repository_id="andrewrreed/Mixtral-8x7B-Instruct-v0.1-neuron-bs1-sl8k", save_directory=save_directory, use_auth_token=True)

64de654e0a6a8a80a36b.neff:   0%|          | 0.00/15.6M [00:00<?, ?B/s]

27b9ae2747e681472f0f.neff:   0%|          | 0.00/12.8M [00:00<?, ?B/s]

002e47ecb433d86b46c1.neff:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

6d25eb142954e695ea56.neff:   0%|          | 0.00/3.53M [00:00<?, ?B/s]

347ecd13e9a213b3c6a3.neff:   0%|          | 0.00/12.1M [00:00<?, ?B/s]

Upload 15 LFS files:   0%|          | 0/15 [00:00<?, ?it/s]

82e01244d3f6ba0ce352.neff:   0%|          | 0.00/13.0M [00:00<?, ?B/s]

8fe9a7a8f8fc6a776493.neff:   0%|          | 0.00/12.9M [00:00<?, ?B/s]

9dfc3aef573e69e28c6a.neff:   0%|          | 0.00/6.23M [00:00<?, ?B/s]

b2b22c0802c289f93ee9.neff:   0%|          | 0.00/13.7M [00:00<?, ?B/s]

be7b02fb092f8eb4557b.neff:   0%|          | 0.00/13.3M [00:00<?, ?B/s]

c9fd176cc710e975296d.neff:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

d43a5de0ce956dc53000.neff:   0%|          | 0.00/12.9M [00:00<?, ?B/s]

d85f08804b747035c4fe.neff:   0%|          | 0.00/13.0M [00:00<?, ?B/s]

de4141881c11ee945be9.neff:   0%|          | 0.00/15.1M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

## 4. Deploy with TGI

**Note:** Here we're only specifying the first 4 neuron devices corresponding to the `num_cores` above (each device has two cores).

docker run -p 8080:80 \
       -v $(pwd)/data:/data \
       --device=/dev/neuron0 \
       --device=/dev/neuron1 \
       --device=/dev/neuron2 \
       --device=/dev/neuron3 \
       -e HF_TOKEN=${HF_TOKEN} \
       ghcr.io/huggingface/neuronx-tgi:latest \
       --model-id andrewrreed/Mixtral-8x7B-Instruct-v0.1-neuron-bs1-sl8k \
       --max-batch-size 1 \
       --max-input-length 8191 \
       --max-total-tokens 8192

## 5. Query the model

In [3]:
!curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is AWS inferentia2 and why would I want to use it?","parameters":{"max_new_tokens":1024}}' \
    -H 'Content-Type: application/json'

{"generated_text":"\n\nAWS Inferentia2 is a new machine learning inference chip designed and built by Amazon Web Services (AWS). It is optimized for deep learning inference workloads and is designed to deliver high performance and low latency at a low cost.\n\nThere are several reasons why you might want to use AWS Inferentia2:\n\n1. High performance: AWS Inferentia2 is designed to deliver high performance for deep learning inference workloads. It can process up to 128 TOPS (tera operations per second) and has a memory bandwidth of up to 3,000 GB/s. This allows it to handle large, complex models and deliver fast, accurate results.\n2. Low latency: AWS Inferentia2 is also designed to deliver low latency for inference workloads. It has a latency of less than 1 millisecond for many common inference tasks, which is important for applications that require real-time responses.\n3. Low cost: AWS Inferentia2 is designed to be cost-effective, with a price that is significantly lower than other 