# Deploying Mixtral on AWS Inferentia2

## 1. Spin up an EC2 instance

Here we'll be using:
- [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2)
- `ml.inf2.24xlarge` instance type

## 2. Convert `Mixtral-8x7B-Instruct-v0.1` to AWS Neuron with `optimum-neuron`

In [1]:
# optimium[neuronx] comes pre-installed, but upgrade to latest: 
# https://huggingface.co/docs/optimum-neuron/installation
!pip install --upgrade-strategy eager optimum[neuronx] ipywidgets

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [6]:
# login for gated repo
from huggingface_hub import interpreter_login

interpreter_login()


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your 

In [1]:
import sys
sys.path.append("/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages")

**NOTE:** The Mixtral implementation in `aws-neuron/transformers-neuronx` has [a requirement](https://github.com/aws-neuron/transformers-neuronx/blob/0623de20a3934f8d1b3cb73e1672138657134d7f/src/transformers_neuronx/mixtral/config.py#L57) on the value for `tp_degree`. 
- `tp_degree` is [auto-populated](https://github.com/huggingface/optimum-neuron/blob/7439a2d32808ce19ddece2d9a34d047af46dd1b1/optimum/neuron/modeling_decoder.py#L174) in `optimum-neuron` to be equal to `num_cores`.
- For this reason, you can only compile the model with the `num_cores` set to 8 or 16 (32 is only for tranium instances)

In [8]:
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

# model id you want to compile
vanilla_model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# configs for compiling model
compiler_args = {"num_cores": 8 , "auto_cast_type": "bf16"}
input_shapes = {
  "sequence_length": 32768, # max length to generate
  "batch_size": 1 # batch size for the model
  }

llm = NeuronModelForCausalLM.from_pretrained(vanilla_model_id, export=True, **input_shapes, **compiler_args)
tokenizer = AutoTokenizer.from_pretrained(vanilla_model_id)

# Save locally or upload to the HuggingFace Hub
save_directory = "./mixtral_neuron2"
llm.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

2024-05-21 19:38:41.000650:  20050  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


2024-May-21 19:40:11.264821 20050:20050 ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 117440512
2024-May-21 19:40:11.389363 20050:20050 ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_664cf89b.csv
2024-May-21 19:40:11.560334 20050:20050 ERROR  TDRV:log_dev_mem                             Failed to allocate 112.000MB (usage: tensors) on ND 0:NC 0, current utilization:
	* total: 15.903GB
	* tensors: 15.903GB
	* runtime: 1.062KB
	* dma rings: 32.000KB

2024-May-21 19:40:11.888299 20050:20050 ERROR  TDRV:tensor_allocate                         Failed to allocate 117440512 bytes on DEVICE for tensor UNKNOWN.


RuntimeError: nrt_tensor_allocate status=4 message="Allocation Failure"

In [7]:
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

# model id you want to compile
vanilla_model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# configs for compiling model
compiler_args = {"num_cores": 8 , "auto_cast_type": "bf16"}
input_shapes = {
  "sequence_length": 32768, # max length to generate
  "batch_size": 4 # batch size for the model
  }

llm = NeuronModelForCausalLM.from_pretrained(vanilla_model_id, export=True, **input_shapes, **compiler_args)
tokenizer = AutoTokenizer.from_pretrained(vanilla_model_id)

# Save locally or upload to the HuggingFace Hub
save_directory = "./mixtral_neuron"
llm.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

2024-05-21 16:43:38.000891:  20050  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-05-21 16:49:44.000719:  20650  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-05-21 16:49:44.000807:  20651  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_0ee6a82628ec518e67cf+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_0ee6a82628ec518e67cf+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_e899a005793b73bcce61+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_e899a005793b73bcce61+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.13.66.0+6dfecc895/MODULE_0ee6a82628ec518e67cf+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the correspondin

2024-05-21 16:49:45.000179:  20650  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/dd215b5d-e505-476c-9154-4ffa49973022/model.MODULE_0ee6a82628ec518e67cf+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/dd215b5d-e505-476c-9154-4ffa49973022/model.MODULE_0ee6a82628ec518e67cf+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35


neuronxcc-2.13.66.0+6dfecc895/MODULE_e899a005793b73bcce61+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-05-21 16:49:45.000235:  20651  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/0219b7b4-c825-4abf-b53a-3262744521de/model.MODULE_e899a005793b73bcce61+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/0219b7b4-c825-4abf-b53a-3262744521de/model.MODULE_e899a005793b73bcce61+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
....performing partition vectorization on AG_2[[0, 25, 0, 0, 0]]{1 nodes (1 sources, 0 stops)}. dags covered: {dag_25_TC_DST}
performing partition vectorization on AG_2[[0, 28, 0, 0, 0]]{3 nodes (1 sources, 0 stops)}. dags covered: {dag_28, dag_33_TRANSPOSE_DST, dag_34_TC_SRC}
performing partition vectorization on AG_2[[0, 25, 0, 0, 0]]{1 nodes (1 sources, 0 stops)}. dags covered: {dag_25_TC_DST}
performing partition vectorization on AG_2[[0, 28, 0, 0, 0]]{3 nodes (1 sources, 0 stops)}. dags covered: {dag_28, dag_33_TRANSPOSE_DST, dag_34_TC

CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/dd215b5d-e505-476c-9154-4ffa49973022/model.MODULE_0ee6a82628ec518e67cf+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/dd215b5d-e505-476c-9154-4ffa49973022/model.MODULE_0ee6a82628ec518e67cf+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.

In [3]:
llm

NameError: name 'llm' is not defined

In [None]:
llm.config