# Run Hugging Face `mistralai/Mixtral-8x7B-v0.1` autoregressive sampling on Inf2 & Trn1

In this example, we compile and deploy the Hugging Face [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) model for tensor parallel inference on AWS Neuron devices using the `transformers-neuronx` package.

The example has the following main sections:
1. Set up the Jupyter Notebook
1. Install dependencies
1. Perform autoregressive sampling

This Jupyter Notebook can be run on an Inf2 instance (`inf2.48xlarge`) or Trn1 instance (`trn1.32xlarge`).

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`mixtral-8x7b-sampling.ipynb`) and launch it. Follow the rest instructions in this tutorial. 

## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `sentencepiece`
 - `transformers`
 - `transformers-neuronx`


Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [None]:
!pip install transformers-neuronx

## Perform autoregressive sampling

Before running autoregressive sampling, we first consider the model memory footprint and tensor parallelism (TP) degree to be used. Due to the model size and mixture-of-expert (MoE) implementation in `transformers-neuronx`, the supported TP degrees are {8, 16, 32}. Detail analysis is described as follows.

The memory required to host a model can be computed as:
```
total memory = bytes per parameter * number of parameters
```
The `mistralai/Mixtral-8x7B-v0.1` model consists of 46.7 billion parameters.  With `float16` casted weights, we need 93.4GB to store the model weights. In reality, the total space required is often greater than just the model parameters due to caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size. The exact calculation can be found from the [AWS Neuron documentation page](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/transformers-neuronx/generative-llm-inference-with-neuron.html).

To get very large language models to fit on Inf2 & Trn1, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores, each equipped with 16GB high-bandwidth memory (HBM). For this model, we need at least 6 NeuronCores. 

The `mistralai/Mixtral-8x7B-v0.1` model adopts the MoE architecture with 8 experts in total. `transformers-neuronx` in Neuron SDK 2.18 employs expert parallelism for MoE architecture, splitting the 8 experts  across multiple NeuronCores. Note that increasing the TP degree beyond the minimum requirement almost always improves the model performance as more compute and memory bandwidth are available. To get better performance, it's recommended to use higher TP degree, for example, 32 for `trn1.32xlarge`. Note that we don't support TP degree of 24 on `inf2.48xlarge` for this model and the max TP degree that can be used on `inf2.48xlarge` is 16. If using TP degree 8 to run this model, users can use [int8 weight storage] (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html) to reduce the model memory footprint.

Starting from Neuron SDK 2.18, `transformers-neuronx` supports directly loading Hugging Face models in safetensor format and save_pretrained_split will be deprecated. In the following, we use the `MixtralForSampling` class in `transformers-neuronx` to create the model with model checkpoint loaded from Hugging Face. We enable tensor parallelism with the argument `tp_degree=16` and the use of data type `float16` with the argument `amp='f16'`. We set the max sequence length with `n_positions=1024`. 

In [None]:
import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.mixtral.model import MixtralForSampling

# set the directory for storing compiled model files
os.environ['NEURON_COMPILE_CACHE_URL'] = f'./neuron_cache'

# load mistralai/Mixtral-8x7B-v0.1 to the NeuronCores with 16-way tensor parallelism
neuron_model = MixtralForSampling.from_pretrained(
    'mistralai/Mixtral-8x7B-v0.1',
    batch_size=1,
    tp_degree=16,
    n_positions=1024,
    amp='f16')

# compile model
neuron_model.to_neuron()

# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-v0.1')
prompt = "Hello, I'm a language model,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=512, top_k=1) # sequence_length <= n_positions
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')