# Run Hugging Face mistralai/Mistral-7B-Instruct-v0.2 autoregressive sampling on Inf2 & Trn1

In this example we compile and deploy the Hugging Face [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model for tensor parallel inference on Neuron using the `transformers-neuronx` package.

The example has the following main sections:
1. Set up the Jupyter Notebook
1. Install dependencies
1. Load the model
1. Perform autoregressive sampling using tensor parallelism

This Jupyter Notebook should be run on an Inf2 instance (`inf2.48xlarge`). To run on a larger Trn1 instance (`trn1.32xlarge`) will require changing the `tp_degree` specified in compilation section.

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`mistralai-Mistral-7b-Instruct-v0.2.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. 

## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `sentencepiece`
 - `transformers`
 - `transformers-neuronx`


Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [None]:
!pip install transformers-neuronx sentencepiece -U

# Load the model

The memory required to host any model can be computed with:
```
total memory = bytes per parameter * number of parameters
```
When using `float16` casted weights for a 7 billion parameter model, this works out to `2 * 7B` or ~14GB of weights. In theory, this means it is possible to fit this model on a single NeuronCore (16GB capacity). In this example, we will show splitting the compute across 8 NeuronCores.

Increasing the `tp_degree` beyond the minimum requirement for a model almost always results in a faster model. Increasing the tensor parallelism degree increases both available compute power and memory bandwidth which improve model performance. To minimize model latency, it is recommended to use the highest tensor parallelism degree that is supported by the instance.

In the following code, we will use the `NeuronAutoModelForCausalLM` class to automatically load a checkpoint directly from the huggingface hub. The default model config supports sampling up to sequence length 2048. Tensor parallelism is enabled through the argument `tp_degree=8`. We enable `bfloat16` casting with the `amp='bf16'` flag. The model computational graph is compiled by `neuronx-cc` for optimized inference on Neuron. 

In [None]:
from transformers_neuronx import NeuronAutoModelForCausalLM

name = 'mistralai/Mistral-7B-Instruct-v0.2'

model = NeuronAutoModelForCausalLM.from_pretrained(
    name,           # The reference to the huggingface model
    tp_degree=8,    # The Number of NeuronCores to shard the model across. Using 8 means 3 replicas can be used on a inf2.48xlarge
    amp='bf16',     # Ensure the model weights/compute are bfloat16 for faster compute
)
model.to_neuron()

# Perform autoregressive sampling using tensor parallelism

In this code we demonstrate using the model to answer prompts and stream the output results token-by-token as they are produced. Here we use Top-K sampling to select tokens.

In [None]:
import torch
from transformers import AutoTokenizer, TextStreamer

tokenizer = AutoTokenizer.from_pretrained(name)
streamer = TextStreamer(tokenizer)

prompt = "[INST] What is your favourite condiment? [/INST]"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

with torch.inference_mode():
    generated_sequences = model.sample(input_ids, sequence_length=2048, top_k=50, streamer=streamer)