# Guide to deploy, benchmark, and profile Mistral Small 2501 with NXDI and VLLM on Trn1

This notebook provides a step-by-step guide for serving, profiling, and running benchmarks on Mistral Small 24B model on a **Trn1** instance. 

## Mistral Small 2501

[Mistral Small 3.0](https://mistral.ai/news/mistral-small-3) is a 24B-parameter language model from Mistral AI optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment. The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

## Neuronx-Distributed-Inference (NxDI)

[NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview) (where NxD stands for NeuronX Distributed) is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances. Introduced with Neuron SDK 2.21 release, it offers advanced inference capabilities, including features such as continuous batching and speculative decoding for high performance inference. Additionally, it supports inference engine for vLLM for seamless integration with the majority of customers’ production deployment systems. ML developers can use NxD Inference library at different levels of abstraction that fits their inference use case.

## Overview

1. **Check/Install Dependencies** for AWS Neuron (NXDI, vLLM fork, etc.).
2. **Optional**: Install additional utilities (`inference-benchmarking` (lm_eval), InfluxDB, `llmperf` for performance benchmarking, etc.).
3. **Download** Mistral Small 24B base model.
4. **Spin Up** a VLLM server, benchmark and pull a profile.
   
### Prerequisites

- **Amazon EC2 Trn1.32xlarge instance** with AWS Neuron drivers and recommended PyTorch environment.
- **NXDI virtual environment** (e.g., `aws_neuronx_venv_pytorch_2_5_nxd_inference`) is required.

- To request a quota increase for `trn1.32xlarge` on EC2, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon EC2.
3. Review your default quota for the following resources:
   - `trn1.32xlarge` for ec2 on-demand use
4. If needed, request a quota increase for these resources.


<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

### Create Your EC2 instance

Follow the steps here for a detailed set up of your EC2 instance setup:

#### Steps:
- Navigate to the EC2 dashboard from the AWS mgmt console and launch your instance.
- Search for the Ubuntu 22.04 Neuron DLAMI.
- Choose the instance size as Trn1.32xlarge or any other Neuron based instance you're able to fit the model.
- Set the inbound rule for ssh to your local machine's ip address or anywhere (note that it is not in accordance to set this to allow trafic from any ipv4, please ensure you secure these ports once done testing.
- Create and specify your ssh key in the instance configuration step. You will need your .pem file
- Create your instance.
- Once you have launched your instance, navigate to either your terminal or VSCODE and follow the steps below:

#### ssh for powershell:

`$PUBLIC_DNS="paste your public ipv4 dns here" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`
`$KEY_PATH="paste ssh key path here" # local path to key, e.g. ssh/trn.pem`

`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`

#### ssh for linux/macOS:

`export PUBLIC_DNS="paste your public ipv4 dns here" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`
`export KEY_PATH="paste ssh key path here" # local path to key, e.g. ssh/trn.pem`

`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`

You should have sshed into your EC2 instance. 

- Activate your NXDI venv:

`source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate`

- Activate jupyter server:

`jupyter lab —no-browser —port 8888 —ip 0.0.0.0`

You should see a familiar jupyter output with a URL to the notebook.

`http://localhost:8888/....`

We can click on it, and a jupyter environment opens in our local browser. Upload this notebook to your jupyter environment and run the steps in the cells below.

---

## Install and Set up Dependencies

### 1. Validate / Activate Python Environment

Inside a Jupyter notebook, using `source myenv/bin/activate` directly will not persist the environment in subsequent cells, because source runs in a subshell. Please run the command to actuvate the venv in the terminal or activate prior to spinning up the server

In [None]:
%%bash
# (Optional) Uncomment or modify the following line to activate a custom environment.
#source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

echo 'Python environment check:'
which python
python --version

In [None]:
%%writefile requirements.txt
torch==2.5.1
transformers==4.45.2
huggingface_hub
git-lfs

In [None]:
!pip install -U -r requirements.txt --quiet

In [None]:
! pip list | grep neuron

---

### 2. Install Neuron vLLM Fork

If you would like to serve your model via [vLLM](https://vllm.readthedocs.io/en/latest/) specialized for Neuron-based inference, you can install AWS Neuron's vLLM fork. NxD Inference integrates into vLLM by extending the model execution components responsible for loading and invoking models used in vLLM’s LLMEngine (see [link](https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine) for more details on vLLM architecture). This means input processing, scheduling and output processing follow the default vLLM behavior.

You enable the Neuron integration in vLLM by setting the device type used by `vLLM` to `neuron`.

Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.


Skip this step if you do not need the vLLM server. Cloning and installing vLLM takes 8-10 minutes to complete


In [None]:
%%bash
set -euxo pipefail

if [ -d "/home/ubuntu/upstreaming-to-vllm" ]; then
    echo "Neuron vLLM fork already cloned. Skipping."
else
    echo "Cloning and installing AWS Neuron vLLM fork..."
    cd /home/ubuntu/
    git clone -b neuron-2.22-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git #neuron 2.22 vllm version
    cd upstreaming-to-vllm
    pip install -r requirements-neuron.txt --quiet

    # Install in editable mode with device set to neuron
    VLLM_TARGET_DEVICE="neuron" pip install -e . --quiet
fi

---

### 3. (Optional) Install benchmarking and profiling tools

#### 3.1 Install llmperf

If you'd like to run benchmarks or load tests, you can install [llmperf](https://github.com/ray-project/llmperf). Skip if not needed.


In [None]:
%%bash
if pip show llmperf > /dev/null 2>&1; then
    echo "llmperf is already installed. Skipping."
else
    echo "Installing llmperf..."
    cd /home/ubuntu/
    git clone https://github.com/ray-project/llmperf.git > /dev/null 2>&1 --quiet
    cd llmperf
    pip install -e . --quiet
fi

In [None]:
!pip list| grep neuron

#### 3.2 Install AWS Neuron Tools (If Needed)

This cell installs the Neuron packages for profiling and other tooling. If already installed, the script checks and skips. For more information, see [Installing Neuron Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/index.html).

> **Note**: If you have your apt sources already configured and have installed the Neuron packages, you can skip this step.


In [None]:
%%bash
set -euxo pipefail

# Check if aws-neuronx-tools is installed
if dpkg -s aws-neuronx-tools > /dev/null 2>&1; then
    echo "aws-neuronx-tools is already installed. Skipping."
else
    echo "Installing aws-neuronx-tools..."
    . /etc/os-release

    sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF

    wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
    sudo apt-get update -y
    sudo apt-get install -y aws-neuronx-runtime-lib aws-neuronx-dkms aws-neuronx-tools
fi


#### 3.3 (Optional) Install InfluxDB 2.x

Install InfluxDB if using the Neuron Profiler

In [None]:
%%bash
if dpkg -s influxdb2 > /dev/null 2>&1; then
    echo "InfluxDB2 is already installed, skipping."
    if systemctl is-active --quiet influxdb; then
        echo "InfluxDB is already running."
    else
        sudo systemctl start influxdb
        echo "Setting up InfluxDB ..."
        # influx setup
    fi
else
    # Install InfluxDB
    wget -q https://repos.influxdata.com/influxdata-archive_compat.key
    echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && \
      cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null
    echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list
    
    sudo apt-get update && sudo apt-get install influxdb2 influxdb2-cli -y
    sudo systemctl start influxdb
    
    # Run non-interactive influx setup with all necessary flags
    # replace the following flags below with the necessary credentials
    influx setup \
      --username admin \
      --password testpassowrd \
      --org yourorg \
      --bucket yourbucket \
      --token yoursupersecrettoken \
      --force

fi

#### 3.4 Accuracy-benchmarking with lm_eval


Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. 

In [None]:
! git clone https://github.com/aws-neuron/aws-neuron-samples.git

Change directory to the your copy of inference-benchmarking. Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed manual install NxD Inference ) by:

In [None]:
%%bash
cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/
pip install -r requirements.txt --quiet

---

## 4. Download or Provide Your Model

Below is a template for downloading the model. You can skip or adjust if you already have a local model.

For more information on model checkpoint usage, see the [NxDI inference with Hugging Face-based models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html).

You will need to log in to huggingface from the commandline.  You will need your token from https://huggingface.co/settings/tokens Paste it to replace the MY_HUGGINGFACE_TOKEN_HERE text below. 

In [None]:
!git config --global credential.helper store
from huggingface_hub import notebook_login
notebook_login()

In [None]:
#run the following code in the terminal to install git-lfs

`sudo apt-get update`

`sudo apt-get install git-lfs`

`git lfs install`

In [None]:
#check that git lfs is installed on path

In [None]:
!git lfs version

In [None]:
#start a tmux session and run the following code in the terminal:

`sudo apt-get update`

`sudo apt-get install tmux`

`tmux new -s mysession`

In [None]:
# run the following code to download the model in a tmux session since this may take a while - run in terminal

`git clone https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501`

In [None]:
!du -sh /home/ubuntu/Mistral-Small-24B-Instruct-2501/ #check if the full model was copied in

---

## 5. Compile and save model and run generation with HuggingFaceGenerationAdapter- `inference_demo.py`

NxD Inference supports running inference with the HuggingFace generate inference. To use HuggingFace-style generation, create a HuggingFaceGenerationAdapter that wraps a Neuron application model. Then, you can call generate on the adapted model. In the below cell, we use the `inference_demo` script that NXDI provides to compile, save, and run some prompts with our Mistral Small 24B model, for more information on the flags we set, refer to the [nxdi api reference guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html).

In [None]:
%%bash
# Replace this with the path where you downloaded and saved the model files.
# These should be the same paths used when compiling the model.
MODEL_PATH="/home/ubuntu/Mistral-Small-24B-Instruct-2501/"
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/"
TP_DEGREE=32

inference_demo \
    --model-type llama \
    --task-type causal-lm \
        run \
        --model-path $MODEL_PATH \
        --compiled-model-path $COMPILED_MODEL_PATH \
        --torch-dtype bfloat16 \
        --start_rank_id 0 \
        --tp-degree $TP_DEGREE \
        --batch-size 1 \
        --max-context-length 12288 \
        --seq-len 12800 \
        --on-device-sampling \
        --top-k 1 \
        --do-sample \
        --fused-qkv \
        --sequence-parallel-enabled \
        --pad-token-id 2 \
        --enable-bucketing \
        --context-encoding-buckets 2048 4096 8192 12288 \
            --token-generation-buckets 2048 4096 8192 12800 \
        --prompt "What is annapurna labs?" 2>&1 | tee log

---

## 6. vLLM demo and perf benchmarking - standalone model 

#### 6.1 Run Mistral Small 2501 on Trainium

Here is an example for running online inference with Mistral Small 2501 and let's get some perf results. We will first compile and run generation on a sample prompt using a command installed by neuronx-distributed-inference. The script compiles the model and runs generation on the given input prompt. Note the path we used to save the compiled model. This path should be used when launching vLLM server for inference so that the compiled model can be loaded without recompilation. Please refer to [NxD Inference API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html) and [VLLM user guide for NxDI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)for more information on these `inference_demo` flags.

In [None]:
!pip list | grep neuron

In [None]:
# RUN THE FOLLOWING CELL IN A TERMINAL - spin up the vllm server

In [None]:
# These should be the same paths used when compiling the model. - command for terminal
MODEL_PATH="/home/ubuntu/Mistral-Small-24B-Instruct-2501/"
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/"

export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH
VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --max-num-seqs 1 \
    --max-model-len 12800 \
    --tensor-parallel-size 32 \
    --device neuron \
    --use-v2-block-manager \
    --port 8000 &
PID=$!
echo "vLLM server started with PID $PID"

Let's send a quick request with a python client to the server:

In [None]:
from openai import OpenAI

# Client Setup
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model_name = models.data[0].id

# Sampling Parameters
max_tokens = 1024
temperature = 1.0
top_p = 1.0
top_k = 50
stream = False

# Chat Completion Request
response = client.chat.completions.create(
    model=model_name,
    messages=[
       {"role": "system", "content": "You are a helpful AI assistant."},
       {"role": "user", "content": "What is AWS Neuron?"}
    ],
)

# Parse the response
generated_text = ""
generated_text = response.choices[0].message.content

print(generated_text)

In [None]:
!neuron-ls # show running processes - vllm server is still running

----

#### 6.2 llmperf- let's run some quick benchmarks 

After the above steps, the vllm server should be running. You can now measure the performance using LLMPerf. Before we can use the llmperf package, we need to make a few changes to its code. Follow benchmarking with LLMPerf guide to apply the code changes.

Below is a sample shell script to run LLMPerf. To provide the model with 10000 tokens as input and generate 1500 tokens as output on average, we use the following parameters from LLMPerf:

In [None]:
%%bash
cd /home/ubuntu/llmperf/

MODEL_PATH="/home/ubuntu/Mistral-Small-24B-Instruct-2501/"
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/"
OUTPUT_PATH=llmperf-results-sonnets

export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="mock_key"

python token_benchmark_ray.py \
    --model $MODEL_PATH \
    --mean-input-tokens 10000 \
    --stddev-input-tokens 0 \
    --mean-output-tokens 1500 \
    --stddev-output-tokens 0 \
    --num-concurrent-requests 1\
    --timeout 3600 \
    --max-num-completed-requests 50 \
    --additional-sampling-params '{}' \
    --results-dir $OUTPUT_PATH \
    --llm-api "openai"

In [None]:
!sudo kill 55509 #stop the server

Summarized results:

| Scenario                                                                  | TTFT (p50 ms) | TPOT (p50 ms) | Output-token Throughput (tokens/s, p50) |
|---------------------------------------------------------------------------|---------------|---------------|-----------------------------------------|
| Mistral-Small-24B-Instruct-2501 on Trainium (OpenAI-style API)            | 347           | 10.55         | 107.35                                  |


---

#### 6.3 Running Evaluations

There are two methods that you can use the evaluation scirpts to run your evaluation. For more information, check out the [inference-demo](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/)directory and [tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial.html) in NXDI.

1. Using a `yaml` configuration file and `accuracy.py` script

2. Writing your own python script that uses several components provided in `accuracy.py` and `server_config.py`

In this notebook we only demonstrate running an eval with the `yaml` config file.

In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create `config.yaml` with the following content.

In [None]:
%%writefile mistral_config.yaml

server:
  name: "Mistral-Small-24B-Instruct"
  model_path: "/home/ubuntu/Mistral-Small-24B-Instruct-2501/"
  model_s3_path: null
  compiled_model_path: "/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/"
  max_seq_len: 12800
  context_encoding_len: 12288
  tp_degree: 32
  n_vllm_threads: 32
  server_port: 8888
  continuous_batch_size: 1

test:
  accuracy:
    mytest:
      client: "lm_eval"
      datasets: ["gsm8k_cot"]
      max_concurrent_requests: 1
      timeout: 3600
      client_params:
        limit: 200
        use_chat: True

In [None]:
%%bash
if test -f "/home/ubuntu/aws-neuron-samples/inference-benchmarking/mistral_config.yaml"; then
   echo "config file exists."
else 
   echo "Copying config file."
   mv /home/ubuntu/mistral_config.yaml /home/ubuntu/aws-neuron-samples/inference-benchmarking/
fi

In [None]:
%%bash
cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/
python accuracy.py --config mistral_config.yaml

Results Summary:

Accuracy_mytest_gsm8k_cot:
    Saved at  results/accuracy/mytest/gsm8k_cot/__home__ubuntu__Mistral-Small-24B-Instruct-2501__/results_2025-04-26T20-02-47.843052.json:
    
    Metrics: {'gsm8k_cot': {'AccuracyExactMatchStrictMatch': 39.5, 'AccuracyExactMatchStrictMatchStderr': 3.46537, 'AccuracyExactMatchFlexibleExtract': 78.5, 'AccuracyExactMatchFlexibleExtractStderr': 2.91224}}

---

#### 6.4 Profiling with `neuron-profile`

`neuron-profile` helps developers identify performance bottlenecks and optimize their workloads for NeuronDevices. `neuron-profile` provides insights into NeuronDevice activity including the instructions executed on each compute engine (ex. Tensor engine, Vector engine, etc.), DMA data movement activity, and performance metrics such as engine utilization, DMA throughput, memory usage, and more. NeuronDevice activity is collected by the `neuron-profile` capture command which runs the model with tracing enabled. Profiling typically has near zero overhead because NeuronDevices have dedicated on-chip hardware profiling.

Let's cd into `/tmp/nxd_model` for the compiler working dir with the `context_encoding` and `token_generation` directories that we set the context encoding and token generation buckets for, which hold the NEFFs for these. The neuron-profile tool can both capture and post-process profiling information. neuron-profile takes a compiled model (a NEFF), executes it, and saves the profile results to a NTFF (profile.ntff by default).

In [None]:
%%bash 
cd /tmp/nxd_model/
ls #list directories
cd context_encoding_model
ls 

In [None]:
#_tp0_bk0  _tp0_bk1  _tp0_bk2  _tp0_bk3 - are the context encoding buckets

##### Capturing profiles for multi-worker jobs
`neuron-profile` can capture profiles for collectives-enabled NEFFs running across multiple NeuronCores, NeuronDevices, or even nodes. This is useful for understanding performance and communication overheads when deploying larger distributed models.

The following example, performs a distributed run across all NeuronDevices and NeuronCores on our trn1.32xlarge instance, capturing profiles for all 32 workers (one for each NeuronCore).

In [None]:
%%bash
# 1. Make sure the directory exists and is writable
mkdir -p /tmp/output/          

cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/
# 2. Run the capture, pointing -s at that directory
neuron-profile capture \
  -n graph.neff \
  --collectives-workers-per-node 32 \
  -s /tmp/output/profile.ntff        


Now if we check our output dir- A profile is saved for each worker in the output directory.

In [None]:
%%bash 
cd /tmp/output/
ls

##### Viewing profiles for multi-worker jobs
Profiles from multi-worker jobs (i.e. more than one NeuronCore) can either be viewed individually or in a combined collectives view. Since profile data is often similar between workers and processing profile data for all workers can be time-consuming, it is recommended to first explore the profile for a single worker or small subset of workers. Viewing the profile for a specific worker is the same as for single-worker profiles.

In the beginning, we forwarded port 3001 and 806. This is because `neuron-profile` view is running on a remote instance, we need to use port forwarding to access the profiles.


Viewing the profile for a specific worker is as below.

In [None]:
%%bash
cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/
neuron-profile view -n graph.neff -s /tmp/output/profile_rank_2.ntff

You will see an output like- View profile at http://localhost:3001/profile/n_a1143c514431fb4c23b7aae9208fd1a89cad42f6

![image-profile](imgs/img-neff.png)

To view the profile for multiple workers, pass the directory containing all worker profiles to neuron-profile.

In [None]:
%%bash
cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/
neuron-profile view -n graph.neff -d /tmp/output

For more on profiling with neuron and understanding profiles,check out the [link](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) to `neuron-profile` user guide

---

## Conclusion

In this notebook, we successfully walked through deploying, benchmarking, and generating profiles for NEFFs on TRN1 using Mistral Small 2501

---

#### Distributors
- AWS
- Mistral