# Guide to deploy and benchmark Devstral Small 2505 with NxDI and vLLM on Inf2

## Devstral Small 2505 
Official model card: <https://huggingface.co/mistralai/Devstral-Small-2505>

## NeuronX Distributed Inference (NxDI)  
[NxDI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/index.html) is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Inferentia2 instances. Introduced with Neuron SDK 2.21 release, it offers advanced inference capabilities, including features such as continuous batching and speculative decoding for high performance inference.

## Overview
1. **Install dependencies** – NxDI, the Neuron vLLM fork, and supporting libraries.  
2. **(Optional)** Install benchmarking / evaluation utilities (`llmperf`, `lm_eval`).  
3. **Download** the Devstral Small 2505 base model weights.  
4. **Compile and save** the model with `inference_demo` and verify generation.  
5. **Deploy** the model behind a vLLM server.  
6. **Benchmark** latency and throughput with `llmperf`.  
7. **Evaluate accuracy** with `lm_eval`.

### Prerequisites

- **Amazon EC2 inf2.24xlarge instance** with `ubuntu 22.04 neuron` DLAMI
- **NXDI virtual environment** (e.g., `aws_neuronx_venv_pytorch_2_5_nxd_inference`) is required.

- To request a quota increase for `inf2.24xlarge` on EC2, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon EC2.
3. Review your default quota for the following resources:
   - `inf2.24xlarge` for ec2 on-demand use
4. If needed, request a quota increase for these resources.

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

### Create Your EC2 instance and ssh into it

Follow the steps here for a detailed set up of your EC2 instance setup:

#### Steps:
- Navigate to the EC2 dashboard from the AWS mgmt console and launch your instance.
- Search for the Ubuntu 22.04 Neuron DLAMI.
- Choose the instance size as inf2.24xlarge or any other Neuron based instance you're able to fit the model.
- Set the inbound rule for ssh to your local machine's ip address or anywhere (note that it is not in accordance to set this to allow trafic from any ipv4, please ensure you secure these ports once done testing.
- Create and specify your ssh key in the instance configuration step. You will need your .pem file
- Create your instance.
- Once you have launched your instance, navigate to either your terminal or VSCODE and follow the steps below:

#### ssh for powershell:

`$PUBLIC_DNS="paste your public ipv4 dns here" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`
`$KEY_PATH="paste ssh key path here" # local path to key, e.g. ssh/inf.pem`

`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`

#### ssh for linux/macOS:

`export PUBLIC_DNS="paste your public ipv4 dns here" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`
`export KEY_PATH="paste ssh key path here" # local path to key, e.g. ssh/inf.pem`

`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`

You should have sshed into your EC2 instance. 

- Activate your NXDI venv:

`source /opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/bin/activate`

- Activate jupyter server:

`jupyter lab —no-browser —port 8888 —ip 0.0.0.0`

You should see a familiar jupyter output with a URL to the notebook.

`http://localhost:8888/....`

We can click on it, and a jupyter environment opens in our local browser. Upload this notebook to your jupyter environment and run the steps in the cells below.

---

## Install and Set up Dependencies

### 1. Validate / Activate Python Environment

Inside a Jupyter notebook, using `source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate` directly will not persist the environment in subsequent cells, because source runs in a subshell. Please run the command to activate the venv in the terminal or activate prior to spinning up the server

In [None]:
%%bash
# (Optional) Uncomment or modify the following line to activate a custom environment.
#source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

echo 'Python environment check:'
which python
python3 --version

In [None]:
%%writefile requirements.txt
transformers==4.45.2
huggingface_hub

In [None]:
!pip install -U -r requirements.txt --quiet

In [None]:
! pip list | grep neuron

---

### 2. Install Neuron vLLM Fork

If you would like to serve your model via [vLLM](https://vllm.readthedocs.io/en/latest/) specialized for Neuron-based inference, you can install AWS Neuron's vLLM fork. NxD Inference integrates into vLLM by extending the model execution components responsible for loading and invoking models used in vLLM's LLMEngine (see [link](https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine) for more details on vLLM architecture). This means input processing, scheduling and output processing follow the default vLLM behavior.

You enable the Neuron integration in vLLM by setting the device type used by `vLLM` to `neuron`.

Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.


Skip this step if you do not need the vLLM server. Cloning and installing vLLM takes 8-10 minutes to complete


In [None]:
%%bash
set -euxo pipefail

if [ -d "/home/ubuntu/upstreaming-to-vllm" ]; then
    echo "Neuron vLLM fork already cloned. Skipping."
else
    echo "Cloning and installing AWS Neuron vLLM fork..."
    cd /home/ubuntu/
    git clone -b neuron-2.24-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git #neuron 2.24 vllm version
    cd upstreaming-to-vllm
    pip install -r requirements-neuron.txt --quiet

    # Install in editable mode with device set to neuron
    VLLM_TARGET_DEVICE="neuron" pip install -e . --quiet
fi

---

### 3. (Optional) Install accuracy and perf benchmarking tools

#### 3.1 Install llmperf

If you'd like to run benchmarks or load tests, you can install [llmperf](https://github.com/ray-project/llmperf). Skip if not needed.


In [None]:
%%bash
if pip show llmperf > /dev/null 2>&1; then
    echo "llmperf is already installed. Skipping."
else
    echo "Installing llmperf..."
    cd /home/ubuntu/
    git clone https://github.com/ray-project/llmperf.git > /dev/null 2>&1 --quiet
    cd llmperf
    pip install -e . --quiet
fi

In [None]:
!pip list| grep neuron

#### 3.2 Accuracy-benchmarking with lm_eval


Clone the `aws-neuron-samples` repo to your instance

In [None]:
! git clone https://github.com/aws-neuron/aws-neuron-samples.git

Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. 

Change directory to the your copy of inference-benchmarking. Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed manual install NxD Inference ) by:

In [None]:
%%bash
cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/
pip install -r requirements.txt --quiet

---

## 4. Download or Provide Your Model

Below is a template for downloading the Devstral Small 2505 model. You can skip or adjust if you already have a local model.

You will need to log in to huggingface from the commandline. You will need your token from https://huggingface.co/settings/tokens Paste it to replace the MY_HUGGINGFACE_TOKEN_HERE text below.

In [None]:
!git config --global credential.helper store
from huggingface_hub import notebook_login
notebook_login()

In [None]:
#run the following code in the terminal to install git-lfs

`sudo apt-get update`

`sudo apt-get install git-lfs`

`git lfs install`

In [None]:
#check that git lfs is installed on path

In [None]:
!git lfs version

In [None]:
#start a tmux session and run the following code in the terminal:

`sudo apt-get update`

`sudo apt-get install tmux`

`tmux new -s mysession`

In [None]:
# run the following code to download the model in a tmux session since this may take a while - run in terminal

`git clone https://huggingface.co/mistralai/Devstral-Small-2505`

In [None]:
!du -sh /home/ubuntu/Devstral-Small-2505/ #check if the full model was copied in

---

## 5. Compile and save the model

Use the `inference_demo` command that ships with **NeuronX Distributed Inference** to compile the model for Inferentia2 and generate a quick sample response. Compiled artifacts (NEFF files) are stored under the `--compiled-model-path` you provide and can be reused later.

In [None]:
# this is a global parameter which is required for distrbiuted inferencing/training , since we are defining TP=4 we need to define it
import os
os.environ["LOCAL_WORLD_SIZE"] = "4"

In [None]:
# Since Devstral is fine tuned from Mistral Small 3.1-https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503 we would need to make sure we download the mistrall small tokenizer files

from huggingface_hub import hf_hub_download
import os

def download_tokenizer_files(
    repo_id="mistralai/Mistral-Small-3.1-24B-Base-2503",
    output_dir="/home/ubuntu/Devstral-Small-2505"
):
    # Verify output directory exists
    if not os.path.exists(output_dir):
        print(f"Error: Directory {output_dir} does not exist!")
        return [], ["Output directory not found"]
    
    # Only the files that exist
    tokenizer_files = [
        "tokenizer.json",
        "tokenizer_config.json"
    ]
    
    downloaded_files = []
    errors = []
    
    print(f"Downloading tokenizer files from {repo_id}")
    print(f"To existing directory: {output_dir}")
    
    for file in tokenizer_files:
        try:
            print(f"\nDownloading {file}...")
            path = hf_hub_download(
                repo_id=repo_id,
                filename=file,
                local_dir=output_dir
            )
            downloaded_files.append(path)
            print(f"✓ Successfully downloaded {file}")
        except Exception as e:
            errors.append(f"Error downloading {file}: {str(e)}")
    
    # Verify downloads
    print("\nVerifying downloads:")
    for file in tokenizer_files:
        file_path = os.path.join(output_dir, file)
        if os.path.exists(file_path):
            size = os.path.getsize(file_path)
            print(f"✓ {file} ({size:,} bytes)")
            print(f"  Location: {file_path}")
        else:
            print(f"✗ Missing: {file}")
    
    return downloaded_files, errors

# Run the function
files, errors = download_tokenizer_files()

if errors:
    print("\nErrors encountered:")
    for error in errors:
        print(error)
else:
    print("\nAll files downloaded successfully!")


In [None]:
%%bash
# Replace this with the path where you downloaded and saved the model files.
# These should be the same paths used when compiling the model.
MODEL_PATH="/home/ubuntu/Devstral-Small-2505/"
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Devstral-small-2505/"
TP_DEGREE=4

inference_demo \
    --model-type llama \
    --task-type causal-lm \
        run \
        --model-path $MODEL_PATH \
        --compiled-model-path $COMPILED_MODEL_PATH \
        --torch-dtype bfloat16 \
        --start_rank_id 0 \
        --tp-degree $TP_DEGREE \
        --batch-size 4 \
        --max-context-length 4096 \
        --seq-len 4096 \
        --on-device-sampling \
        --top-k 1 \
        --do-sample \
        --fused-qkv \
        --sequence-parallel-enabled \
        --pad-token-id 2 \
        --enable-bucketing \
        --context-encoding-buckets 1024 2048 4096  \
            --token-generation-buckets 1024 2048 4096 \
        --prompt "Write a Python function that generates a Fibonacci sequence.?" 2>&1 | tee log

---

## 6. Deploy the model using vLLM

#### 6.1 Run Devstral Small 2505 on Inferentia2

The Neuron‑aware vLLM fork can load the **pre‑compiled** artifacts produced in step 5.

If pre-compiled artifacts are provided, then configurations passed through the vllm API will not be used.

If they are absent, vLLM automatically triggers a one‑time compilation on first launch.  
See the [vLLM user guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#loading-pre-compiled-models-serialization-support) for details.

Key CLI flags:

* `--max-num-seqs` – maximum batch size in NxDI.  
* `--max-model-len` – maximum context length (tokens) per sequence.  
* `--tensor-parallel-size` – number of NeuronCores across which the model is sharded.  
* `--override-neuron-config` – accepts a dictionary that can be provided to change the default configurations in NxDI while compiling the model for deployment.

Example:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model /home/ubuntu/Devstral-small-2505 \
  --max-num-seqs 16 \
  --max-model-len 8192 \
  --tensor-parallel-size 4 \
  --compiled-model-path /home/ubuntu/traced_model/Devstral-small-2505 \
  --override-neuron-config /home/ubuntu/traced_model/Devstral-small-2505/neuron_config.json
```

In the below steps, we use the precompiled model artifacts we had saved from the previous run with `inference_demo` and we set `VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"` to override the default value.

In [None]:
!pip list | grep neuron

In [None]:
# RUN THE FOLLOWING CELL IN A TERMINAL - spin up the vllm server

In [None]:
# These should be the same paths used when compiling the model. - command for terminal
MODEL_PATH="/home/ubuntu/Devstral-Small-2505/"
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Devstral-small-2505/"

export LOCAL_WORLD_SIZE=4
export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH

VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --max-num-seqs 4 \
    --max-model-len 2048 \
    --tensor-parallel-size 4 \
    --device neuron \
    --use-v2-block-manager \
    --port 8000 \
    --chat-template "{% for message in messages %}{% if message['role'] == 'user' %}[INST] {{ message['content'] }} [/INST]{% elif message['role'] == 'assistant' %}{{ message['content'] }}</s>{% endif %}{% endfor %}" &

PID=$!
echo "vLLM server started with PID $PID"

Let's send a quick request with a python client to the server:

In [None]:
from openai import OpenAI

# Client Setup
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model_name = models.data[0].id

# Sampling Parameters
max_tokens = 1024
temperature = 1.0
top_p = 1.0
top_k = 50
stream = False

# Chat Completion Request
response = client.chat.completions.create(
    model=model_name,
    messages=[
       {"role": "system", "content": "You are a helpful AI assistant."},
       {"role": "user", "content": "Write a Python function to implement binary search?"}
    ],
)

# Parse the response
generated_text = ""
generated_text = response.choices[0].message.content

print(generated_text)

In [None]:
!neuron-ls # show running processes - vllm server is still running

----

#### 6.2 Benchmarking with llmperf

Follow the [LLMPerf on Trainium guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide.html) to install and configure the tool.

Below is a sample shell script that targets the vLLM server started in the previous step:

In [None]:
%%bash
cd /home/ubuntu/llmperf/

MODEL_PATH="/home/ubuntu/Devstral-Small-2505/"
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Devstral-small-2505/"
OUTPUT_PATH=llmperf-results-sonnets

export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="mock_key"

python token_benchmark_ray.py \
    --model $MODEL_PATH \
    --mean-input-tokens 1000 \
    --stddev-input-tokens 0 \
    --mean-output-tokens 500 \
    --stddev-output-tokens 0 \
    --num-concurrent-requests 4\
    --timeout 3600 \
    --max-num-completed-requests 10 \
    --additional-sampling-params '{}' \
    --results-dir $OUTPUT_PATH \
    --llm-api "openai"

In [None]:
%%bash
sudo kill $(pgrep -f "vllm.entrypoints.openai.api_server")  # Stop the vLLM server

---

#### 6.3 Accuracy evaluation with lm_eval

This approach expands on the accuracy evaluation using logits and enables you to evaluate accuracy using open source datasets like MMLU and GSM8K for tasks such as instruction following and mathematical reasoning.

Under the hood, this accuracy suite uses vLLM server to serve the model and can use benchmarking clients such as [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate on their supported datasets. Refer to the [Accuracy eval](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/accuracy-eval-with-datasets.html) guide in the neuron docs for more.

In [None]:
%%writefile devstral_config.yaml

server:
  name: "Devstral-Small-2505"
  model_path: "/home/ubuntu/Devstral-Small-2505/"
  model_s3_path: null
  compiled_model_path: "/home/ubuntu/traced_model/Devstral-small-2505/"
  max_seq_len: 4096
  context_encoding_len: 4096
  tp_degree: 4
  n_vllm_threads: 32
  server_port: 8888
  continuous_batch_size: 1

test:
  accuracy:
    mytest:
      client: "lm_eval"
      datasets: ["gsm8k_cot"]
      max_concurrent_requests: 1
      timeout: 3600
      client_params:
        limit: 200
        use_chat: True

In [None]:
%%bash
if test -f "/home/ubuntu/aws-neuron-samples/inference-benchmarking/devstral_config.yaml"; then
   echo "config file exists."
else 
   echo "Copying config file."
   mv /home/ubuntu/devstral_config.yaml /home/ubuntu/aws-neuron-samples/inference-benchmarking/
fi

In [None]:
%%bash
export LOCAL_WORLD_SIZE=4
cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/
python accuracy.py --config devstral_config.yaml

---

## Conclusion

In this notebook we:

* Compiled **Devstral Small 2505** for Inferentia2 with `inference_demo`.
* Served the model through the Neuron‑enabled **vLLM** server.
* Measured latency and throughput using **llmperf**.
* Verified accuracy with **lm_eval**.

You can now adapt these steps for your own prompts and workloads.

In this notebook, we successfully walked through deploying and benchmarking Devstral Small 2505 on Inf2

---

#### Distributors
- AWS
- Mistral