# NXDI Deployment Patterns for Llama models

This notebook demonstrates how to set up an environment for **Amazon EC2 Trn1** (Trainium) instances using the [AWS Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/) and shows a small Llama-based example of compiling and running inference on a distributed model.

## Overview
1. **Check/Install Dependencies** for AWS Neuron (tools, vLLM fork, etc.).
2. **Optional**: Install additional utilities (InfluxDB, `llmperf` for performance benchmarking, etc.).
3. **Download** an example model (or place your own model in the correct path).
4. **Run** a short Python script that:
   - Loads a Llama model from a Hugging Face path.
   - Compiles it for Trainium.
   - Runs a few prompts.
   - Demonstrates on-device sampling.

### Prerequisites
- **Amazon EC2 Trn1.2xlarge instance** with AWS Neuron drivers and recommended PyTorch environment.
- **A Python virtual environment** (e.g., `aws_neuronx_venv_pytorch_2_5_nxd_inference`) is highly recommended.
- **Neuron tools** (e.g., `neuron-ls`, `neuron-top`, `neuron-profile`) installed, along with the necessary apt repo definitions. (We show how to install these if needed.)

For more details, see [Install Guide for AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-install-guide/).


## Install and Set up Dependencies

### 1. Validate / Activate Python Environment

Inside a Jupyter notebook, using `source myenv/bin/activate` directly will not persist the environment in subsequent cells, because source runs in a subshell. Please run the following in the terminal

In [2]:
%%bash
# (Optional) Uncomment or modify the following line to activate a custom environment.
source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

echo 'Python environment check:'
which python
python --version

Python environment check:
/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/python
Python 3.10.12


In [3]:
%%writefile requirements.txt
torch==2.5.1
transformers==4.45.2
huggingface_hub
git-lfs

Overwriting requirements.txt


In [4]:
!pip install -U -r requirements.txt --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 2. Install AWS Neuron Tools (If Needed)

This cell installs the Neuron packages for profiling and other tooling. If already installed, the script checks and skips. For more information, see [Installing Neuron Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-install-guide/install-aws-neuronx-tools.html).

> **Note**: If you have your apt sources already configured and have installed the Neuron packages, you can skip this step.


In [5]:
%%bash
set -euxo pipefail

# Check if aws-neuronx-tools is installed
if dpkg -s aws-neuronx-tools > /dev/null 2>&1; then
    echo "aws-neuronx-tools is already installed. Skipping."
else
    echo "Installing aws-neuronx-tools..."
    . /etc/os-release

    sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF

    wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
    sudo apt-get update -y
    sudo apt-get install -y aws-neuronx-runtime-lib aws-neuronx-dkms aws-neuronx-tools
fi


+ dpkg -s aws-neuronx-tools
+ echo 'aws-neuronx-tools is already installed. Skipping.'


aws-neuronx-tools is already installed. Skipping.


### 3. (Optional) Install Neuron vLLM Fork

If you would like to serve your model via [vLLM](https://vllm.readthedocs.io/en/latest/) specialized for Neuron-based inference, you can install AWS Neuron's vLLM fork. NxD Inference integrates into vLLM by extending the model execution components responsible for loading and invoking models used in vLLM’s LLMEngine (see [link](https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine) for more details on vLLM architecture). This means input processing, scheduling and output processing follow the default vLLM behavior.

You enable the Neuron integration in vLLM by setting the device type used by `vLLM` to `neuron`.

Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.


Skip this step if you do not need the vLLM server. Cloning and installing vLLM takes 8-10 minutes to complete


In [6]:
%%bash
set -euxo pipefail

if [ -d "/home/ubuntu/upstreaming-to-vllm" ]; then
    echo "Neuron vLLM fork already cloned. Skipping."
else
    echo "Cloning and installing AWS Neuron vLLM fork..."
    cd /home/ubuntu/
    git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
    cd upstreaming-to-vllm
    pip install -r requirements-neuron.txt

    # Install in editable mode with device set to neuron
    VLLM_TARGET_DEVICE="neuron" pip install -e .
fi

+ '[' -d /home/ubuntu/upstreaming-to-vllm ']'
+ echo 'Cloning and installing AWS Neuron vLLM fork...'
+ cd /home/ubuntu/
+ git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git


Cloning and installing AWS Neuron vLLM fork...


Cloning into 'upstreaming-to-vllm'...
+ cd upstreaming-to-vllm
+ pip install -r requirements-neuron.txt


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Ignoring fastapi: markers 'python_version < "3.9"' don't match your environment
Ignoring six: markers 'python_version > "3.11"' don't match your environment
Ignoring setuptools: markers 'python_version > "3.11"' don't match your environment
Collecting py-cpuinfo (from -r /home/ubuntu/upstreaming-to-vllm/requirements-common.txt (line 6))
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting openai>=1.40.0 (from -r /home/ubuntu/upstreaming-to-vllm/requirements-common.txt (line 13))
  Downloading openai-1.65.4-py3-none-any.whl.metadata (27 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from -r /home/ubuntu/upstreaming-to-vllm/requirements-common.txt (line 18))
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from -r /home/ubuntu/upstreaming-to-vllm/requirements-common.txt (line 19))
  Downloading tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
+ VLLM_TARGET_DEVICE=neuron
+ pip install -e .


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Obtaining file:///home/ubuntu/upstreaming-to-vllm
  Installing build dependencies: started
  Installing build dependencies: still running...
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: vllm
  Building editable for vllm (pyproject.toml): started
  Building editable for vllm (pyproject.toml): finished with status 'done'
  Created wheel for vllm: filename=vllm-0.1.dev2830+g22c56ee.neuron216-0.editable-py3-none-any.whl size=11425 sha256=a94a5945e453c76ee17dd975d315


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 4. (Optional) Install llmperf

If you'd like to run benchmarks or load tests, you can install [llmperf](https://github.com/ray-project/llmperf). Skip if not needed.


In [7]:
%%bash
if pip show llmperf > /dev/null 2>&1; then
    echo "llmperf is already installed. Skipping."
else
    echo "Installing llmperf..."
    cd /home/ubuntu/
    git clone https://github.com/ray-project/llmperf.git > /dev/null 2>&1
    cd llmperf
    pip install -e .
fi

Installing llmperf...


Cloning into 'llmperf'...


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Obtaining file:///home/ubuntu/llmperf
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Collecting pydantic<2.5 (from LLMPerf==0.1.0)
  Downloading pydantic-2.4.2-py3-none-any.whl.metadata (158 kB)
Collecting ray (from LLMPerf==0.1.0)
  Downloading ray-2.43.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (19 kB)
Collecting typer>=0.4 (from LLMPerf==0.1.0)
  Downloading typer-0.15.2-py3-none-any.whl.metadata (15 kB)
Collecting litellm>=0.1.738 (from LLMPerf==0.1.0)
 

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mistral-common 1.5.3 requires pydantic<3.0,>=2.7, but you have pydantic 2.4.2 which is incompatible.
vllm 0.1.dev2830+g22c56ee.neuron216 requires pydantic>=2.9, but you have pydantic 2.4.2 which is incompatible.[0m[31m


Successfully installed LLMPerf-0.1.0 cachetools-5.5.2 docopt-0.6.2 docstring-parser-0.16 google-api-core-2.24.1 google-auth-2.38.0 google-cloud-aiplatform-1.83.0 google-cloud-bigquery-3.30.0 google-cloud-core-2.4.2 google-cloud-resource-manager-1.14.1 google-cloud-storage-2.19.0 google-crc32c-1.6.0 google-resumable-media-2.7.2 googleapis-common-protos-1.69.0 grpc-google-iam-v1-0.14.1 grpcio-1.70.0 grpcio-status-1.70.0 litellm-1.63.2 msgpack-1.1.0 num2words-0.5.14 proto-plus-1.26.0 pydantic-2.4.2 pydantic-core-2.10.1 ray-2.43.0 shapely-2.0.7 shellingham-1.5.4 typer-0.15.2


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 5. (Optional) Install InfluxDB 2.x

Install InfluxDB if using the Neuron Profiler

In [8]:
%%bash
if dpkg -s influxdb2 > /dev/null 2>&1; then
    echo "InfluxDB2 is already installed, skipping."
    if systemctl is-active --quiet influxdb; then
        echo "InfluxDB is already running."
    else
        sudo systemctl start influxdb
        echo "Setting up InfluxDB ..."
        # influx setup
    fi
else
    # Install InfluxDB
    wget -q https://repos.influxdata.com/influxdata-archive_compat.key
    echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && \
      cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null
    echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list
    
    sudo apt-get update && sudo apt-get install influxdb2 influxdb2-cli -y
    sudo systemctl start influxdb
    
    # Run non-interactive influx setup with all necessary flags
    # replace the following flags below with the necessary credentials
    influx setup \
      --username admin \
      --password testpassowrd \
      --org yourorg \
      --bucket yourbucket \
      --token yoursupersecrettoken \
      --force

fi

influxdata-archive_compat.key: OK
deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main
Hit:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://us-west-2.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://us-west-2.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:4 https://repos.influxdata.com/debian stable InRelease [6907 B]
Hit:5 https://download.docker.com/linux/ubuntu jammy InRelease
Get:6 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  InRelease [1484 B]
Hit:7 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  InRelease
Hit:8 https://apt.repos.neuron.amazonaws.com jammy InRelease
Hit:9 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease
Get:10 https://repos.influxdata.com/debian stable/main amd64 Packages [14.6 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 k

W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://apt.repos.neuron.amazonaws.com/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.


Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  influxdb2 influxdb2-cli
0 upgraded, 2 newly installed, 0 to remove and 29 not upgraded.
Need to get 61.3 MB of archives.
After this operation, 147 MB of additional disk space will be used.
Get:1 https://repos.influxdata.com/debian stable/main amd64 influxdb2 amd64 2.7.11-1 [49.6 MB]
Get:2 https://repos.influxdata.com/debian stable/main amd64 influxdb2-cli amd64 2.7.5-1 [11.7 MB]


dpkg-preconfigure: unable to re-open stdin: No such file or directory


Fetched 61.3 MB in 0s (123 MB/s)
Selecting previously unselected package influxdb2.
(Reading database ... 124863 files and directories currently installed.)
Preparing to unpack .../influxdb2_2.7.11-1_amd64.deb ...
Unpacking influxdb2 (2.7.11-1) ...
Selecting previously unselected package influxdb2-cli.
Preparing to unpack .../influxdb2-cli_2.7.5-1_amd64.deb ...
Unpacking influxdb2-cli (2.7.5-1) ...
Setting up influxdb2 (2.7.11-1) ...
Created symlink /etc/systemd/system/influxd.service → /lib/systemd/system/influxdb.service.
Created symlink /etc/systemd/system/multi-user.target.wants/influxdb.service → /lib/systemd/system/influxdb.service.
Setting up influxdb2-cli (2.7.5-1) ...


Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep recursion on subroutine "NeedRestart::Interp::Python::_scan" at /usr/share/perl5/NeedRestart/Interp/Python.pm line 78.
Deep rec


Running kernel seems to be up-to-date.

Services to be restarted:
 systemctl restart acpid.service
 systemctl restart chrony.service
 systemctl restart containerd.service
 systemctl restart cron.service
 systemctl restart irqbalance.service
 systemctl restart multipathd.service
 systemctl restart packagekit.service
 systemctl restart polkit.service
 systemctl restart rpcbind.service
 systemctl restart rsyslog.service
 systemctl restart serial-getty@ttyS0.service
 systemctl restart ssh.service
 systemctl restart systemd-journald.service
 /etc/needrestart/restart.d/systemd-manager
 systemctl restart systemd-networkd.service
 systemctl restart systemd-resolved.service
 systemctl restart systemd-udevd.service

Service restarts being deferred:
 /etc/needrestart/restart.d/dbus.service
 systemctl restart docker.service
 systemctl restart getty@tty1.service
 systemctl restart networkd-dispatcher.service
 systemctl restart systemd-logind.service
 systemctl restart unattended-upgrades.service
 

In [9]:
!pip list| grep neuron

libneuronxla                      2.1.714.0
neuronx-cc                        2.16.372.0+4a9b2326
neuronx-distributed               0.10.1
neuronx-distributed-inference     0.1.1
torch-neuronx                     2.5.1.2.4.0
vllm                              0.1.dev2830+g22c56ee.neuron216 /home/ubuntu/upstreaming-to-vllm


## 6. Download or Provide Your Model

Below is a template for downloading a model if you have a pre-signed URL. You can skip or adjust if you already have a local model.

For more information on model checkpoint usage, see the [Neuron Inference with Hugging Face-based models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/pytorch/).

You will need to log in to huggingface from the commandline.  You will need your token from https://huggingface.co/settings/tokens Paste it to replace the MY_HUGGINGFACE_TOKEN_HERE text below. 

In [15]:
!git config --global credential.helper store
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#run following in terminal

In [None]:
sudo apt-get update
sudo apt-get install git-lfs
git lfs install

In [None]:
#check git lfs is installed on path

In [21]:
!git lfs version

git-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1)


In [22]:
!git clone https://huggingface.co/meta-llama/Llama-3.2-1B

Git LFS initialized.
Cloning into 'Llama-3.2-1B'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 76 (delta 33), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (76/76), 2.27 MiB | 3.26 MiB/s, done.
Filtering content: 100% (3/3), 4.60 GiB | 94.51 MiB/s, done.


In [None]:
#check if full model was downloaded

In [27]:
!du -sh /home/ubuntu/Llama-3.2-1B/

9.3G	/home/ubuntu/Llama-3.2-1B/


## 7. Llama Python Generation Demo

Below is a Python script that uses **Amazon Neuron** libraries to:
- Import libraries including `torch`, `transformers`, and specialized Neuron classes.
- Load a **Llama** model from a local Hugging Face directory.
- Compile and store the traced model to a specified location.
- Reload from the compiled artifacts.
- Generate text based on a couple of example prompts.

This process illustrates how [NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/pytorch/torch-neuronx-distributed-inference/index.html) compiles a model for **TP Degree** = 32 or 2 or any other (depending on your system resources), and allows sampling within hardware constraints.

> **Important**: Ensure the `tp_degree` in `NeuronConfig` matches the number of Neuron Cores (or a multiple) available, or that you have enough parallel resources to accommodate your chosen partitioning.

Once you run this code cell, it will:
- Set a random seed.
- Compile the Llama model and store the compiled artifacts in `traced_model_path`.
- Reload the compiled checkpoint.
- Generate text from the prompts.

If you would like to modify the prompts, simply edit them in the `prompts` list.


In [28]:
import torch
from transformers import AutoTokenizer, GenerationConfig

from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.llama.modeling_llama import LlamaInferenceConfig, NeuronLlamaForCausalLM
from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config
from neuronx_distributed_inference.modules.generation.sampling import prepare_sampling_params

# Modify these paths as needed:Llama-3.2-1B
model_path = "/home/ubuntu/Llama-3.2-1B/"       # The original HF directory for your Llama model
traced_model_path = "/home/ubuntu/traced_model/Llama-3.2-1B/"  # Where to store the compiled artifacts

torch.manual_seed(0)

def run_llama_generate():
    # Initialize configs and tokenizer.
    generation_config = GenerationConfig.from_pretrained(model_path)
    # Some sample overrides for generation
    generation_config_kwargs = {
        "do_sample": True,
        "top_k": 1,
        "pad_token_id": generation_config.eos_token_id,
    }
    generation_config.update(**generation_config_kwargs)

    # Set up the Neuron config (tensor parallel = 32, batch = 2, etc.)
    neuron_config = NeuronConfig(
        tp_degree=2,
        batch_size=2,
        max_context_length=32,
        seq_len=64,
        on_device_sampling_config=OnDeviceSamplingConfig(top_k=1),
        enable_bucketing=True,
        flash_decoding_enabled=False
    )

    # Build the Llama Inference config
    config = LlamaInferenceConfig(
        neuron_config,
        load_config=load_pretrained_config(model_path),
    )

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="right")
    tokenizer.pad_token = tokenizer.eos_token

    # Compile and save model.
    print("\nCompiling and saving model...")
    model = NeuronLlamaForCausalLM(model_path, config)
    model.compile(traced_model_path)
    tokenizer.save_pretrained(traced_model_path)

    # Load from compiled checkpoint.
    print("\nLoading model from compiled checkpoint...")
    model = NeuronLlamaForCausalLM(traced_model_path)
    model.load(traced_model_path)
    tokenizer = AutoTokenizer.from_pretrained(traced_model_path)

    # Generate outputs.
    print("\nGenerating outputs...")
    prompts = ["I believe the meaning of life is", "The color of the sky is"]

    # Example: parameter sweeps for sampling
    sampling_params = prepare_sampling_params(batch_size=neuron_config.batch_size,
                                             top_k=[10, 5],
                                             top_p=[0.5, 0.9],
                                             temperature=[0.9, 0.5])

    print(f"Prompts: {prompts}")
    inputs = tokenizer(prompts, padding=True, return_tensors="pt")
    generation_model = HuggingFaceGenerationAdapter(model)
    outputs = generation_model.generate(
        inputs.input_ids,
        generation_config=generation_config,
        attention_mask=inputs.attention_mask,
        max_length=model.config.neuron_config.max_length,
        sampling_params=sampling_params,
    )
    output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)

    print("Generated outputs:")
    for i, output_token in enumerate(output_tokens):
        print(f"Output {i}: {output_token}")

# Run example if you wish.
if __name__ == "__main__":
    run_llama_generate()

INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']



Compiling and saving model...
[2025-03-06 19:03:24.043: I neuronx_distributed/parallel_layers/parallel_state.py:518] > initializing tensor model parallel with size 2
[2025-03-06 19:03:24.043: I neuronx_distributed/parallel_layers/parallel_state.py:519] > initializing pipeline model parallel with size 1
[2025-03-06 19:03:24.044: I neuronx_distributed/parallel_layers/parallel_state.py:520] > initializing data parallel with size 1
[2025-03-06 19:03:24.044: I neuronx_distributed/parallel_layers/parallel_state.py:521] > initializing world size to 2
[2025-03-06 19:03:24.046: I neuronx_distributed/parallel_layers/parallel_state.py:307] [rank_0_pp-1_tp-1_dp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x7f17f4b24d30>, 'Ascending Ring PG Group')>
[2025-03-06 19:03:24.047: I neuronx_distributed/parallel_layers/parallel_state.py:557] [rank_0_pp-1_tp-1_dp-1] tp_groups: replica_groups.tp_groups=[[0, 1]]
[2025-03-06 19:03:24.047: I neuro

INFO:Neuron:Generating 1 hlos for key: context_encoding_model
INFO:Neuron:Started loading module context_encoding_model
INFO:Neuron:Finished loading module context_encoding_model in 0.07097268104553223 seconds
INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([2, 32])
INFO:Neuron:Generating 1 hlos for key: token_generation_model
INFO:Neuron:Started loading module token_generation_model
INFO:Neuron:Finished loading module token_generation_model in 0.05785727500915527 seconds
INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([2, 1])
INFO:Neuron:Started compilation for all HLOs


..

INFO:Neuron:Done compilation for the priority HLO


.
Compiler status PASS


INFO:Neuron:Updating the hlo module with optimized layout
INFO:Neuron:Done optimizing weight layout for all HLOs


.

INFO:Neuron:Finished Compilation for all HLOs



Compiler status PASS
..

INFO:Neuron:Done preparing weight layout transformation
INFO:Neuron:Sharding Weights for ranks: 0...1



Compiler status PASS
[2025-03-06 19:04:48.577: I neuronx_distributed/parallel_layers/parallel_state.py:518] > initializing tensor model parallel with size 2
[2025-03-06 19:04:48.577: I neuronx_distributed/parallel_layers/parallel_state.py:519] > initializing pipeline model parallel with size 1
[2025-03-06 19:04:48.578: I neuronx_distributed/parallel_layers/parallel_state.py:520] > initializing data parallel with size 1
[2025-03-06 19:04:48.578: I neuronx_distributed/parallel_layers/parallel_state.py:521] > initializing world size to 2
[2025-03-06 19:04:48.580: I neuronx_distributed/parallel_layers/parallel_state.py:307] [rank_0_pp-1_tp-1_dp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x7f17f4b24d30>, 'Ascending Ring PG Group')>
[2025-03-06 19:04:48.580: I neuronx_distributed/parallel_layers/parallel_state.py:557] [rank_0_pp-1_tp-1_dp-1] tp_groups: replica_groups.tp_groups=[[0, 1]]
[2025-03-06 19:04:48.581: I neuronx_distri



[2025-03-06 19:04:50.449: I neuronx_distributed/parallel_layers/parallel_state.py:518] > initializing tensor model parallel with size 2
[2025-03-06 19:04:50.450: I neuronx_distributed/parallel_layers/parallel_state.py:519] > initializing pipeline model parallel with size 1
[2025-03-06 19:04:50.450: I neuronx_distributed/parallel_layers/parallel_state.py:520] > initializing data parallel with size 1
[2025-03-06 19:04:50.451: I neuronx_distributed/parallel_layers/parallel_state.py:521] > initializing world size to 2
[2025-03-06 19:04:50.453: I neuronx_distributed/parallel_layers/parallel_state.py:307] [rank_0_pp-1_tp-1_dp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x7f17f4b24d30>, 'Ascending Ring PG Group')>
[2025-03-06 19:04:50.453: I neuronx_distributed/parallel_layers/parallel_state.py:557] [rank_0_pp-1_tp-1_dp-1] tp_groups: replica_groups.tp_groups=[[0, 1]]
[2025-03-06 19:04:50.453: I neuronx_distributed/parallel_layers/

INFO:Neuron:Done Sharding weights



Loading model from compiled checkpoint...


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



Generating outputs...
Prompts: ['I believe the meaning of life is', 'The color of the sky is']
2025-Mar-06 19:05:13.0730 21230:26358 [0] nccl_net_ofi_rdma_init:7734 CCOM WARN NET/OFI OFI fi_getinfo() call failed: No data available
2025-Mar-06 19:05:13.0740 21230:26358 [0] nccl_net_ofi_create_plugin:251 CCOM WARN NET/OFI Unable to find a protocol that worked.  Failing initialization.
2025-Mar-06 19:05:13.0745 21230:26358 [0] nccl_net_ofi_create_plugin:316 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Mar-06 19:05:13.0750 21230:26358 [0] nccl_net_ofi_init:139 CCOM WARN NET/OFI Initializing plugin failed
2025-Mar-06 19:05:13.0754 21230:26358 [0] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Generated outputs:
Output 0: I believe the meaning of life is to find your passion and to live it. I believe that the best way to do that is to find your purpose. I believe that the best way to find your purpose is to find your passion. I believe that the best way 

## 8. Running the Llama Example
If you haven’t already, **run the cell above**. It will compile and generate sample text. The code cell is configured to run by default in the last line (`if __name__ == "__main__": ...`).

After the model is compiled, you should see logs similar to:

```
Generating outputs...
Prompts: ['I believe the meaning of life is', 'The color of the sky is']
2025-Mar-06 19:05:13.0730 21230:26358 [0] nccl_net_ofi_rdma_init:7734 CCOM WARN NET/OFI OFI fi_getinfo() call failed: No data available
2025-Mar-06 19:05:13.0740 21230:26358 [0] nccl_net_ofi_create_plugin:251 CCOM WARN NET/OFI Unable to find a protocol that worked.  Failing initialization.
2025-Mar-06 19:05:13.0745 21230:26358 [0] nccl_net_ofi_create_plugin:316 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Mar-06 19:05:13.0750 21230:26358 [0] nccl_net_ofi_init:139 CCOM WARN NET/OFI Initializing plugin failed
2025-Mar-06 19:05:13.0754 21230:26358 [0] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Generated outputs:
Output 0: I believe the meaning of life is to find your passion and to live it. I believe that the best way to do that is to find your purpose. I believe that the best way to find your purpose is to find your passion. I believe that the best way to find your passion is to find your purpose.
Output 1: The color of the sky is blue, the color of the water is green, the color of the grass is yellow, the color of the trees is red, the color of the flowers is white, the color of the clouds is black, the color of the sun is yellow, the color of the moon is
```

That’s it! You have successfully compiled a Llama model for inference with AWS Neuron.

Feel free to edit **`tp_degree`**, **`batch_size`**, **`seq_len`**, or the **prompts** to reflect your needs. For more advanced usage, see the [AWS Neuron Distributed Inference Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/pytorch/torch-neuronx-distributed-inference/index.html).

In [34]:
import sys
print(sys.executable)
!which pip
!pip list | grep vllm

/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/python3
/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/pip


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


vllm                              0.1.dev2830+g22c56ee.neuron216 /home/ubuntu/upstreaming-to-vllm


## 9. vLLM demo

### 9.1 Offline Inference Example

Here is an example for running offline inference. Bucketing is only disabled to demonstrate how to override Neuron configuration values. Keeping it enabled generally delivers better performance.

In [1]:
!pip list | grep neuron

libneuronxla                      2.1.714.0
neuronx-cc                        2.16.372.0+4a9b2326
neuronx-distributed               0.10.1
neuronx-distributed-inference     0.1.1
torch-neuronx                     2.5.1.2.4.0
vllm                              0.1.dev2830+g22c56ee.neuron216 /home/ubuntu/upstreaming-to-vllm


In [6]:
import os

os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"
# Point to the directory of your already-compiled artifacts:
os.environ['NEURON_COMPILED_ARTIFACTS'] = "/home/ubuntu/traced_model/Llama-3.2-1B/"

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(top_k=1)

# Create an LLM.
llm = LLM(
    model="meta-llama/Llama-3.2-1B-Instruct",
    max_num_seqs=1,
    max_model_len=32,
    override_neuron_config={
        "enable_bucketing":False,
    },
    device="neuron",
    tensor_parallel_size=2)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 03-06 19:47:42 config.py:901] Defaulting to use mp for distributed inference
INFO 03-06 19:47:42 llm_engine.py:226] Initializing an LLM engine (v0.1.dev2830+g22c56ee) with config: model='meta-llama/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={'enable_bucketing': False}, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_mo

Processed prompts: 100%|██████████████████████████████████████████████| 3/3 [00:02<00:00,  1.44it/s, est. speed input: 9.59 toks/s, output: 23.01 toks/s]

Prompt: 'The president of the United States is', Generated text: ' the head of state of the United States of America. The president is the commander'
Prompt: 'The capital of France is', Generated text: ' Paris. The city is located in the north of the country. The city is'
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword anymore. It’s a reality that'





In [3]:
!neuron-ls

instance-type: trn1.2xlarge
instance-id: i-06f522ff86769dc0f
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 |
+--------+--------+--------+---------+


In [None]:
!sudo kill #PID

In [None]:
# run in terminal
# set env var again for compiled artifacts since 

In [None]:
%%

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
    --model="meta-llama/Llama-3.2-1B" \
    --max-num-seqs=1 \
    --max-model-len=32 \
    --tensor-parallel-size=2 \
    --port=8888 \
    --device "neuron" \
    --override-neuron-config "{\"enable_bucketing\":false}"

In [2]:
%bash
# This should be the same path to which the model was downloaded (also used in the above steps).
MODEL_PATH="/home/ubuntu/models/Llama-3.2-1B-Instruct/"
# This is the name of directory where the test results will be saved.
OUTPUT_PATH=llmperf-results-sonnets

export OPENAI_API_BASE="http://0.0.0.0:8888"
export OPENAI_API_KEY="mock_key"

python token_benchmark_ray.py \
    --model $MODEL_PATH \
    --mean-input-tokens 32 \
    --stddev-input-tokens 0 \
    --mean-output-tokens 32 \
    --stddev-output-tokens 0 \
    --num-concurrent-requests 1\
    --timeout 3600 \
    --max-num-completed-requests 50 \
    --tokenizer $MODEL_PATH \
    --additional-sampling-params '{}' \
    --results-dir $OUTPUT_PATH \
    --llm-api "openai"

SyntaxError: invalid syntax (1018308174.py, line 7)

# Notebook Wrap-Up
You now have:
1. Installed the core AWS Neuron tools and optional packages (`vLLM`, `llmperf`, `InfluxDB`).
2. Optionally downloaded or placed your Llama model in a local directory.
3. Compiled and run a short demonstration of Llama-based text generation on Trainium.

For more advanced topics:
- **Profiling**: See [Neuron Profiling Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-profile/index.html).
- **Distributed Serving**: Explore vLLM or other serving frameworks.
- **Performance Benchmarking**: Use `llmperf` or custom scripts.
