# Hello World Example

This is a simple Jupyter Notebook that walks through the 4 steps of compiling and running a PyTorch model on the embedded Neural Processing Unit (NPU) in your AMD Ryzen AI enabled PC. The steps are as follows:

1. Get model - download or create a PyTorch model that we will run on the NPU
2. Export to ONNX - convert the PyTorch model to ONNX format.
3. Quantize - optimize the model for faster inference on the NPU by reducing its precision to INT8.
4. Run Model on CPU and NPU - compare performance between running the model on the CPU and on the NPU.

In [2]:
# Before starting, be sure you've installed the requirements listed in the requirements.txt file:
!python -m pip install -r requirements.txt



### 0. Imports & Environment Variables

We'll use the following imports in our example. `torch` and `torch_nn` are used for building and running ML models. We'll use them to define a small neural network and to generate the model weights. `os` is used for interacting with the operating system and is used to manage our environment variables, file paths, and directories. `subprocess` allows us to retrieve the hardware information. `onnx` and `onnxruntime` are used to work with our model in the ONNX format and for running our inference. `vai_q_onnx` is part of the Vitis AI Quantizer for ONNX models. We use it to perform quantization, converting the model into an INT8 format that is optimized for the NPU.

In [1]:
import torch
import torch.nn as nn
import os
import subprocess
import onnxruntime
import numpy as np
import onnx
import shutil
from timeit import default_timer as timer
#import vai_q_onnx

As well, we want to set the environment variables based on the NPU device we have in our PC. For more information about NPU configurations, see: For more information about NPU configurations, refer to the official [AMD Ryzen AI Documentation](https://ryzenai.docs.amd.com/en/latest/runtime_setup.html).

In [None]:
# This function detects the APU (NPU) type in your system to configure environment variables for hardware-specific optimization.
def get_npu_info():
    # Run pnputil as a subprocess to enumerate PCI devices
    command = r'pnputil /enum-devices /bus PCI /deviceids '
    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    # Check for supported Hardware IDs
    npu_type = ''
    if 'PCI\\VEN_1022&DEV_1502&REV_00' in stdout.decode(): npu_type = 'PHX/HPT'
    if 'PCI\\VEN_1022&DEV_17F0&REV_00' in stdout.decode(): npu_type = 'STX'
    if 'PCI\\VEN_1022&DEV_17F0&REV_10' in stdout.decode(): npu_type = 'STX'
    if 'PCI\\VEN_1022&DEV_17F0&REV_11' in stdout.decode(): npu_type = 'STX'
    if 'PCI\\VEN_1022&DEV_17F0&REV_20' in stdout.decode(): npu_type = 'KRK'
    return npu_type

npu_type = get_npu_info()
print(f"APU Type: {npu_type}")

NPU Type: KRK


In [None]:
# XLNX_VART_FIRMWARE - Specifies the firmware file used by the NPU for runtime execution
# NUM_OF_DPU_RUNNERS - Specifies the number of DPU runners (processing cores) available for execution
# XLNX_TARGET_NAME - Name of the target hardware configuration

def set_environment_variable(npu_type):

    install_dir = os.environ['RYZEN_AI_INSTALLATION_PATH']
    match npu_type:
        case 'PHX/HPT':
            print("Setting environment for PHX/HPT")
            os.environ['XLNX_VART_FIRMWARE']= os.path.join(install_dir, 'voe-4.0-win_amd64', 'xclbins', 'phoenix', '4x4.xclbin')
            os.environ['NUM_OF_DPU_RUNNERS']='1'
            os.environ['XLNX_TARGET_NAME']='AMD_AIE2_Nx4_Overlay'
        case 'STX' | 'KRK':
            print("Setting environment for STX/KRK")
            os.environ['XLNX_VART_FIRMWARE']= os.path.join(install_dir, 'voe-4.0-win_amd64', 'xclbins', 'strix', 'AMD_AIE2P_4x4_Overlay.xclbin')
            os.environ['NUM_OF_DPU_RUNNERS']='1'
            os.environ['XLNX_TARGET_NAME']='AMD_AIE2_Nx4_Overlay'
        case _:
            print("Unrecognized APU type. Exiting.")
            exit()
    print('XLNX_VART_FIRMWARE=', os.environ['XLNX_VART_FIRMWARE'])
    print('NUM_OF_DPU_RUNNERS=', os.environ['NUM_OF_DPU_RUNNERS'])
    print('XLNX_TARGET_NAME=', os.environ['XLNX_TARGET_NAME'])

set_environment_variable(npu_type)

Setting environment for STX/KRK
XLNX_VART_FIRMWARE= C:\Program Files\RyzenAI\1.6.0\voe-4.0-win_amd64\xclbins\strix\AMD_AIE2P_4x4_Overlay.xclbin
NUM_OF_DPU_RUNNERS= 1
XLNX_TARGET_NAME= AMD_AIE2_Nx4_Overlay


### 1. Get Model
Here, we'll use the PyTorch library to define and instantiate a simple neural network model called `SmallModel` as a starting point. You can swap this model with any custom model, but make sure the input/output shapes remain compatible.

In [4]:
torch.manual_seed(0)

class SmallModel(nn.Module):
    def __init__(self):
        super(SmallModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.conv4 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        
        x = self.conv2(x)
        x = self.relu(x) 
        
        x = self.conv3(x)
        x = self.relu(x) 
        
        x = self.conv4(x)
        x = self.relu(x) 
        
        x = torch.add(x, 1)
        
        return x

# Instantiate the model
pytorch_model = SmallModel()

pytorch_model.eval()

# Print the model architecture
print(pytorch_model)

SmallModel(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu): ReLU()
)


### 2. Export to ONNX

The following code is used for exporting a PyTorch model (pytorch_model) to the ONNX (Open Neural Network Exchange) format. ONNX is an open format that facilitates interoperability between different AI frameworks. Ryzen AI uses ONNX as the input format for quantization using the Vitis AI Quantizer. 

In [5]:
# Generate dummy input data
batch_size = 1
input_channels = 3
input_size = 224
dummy_input = torch.rand(batch_size, input_channels, input_size, input_size)

# Prep for ONNX export
inputs = {"x": dummy_input}
dynamic_axes = {'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
tmp_model_path = "models/helloworld.onnx"

# Call export function
torch.onnx.export(
        pytorch_model,
        dummy_input,
        tmp_model_path,
        export_params=True,
        opset_version=17,  # Recommended opset
        input_names=['input'],
        output_names=['output'],
        dynamic_axes=dynamic_axes,
    )

  torch.onnx.export(
W1213 23:10:49.984000 18628 site-packages\torch\onnx\_internal\exporter\_compat.py:114] Setting ONNX exporter to use operator set version 18 because the requested opset_version 17 is a lower version than we have implementations for. Automatic version conversion will be performed, which may not be successful at converting to the requested version. If version conversion is unsuccessful, the opset version of the exported model will be kept at 18. Please consider setting opset_version >=18 to leverage latest ONNX features
W1213 23:10:51.139000 18628 site-packages\torch\onnx\_internal\exporter\_registration.py:107] torchvision is not installed. Skipping torchvision::nms


[torch.onnx] Obtain model graph for `SmallModel([...]` with `torch.export.export(..., strict=False)`...
[torch.onnx] Obtain model graph for `SmallModel([...]` with `torch.export.export(..., strict=False)`... ✅
[torch.onnx] Run decomposition...


The model version conversion is not supported by the onnxscript version converter and fallback is enabled. The model will be converted using the onnx C API (target version: 17).


[torch.onnx] Run decomposition... ✅
[torch.onnx] Translate the graph into ONNX...
[torch.onnx] Translate the graph into ONNX... ✅


ONNXProgram(
    model=
        <
            ir_version=10,
            opset_imports={'': 17},
            producer_name='pytorch',
            producer_version='2.9.1+cpu',
            domain=None,
            model_version=None,
        >
        graph(
            name=main_graph,
            inputs=(
                %"input"<FLOAT,[s77,3,224,224]>
            ),
            outputs=(
                %"output"<FLOAT,[1,256,224,224]>
            ),
            initializers=(
                %"conv1.weight"<FLOAT,[32,3,3,3]>{TorchTensor(...)},
                %"conv1.bias"<FLOAT,[32]>{TorchTensor(...)},
                %"conv2.bias"<FLOAT,[64]>{TorchTensor(...)},
                %"conv3.bias"<FLOAT,[128]>{TorchTensor(...)},
                %"conv4.bias"<FLOAT,[256]>{TorchTensor(...)},
                %"conv2.weight"<FLOAT,[64,32,3,3]>{TorchTensor(...)},
                %"conv3.weight"<FLOAT,[128,64,3,3]>{TorchTensor(...)},
                %"conv4.weight"<FLOAT,[256,128,3,3]>{TorchTe

### 3. Quantize Model

Using the static quantization method provided by the AMD Quark Quantizer and providing the newly exported ONNX model, we'll quantize the model to INT8. Quantization reduces the precision of model weights and activations from 32-bit floating point (FP32) to 8-bit integers (INT8). This compression allows the model to run faster on hardware accelerators like NPUs, while maintaining nearly the same accuracy. For more information on this quantization method, see [AMD Quark Quantization](https://ryzenai.docs.amd.com/en/latest/modelport.html).

In [6]:
from quark.onnx.quantization.config import Config, get_default_config
from quark.onnx import ModelQuantizer

# `input_model_path` is the path to the original, unquantized ONNX model.
input_model_path = "models/helloworld.onnx"

# `output_model_path` is the path where the quantized model will be saved.
output_model_path = "models/helloworld_quantized.onnx"

# Use default quantization configuration
quant_config = get_default_config("XINT8")
quant_config.extra_options["UseRandomData"] = True
# Defines the quantization configuration for the whole model
config = Config(global_quant_config=quant_config)
print("The configuration of the quantization is {}".format(config))

# Create an ONNX Quantizer
quantizer = ModelQuantizer(config)

# Quantize the ONNX model
quant_model = quantizer.quantize_model(model_input = input_model_path,
                                       model_output = output_model_path,
                                       calibration_data_path = None)

print('Calibrated and quantized model saved at:', output_model_path)

[32m
[QUARK-INFO]: Checking custom ops library ...[0m
[32m
[QUARK-INFO]: The CPU version of custom ops library already exists.[0m
[32m
[QUARK-INFO]: Checked custom ops library.[0m
  from .autonotebook import tqdm as notebook_tqdm
[32m
[QUARK-INFO]: The input ONNX model can create InferenceSession successfully[0m
[32m
[QUARK-INFO]: Random input name input shape [1, 3, 224, 224] type <class 'numpy.float32'> [0m
[32m
[QUARK-INFO]: Obtained calibration data with 1 iters[0m


[QUARK_INFO]: Time information:
2025-12-13 23:11:15.395783
[QUARK_INFO]: OS and CPU information:
                                        system --- Windows
                                          node --- windel
                                       release --- 11
                                       version --- 10.0.26200
                                       machine --- AMD64
                                     processor --- AMD64 Family 26 Model 96 Stepping 0, AuthenticAMD
[QUARK_INFO]: Tools version information:
                                        python --- 3.12.11
                                          onnx --- 1.18.0
                                   onnxruntime --- 1.23.0.dev20250928
                                    quark.onnx --- 0.10+db671e3+db671e3
[QUARK_INFO]: Quantized Configuration information:
                                   model_input --- models/helloworld.onnx
                                  model_output --- models/helloworld_quantized.onnx
   

[32m
[QUARK-INFO]: Removed initializers from input[0m
[32m
[QUARK-INFO]: Simplified model sucessfully[0m
[32m
[QUARK-INFO]: Loading model...[0m
[32m
[QUARK-INFO]: The input ONNX model can run inference successfully[0m
[32m
[QUARK-INFO]: Start CrossLayerEqualization...[0m
[32m
[QUARK-INFO]: CrossLayerEqualization pattern num: 3[0m
[32m
[QUARK-INFO]: Total CrossLayerEqualization steps: 1[0m
[32m
[QUARK-INFO]: CrossLayerEqualization Done.[0m
[32m
[QUARK-INFO]: optimize the model for better hardware compatibility.[0m
[33m
[32m
[QUARK-INFO]: Start calibration...[0m
[32m
[QUARK-INFO]: Start collecting data, runtime depends on your model size and the number of calibration dataset.[0m
[32m
[QUARK-INFO]: Finding optimal threshold for each tensor using PowerOfTwoMethod.MinMSE algorithm ...[0m
[32m
[QUARK-INFO]: Use all calibration data to calculate min mse[0m
Computing range: 100%|██████████| 10/10 [00:05<00:00,  1.95tensor/s]
[32m
[QUARK-INFO]: Finished the calibrati

[32m
[QUARK-INFO]: The quantized information for all operation types is shown in the table below.[0m
[32m
[QUARK-INFO]: The discrepancy between the operation types in the quantized model and the float model is due to the application of graph optimization.[0m


Calibrated and quantized model saved at: models/helloworld_quantized.onnx


### 4. Run Model

#### CPU Run

Before runnning the model on the NPU, we'll run the model on the CPU and get the execution time for comparison with the NPU.

In [None]:
# Specify the path to the quantized ONNZ Model
quantized_model_path = r'./models/helloworld_quantized.onnx'
model = onnx.load(quantized_model_path)

# Create some random input data for testing
input_data = np.random.uniform(low=-1, high=1, size=(batch_size, input_channels, input_size, input_size)).astype(np.float32)

cpu_options = onnxruntime.SessionOptions()

# Create Inference Session to run the quantized model on the CPU
cpu_session = onnxruntime.InferenceSession(
    model.SerializeToString(),
    providers = ['CPUExecutionProvider'],
    sess_options=cpu_options,
)

# Run Inference
start = timer()
cpu_results = cpu_session.run(None, {'input': input_data})
cpu_total = timer() - start

#### NPU Run

Now, we'll run it on the NPU and time the execution so that we can compare the results with the CPU.
If the model has already been compiled, it won't recompile unless you delete the generated cache folder using the following cell.

In [None]:
# We want to make sure we compile everytime, otherwise the tools will use the cached version
# Get the current working directory
current_directory = os.getcwd()
directory_path = os.path.join(current_directory,  r'cache\hello_cache')
cache_directory = os.path.join(current_directory,  r'cache')

# Check if the directory exists and delete it if it does
if os.path.exists(directory_path):
    shutil.rmtree(directory_path)
    print(f"Directory deleted successfully. Starting Fresh.")
else:
    print(f"Directory '{directory_path}' does not exist.")

Directory 'c:\Users\kfreidank\projects\amd_demos\RyzenAI-SW\tutorial\hello_world\cache\hello_cache' does not exist.


#### Compile and run

On the first run, the model will compile for the NPU before executing the inference. It's best to run the following cell again if you want to see better inference times.

In [None]:
install_dir = os.environ['RYZEN_AI_INSTALLATION_PATH']
config_file_path = os.path.join(install_dir, 'voe-4.0-win_amd64', 'vaip_config.json') # Path to the NPU config file
xclbin_file = ''
provider_options = []
match npu_type:
    case 'PHX/HPT':
        print("Setting xclbin file for PHX/HPT")
        xclbin_file = os.path.join(install_dir, 'voe-4.0-win_amd64', 'xclbins', 'phoenix', '4x4.xclbin')
        provider_options = [{
                        'target': 'X1',
                        'xclbin': xclbin_file,
                        'log_level':'info',
                    }]
    case 'STX' | 'KRK':
        provider_options = [{
                'log_level':'info',
            }]
    case _:
        print("Unrecognized APU type. Exiting.")
        exit()
aie_options = onnxruntime.SessionOptions()

aie_session = onnxruntime.InferenceSession(
    model.SerializeToString(),
    providers=['VitisAIExecutionProvider'],
    sess_options=aie_options,
    provider_options = provider_options
)



In [10]:
# Run Inference
start = timer()
npu_results = aie_session.run(None, {'input': input_data})
npu_total = timer() - start

Let's gather our results and see what we have

In [11]:
print(f"CPU Execution Time: {cpu_total}")
print(f"NPU Execution Time: {npu_total}")

CPU Execution Time: 0.17882769999999937
NPU Execution Time: 0.20666400000001772


**Note:** For a model this small in size, you likely won't see much of a performance gain when using the NPU versus the CPU. 

Let's take a look at running the model on the NPU lots of times so that we can see the NPU being utilized.
To do this, make sure to have Task Manager opened in a window you can see when you run the next cell.

In [12]:
iterations = 50 # edit this for more or less

npu_total = cpu_total = 0
for i in range(iterations):
    start = timer()
    npu_results = aie_session.run(None, {'input': input_data})
    npu_total += timer() - start
    start = timer()
    cpu_results = cpu_session.run(None, {'input': input_data})
    cpu_total += timer() - start

print(f"For {iterations} iterations of a small model:")
print(f"- CPU Execution Time: {cpu_total}")
print(f"- NPU Execution Time: {npu_total}")

For 50 iterations of a small model:
- CPU Execution Time: 8.32508400000006
- NPU Execution Time: 8.485271200000227


And there you have it. Your first model running on the NPU. We recommend trying a more complex model like ResNet50 or a custom model to compare performance and accuracy on the NPU.
