# Hello World Example

This is a simple Jupyter Notebook that walks through the 4 steps of compiling and running a PyTorch model on the embedded Neural Processing Unit (NPU) in your AMD Ryzen AI enabled PC. The steps are as follows:

1. Get model - download or create a PyTorch model that we will run on the NPU
2. Export to ONNX - convert the PyTorch model to ONNX format.
3. Quantize - optimize the model for faster inference on the NPU by reducing its precision to INT8.
4. Run Model on CPU and NPU - compare performance between running the model on the CPU and on the NPU.

In [2]:
# Before starting, be sure you've installed the requirements listed in the requirements.txt file:
!python -m pip install -r requirements.txt



### 0. Imports & Environment Variables

We'll use the following imports in our example. `torch` and `torch_nn` are used for building and running ML models. We'll use them to define a small neural network and to generate the model weights. `os` is used for interacting with the operating system and is used to manage our environment variables, file paths, and directories. `subprocess` allows us to retrieve the hardware information. `onnx` and `onnxruntime` are used to work with our model in the ONNX format and for running our inference. `vai_q_onnx` is part of the Vitis AI Quantizer for ONNX models. We use it to perform quantization, converting the model into an INT8 format that is optimized for the NPU.

In [4]:
import torch
import torch.nn as nn
import os
import subprocess
import onnxruntime
import numpy as np
import onnx
import shutil
from timeit import default_timer as timer
import vai_q_onnx

As well, we want to set the environment variables based on the NPU device we have in our PC. For more information about NPU configurations, see: For more information about NPU configurations, refer to the official [AMD Ryzen AI Documentation](https://ryzenai.docs.amd.com/en/latest/runtime_setup.html).

In [None]:
# This function detects the APU (NPU) type in your system to configure environment variables for hardware-specific optimization.
def get_apu_info():
    # Run pnputil as a subprocess to enumerate PCI devices
    command = r'pnputil /enum-devices /bus PCI /deviceids '
    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    # Decode with error handling
    apu_type = ''
    try:  
        stdout = stdout.decode('utf-8', errors='ignore')  
    except Exception as e:  
        # Log error message and return an empty string  
        print(f"An error occurred while decoding output: {e}") 
        return apu_type
    # Check for supported Hardware IDs
    if 'PCI\\VEN_1022&DEV_1502&REV_00' in stdout: apu_type = 'PHX/HPT'  
    if 'PCI\\VEN_1022&DEV_17F0&REV_00' in stdout: apu_type = 'STX'  
    if 'PCI\\VEN_1022&DEV_17F0&REV_10' in stdout: apu_type = 'STX' 
    if 'PCI\\VEN_1022&DEV_17F0&REV_11' in stdout: apu_type = 'STX' 
    return apu_type

apu_type = get_apu_info()
print(f"APU Type: {apu_type}")

APU Type: PHX/HPT


In [6]:
# XLNX_VART_FIRMWARE - Specifies the firmware file used by the NPU for runtime execution
# NUM_OF_DPU_RUNNERS - Specifies the number of DPU runners (processing cores) available for execution
# XLNX_TARGET_NAME - Name of the target hardware configuration

def set_environment_variable(apu_type):

    install_dir = os.environ['RYZEN_AI_INSTALLATION_PATH']
    match apu_type:
        case 'PHX/HPT':
            print("Setting environment for PHX/HPT")
            os.environ['XLNX_VART_FIRMWARE']= os.path.join(install_dir, 'voe-4.0-win_amd64', 'xclbins', 'phoenix', '1x4.xclbin')
            os.environ['NUM_OF_DPU_RUNNERS']='1'
            os.environ['XLNX_TARGET_NAME']='AMD_AIE2_Nx4_Overlay'
        case 'STX':
            print("Setting environment for STX")
            os.environ['XLNX_VART_FIRMWARE']= os.path.join(install_dir, 'voe-4.0-win_amd64', 'xclbins', 'strix', 'AMD_AIE2P_Nx4_Overlay.xclbin')
            os.environ['NUM_OF_DPU_RUNNERS']='1'
            os.environ['XLNX_TARGET_NAME']='AMD_AIE2_Nx4_Overlay'
        case _:
            print("Unrecognized APU type. Exiting.")
            exit()
    print('XLNX_VART_FIRMWARE=', os.environ['XLNX_VART_FIRMWARE'])
    print('NUM_OF_DPU_RUNNERS=', os.environ['NUM_OF_DPU_RUNNERS'])
    print('XLNX_TARGET_NAME=', os.environ['XLNX_TARGET_NAME'])

set_environment_variable(apu_type)

Setting environment for PHX/HPT
XLNX_VART_FIRMWARE= C:\Program Files\RyzenAI\1.2.0\voe-4.0-win_amd64\xclbins\phoenix\1x4.xclbin
NUM_OF_DPU_RUNNERS= 1
XLNX_TARGET_NAME= AMD_AIE2_Nx4_Overlay


### 1. Get Model
Here, we'll use the PyTorch library to define and instantiate a simple neural network model called `SmallModel` as a starting point. You can swap this model with any custom model, but make sure the input/output shapes remain compatible.

In [7]:
torch.manual_seed(0)

class SmallModel(nn.Module):
    def __init__(self):
        super(SmallModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.conv4 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        
        x = self.conv2(x)
        x = self.relu(x) 
        
        x = self.conv3(x)
        x = self.relu(x) 
        
        x = self.conv4(x)
        x = self.relu(x) 
        
        x = torch.add(x, 1)
        
        return x

# Instantiate the model
pytorch_model = SmallModel()

pytorch_model.eval()

# Print the model architecture
print(pytorch_model)

SmallModel(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu): ReLU()
)


### 2. Export to ONNX

The following code is used for exporting a PyTorch model (pytorch_model) to the ONNX (Open Neural Network Exchange) format. ONNX is an open format that facilitates interoperability between different AI frameworks. Ryzen AI uses ONNX as the input format for quantization using the Vitis AI Quantizer. 

In [8]:
# Generate dummy input data
batch_size = 1
input_channels = 3
input_size = 224
dummy_input = torch.rand(batch_size, input_channels, input_size, input_size)

# Prep for ONNX export
inputs = {"x": dummy_input}
dynamic_axes = {'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
tmp_model_path = "models/helloworld.onnx"

# Call export function
torch.onnx.export(
        pytorch_model,
        inputs,
        tmp_model_path,
        export_params=True,
        opset_version=17,  # Recommended opset
        input_names=['input'],
        output_names=['output'],
        dynamic_axes=dynamic_axes,
    )

### 3. Quantize Model

Using the static quantization method provided by the Vitis AI Quantizer and providing the newly exported ONNX model, we'll quantize the model to INT8. Quantization reduces the precision of model weights and activations from 32-bit floating point (FP32) to 8-bit integers (INT8). This compression allows the model to run faster on hardware accelerators like NPUs, while maintaining nearly the same accuracy. For more information on this quantization method, see [Vitis AI ONNX Quantization](https://ryzenai.docs.amd.com/en/latest/vai_quant/vai_q_onnx.html).

In [9]:
# `input_model_path` is the path to the original, unquantized ONNX model.
input_model_path = "models/helloworld.onnx"

# `output_model_path` is the path where the quantized model will be saved.
output_model_path = "models/helloworld_quantized.onnx"

vai_q_onnx.quantize_static(
    input_model_path,
    output_model_path,
    calibration_data_reader=None,
    quant_format=vai_q_onnx.QuantFormat.QDQ,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    activation_type=vai_q_onnx.QuantType.QUInt8,
    weight_type=vai_q_onnx.QuantType.QInt8,
    enable_ipu_cnn=True,
    extra_options={'ActivationSymmetric': True}
)

print('Calibrated and quantized model saved at:', output_model_path)

INFO:vai_q_onnx.quantize:calibration_data_reader is None, using random data for calibration
INFO:vai_q_onnx.quant_utils:The input ONNX model models/helloworld.onnx can create InferenceSession successfully
INFO:vai_q_onnx.quant_utils:Random input name input shape [1, 3, 224, 224] type <class 'numpy.float32'> 
INFO:vai_q_onnx.quant_utils:Obtained calibration data with 1 iters
INFO:vai_q_onnx.quantize:Removed initializers from input
INFO:vai_q_onnx.quantize:Simplified model sucessfully
INFO:vai_q_onnx.quantize:Loading model...


[VAI_Q_ONNX_INFO]: Time information:
2024-08-23 10:12:35.362481
[VAI_Q_ONNX_INFO]: OS and CPU information:
                                        system --- Windows
                                          node --- vgodsoe-ryzen
                                       release --- 10
                                       version --- 10.0.26100
                                       machine --- AMD64
                                     processor --- AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD
[VAI_Q_ONNX_INFO]: Tools version information:
                                        python --- 3.10.14
                                          onnx --- 1.16.2
                                   onnxruntime --- 1.17.0
                                    vai_q_onnx --- 1.17.0+511d6f4
[VAI_Q_ONNX_INFO]: Quantized Configuration information:
                                   model_input --- models/helloworld.onnx
                                  model_output --- models/helloworld_quantize

INFO:vai_q_onnx.quant_utils:The input ONNX model C:/Users/vgods/AppData/Local/Temp/vai.simp.kpf9kmm3/model_simp.onnx can run inference successfully
INFO:vai_q_onnx.quantize:optimize the model for better hardware compatibility.
INFO:vai_q_onnx.quantize:Start calibration...
INFO:vai_q_onnx.quantize:Start collecting data, runtime depends on your model size and the number of calibration dataset.
INFO:vai_q_onnx.calibrate:Finding optimal threshold for each tensor using PowerOfTwoMethod.MinMSE algorithm ...
INFO:vai_q_onnx.calibrate:Use all calibration data to calculate min mse
Computing range: 100%|██████████| 10/10 [00:04<00:00,  2.30tensor/s]
INFO:vai_q_onnx.quantize:Finished the calibration of PowerOfTwoMethod.MinMSE which costs 4.6s
INFO:vai_q_onnx.qdq_quantizer:Remove QuantizeLinear & DequantizeLinear on certain operations(such as conv-relu).
INFO:vai_q_onnx.refine:Adjust the quantize info to meet the compiler constraints


Calibrated and quantized model saved at: models/helloworld_quantized.onnx


### 4. Run Model

#### CPU Run

Before runnning the model on the NPU, we'll run the model on the CPU and get the execution time for comparison with the NPU.

In [10]:
# Specify the path to the quantized ONNZ Model
quantized_model_path = r'./models/helloworld_quantized.onnx'
model = onnx.load(quantized_model_path)

# Create some random input data for testing
input_data = np.random.uniform(low=-1, high=1, size=(batch_size, input_channels, input_size, input_size)).astype(np.float32)

cpu_options = onnxruntime.SessionOptions()

# Create Inference Session to run the quantized model on the CPU
cpu_session = onnxruntime.InferenceSession(
    model.SerializeToString(),
    providers = ['CPUExecutionProvider'],
    sess_options=cpu_options,
)

# Run Inference
start = timer()
cpu_results = cpu_session.run(None, {'input': input_data})
cpu_total = timer() - start

#### NPU Run

Now, we'll run it on the NPU and time the execution so that we can compare the results with the CPU.
If the model has already been compiled, it won't recompile unless you delete the generated cache folder using the following cell.

In [11]:
# We want to make sure we compile everytime, otherwise the tools will use the cached version
# Get the current working directory
current_directory = os.getcwd()
directory_path = os.path.join(current_directory,  r'cache\hello_cache')
cache_directory = os.path.join(current_directory,  r'cache')

# Check if the directory exists and delete it if it does.
if os.path.exists(directory_path):
    shutil.rmtree(directory_path)
    print(f"Directory deleted successfully. Starting Fresh.")
else:
    print(f"Directory '{directory_path}' does not exist.")

Directory deleted successfully. Starting Fresh.


#### Compile and run

On the first run, the model will compile for the NPU before executing the inference. It's best to run the following cell again if you want to see better inference times.

In [13]:
install_dir = os.environ['RYZEN_AI_INSTALLATION_PATH']
config_file_path = os.path.join(install_dir, 'voe-4.0-win_amd64', 'vaip_config.json') # Path to the NPU config file

aie_options = onnxruntime.SessionOptions()

aie_session = onnxruntime.InferenceSession(
    model.SerializeToString(),
    providers=['VitisAIExecutionProvider'],
    sess_options=aie_options,
    provider_options = [{'config_file': config_file_path,
                         'cacheDir': cache_directory,
                         'cacheKey': 'hello_cache'}]
)



In [14]:
# Run Inference
start = timer()
npu_results = aie_session.run(None, {'input': input_data})
npu_total = timer() - start

Let's gather our results and see what we have

In [15]:
print(f"CPU Execution Time: {cpu_total}")
print(f"NPU Execution Time: {npu_total}")

CPU Execution Time: 0.11257850000004055
NPU Execution Time: 0.08555689999997185


**Note:** For a model this small in size, you likely won't see much of a performance gain when using the NPU versus the CPU. 

Let's take a look at running the model on the NPU lots of times so that we can see the NPU being utilized.
To do this, make sure to have Task Manager opened in a window you can see when you run the next cell.

In [16]:
iterations = 50 # edit this for more or less

for i in range(iterations):
    npu_results = aie_session.run(None, {'input': input_data})



And there you have it. Your first model running on the NPU. We recommend trying a more complex model like ResNet50 or a custom model to compare performance and accuracy on the NPU.
