Ryzen 7940HS Inference: Why is CPU Faster than GPU/NPU for Quantized ONNX Models? (AMD RyzenAI)

<html>
<body>
<p style="margin-bottom: 1rem; margin-top: 0px; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Hi all, I’m running image classification inference using quantized ONNX models on a Ryzen 7940HS device (8C/16T, integrated GPU/NPU). According to my understanding and most docs, the inference speed should be:<ul style="margin-bottom: 1rem; padding-left: 2rem; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li style="margin-top: 0.25rem;">CPU &lt; GPU &lt; NPU (in terms of speed, so NPU should be fastest),</li><li style="margin-top: 0.25rem; margin-bottom: 0px;">and vice versa for throughput (images/sec).</li></ul><p style="margin-bottom: 1rem; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">But in my benchmarks, CPU is often faster than both GPU and NPU, especially with quantized models using AMD’s official RyzenAI reference (<a rpl="" class="relative pointer-events-auto a cursor-pointer
 
 
 
 
 underline
 " href="https://github.com/amd/RyzenAI-SW/blob/main/tutorial/torchvision_inference/classification.py" rel="noopener nofollow noreferrer ugc" target="_blank" style="color: rgb(17, 91, 202); text-decoration: underline; font-size: 1em; pointer-events: auto; position: relative; cursor: pointer; margin-bottom: 0px;">AMD Github example</a>). Here’s a sample of my results (all numbers for 50 images):
Model | CPU (ms/img) | CPU (img/s) | GPU (ms/img) | GPU (img/s) | NPU (ms/img) | NPU (img/s)
-- | -- | -- | -- | -- | -- | --

alexnet_quantized.onnx | 24.57 | 40.70 | 25.21 | 39.66 | 58.25 | 17.17
googlenet_quantized.onnx | 17.67 | 56.58 | 22.43 | 44.58 | 27.48 | 36.39
mobilenetv2_050_quantized.onnx | 10.87 | 92.04 | 9.98 | 100.24 | 13.93 | 71.80
cspresnet50_quantized.onnx | 38.35 | 26.07 | 49.14 | 20.35 | 63.55 | 15.74
squeezenet1_0_quantized.onnx | 7.66 | 130.51 | 10.24 | 97.62 | 12.86 | 77.79

<p style="margin-bottom: 1rem; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Questions:<ol style="margin-bottom: 1rem; padding-left: 2rem; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li style="margin-top: 0.25rem;">Why is my CPU inference often faster than GPU and NPU, even though the expectation is the opposite?</li><li style="margin-top: 0.25rem;">Does the Ryzen 7940HS being an “older” chip (Zen 4, but 2023 release) or its 8C/16T configuration play a significant role in this?</li><li style="margin-top: 0.25rem;">Is there anything I should tune or check in my setup, quantization, or ONNX runtime to get better NPU/GPU results?</li><li style="margin-top: 0.25rem; margin-bottom: 0px;">Has anyone else had similar experience with RyzenAI on this or similar hardware?</li></ol><p style="margin-bottom: 1rem; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Notes:<ul style="margin-bottom: 0px; padding-left: 2rem; color: rgb(51, 61, 66); font-family: -apple-system, BlinkMacSystemFont, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li style="margin-top: 0.25rem;">I followed AMD’s official quantization/inference code.</li><li style="margin-top: 0.25rem;">All drivers and firmware are up to date.</li><li style="margin-top: 0.25rem; margin-bottom: 0px;">Let me know if you need more info about my setup!</li></ul>
</body>
</html>Hi all,
I’m running image classification inference using quantized ONNX models on a Ryzen 7940HS device (8C/16T, integrated GPU/NPU). According to my understanding and most docs, the inference speed should be:

CPU < GPU < NPU (in terms of speed, so NPU should be fastest),

and vice versa for throughput (images/sec).

But in my benchmarks, CPU is often faster than both GPU and NPU, especially with quantized models using AMD’s official RyzenAI reference ([AMD Github example](https://github.com/amd/RyzenAI-SW/blob/main/tutorial/torchvision_inference/classification.py)). Here’s a sample of my results (all numbers for 50 images):

Model	CPU (ms/img)	CPU (img/s)	GPU (ms/img)	GPU (img/s)	NPU (ms/img)	NPU (img/s)
alexnet_quantized.onnx	24.57	40.70	25.21	39.66	58.25	17.17
googlenet_quantized.onnx	17.67	56.58	22.43	44.58	27.48	36.39
mobilenetv2_050_quantized.onnx	10.87	92.04	9.98	100.24	13.93	71.80
cspresnet50_quantized.onnx	38.35	26.07	49.14	20.35	63.55	15.74
squeezenet1_0_quantized.onnx	7.66	130.51	10.24	97.62	12.86	77.79
Questions:

Why is my CPU inference often faster than GPU and NPU, even though the expectation is the opposite?

Does the Ryzen 7940HS being an “older” chip (Zen 4, but 2023 release) or its 8C/16T configuration play a significant role in this?

Is there anything I should tune or check in my setup, quantization, or ONNX runtime to get better NPU/GPU results?

Has anyone else had similar experience with RyzenAI on this or similar hardware?

Notes:

I followed AMD’s official quantization/inference code.

All drivers and firmware are up to date.

Let me know if you need more info about my setup!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ryzen 7940HS Inference: Why is CPU Faster than GPU/NPU for Quantized ONNX Models? (AMD RyzenAI) #212

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ryzen 7940HS Inference: Why is CPU Faster than GPU/NPU for Quantized ONNX Models? (AMD RyzenAI) #212

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions