[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors? #1499

KarmaMonk · 2023-05-17T09:57:42Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Enhancement

Are there any plans to support the AMD XDNA AI Engine (in AMD Ryzen 7x40 (x = 6,8,9) processors)?

AlphaAtlas · 2023-05-19T04:55:02Z

Is there any API or example code of how to use it out in the wild?

I can't even find a whitepaper or anything like that.

KarmaMonk · 2023-05-19T10:30:05Z

What I found so far are is:

XDNA is in the Ryzen AI 7040 series and in the Alveo V70 AI inference acceleration card.
CDX Driver and CDX Controller for Linux to push code to the XDNA AI Engine
Older infos from XILINX about their AI Engine, which probably will be similar/same to the Ryzen XDNA AI Engine
https://twitter.com/ogawa_tter/status/1655131300058267648

SlyEcho · 2023-05-19T20:30:10Z

If there is a BLAS or even a matrix multiplication API available, it should be easy to add.

Azeirah · 2023-05-21T13:13:11Z

Do the non-mobile ryzen 7x00 cpus also have this feature?

KarmaMonk · 2023-05-21T13:30:58Z

Do the non-mobile ryzen 7x00 cpus also have this feature?

As far as i know: no, only the mobile Phoenix 7040 Series, see wikipedia

GreyXor · 2023-05-25T08:48:42Z

https://github.com/amd/RyzenAI-cloud-to-client-demo

Dampfinchen · 2023-05-31T20:36:51Z

I second this. With the rise of ML accelerators in PCs starting with Ryzen AI and Meteor Lake VPU's, using them might result in big efficiency gains and speedups.

I'm also sure once memory bottlenecks are reduced, more can be done by using tensor cores and the new RDNA3 AI accelerators more efficiently.

Then, you also have NPUs in modern smartphone that can be leveraged. Hardware acceleration is the way forward.

AlphaAtlas · 2023-05-31T20:58:06Z

An issue is that inference either has to totally be on the XPU (excluding the possibility of partial OpenCL/CUDA acceleration), or support zero copy/unified memory to avoid cost prohibitive copies. IGP acceleration is similarly problematic.

Its theoretically not impossible... For instance, Tencent ncnn has some kind of Vulkan zero copy mechanism for "unified memory" devices.

I think the library AMD uses for the demo is this: https://github.com/Xilinx/Vitis-AI

ggerganov · 2023-06-10T06:39:50Z

Don't know the specifics of this hardware, but most likely you need to follow the steps in #1642 to make it work

EwoutH · 2023-06-13T19:11:13Z

AMD's Unified Inference Frontend (UIF) might also be useful.

audiovention · 2023-06-18T18:30:10Z

Here's some more information:
https://onnxruntime.ai/docs/execution-providers/Vitis-AI-ExecutionProvider.html
This onnxruntime provider is what the only AMD-provided example uses. Things seem very early stage at this point (and considering AMD's history might forever remain in that stage).
I suspect it's more likely these accelerators would have to be utilized similarly to the Apple neural engine.

marty1885 · 2023-06-21T13:15:38Z

Second this. I'm willing to look into and actually writing it. But where TF is the API?

EwoutH · 2023-06-28T19:10:58Z

Second this. I'm willing to look into and actually writing it. But where TF is the API?

I think it's here: https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html

marty1885 · 2023-06-29T02:19:37Z

@EwoutH Ohh.. thanks. Dang I used to do work with Vitis in collage. It'll be quite a task to gain access to low level API on that.

Update: Waiting for Acer to release the new Swift edge in my country.

EwoutH · 2023-09-26T12:14:11Z

There is now Ryzen AI Software Platform documentation: GitHub, ReadTheDocs.

The Ryzen AI cloud-to-client demo has also been updated over the last weeks.

marty1885 · 2023-09-26T14:13:49Z

That looks grim. AMD did not expose low level (callable) access to their accelerator. And everything has to go through ONNX.

flobeier · 2023-09-27T14:15:54Z

@marty1885 I believe ONNX uses this provider to access the accelerator. Does that help in any way?

marty1885 · 2023-09-27T17:19:49Z

@flobeier Unfortunately no. That ONNX provider calls Xilinx(now AMD)'s Vitis AI Library. Vitis also works on a graph based system and the provider is basically a translator from ONNX graph to AMD's XIR graph. I tried searching the entire document. Vitis doesn't seem to expose low level functions like GEMM, convolution, pooling, etc.. directly to the user. Which we need for GGML work with it.

You can find the documents here
https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Vitis-AI-Overview

flobeier · 2023-09-27T18:00:13Z

@marty1885 if my understanding of the hierarchy of the libraries presented here is correct, then the lowest abstraction is the Xilinx Runtime Library (XRT). Here is the documentation: https://xilinx.github.io/XRT/master/html/index.html
Does that look better?

shtirlic · 2023-10-05T17:21:30Z

JFYI, https://ryzenai.docs.amd.com/en/latest/modelcompat.html
on the page

In the Ryzen AI workflow, the quantized model is converted into ONNX format for deployment. Currently, the IPU supports a subset of ONNX operators.

The current list of the IPU-supported ONNX operators is as follows:
Add
And
Average pool
BatchNorm
Clip
Concat
Conv
ConvTranspose
Gemm
GlobalAveragePool
GlobalMaxPool
Identity
LayerNorm
LSTM
Max
MaxPool
Mul
Pad
Prelu
Relu
Resize
RNN
Sigmoid
Slice
Softmax
Split
Squeeze
Unsqueeze
Upsample
Spacetodepth

marty1885 · 2023-10-06T03:28:10Z

The operation set is ok, it's good enough to accelerate LLaMA. the problem is the quantized model is converted into ONNX format for deployment. (the API they provide is not low level enough). GGML works on the premise that we can call the accelerator whenever we want to offload operations (ex: matmul). But Vitis forces to build a compute graph beforehand.

It's possible to hack around it. But it's likely to cause huge overhead and just too much work for me :(

arlo-phoenix · 2023-12-30T12:41:47Z

If there is a BLAS or even a matrix multiplication API available, it should be easy to add.

There's Vitis BLAS, but no idea if it actually works. It's under Apache 2 license so would be compatible with llama.cpp. This would only require the Xilinx runtime which from what I see should work

Some of the results of me browsing around:

RyzenAI-SW

The main Ryzen AI github repo https://github.com/amd/RyzenAI-SW:

Only supports Windows
from what I see this is just a copy of an ONNX tutorial and the special thing is that you can now run ONNX on Vitis Graphs
not interesting at all for llama.cpp

If you can convert ONNX to Vitis Graphs and run it the runtime and everything used by it should probably work as well on its own:

Xilinx Runtime

https://www.xilinx.com/products/design-tools/vitis/xrt.html
https://github.com/Xilinx/XRT
should have Linux support
Should work on all Xilinx devices, so the AI Engine on the newer Ryzen APU's included (did not find any specifics though)
Vitis-AI says it supports AMD Target Ryzen AI in some graphic which supports that assumption https://github.com/Xilinx/Vitis-AI

Directly coding Vitis

large tutorial collection here

I don't have a newer Ryzen Laptop, but will be getting one early next year (waiting for Hawk Point or more Phoenix laptops) and will look deeper into this then if noone starts working on this before me (if it's even possible, AMD docs on point again...)

arlo-phoenix · 2024-01-01T11:30:33Z

Looked a bit more and Vitis Blas isn't the correct library. The high perfomance AIE-ML compatible library is under https://github.com/Xilinx/Vitis_Libraries/tree/main/dsp. This doesn't have L3 support yet, so it would be required to actually directly design graphs using the L2 components which doesn't seem impossible, but definitely harder.

I also don't know how many instructions are actually supported by the Ryzen AI / if it's possible to code custom kernels for it to run dequantizing as well inside the graph or if we only have matmul support. I won't work research on this further before it's even confirmed that Vitis-AI is the right spot and someone manages to run it on Linux. The entire development suite is Linux only and all the tutorials are made for it as well.

Edit: Found this forum post HLS not supported on V70 so HLS isn't supported and the lowest level access is just XIR after all... The supported operators also don't seem promising (matmul translates to a conv2d which I have no idea how that would work, but probably not as efficient as normal matmul) Actually supported operators are here: https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Currently-Supported-Operators

You can write custom operators, but since no HLS only CPU. It makes sense since otherwise it wouldn't really be possible to share the NPU between different processes as HLS just generates an xlcbin for flashing.

The only lead to actual matmul access I have is the windows IPU driver. It has a aieml_gemm_vm_phx_4x4.xlcbin which is probably used by ONNX somewhere for gemm support. This would still mean that at most even with a very tricky setup we could only support matrix multiplication acceleration. Reverse engineering where that is used would probably be too much for me though, maybe AMD themselves will provide a new API at some point. The XIR operators are catered towards Vision AI (makes sense since all videos showcasing RyzenAI where vision related except llama) and is very hard to use for transformers. I might try to convert a llama model to onnx -> xmodel and look at the graph to see what they do. But I agree with previous commenters that this will likely lead to much overhead and is also beyond me to reverse engineer that.

Kinda sad, since the hardware seemed really cool to play around with: hardware used by XDNA/IPU and HLS. I thought we could use it just like an ALU for vectors and matrices, but i guess not .-.

stefanos-kalantzis · 2024-01-26T06:30:53Z

https://github.com/amd/xdna-driver

arlo-phoenix · 2024-01-27T22:28:48Z

https://github.com/amd/xdna-driver

Thanks for the link, nice to see things moving forward. Since the example only used the XRT directly I decided to dig through the RyzenAI SW repo again if there are maybe any other examples for matmul with XRT inside it --- and found something promising: example/transformers/ops/cpp/qlinear_2/qlinear_2.hpp. If someone is fine with testing around a bunch it should be doable to accelerate matmuls with int8 x int8 NPU kernels, the code is semi well documented and for testing around you can first use the python API as well.

I'll probably not look into this anymore for a while/can't, thought HawkPoint laptops would come out faster -.- so good luck if anyone else wants to try this! The issue with creating this without an existing API is that you'd need to pack all dependencies in that project in the .cpp file which isn't the worst, but if these update/have bugs this is gonna be annoying to maintain. The file doc comments still says experimental as well and I assume at some point there will be an official API for this anyways so this would be more experimenting/trying it out rather than something that should be made into a PR imo.

Also isn't clear to me if an application/project should provide the xlcbin for supported targets themselves (uses this one for phoenix) or if they will be packed in a common folder like the overlays in the /opt/xilinx/overlaybins/{device_id} folder. As said previously you currently can not (and likely never) create custom kernels so these will prbly be packed with a driver/library install at some point so I'd wait a bit longer for things to become more stable/clear unless you wanna test around a bunch.

Vorlent · 2024-03-05T17:41:35Z

The Ryzen 7 8700 APU can have its DDR5 memory channels be overclocked up to DDR5-10600. This would result in a purely theoretical memory bandwidth of around 140 GB/s, which is roughly half of a low end GPU such as the 7600 XT. In exchange you could easily get up to 48 GB of system RAM using 2x 24 GB sticks. Populating all slots will cause the frequency to decrease.

Ok, back to the topic at hand. Ryzen APUs come with the Ryzen AI Engine. From my research based on arlo-phoenix's references, you really only need XRT to launch DPU kernels. You can think of XRT as the equivalent to CUDA or OpenCL for the DPU. A DPU kernel in the repo has a corresponding .txt file (yes, the instructions are hex encoded) such as "a8w8acc32_8_2k_2k.txt" referring to a 8bit 2048x2048 matrix multiplication.

So the two remaining questions are:

How is data loaded into the DPU?
Is there a way to compile our own kernels for the DPU?

XRT uses the concept of buffer objects (BO): https://xilinx.github.io/XRT/2023.2/html/xrt_native.main.html#buffer-apis
You can reserve a buffer object and then get a pointer to the memory region accessible to the DPU to which you can write your data as usual.

This is speculation from my side, but Apache TVM integration already exists for the DPUs in Xilinx's FPGAs. We don't know which exact DPU is in the Ryzen, but here is a list of supported DPUs: https://tvm.apache.org/docs/how_to/deploy/vitis_ai.html

Apache TVM by itself does not actually compile anything to the DPU instructions. https://github.com/apache/tvm/blob/main/python/tvm/relay/op/contrib/vitis_ai.py

It simply delegates the actual work to https://github.com/Xilinx/pyxir

PyXIR is an Neural Network Intermediate Representation (IR) for deep learning. It is designed to be an interface between deep learning frameworks and neural network hardware accelerators, specifically Xilinx Vitis-AI FPGA based accelerators like DPU.
with following Vitis-AI accelerators:

DPUCADX8G (formerly DPUv1)
DPUCZDX8G (formerly DPUv2)

Ok, so we are now down to the final question. Does pyxir actually generate the code, or will it call some proprietary library to which we have no access behind the scenes?

I haven't dug too deep but here is a list of implemented DPU operations: https://github.com/Xilinx/pyxir/tree/master/python/pyxir/graph/ops

Here is a description of the XIR based flow: https://docs.xilinx.com/r/1.1-English/ug1414-vitis-ai/XIR-Based-Flow-for-DPUv3
From what I can tell pyxir delegates the actual compilation to vai_c_tensorflow, which belongs to a local Vitis AI installation so as long as you can get your hands on that, it should be possible to develop new kernels. https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/VAI_C-Usage

Vorlent · 2024-03-05T21:54:25Z

Considering the extreme similarity in the micro-architecture of the Versal AI engine versus the Ryzen AI engine and the extensive documentation of the Versal AI engine, this isn't going to be a ROCm dud. I'm making a bold prediction: The market for AI inference using consumer GPUs will disappear by the end of 2025 and nobody will ever consider using CUDA or ROCm for AI inference on a PC again.

Vorlent · 2024-03-07T21:18:22Z

Ok, pyxir was a red herring, but my reasoning was the following: there are a few compiled .dll files containing a reference to Apache TVM and therefore that was my starting point and it was decent one.

Overview over the Hardware architecture of the Ryzen AI NPU

The Ryzen 7040 officially has 10 TOPs and 20 tiles. This is 0.5 TOPs and both the Xilinx and Ryzen AI documentation confirms this. The Vitis documentation refers to two types of AI Engine Tiles, the old ones, which do not even support bfloat16 and the newer ones called AI Engine-ML or AIE-ML in short. The bfloat16 performance is one quarter of the advertised integer TOPs, namely 2.5 TFLOPS (4 TFLOPs for 8700G). Each Tile contains a fully C programmable RISC/VLIW processor with 16 KByte instruction memory, 64KByte of scratchpad memory and is also capable of accessing the memory of neighboring tiles. The tiles are arranged in colums with 4 tiles and 1 memory tile that has a 512KB of SRAM with a memory bandwidth of at least 30GB/s per memory tile. This means the total SRAM in the 8700 G NPU will be 2 MB for the local tile memory and another 4 MB in the memory tiles.

Resources:
Versal Adaptive SoC AIE-ML Architecture Manual (AM020)
AI Engine-ML Intrinsics User Guide
AIE Tile VS AIE-ML Tile
Versal ACAP AI Engine

Riallto - an exploration framework for the AMD Ryzen™ AI NPU

Rialto
Design Rationale of Two Generations
of AI Engines (PDF slides)

How to program the AIE-ML Tiles

As mentioned above, there are extreme similarities between the AI Engine Tiles used by Xilinx and the new AIE-ML tiles developed after the AMD acquisition. Xilinx AI Engine Tiles were always C programmable using Vitis AI, but AMD has decided to go a different route with Ryzen AI. To get an overview I recommend watching this video: https://www.youtube.com/watch?v=pfazqbOODIU. It is highly recommended.

AMD has doubled down on building an MLIR (Multi Level Intermediate Representation) backend for Ryzen AI called mlir-aie which is available https://github.com/Xilinx/mlir-aie.

This repository contains an [MLIR-based](https://mlir.llvm.org/) toolchain for AI Engine-enabled devices,
such as [AMD Ryzen™ AI](https://www.amd.com/en/products/ryzen-ai) and [Versal™] (https://www.xilinx.com/products/technology/ai-engine.html).

So, first of all, this toolchain still depends on Vitis, but you can get a free license. Therefore it is not fully opensource, but you no longer need an FPGA license for access. The proprietary components are kept at a minimum. So, how do you generate the MLIR? For that you need a frontend, kind of like clang is a C/C++ frontend for LLVM, polygeist is a C/C++ frontend for MLIR.
You can find Polygeist here: https://github.com/llvm/Polygeist. Alternatively, anything that targets MLIR can compile kernels e.g. OpenAI's Triton, which lets you program kernels in python.

Here is a guide on how to do that: https://riallto.ai/4_2_write_your_kernel.html and here is the example kernel they showed.

%%kernel
void passthrough(uint8_t *in_buffer, uint8_t *out_buffer, uint32_t nbytes)
{
    for(int i=0; i<nbytes; i++) {
        out_buffer[i] = in_buffer[i];
    }
}

There, it is just simple C code plus some vector intrinsics. To make it easier for the kernel developers, they have also developed automatic vectorization. https://xilinx.github.io/mlir-aie/AIEVectorization

void conv2d(int img_in[17][272], int kernel_coeff[3][3], int img_out[16][256]) {
    for(int r = 0; r < 16; r++)
        for(int c = 0; c < 256; c++) {
            int acc = 0;
            for(int i = 0; i < 3; i++)
                for(int j = 0; j < 3; j++) {
                    acc += img_in[r+i][c+j] * kernel_coeff[i][j];
                }
            img_out[r][c] = acc;
        }
}

which correctly vectorized into this MLIR:

mlir-clang --function=conv2d conv2d_i32.c -S --raise-scf-to-affine | aie-opt --affine-loop-unroll="unroll-full unroll-full-threshold=3" --canonicalize -affine-super-vectorize="virtual-vector-size=8 vectorize-reductions" --aie-vectorize | aie-translate --aievec-to-cpp
void conv2d(int32_t * restrict v4, size_t m1, int32_t * restrict v5, size_t m2, int32_t * restrict v6, size_t m3) {
  size_t v7 = 0;
  size_t v8 = 2;
  v8int32 v9 = *(v8int32 *)(v5 + 3*v7+v7);
  v8int32 v10 = *(v8int32 *)(v5 + 3*v8+v8);
  size_t v11 = 0;
  size_t v12 = 16;
  size_t v13 = 1;
  for (size_t v14 = v11; v14 < v12; v14 += v13)
  chess_prepare_for_pipelining
  chess_loop_range(16, 16)
  {
    size_t v15 = 1;
    size_t v16 = v14 + v15;
    size_t v17 = 2;
    size_t v18 = v14 + v17;
    size_t v19 = 0;
    size_t v20 = 256;
    size_t v21 = 8;
    for (size_t v22 = v19; v22 < v20; v22 += v21)
    chess_prepare_for_pipelining
    chess_loop_range(32, 32)
    {
      v16int32 v23;
      int32_t * restrict r_v23_v4 = v4;
      v23 = upd_w(v23, 0, *(v8int32 *)(r_v23_v4 + 272*v14+v22));
      v8acc80 v24 = lmul8(v23, 0, 0x76543210, v9, 0, 0x00000000);
      size_t v25 = 1;
      size_t v26 = v22 + v25;
      v23 = upd_w(v23, 1, *(v8int32 *)(r_v23_v4 + 272*v14+v26 + 7));
      v24 = lmac8(v24, v23, 1, 0x76543210, v9, 1, 0x00000000);
      v24 = lmac8(v24, v23, 2, 0x76543210, v9, 2, 0x00000000);
      v16int32 v27;
      int32_t * restrict r_v27_v4 = v4;
      v27 = upd_w(v27, 0, *(v8int32 *)(r_v27_v4 + 272*v16+v22));
      v24 = lmac8(v24, v27, 0, 0x76543210, v9, 3, 0x00000000);
      v27 = upd_w(v27, 1, *(v8int32 *)(r_v27_v4 + 272*v16+v26 + 7));
      v24 = lmac8(v24, v27, 1, 0x76543210, v9, 4, 0x00000000);
      v24 = lmac8(v24, v27, 2, 0x76543210, v9, 5, 0x00000000);
      v16int32 v28;
      int32_t * restrict r_v28_v4 = v4;
      v28 = upd_w(v28, 0, *(v8int32 *)(r_v28_v4 + 272*v18+v22));
      v24 = lmac8(v24, v28, 0, 0x76543210, v9, 6, 0x00000000);
      v28 = upd_w(v28, 1, *(v8int32 *)(r_v28_v4 + 272*v18+v26 + 7));
      v24 = lmac8(v24, v28, 1, 0x76543210, v9, 7, 0x00000000);
      v24 = lmac8(v24, v28, 2, 0x76543210, v10, 0, 0x00000000);
      v8int32 v29 = srs(v24, 0);
      *(v8int32 *)(v6 + 256*v14+v22) = v29;
    }
  }
  return;
}

Resources:
AI Engine Kernel Coding Best Practices Guide (UG1079)
T2: Leveraging MLIR to
Design for AI Engines
MLIR-based AIEngine toolchain
AIE Build License

Recommendations for llama.cpp kernel developers

Linux Setup and Build Instructions
Windows Setup and Build Instructions

I don't think I have the time to work on this, but I have looked at the instrinsics. There are the most relevant pages that you need to have glanced over at least once.

Vector Operations: Bitwise logical is essential for dequantization. Each AI-Engine tile has a 64 vector lanes for the data type int8. You do not get anything smaller or bigger than this but the good news is that all common bitwise operators are supported, such as bit shifting, AND masking, bitwise OR, etc. This means dequantization is possible. These operators do not exist in the old AI Engine tiles, which only support 8 bits, take it or leave. So the takeaway is that the hardware is not gimped.

Now onto the more LLM focused part. You can multiply a 4x8 with an 8x4 matrix or multiply 16 1x2 matrices with 16 2x1 matrices (equivalent to 16 dot products of two 2-element vectors). This gives you the theoretical 4 TFLOPs performance. Any other operation will give you less, except 4 bit and 8 bit integers of course.

So what should be accelerated? I am not exactly an expert in the llama2 architecture, but you should focus on plain old matrix-matrix and matrix-vector multiplication. You should put the columns from the second matrix (for matrix matrix multiplication) in the memory tiles and the intermediate results in the SRAM and stream the quantized matrix weights directly from DRAM and dequantize them without spilling to the local scratchpad SRAM. The actual multiplication itself shouldn't be a big deal, but you might have to accumulate the results. The 16 dot product strategy might be easy to whip up and that should already get you 64 flops/cycle at which point your bottleneck will never be compute ever again, only memory bandwidth. The CPU cores will stay mostly idle and we will run LLMs like mixtral 8x7b locally on our Strix Point Laptops with 32 to 64 GB for probably under 1500€ instead of absurdly expensive NVidia GPUs where just the GPU costs you 2000€ with only 24 GB. I'm sure there will be enthusiasts stacking three to four GPUs, so that they can run Goliath 120b at more than 1 token per second, but I would honestly be happy if I could get 8bit mixtral working.

Personal remarks

By the way, the reason why I am sort of obsessed with this is that I originally had the idea to use Effinix FPGAs, not because of the compute but rather, because you can just keep adding memory channels and connect FPGAs to each other similar to Groq. This idea is dead in the water for two reasons: 1. Effinix FPGAs run their memory at really low frequencies and only DDR4 so you only get 28GB/s per FPGA. 2. The total bfloat16 FLOPS are 0.5 TFLOPs. This means a single AI Engine ML Tile will roast the FPGA and you get 32 of them per chip. So really, the only benefit of the FPGA approach would be that you could stack the memory all the way to 256GB of RAM while staying under 5000€ per node. That is such a small niche that I recommend everyone to stop bothering with it. If you want to compete with Ryzen AI, build 256 bit memory channels onto your AI chip and put the DRAM over the AI accelerator via Package on Package or Multi Chip Module or System in Package or at least by putting the DRAM chips at the opposite side of the PCB, so that the vias go straight from the AI chip to the DRAM chip on the other side. Literally nobody is doing this except Apple. The NVIDIA way (HBM) is too expensive and AMD is too stupid to get their GPU drivers fixed so their GPUs are a big no-no. The AI Engines were originally developed by Xilinx so they aren't cursed.

dkuku · 2024-03-07T22:36:12Z

You just need to be aware that the ryzen ai is not exposed on every laptop. I bought ae lenovo with 7840hs and 32gb of ram crazy cheap, for under 600gbp and ryzen ai is not showing up as a device under linux.

…

On Thu, 7 Mar 2024, 22:18 Vorlent, ***@***.***> wrote: Ok, pyxir was a red herring, but my reasoning was the following: there are a few compiled .dll files containing a reference to Apache TVM and therefore that was my starting point and it was decent one. Overview over the Hardware architecture of the Ryzen AI NPU The Ryzen 7040 officially has 10 TOPs and 20 tiles. This is 0.5 TOPs and both the Xilinx and Ryzen AI documentation confirms this. The Vitis documentation refers to two types of AI Engine Tiles, the old ones, which do not even support bfloat16 and the newer ones called AI Engine-ML or AIE-ML in short. The bfloat16 performance is one quarter of the advertised integer TOPs, namely 4 TFLOPs. Each Tile contains a fully C programmable RISC/VLIW processor with 16 KByte instruction memory, 64KByte of scratchpad memory and is also capable of accessing the memory of neighboring tiles. The tiles are arranged in colums with 4 tiles and 1 memory tile that has a 512KB of SRAM with a memory bandwidth of at least 30GB/s per memory tile. This means the total SRAM in the 8700 G NPU will be 2 MB for the local tile memory and another 4 MB in the memory tiles. image.png (view on web) <https://github.com/ggerganov/llama.cpp/assets/13166716/c073041d-a342-45f7-8923-ad57770d32e7> Resources: Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.xilinx.com/r/en-US/am020-versal-aie-ml/Overview> AI Engine-ML Intrinsics User Guide <https://japan.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop.html> AIE Tile VS AIE-ML Tile <https://www.xilinx.com/products/technology/ai-engine.html> Versal ACAP AI Engine <https://www.xilinx.com/content/dam/xilinx/support/documents/architecture-manuals/am009-versal-ai-engine.pdf> Riallto - an exploration framework for the AMD Ryzen™ AI NPU <https://riallto.ai/index.html#riallto-an-exploration-framework-for-the-amd-ryzen-ai-npu> ](https://riallto.ai/index.html) Rialto <https://github.com/AMDResearch/Riallto> Design Rationale of Two Generations of AI Engines (PDF slides) <https://indico.cern.ch/event/1079717/contributions/4541496/attachments/2324531/3959170/Design%20Rationale%20of%20Two%20Generations%20of%20AI%20Engines%20.pdf> How to program the AIE-ML Tiles As mentioned above, there are extreme similarities between the AI Engine Tiles used by Xilinx and the new AIE-ML tiles developed after the AMD acquisition. Xilinx AI Engine Tiles were always C programmable using Vitis AI, but AMD has decided to go a different route with Ryzen AI. To get an overview I recommend watching this video: https://www.youtube.com/watch?v=pfazqbOODIU. It is highly recommended. AMD has doubled down on building an MLIR (Multi Level Intermediate Representation) backend for Ryzen AI called mlir-aie which is available https://github.com/Xilinx/mlir-aie. This repository contains an [MLIR-based](https://mlir.llvm.org/) toolchain for AI Engine-enabled devices, such as [AMD Ryzen™ AI](https://www.amd.com/en/products/ryzen-ai) and [Versal™] (https://www.xilinx.com/products/technology/ai-engine.html). So, first of all, this toolchain still depends on Vitis, but you can get a free license. Therefore it is not fully opensource, but you no longer need an FPGA license for access. The proprietary components are kept at a minimum. So, how do you generate the MLIR? For that you need a frontend, kind of like clang is a C/C++ frontend for LLVM, polygeist is a C/C++ frontend for MLIR. You can find Polygeist here: https://github.com/llvm/Polygeist. Alternatively, anything that targets MLIR can compile kernels e.g. OpenAI's Triton, which lets you program kernels in python. Here is a guide on how to do that: https://riallto.ai/4_2_write_your_kernel.html and here is the example kernel they showed. %%kernel void passthrough(uint8_t *in_buffer, uint8_t *out_buffer, uint32_t nbytes) { for(int i=0; i<nbytes; i++) { out_buffer[i] = in_buffer[i]; } } There, it is just simple C code plus some vector intrinsics. To make it easier for the kernel developers, they have also developed automatic vectorization. https://xilinx.github.io/mlir-aie/AIEVectorization void conv2d(int img_in[17][272], int kernel_coeff[3][3], int img_out[16][256]) { for(int r = 0; r < 16; r++) for(int c = 0; c < 256; c++) { int acc = 0; for(int i = 0; i < 3; i++) for(int j = 0; j < 3; j++) { acc += img_in[r+i][c+j] * kernel_coeff[i][j]; } img_out[r][c] = acc; } } which correctly vectorized into this MLIR: mlir-clang --function=conv2d conv2d_i32.c -S --raise-scf-to-affine | aie-opt --affine-loop-unroll="unroll-full unroll-full-threshold=3" --canonicalize -affine-super-vectorize="virtual-vector-size=8 vectorize-reductions" --aie-vectorize | aie-translate --aievec-to-cpp void conv2d(int32_t * restrict v4, size_t m1, int32_t * restrict v5, size_t m2, int32_t * restrict v6, size_t m3) { size_t v7 = 0; size_t v8 = 2; v8int32 v9 = *(v8int32 *)(v5 + 3*v7+v7); v8int32 v10 = *(v8int32 *)(v5 + 3*v8+v8); size_t v11 = 0; size_t v12 = 16; size_t v13 = 1; for (size_t v14 = v11; v14 < v12; v14 += v13) chess_prepare_for_pipelining chess_loop_range(16, 16) { size_t v15 = 1; size_t v16 = v14 + v15; size_t v17 = 2; size_t v18 = v14 + v17; size_t v19 = 0; size_t v20 = 256; size_t v21 = 8; for (size_t v22 = v19; v22 < v20; v22 += v21) chess_prepare_for_pipelining chess_loop_range(32, 32) { v16int32 v23; int32_t * restrict r_v23_v4 = v4; v23 = upd_w(v23, 0, *(v8int32 *)(r_v23_v4 + 272*v14+v22)); v8acc80 v24 = lmul8(v23, 0, 0x76543210, v9, 0, 0x00000000); size_t v25 = 1; size_t v26 = v22 + v25; v23 = upd_w(v23, 1, *(v8int32 *)(r_v23_v4 + 272*v14+v26 + 7)); v24 = lmac8(v24, v23, 1, 0x76543210, v9, 1, 0x00000000); v24 = lmac8(v24, v23, 2, 0x76543210, v9, 2, 0x00000000); v16int32 v27; int32_t * restrict r_v27_v4 = v4; v27 = upd_w(v27, 0, *(v8int32 *)(r_v27_v4 + 272*v16+v22)); v24 = lmac8(v24, v27, 0, 0x76543210, v9, 3, 0x00000000); v27 = upd_w(v27, 1, *(v8int32 *)(r_v27_v4 + 272*v16+v26 + 7)); v24 = lmac8(v24, v27, 1, 0x76543210, v9, 4, 0x00000000); v24 = lmac8(v24, v27, 2, 0x76543210, v9, 5, 0x00000000); v16int32 v28; int32_t * restrict r_v28_v4 = v4; v28 = upd_w(v28, 0, *(v8int32 *)(r_v28_v4 + 272*v18+v22)); v24 = lmac8(v24, v28, 0, 0x76543210, v9, 6, 0x00000000); v28 = upd_w(v28, 1, *(v8int32 *)(r_v28_v4 + 272*v18+v26 + 7)); v24 = lmac8(v24, v28, 1, 0x76543210, v9, 7, 0x00000000); v24 = lmac8(v24, v28, 2, 0x76543210, v10, 0, 0x00000000); v8int32 v29 = srs(v24, 0); *(v8int32 *)(v6 + 256*v14+v22) = v29; } } return; } Resources: AI Engine Kernel Coding Best Practices Guide (UG1079) <https://docs.xilinx.com/r/2021.2-English/ug1079-ai-engine-kernel-coding/Overview> T2: Leveraging MLIR to Design for AI Engines <https://www.xilinx.com/content/dam/xilinx/publications/presentations/leveraging-mlir-to-design-for-aie-fpga-2023.pdf> MLIR-based AIEngine toolchain <https://xilinx.github.io/mlir-aie/index.html> AIE Build License <https://riallto.ai/prerequisites-aie-license.html#aie-build-license>]( https://riallto.ai/prerequisites-aie-license.html) Recommendations for llama.cpp kernel developers I don't think I have the time to work on this, but I have looked at the instrinsics. There are the most relevant pages that you need to have glanced over at least once. Vector Operations: Bitwise logical <https://japan.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__logic.html> is essential for dequantization. Each AI-Engine tile has a 64 vector lanes for the data type int8. You do not get anything smaller or bigger than this but the good news is that all common bitwise operators are supported, such as bit shifting, AND masking, bitwise OR, etc. This means dequantization is possible. These operators do not exist in the old AI Engine tiles, which only support 8 bits, take it or leave. So the takeaway is that the hardware is not gimped. Now onto the more LLM focused part. You can multiply a 4x8 with an 8x4 matrix or multiply 16 1x2 matrices with 16 2x1 matrices (equivalent to 16 dot products of two 2-element vectors). This gives you the theoretical 4 TFLOPs performance. Any other operation will give you less, except 4 bit and 8 bit integers of course. So what should be accelerated? I am not exactly an expert in the llama2 architecture, but you should focus on plain old matrix-matrix and matrix-vector multiplication. You should put the columns from the second matrix (for matrix matrix multiplication) in the memory tiles and the intermediate results in the SRAM and stream the quantized matrix weights directly from DRAM and dequantize them without spilling to the local scratchpad SRAM. The actual multiplication itself shouldn't be a big deal, but you might have to accumulate the results. The 16 dot product strategy might be easy to whip up and that should already get you 64 flops/cycle at which point your bottleneck will never be compute ever again, only memory bandwidth. The CPU cores will stay mostly idle and we will run LLMs like mixtral 8x7b locally on our Strix Point Laptops with 32 to 64 GB for probably under 1500€ instead of absurdly expensive NVidia GPUs where just the GPU costs you 2000€ with only 24 GB. I'm sure there will be enthusiasts stacking three to four GPUs, so that they can run Goliath 120b at more than 1 token per second, but I would honestly be happy if I could get 8bit mixtral working. Personal remarks By the way, the reason why I am sort of obsessed with this is that I originally had the idea to use Effinix FPGAs, not because of the compute but rather, because you can just keep adding memory channels and connect FPGAs to each other similar to Groq. This idea is dead in the water for two reasons: 1. Effinix FPGAs run their memory at really low frequencies and only DDR4 so you only get 28GB/s per FPGA. The total bfloat16 FLOPS are 0.5 TFLOPs. This means a single AI Engine ML Tile will roast the FPGA and you get 32 of them per chip. So really, the only benefit of the FPGA approach would be that you could stack the memory all the way to 256GB of RAM while staying under 5000€ per node. That is such a small niche that I recommend everyone to stop bothering with it. If you want to compete with Ryzen AI, build 256 bit memory channels onto your AI chip and put the DRAM over the AI accelerator via Package on Package or Multi Chip Module or System in Package or at least by putting the DRAM chips at the opposite side of the PCB, so that the vias go straight from the AI chip to the DRAM chip on the other side. Literally nobody is doing this except Apple. The NVIDIA way (HBM) is too expensive and AMD is too stupid to get their GPU drivers fixed so their GPUs are a big no-no. The AI Engines were originally developed by Xilinx so they aren't cursed. — Reply to this email directly, view it on GitHub <#1499 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4X45SZJ3TPAKSXUNO55TYXDKTLAVCNFSM6AAAAAAYE25E2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGUYTEMRQHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Vorlent · 2024-03-08T20:59:38Z

@dkuku I admit software issues are going to cripple this, especially on Linux, but from what I can know they specifically tell you to compile the kernel yourself on Linux to enable IOMMU SVA support. https://github.com/Xilinx/mlir-aie/blob/main/docs/buildHostLin.md#update-linux . The article https://www.phoronix.com/news/AMD-IOMMU-SVA-Nears mentions that IO MMU SVA support will only ship with Linux 6.8. The latest release from 5 days ago is still only 6.7.9.

Also, it might be more interesting to use XDNA/Ryzen AI to accelerate convolutions first, because when I used LLaVa 1.6, it was significantly slower during image processing and then much faster during text generation. Having fast local vision LLMs might be more important than generative AI e.g. to preserve privacy.

marty1885 · 2024-03-09T03:43:30Z

I'm waiting for Arch to ship 6.8 kernel. I've a laptop that should have the NPU, and I've ported other NPU to GGML in the past (not upstreamed, SDK limitations crippled that).

TBH the Intel NPU seems more promising. The SDK looks much easier then AMD's. And less middleware.

Vorlent · 2024-03-12T12:33:47Z

Strix Halo (sorry, not Strix Point) is supposed to have more memory channels. Both NPUs are way beyond what the system memory is going to handle, so even a poorly optimized implementation is going to out-run DRAM. Meteor Lake has soldered RAM, but other than overclocking they don't seem to have used this to double the number of memory channels? https://ark.intel.com/content/www/us/en/ark/products/236848/intel-core-ultra-5-processor-125h-18m-cache-up-to-4-50-ghz.html

If it is easier to make it work on Intel then go ahead with that!

marty1885 · 2024-03-12T14:41:12Z

I don't have the hardware. I can only speculate unless Intel or someone wants of donate one, or give me remote access to their systems.

AMD's Riallto is interesting but.. it only supports windows. Writing a BLAS library for it is going to be fun... I don't think it's possible to fully utilize the FLOPS given the VLIW nature and a lack of the ISA documentation. We'll have to see where it lands.

Now the blockers:

Riallto only has Windows support.
Riallto is in Python. But GGML is C
Still waiting for kernel 6.8 for Ryzen AI support

vid · 2024-03-12T15:42:36Z

I bought ae lenovo with 7840hs and 32gb of ram crazy cheap, for under 600gbp and ryzen ai is not showing up as a device under linux.

Here's a list of known support for AMD's IPU amd/RyzenAI-SW#18

There have been cases of users organizing and petitioning vendors to enable this in the BIOS. Could start here.

I bought a P16s Gen 2, it has the 7840u which is supported. Hoping the community will build on this, since I hope to wait a couple years to upgrade, yet definitely want to run some local models.

uniartisan · 2024-03-15T12:33:58Z

AMD has provided LLM demos with PyTorch and ONNX, but they are really hard to use. Ryzen AI with PyTorch supports INT3/4 quantization, but the ONNX provider only supports INT8 quantization.

I adapted some models to onnxruntime and tried to run them on ryzenai, but their compatibility is very poor, most models cannot be compiled or run correctly, and some models will get stuck in the repeater or nonsense even if they are run on the CPU. dilemma. Also, they are quite slow.

So it would be useful if ggml could support Ryzen AI. I think Windows support will be more important for this backend because more users are using Windows, although AMD released the Linux driver weeks before.

Vorlent · 2024-03-17T23:48:31Z

@marty1885 Linux 6.8.1 is out on Arch Linux: https://archlinux.org/packages/core/x86_64/linux/

I could swear there was a CLI tool to check AMD NPU utilization on Linux, but I don't remember where I saw that.

Edit:
Here is how you check that the NPU is working (after installing the XDNA Driver):

source /opt/xilinx/xrt/setup.sh
xbutil examine

Vorlent · 2024-03-18T00:08:51Z

@uniartisan llama.cpp or more specifically the GGUF file format has a variety of custom quantization schemes. We're going to need custom kernels specifically written for llama.cpp to make use of Ryzen AI. The challenge is writing kernels in C/C++ that gets vectorized by MLIR-AIE with good performance. I am not exactly sure why Vitis AI is mandatory, but once someone has compiled binaries for Ryzen AI using it, those binaries will most likely be identical for both Windows and Linux. Assuming the XRT runtime library works on both Windows and Linux, both will be supported at roughly the same time.

Regarding VLIW, I do not necessarily think that this is going to be a hurdle for performance, rather it is going to be a hurdle for backwards compatibility, since static scheduling requires you to recompile the code for each new micro-architecture to get better performance. If AMD keeps releasing new NPUs, then the number of binaries that need to be shipped may increase. This could lead to a similar situation as ROCm, where AMD is very eager to drop support for old GPUs. Nvidia solved this by first compiling to PTX and letting the driver compile to the specific GPU microarchitecture.

marty1885 · 2024-04-24T01:32:52Z

I did some more digging into the XDNA NPU. More road blockers:

There is no public C++ API. AMD's Ralito SDK uses a blob of Python module
The official example running LLaMA 2 loads a DLL from their repository. There's no way I can compile my own stuff
- I can always just use theirs. But.. welp.
Tool chain seems to mostly run on Linux. But development code is on Windows.

cyber-tooth01 · 2024-04-24T14:38:52Z

I could be very wrong, but I did some looking, and I came across some AIE documentation with mentions of a C++ API.

This one specifically references low-level development with the NPU.
https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html

https://xilinx.github.io/mlir-aie/

And this one talks about kernel development.
https://riallto.ai/4_2_write_your_kernel.html

Could we reach out to AMD and ask them if they know anything about an implementation?

marty1885 · 2024-04-24T14:50:38Z

Could we reach out to AMD and ask them if they know anything about an implementation?

We can. But I think it's difficult to get good responses if we approach them with "we have this idea and we want to build it. Give us docs". At least there must be something for show. I actually asked AMD's sales today at AI Expo Taipei. Their engineers also agrees that documents are sparse at best. (And promises to improve in the future, but I doubt it would materialize soon).

This one specifically references low-level development with the NPU.

Cool! I didn't find this. I've looked into the mlir-aie repo this morning. It looks promising and seems to be what Ralito is using under the hood. But still have a large knowledge gap between what is provides and how Ralito's API works. I'll try to make some Hello World program first. And see where is goes from there.

EwoutH · 2024-04-24T14:57:48Z

@geohot has been somewhat effective in getting AMD to (promise to) release stuff.

AntonIXO · 2024-06-08T12:24:51Z

Here's some news from AMD: https://www.phoronix.com/news/AMD-Peano-LLVM-Ryzen-AI
This might be helpfull.

thatbuzzguy · 2024-06-12T23:36:24Z

So there has been some movement. Open source tooling is being made available: https://discourse.llvm.org/t/peano-llvm-support-for-amd-xilinx-ai-engine-processors/79458

Vorlent · 2024-07-16T19:35:05Z

Sooo does anyone care about this? The peano compiler is nice, but even if it didn't exist, it doesn't stop anyone from compiling their own kernels via Vitis AI as of now. I don't have any Ryzen AI hardware, so I can only run the example mlir-aie kernels in simulation.

Also, the VLIW problem I mentioned before is rearing its ugly head. The "AMD Ryzen™ AI 9 HX 370" (damn, what a name!) will have the same 8 x 4 tile arrangement, but with triple the TOPs. This means that they either cranked up the frequency by a factor of 3 (unlikely), or they increased the vector width from 1024 bit to 2048 bit and raised the frequency to 1.5 GHz (more likely).

This means that there will be a major split between the current 8700G and 7840 HS NPUs, which primarily differ in tile count, and the next generation HX370 NPUs. This means the kernels have to be updated every time a new generation comes out and there is no binary backwards compatibility.

An upgrade to the Ryzen 8700G will require me to buy a new mainboard and new DDR5 RAM. I need a better excuse than doing unpaid labor for AMD to support LLMs on their NPUs to drop 600€ on a CPU upgrade (anything less than 96GB RAM does not make sense in 2024), that will become obsolete for development purposes as soon as next year. I wouldn't have hesitated if this was an upgrade to DDR6. If I had known that AMD started this hackster.io contest (https://www.hackster.io/contests/amd2023), I would have applied for the Ryzen AI kit.

If anyone wants help installing XDNA and XRT or has run into problems installing Vitis AI can throw some questions at me. I have created a sort of unfinished PKGBUILD file with some minor patches to get XRT/XDNA compiling on Arch Linux. I can't guarantee that I will respond very quickly though or that I will upload an XRT/XDNA package to AUR.

Also, the intrinsics reference is actually not what you want for developing your own kernes. Instead, check out the AI Engine API User Guide: https://www.xilinx.com/htmldocs/xilinx2022_2/aiengine_api/aie_api/doc/group__group__basic__types__initialization.html

Writing a C++ kernel is actually the easy part here. The harder part is that you must understand how object_fifo and the npu dma are set up in the python based domain specific language. You will need to understand how to use the XRT api, how to connect your tiles using aie.py and finally write a kernel in C++ using aie::vector as your main datatype.

marty1885 · 2024-07-17T04:08:23Z

I'm waiting for

RyzenAI Mainline Linux support (currently needs custom kernel)
Working RyzenAI helloworld in C/C++ on Linux
XRT needs to compile with GCC-14
List goes on

In the mean time.. I'm working on a backend for Tenstorrent's hardware..

GreyXor · 2024-07-21T15:01:47Z

https://www.phoronix.com/news/AMD-XDNA-Ryzen-AI-Driver-Patch

sailorbob74133 · 2024-07-24T12:06:42Z

AMD is claiming to have partnered with LMStudio to run Llama 3.1 on "Ryzen AI™ series of processors." But I don't see that either the AMD NPUs or iGPUs are supported in LMStudio, just discrete graphics cards. So are they lying?

https://community.amd.com/t5/ai/llama-3-1-ready-to-run-on-amd-platforms-from-data-center-edge-to/ba-p/697323

GrayXu · 2024-07-24T15:10:38Z

@sailorbob74133 lmstudio is a wrapper for llamacpp, and currently llamacpp does not support AMD NPU, so...

shtirlic · 2024-07-24T17:16:54Z

It's works perfectly fine with rocm offloading on 7840u, which is APU and not dGPU

czAdamV · 2024-07-24T17:21:16Z

@shtirlic ROCm offloads to the GPU though (iGPU in case of the 7840U), not the NPU, and this thread is specifically about the latter.

shtirlic · 2024-07-24T17:27:04Z

@czAdamV yes, sorry, that was the answer for @sailorbob74133 message. I still don't feel that NPU kernel module performance will be superior than current ROCm implementation, future will tell.

audiovention · 2024-07-24T17:29:50Z

@shtirlic can you elaborate - how do you get rocm to work on 7840u?
Sorry for the off topic

shtirlic · 2024-07-24T17:32:01Z

@audiovention as always you fake it like this

export HSA_OVERRIDE_GFX_VERSION=11.0.2

AldarisX · 2024-07-30T08:18:33Z

finally https://github.com/amd/RyzenAI-SW/tree/main/example/transformers/ext/llama.cpp

vid mentioned this issue Dec 31, 2023

Linux? amd/RyzenAI-SW#2

Open

github-actions bot added the stale label Apr 17, 2024

github-actions bot removed the stale label Apr 25, 2024

github-actions bot added the stale label May 26, 2024

github-actions bot removed the stale label Jun 9, 2024

[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors? #1499

[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors? #1499

Comments

KarmaMonk commented May 17, 2023

Prerequisites

Enhancement

AlphaAtlas commented May 19, 2023 • edited Loading

KarmaMonk commented May 19, 2023 • edited Loading

SlyEcho commented May 19, 2023

Azeirah commented May 21, 2023

KarmaMonk commented May 21, 2023 • edited Loading

GreyXor commented May 25, 2023

Dampfinchen commented May 31, 2023 • edited Loading

AlphaAtlas commented May 31, 2023 • edited Loading

ggerganov commented Jun 10, 2023

EwoutH commented Jun 13, 2023

audiovention commented Jun 18, 2023

marty1885 commented Jun 21, 2023

EwoutH commented Jun 28, 2023

marty1885 commented Jun 29, 2023 • edited Loading

EwoutH commented Sep 26, 2023 • edited Loading

marty1885 commented Sep 26, 2023

flobeier commented Sep 27, 2023

marty1885 commented Sep 27, 2023

flobeier commented Sep 27, 2023

shtirlic commented Oct 5, 2023 • edited Loading

marty1885 commented Oct 6, 2023 • edited Loading

arlo-phoenix commented Dec 30, 2023 • edited Loading

RyzenAI-SW

Xilinx Runtime

Directly coding Vitis

arlo-phoenix commented Jan 1, 2024 • edited Loading

stefanos-kalantzis commented Jan 26, 2024

arlo-phoenix commented Jan 27, 2024

Vorlent commented Mar 5, 2024

Vorlent commented Mar 5, 2024

Vorlent commented Mar 7, 2024 • edited Loading

Overview over the Hardware architecture of the Ryzen AI NPU

How to program the AIE-ML Tiles

Recommendations for llama.cpp kernel developers

Personal remarks

dkuku commented Mar 7, 2024 via email

Vorlent commented Mar 8, 2024

marty1885 commented Mar 9, 2024

Vorlent commented Mar 12, 2024

marty1885 commented Mar 12, 2024 • edited Loading

vid commented Mar 12, 2024

uniartisan commented Mar 15, 2024

Vorlent commented Mar 17, 2024 • edited Loading

Vorlent commented Mar 18, 2024 • edited Loading

marty1885 commented Apr 24, 2024

cyber-tooth01 commented Apr 24, 2024

marty1885 commented Apr 24, 2024 • edited Loading

EwoutH commented Apr 24, 2024

AntonIXO commented Jun 8, 2024

thatbuzzguy commented Jun 12, 2024

Vorlent commented Jul 16, 2024

marty1885 commented Jul 17, 2024

GreyXor commented Jul 21, 2024

sailorbob74133 commented Jul 24, 2024

GrayXu commented Jul 24, 2024 • edited Loading

shtirlic commented Jul 24, 2024

czAdamV commented Jul 24, 2024

shtirlic commented Jul 24, 2024 • edited Loading

audiovention commented Jul 24, 2024

shtirlic commented Jul 24, 2024

AldarisX commented Jul 30, 2024

AlphaAtlas commented May 19, 2023 •

edited

Loading

KarmaMonk commented May 19, 2023 •

edited

Loading

KarmaMonk commented May 21, 2023 •

edited

Loading

Dampfinchen commented May 31, 2023 •

edited

Loading

AlphaAtlas commented May 31, 2023 •

edited

Loading

marty1885 commented Jun 29, 2023 •

edited

Loading

EwoutH commented Sep 26, 2023 •

edited

Loading

shtirlic commented Oct 5, 2023 •

edited

Loading

marty1885 commented Oct 6, 2023 •

edited

Loading

arlo-phoenix commented Dec 30, 2023 •

edited

Loading

arlo-phoenix commented Jan 1, 2024 •

edited

Loading

Vorlent commented Mar 7, 2024 •

edited

Loading

marty1885 commented Mar 12, 2024 •

edited

Loading

Vorlent commented Mar 17, 2024 •

edited

Loading

Vorlent commented Mar 18, 2024 •

edited

Loading

marty1885 commented Apr 24, 2024 •

edited

Loading

GrayXu commented Jul 24, 2024 •

edited

Loading

shtirlic commented Jul 24, 2024 •

edited

Loading