Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPU support in whisper.cpp #1557

Open
3 tasks
bobqianic opened this issue Nov 27, 2023 · 10 comments
Open
3 tasks

NPU support in whisper.cpp #1557

bobqianic opened this issue Nov 27, 2023 · 10 comments
Labels
good first issue Good for newcomers performance CPU and memory usage - results and comparisons research🔬

Comments

@bobqianic
Copy link
Collaborator

bobqianic commented Nov 27, 2023

Christmas is coming soon, and I want to take some time to research something interesting, such as edge low-power inference. Although current whisper.cpp can run on Raspberry Pi, the inference performance cannot achieve real-time transcription. Fortunately, there are now some development boards that use processors with NPUs, which can be used to achieve real-time transcription of large models. My primary goal is to first support RK3566 and RK3588.

Roadmap:

  • MatMul offloading
  • Conv-Gelu offloading
  • LayerNorm offloading
    ...

Reference:

https://github.com/rockchip-linux/rknpu2

@bobqianic bobqianic added good first issue Good for newcomers performance CPU and memory usage - results and comparisons research🔬 labels Nov 27, 2023
@ggerganov
Copy link
Owner

Would be great if we can find a way to utilize the NPUs! Keep us in the loop!

@Leeviber
Copy link

Leeviber commented Nov 30, 2023

I tried converting the whisper encode model to rknpu format(.rknn), it successed but the estimated runtime is quite slow, even lower than running on CPU. I think the NPU is not full support transformer, some operators are still running on the CPU.

@RoboMagus
Copy link

Some interesting development was done here: https://github.com/usefulsensors/useful-transformers.

However not everything runs on the NPU and I've personally had mixed success on running non English models.

@bobqianic
Copy link
Collaborator Author

Some interesting development was done here: https://github.com/usefulsensors/useful-transformers.

Yes, I've seen that. But I'm looking to enhance the ggml tensor library by adding some operators. This way, not only will whisper.cpp be able to utilize the NPU, but other ggml examples like llama.cpp as well. I've ordered an OrangePi 5 Plus with 32GiB RAM from Aliexpress, which is still in transit : )

However not everything runs on the NPU and I've personally had mixed success on running non English models.

Hopefully, we'll be able to run all models, regardless of their size, and whether they are English-only or support multiple languages.

@bobqianic
Copy link
Collaborator Author

The most challenging aspect I've encountered thus far is finding an appropriate driver for the RK3588 & RK3566 NPU. Most Linux distributions don't include an NPU driver, with this one being the notable exception.

https://github.com/unifreq/linux-5.10.y-rk35xx/tree/main/drivers/rknpu

@bobqianic
Copy link
Collaborator Author

I tried converting the whisper encode model to rknpu format(.rknn), it successed but the estimated runtime is quite slow, even lower than running on CPU. I think the NPU is not full support transformer, some operators are still running on the CPU.

You're right. From my experiments, it seems the NPU on the RK3588 is only effective for 3x3 convolutions. Unfortunately, its GEMM performance is quite poor. Despite being equipped with a 3x2 TOPs NPU, each unit only delivers about 10 GFLOPs for FP16 GEMM or 20 GFLOPs for INT8 GEMM. It's quite a letdown. I regret to share such disappointing news during the holiday.

image

@bobqianic bobqianic changed the title Rockchip NPU support in whisper.cpp NPU support in whisper.cpp Dec 23, 2023
@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 24, 2023

I discovered that someone else did the exact same thing but didn't find success. @ggerganov

The challenge with the Rockchip NPU stems from its peculiar input and output dimensions. To attain maximum speed, it's necessary to transform a 2D matrix into a particular dimension. If you don't do this, the driver will take over, but it operates much slower. After processing, you need to convert the result back to its original dimension. This process is quite inefficient, and I'm sharing this to prevent others from spending unnecessary time trying to implement it.

With the RK3588, when you're working with a matrix A of size (N, K) and a matrix B of size (K, M), you'll need to reshape matrix A to the new dimensions of (K/8, N, 8). Similarly, reshape matrix B to (M/16, K/32, 16, 32). After these transformations, the resulting output matrix C will have the dimensions of (N/4, M, 4), instead of the expected (N, M).

Links:
https://clehaxze.tw/gemlog/2023/12-17-update-on-ggml-rknpu2-backend-and-rknpu2-1_6_0.gmi
https://github.com/marty1885/llama.cpp/tree/rknpu2-backend

Matrix A:

image

Matrix B:

image

Matrix C:

image

@solarsamuel
Copy link

@bobqianic this is a great idea. The question is how can we implement whisper.cpp on a NPU/TPU on an embedded device?

I have an OrangePi 5 and was hoping the NPU would provide benefits, but it looks like it won't be very useful. Thank you for looking into it.

I have one idea that may be theoretically possible, but would require a good amount of work and $$$. The idea is to use 4 Google Coral Edge TPU's in a pipeline (see pipeline example here https://coral.ai/examples/) and in essence jailbreak them (George Hotz is working on it in these videos: https://www.youtube.com/watch?v=rArv2NUXGU8) to run with models other than TensorFlow (for example whisper models). The Coral Edge TPU's would take up all of the USB slots on a Raspberry Pi (maybe a USB hub could be used too), so there would be a bandwidth constraint. Each TPU has up to 8 MB of SRAM to store the models, but in reality it's more like 6.5 MB each, so probably a maximum model size of 26 MB for 4 of these units. The quantized 4 bit tiny model comes in under this. The entire setup may be possible and run quickly, but the accuracy of the tiny model isn't that great.

Another idea would be to take TPU's or FPGA's and connect them to a Raspberry Pi via USB or as a Raspberry Pi hat. That will be bandwidth limited by the communication protocol (serial, I2C, etc...).

Maybe one day when chips like this come out things will be easier for embedded AI: https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55

@ggerganov
Copy link
Owner

@bobqianic Thank you for the updates! The work in marty1885/llama.cpp@rknpu2-backend is interesting and I will be following with the progress

@marty1885
Copy link

For reference. People have worked around the matrix reordering specifically for Whisper by abstracting the entire thing around that fact.

useful-transformers is a very successful implementation. https://github.com/usefulsensors/useful-transformers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers performance CPU and memory usage - results and comparisons research🔬
Projects
None yet
Development

No branches or pull requests

6 participants