-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FFT HW Acceleration #190
Comments
To me, it seems that the EPS DSP "C" examples are merely ASM optimized and do not use any DSP functions!? Is that the case? |
Since these extra instructions are all integer instructions, only fixed-point FFT uses them: https://github.com/espressif/esp-dsp/blob/71514173b58b960173b40c4ade9d15d372770a74/modules/fft/fixed/dsps_fft2r_sc16_aes3.S |
Yes, that makes sense. This means that the DSP primitives have already been spotted in the wild and not only described in the above paper. My approach now would be to fork or extend e.g. https://gitlab.com/teskje/microfft-rs that similar to the SHA HW acceleration use the DSP primitives. Specifically, I would want to replace the radix-2 butterfly computation with the DSP HW functions. |
Another (maybe stupid) idea could be to use the inline assembler asm!{} macro. Especially because the S3 target supports beside the DSP also some nice SIMD operations. The follow compiles and even runs on the xtensa platform:
Now there is a lot between the above code and the universe that finally runs my code, especially the LLVM. I have dug into the esp Rust branch a bit and it looks like at least something was done in that direction:
rust/compiler/rustc_target/src/asm/mod.rs Line 201 in ed3726b
Is that enough? I had naively just used a random DSP mnemonic - the error message is "mnemonic unknown". Does anyone have an idea on the topic? Thanks, |
@ramtej AFAIK, LLVM plain old doesn't support/recognize the hundreds of DSP instructions, so to use inline assembly it would require either adding them all, or using the escaped binary opcode (but that also means specifying registers by number) I wish support could be added, but it's a non-trivial amount of search and replace in the llvm-project codebase. @MabezDev showed me this "trick" a good while ago |
@zRedShift Is the ".byte 0x00, 0x30, 0x00" sequence an inline assembly instruction specified as a sequence of bytes, serving as a workaround when the desired assembly instruction isn't supported by the LLVM? Indeed, this is a neat trick; I'll give it a try. Thanks! |
There is a nice crate that does the '.byte' encoding for the RISC-V V extension instructions - https://github.com/cryptape/rvv-encoder. It would be exciting to have something similar for the Xtensa extension instructions.
|
@ramtej awesome find. This will also help with ESP32P4's custom RISC V DSP extensions, when it comes out. I haven't investigated it yet (I was actually planning on just writing |
Yes, probably the easiest thing for now will be to just link in the |
I've been warming up with some Xtensa LLVM backend contributions over the last few days. My end goal is fast (scalar or vector) DSP, for ESP32S3 in Rust, since I'm working on real time audio processing/encoding and I need to maximize the performance to cram in as much processing into the pipeline. This is obviously DCT/FFT/FIR etc. heavy, among other things. So a few more scalar instruction PRs, and I will move on to adding the 128-bit registers and instructions, so that the inline assembly can be supported directly. It's been a dream of mine for about a year to do it, at the time the ESP32S3 techinical manual still didn't include the extensions, and reverse engineering Maybe even add auto-vectorization support in the (far) future, but it's a daunting task since the instructions are pipelined, the cost tables need to be populated, and there's a huge amount of user registers that control the runtime behavior of the instructions. |
Just in case, cc @sstefan1 who is planning to merge the initial support for ESP32-S3 DSP instructions into Espressif's LLVM fork soon. |
Don't want to step on anyone's toes, if @sstefan1 has already started work on this, I won't pursue, unless I can somehow assist? |
Currently we have all ESP32-S3 DSP instructions implemented in LLVM. All instructions are available in clang through clang's builtins, which translate to llvm intrinsics and then to appropriate instructions. For example: This work should be merged soon. I need to investigate how that should be done in rust, though. If anybody already knows, please let me know. |
@sstefan1 Well, we don't currently have any xtensa intrinsics support, but it would go to stdarch/core_arch here. It lives out of tree, so will need to be forked by the For inline assembly support, it's much simpler since most of the base work has already been done by @MabezDev here If you can get a branch/PR running on espressif/llvm-project, I can start work on initial support/testing on Rust for this. I'm already working on those files, adding rust support for the |
Ok, it looks to me like there is enough incentive and brainpower to tackle the ESP+Rust+DSP challenge. How are we going to coordinate this? Who does what? |
@ramtej as soon as the esp32s3 changes land on espressif/llvm-project or one of its branches, I'll start working on the PR for esp-rs/rust. |
I've merged the initial support for ESP32S3 DSP instructions in llvm. Just to keep in mind, builtins support is not yet very well tested. I will be doing testing in the following weeks. |
One more note, llvm-objdump currently doesn't work correctly with DSP instructions. To check generated assembly, it is best to use llc. |
Here's my branch with experimental support of this in Rust. |
declare void @llvm.xtensa.ee.vld.128.ip(i32, i32, i32) nounwind
define void @test2(i32 %p){
tail call void @llvm.xtensa.ee.vld.128.ip(i32 5, i32 %p, i32 16)
ret void
} If I generate assembly with // llc -O1 -mtriple=xtensa -mcpu=esp32s3 < xtensa-s3-ee-vld-128-ip.ll
.text
.file "<stdin>"
.global test2 # -- Begin function test2
.p2align 2
.type test2,@function
test2: # @test2
.cfi_startproc
# %bb.0:
entry a1, 32
.cfi_def_cfa_offset 32
ee.vld.128.ip q5, a2, 16
retw.n
.Lfunc_end0:
.size test2, .Lfunc_end0-test2
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits But if I generate and object file with llc and run gcc objdump (
We get 0x100 instead of 0x10, and this happens to all other numbers, only 0 is unaffected. |
I will have to look into it. BTW, I'm not sure if xtensa-esp32s3-elf-objdump disassembles DSP instructions correctly either. I had some problems while testing. I will check the llvm as well, but just mentioning I had issues with disassembling too. |
Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test: #[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);
#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
let src_addr = src.0.as_ptr();
let dst_addr = dst.0.as_mut_ptr();
assert!(src_addr.is_aligned_to(16));
assert!(dst_addr.is_aligned_to(16));
assert_eq!(N % 32, 0);
for _ in 0..N / 32 {
core::arch::asm!(
r#"
ee.vld.128.ip q0, {src_addr}, 16
ee.vld.128.ip q1, {src_addr}, 16
ee.vst.128.ip q0, {dst_addr}, 16
ee.vst.128.ip q1, {dst_addr}, 16
"#,
src_addr = in(reg) src_addr,
dst_addr = in(reg) dst_addr,
);
}
} src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] |
Hi @zRedShift, I was on vacation and wasn't able to look at this earlier. LLVM backend was encoding the |
@sstefan1 Thank you. I suspected something like that but didn't have time to look into it over the last few weeks. Glad it's been resolved. I remember also encountering that it was impossible to use |
Bumping this issue as I was researching I (404) Memory Copy: Allocating 2 x 100kb in IRAM, alignment: 32 bytes
I (464) Memory Copy: 8-bit for loop copy IRAM->IRAM took 819922 CPU cycles = 28.59 MB/s
I (514) Memory Copy: 16-bit for loop copy IRAM->IRAM took 205776 CPU cycles = 113.90 MB/s
I (564) Memory Copy: 32-bit for loop copy IRAM->IRAM took 103383 CPU cycles = 226.71 MB/s
I (614) Memory Copy: 64-bit for loop copy IRAM->IRAM took 77682 CPU cycles = 301.71 MB/s
I (664) Memory Copy: memcpy IRAM->IRAM took 64323 CPU cycles = 364.37 MB/s
I (714) Memory Copy: async_memcpy IRAM->IRAM took 408520 CPU cycles = 57.37 MB/s
I (764) Memory Copy: PIE 128-bit (16 byte loop) IRAM->IRAM took 19498 CPU cycles = 1202.05 MB/s
I (814) Memory Copy: PIE 128-bit (32 byte loop) IRAM->IRAM took 13095 CPU cycles = 1789.81 MB/s
I (864) Memory Copy: DSP AES3 IRAM->IRAM took 15813 CPU cycles = 1482.17 MB/s It would be great if we can have this in Rust (or is it already done?) |
Just tested this on current |
Hi, I'm attempting to use the SIMD instructions with latest 1.80 release. Most instructions work as they're intended, but I've encountered a number of misassemblies especially on the arithmeric+load/store instructions. As an example: asm!(
"NOP",
"NOP",
"EE.VADDS.S8.LD.INCP q0, a15, q1, q2, q3",
"NOP",
"NOP",
) Ends up as 420876c5: 0020f0 nop
420876c8: 0020f0 nop
420876cb: cf .byte 0xcf
420876cc: 0299 s32i.n a9, a2, 0
420876ce: f01c movi.n a0, 31
420876d0: f00020 subx8 a0, a0, a2
420876d3: f00020 subx8 a0, a0, a2 (as disassembled by So far I've observed this for instructions in the |
@Noxime The latest 1.82 toolchain includes LLVM 18 which I believe has more (all?) of these instructions implemented - please retry and file a new issue if its still occurring.
memcpy is a weak symbol in compiler builtins, you can override it (we already use the ROM memcpy which might already do this btw) Closing this for now. |
I am currently benchmarking some DSP routines on the ESP32 S3 platform in Rust. Several issues have already arisen, see #180.
Upon reading the 'ESP32-S3 Technical Reference Manual', it became apparent that the S3 platform implements some SIMD as well as DSP operations in hardware, such as
EE.FFT.R2BF.S16
orEE.CMUL.S16
. I would be willing to invest some time and and implement a hardware-accelerated FFT. I am more of a mathematician and do not know the ESP32 architecture well enough, so I need some support.Would it therefore be possible for someone to guide me and show me where I need to start? I looked at the SHA HW acceleration and can understand most things, but not all.
Thanks, Jiri
The text was updated successfully, but these errors were encountered: