FFT HW Acceleration #190

ramtej · 2023-06-22T09:44:39Z

I am currently benchmarking some DSP routines on the ESP32 S3 platform in Rust. Several issues have already arisen, see #180.

Upon reading the 'ESP32-S3 Technical Reference Manual', it became apparent that the S3 platform implements some SIMD as well as DSP operations in hardware, such as EE.FFT.R2BF.S16 or EE.CMUL.S16. I would be willing to invest some time and and implement a hardware-accelerated FFT. I am more of a mathematician and do not know the ESP32 architecture well enough, so I need some support.

Would it therefore be possible for someone to guide me and show me where I need to start? I looked at the SHA HW acceleration and can understand most things, but not all.

Thanks, Jiri

The text was updated successfully, but these errors were encountered:

ramtej · 2023-06-22T10:16:17Z

To me, it seems that the EPS DSP "C" examples are merely ASM optimized and do not use any DSP functions!? Is that the case?

See https://github.com/espressif/esp-dsp/blob/71514173b58b960173b40c4ade9d15d372770a74/modules/fft/float/dsps_fft2r_fc32_ae32_.S#L60

igrr · 2023-06-22T10:46:26Z

Since these extra instructions are all integer instructions, only fixed-point FFT uses them: https://github.com/espressif/esp-dsp/blob/71514173b58b960173b40c4ade9d15d372770a74/modules/fft/fixed/dsps_fft2r_sc16_aes3.S

ramtej · 2023-06-22T11:46:47Z

Yes, that makes sense. This means that the DSP primitives have already been spotted in the wild and not only described in the above paper.

My approach now would be to fork or extend e.g. https://gitlab.com/teskje/microfft-rs that similar to the SHA HW acceleration use the DSP primitives. Specifically, I would want to replace the radix-2 butterfly computation with the DSP HW functions.

ramtej · 2023-06-23T21:11:36Z

Another (maybe stupid) idea could be to use the inline assembler asm!{} macro. Especially because the S3 target supports beside the DSP also some nice SIMD operations.

The follow compiles and even runs on the xtensa platform:

    std::hint::black_box(unsafe {
        asm!("nop");
    });

Now there is a lot between the above code and the universe that finally runs my code, especially the LLVM. I have dug into the esp Rust branch a bit and it looks like at least something was done in that direction:

pub enum InlineAsmArch {
    X86,
    X86_64,
    Arm,
  ..
    Xtensa,
  ..
}

rust/compiler/rustc_target/src/asm/mod.rs

Line 201 in ed3726b

pub enum InlineAsmArch {

Is that enough? I had naively just used a random DSP mnemonic - the error message is "mnemonic unknown".

Does anyone have an idea on the topic?

Thanks,
Jiri

zRedShift · 2023-07-06T17:37:18Z

@ramtej AFAIK, LLVM plain old doesn't support/recognize the hundreds of DSP instructions, so to use inline assembly it would require either adding them all, or using the escaped binary opcode (but that also means specifying registers by number)

I wish support could be added, but it's a non-trivial amount of search and replace in the llvm-project codebase.

https://github.com/esp-rs/xtensa-lx-rt/blob/39256baa9ff78950f502262fdbd4bce77bb31e76/src/exception/assembly_lx6.rs#L424

@MabezDev showed me this "trick" a good while ago

ramtej · 2023-07-09T08:08:26Z

@zRedShift Is the ".byte 0x00, 0x30, 0x00" sequence an inline assembly instruction specified as a sequence of bytes, serving as a workaround when the desired assembly instruction isn't supported by the LLVM? Indeed, this is a neat trick; I'll give it a try.

Thanks!

ramtej · 2023-07-09T12:44:02Z

There is a nice crate that does the '.byte' encoding for the RISC-V V extension instructions - https://github.com/cryptape/rvv-encoder. It would be exciting to have something similar for the Xtensa extension instructions.

unsafe {
    xtensa_asm::asm!(
         ..
        "ee.cmul.s16	q3,q2,q1,3",
        ..
    );
}

zRedShift · 2023-07-09T12:49:38Z

@ramtej awesome find. This will also help with ESP32P4's custom RISC V DSP extensions, when it comes out.

I haven't investigated it yet (I was actually planning on just writing .S files and compiling/linking them), but I wonder, is there support for rur.*/wsr.* etc instructions in llvm (chapter 1.6.10 in the ESP32S3 technical reference, Processor Control Instructions). They are neccessary to manipulate special registers that control stuff like FFT width in fixed point mode.

ramtej · 2023-07-09T13:11:45Z

Yes, probably the easiest thing for now will be to just link in the .S files. I think with the rur.*/wsr.* it will be similar to the other instructions. I need (i)FFT for my application and therefore I try to get the maximum out of the S3. The DSP benchmarks are promising, but I need the functions on Rust level. Maybe it makes sense to develop an esp-dsp-rs crate for the current Xtensa and the future RISC V DSP extensions.

zRedShift · 2023-07-16T16:23:57Z

I've been warming up with some Xtensa LLVM backend contributions over the last few days. My end goal is fast (scalar or vector) DSP, for ESP32S3 in Rust, since I'm working on real time audio processing/encoding and I need to maximize the performance to cram in as much processing into the pipeline. This is obviously DCT/FFT/FIR etc. heavy, among other things.

So a few more scalar instruction PRs, and I will move on to adding the 128-bit registers and instructions, so that the inline assembly can be supported directly. It's been a dream of mine for about a year to do it, at the time the ESP32S3 techinical manual still didn't include the extensions, and reverse engineering bfd/xtensa-modules.c was a major pain, and by the time they released the instructions in the reference, the priorities shifted. Now is the time to do the work.

Maybe even add auto-vectorization support in the (far) future, but it's a daunting task since the instructions are pipelined, the cost tables need to be populated, and there's a huge amount of user registers that control the runtime behavior of the instructions.

igrr · 2023-07-16T16:38:51Z

Just in case, cc @sstefan1 who is planning to merge the initial support for ESP32-S3 DSP instructions into Espressif's LLVM fork soon.

zRedShift · 2023-07-16T20:47:38Z

Don't want to step on anyone's toes, if @sstefan1 has already started work on this, I won't pursue, unless I can somehow assist?

sstefan1 · 2023-07-17T06:23:30Z

Currently we have all ESP32-S3 DSP instructions implemented in LLVM. All instructions are available in clang through clang's builtins, which translate to llvm intrinsics and then to appropriate instructions.

For example:
__builtin_xtensa_ee_vld_128_ip(1, data, 0); --> ee.vld.128.ip q1, a9, 0

This work should be merged soon.

I need to investigate how that should be done in rust, though. If anybody already knows, please let me know.

zRedShift · 2023-07-17T10:13:24Z

@sstefan1 Well, we don't currently have any xtensa intrinsics support, but it would go to stdarch/core_arch here. It lives out of tree, so will need to be forked by the esp-rs org and the .gitmodules at esp-rs/rust should point to it. Then they can be added just like in clang. This can be done separately, whenever, and not a blocker for initial support.

For inline assembly support, it's much simpler since most of the base work has already been done by @MabezDev here
All that needs to be done is add the qregs support, and the user regs (FFT_WIDTH, QACC_H_0, etc.).

If you can get a branch/PR running on espressif/llvm-project, I can start work on initial support/testing on Rust for this. I'm already working on those files, adding rust support for the clamps/minmax features based on this PR.

ramtej · 2023-07-19T13:03:29Z

Ok, it looks to me like there is enough incentive and brainpower to tackle the ESP+Rust+DSP challenge. How are we going to coordinate this? Who does what?

zRedShift · 2023-07-20T15:28:53Z

@ramtej as soon as the esp32s3 changes land on espressif/llvm-project or one of its branches, I'll start working on the PR for esp-rs/rust.

sstefan1 · 2023-07-21T12:42:41Z

I've merged the initial support for ESP32S3 DSP instructions in llvm. Just to keep in mind, builtins support is not yet very well tested. I will be doing testing in the following weeks.

sstefan1 · 2023-07-21T13:41:38Z

One more note, llvm-objdump currently doesn't work correctly with DSP instructions. To check generated assembly, it is best to use llc.

zRedShift · 2023-07-22T04:00:22Z

Here's my branch with experimental support of this in Rust.
I ran into some issues/funky business (with the immediate addressing constant in ee.vld.128.ip, which I think is an issue with llvm, since the constant is correct in the generated llvm ir, I'll investigate later), but the core of it works.

zRedShift · 2023-07-22T13:02:51Z

@sstefan1

declare void @llvm.xtensa.ee.vld.128.ip(i32, i32, i32) nounwind
define void @test2(i32 %p){
    tail call void @llvm.xtensa.ee.vld.128.ip(i32 5, i32 %p, i32 16)
    ret void
}

If I generate assembly with llc I get the correct assembly:

// llc -O1 -mtriple=xtensa -mcpu=esp32s3 < xtensa-s3-ee-vld-128-ip.ll                                                                                                                                                                                                                    
        .text
        .file   "<stdin>"
        .global test2                           # -- Begin function test2
        .p2align        2
        .type   test2,@function
test2:                                  # @test2
        .cfi_startproc
# %bb.0:
        entry   a1, 32
        .cfi_def_cfa_offset 32
        ee.vld.128.ip    q5, a2, 16
        retw.n
.Lfunc_end0:
        .size   test2, .Lfunc_end0-test2
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

But if I generate and object file with llc and run gcc objdump (xtensa-esp32s3-elf-objdump) I get an issue:

// llc -O1 -mtriple=xtensa -mcpu=esp32s3 -filetype=obj < xtensa-s3-ee-vld-128-ip.ll > test.o
// xtensa-esp32s3-elf-objdump -D test.o
test.o:     file format elf32-xtensa-le


Disassembly of section .text:

00000000 <test2>:
   0:   004136          entry   a1, 32
   3:   a39024          ee.vld.128.ip   q5, a2, 0x100
   6:   f01d            retw.n

We get 0x100 instead of 0x10, and this happens to all other numbers, only 0 is unaffected.

sstefan1 · 2023-07-23T08:16:06Z

I will have to look into it. BTW, I'm not sure if xtensa-esp32s3-elf-objdump disassembles DSP instructions correctly either. I had some problems while testing. I will check the llvm as well, but just mentioning I had issues with disassembling too.

zRedShift · 2023-07-23T13:30:58Z

Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:

#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);

#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
    let src_addr = src.0.as_ptr();
    let dst_addr = dst.0.as_mut_ptr();
    assert!(src_addr.is_aligned_to(16));
    assert!(dst_addr.is_aligned_to(16));
    assert_eq!(N % 32, 0);
    for _ in 0..N / 32 {
        core::arch::asm!(
            r#"
                ee.vld.128.ip    q0,  {src_addr},  16
                ee.vld.128.ip    q1,  {src_addr},  16
                ee.vst.128.ip    q0,  {dst_addr},  16
                ee.vst.128.ip    q1,  {dst_addr},  16
            "#,
            src_addr = in(reg) src_addr,
            dst_addr = in(reg) dst_addr,
        );
    }
}

src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...]
dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...]
So the loop is jumping with an offset 320 instead of 32, just like the objdump.

sstefan1 · 2023-08-07T09:11:56Z

Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:

#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);

#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
    let src_addr = src.0.as_ptr();
    let dst_addr = dst.0.as_mut_ptr();
    assert!(src_addr.is_aligned_to(16));
    assert!(dst_addr.is_aligned_to(16));
    assert_eq!(N % 32, 0);
    for _ in 0..N / 32 {
        core::arch::asm!(
            r#"
                ee.vld.128.ip    q0,  {src_addr},  16
                ee.vld.128.ip    q1,  {src_addr},  16
                ee.vst.128.ip    q0,  {dst_addr},  16
                ee.vst.128.ip    q1,  {dst_addr},  16
            "#,
            src_addr = in(reg) src_addr,
            dst_addr = in(reg) dst_addr,
        );
    }
}

src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.

Hi @zRedShift, I was on vacation and wasn't able to look at this earlier. LLVM backend was encoding the imm16 offset as the actual immediate value, but it should actually encode the multiple of 16. So for 32 it should encode 0x02 for 48 it should encode 0x03 and so on. I will post a fix internally and we should have it on the github repo soon.

zRedShift · 2023-08-07T09:30:45Z

@sstefan1 Thank you. I suspected something like that but didn't have time to look into it over the last few weeks. Glad it's been resolved.

I remember also encountering that it was impossible to use loop/loopnez/loopgtz in the inline assembly, but since it's not related to the DSP instructions, and hardware loops are already planned to be fixed/included in the future, I didn't investigate it further.

ProfFan · 2024-07-20T03:13:17Z

Bumping this issue as I was researching memcpy on ESP32-S3. It turns out that aligned memcpy with EE.VLD instructions can be 6 times faster than regular memcpy, quoting https://github.com/project-x51/esp32-s3-memorycopy:

I (404) Memory Copy: Allocating 2 x 100kb in IRAM, alignment: 32 bytes
I (464) Memory Copy: 8-bit for loop copy IRAM->IRAM took 819922 CPU cycles = 28.59 MB/s
I (514) Memory Copy: 16-bit for loop copy IRAM->IRAM took 205776 CPU cycles = 113.90 MB/s
I (564) Memory Copy: 32-bit for loop copy IRAM->IRAM took 103383 CPU cycles = 226.71 MB/s
I (614) Memory Copy: 64-bit for loop copy IRAM->IRAM took 77682 CPU cycles = 301.71 MB/s
I (664) Memory Copy: memcpy IRAM->IRAM took 64323 CPU cycles = 364.37 MB/s
I (714) Memory Copy: async_memcpy IRAM->IRAM took 408520 CPU cycles = 57.37 MB/s
I (764) Memory Copy: PIE 128-bit (16 byte loop) IRAM->IRAM took 19498 CPU cycles = 1202.05 MB/s
I (814) Memory Copy: PIE 128-bit (32 byte loop) IRAM->IRAM took 13095 CPU cycles = 1789.81 MB/s
I (864) Memory Copy: DSP AES3 IRAM->IRAM took 15813 CPU cycles = 1482.17 MB/s

It would be great if we can have this in Rust (or is it already done?)

ProfFan · 2024-07-20T03:25:13Z

Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:

#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);

#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
    let src_addr = src.0.as_ptr();
    let dst_addr = dst.0.as_mut_ptr();
    assert!(src_addr.is_aligned_to(16));
    assert!(dst_addr.is_aligned_to(16));
    assert_eq!(N % 32, 0);
    for _ in 0..N / 32 {
        core::arch::asm!(
            r#"
                ee.vld.128.ip    q0,  {src_addr},  16
                ee.vld.128.ip    q1,  {src_addr},  16
                ee.vst.128.ip    q0,  {dst_addr},  16
                ee.vst.128.ip    q1,  {dst_addr},  16
            "#,
            src_addr = in(reg) src_addr,
            dst_addr = in(reg) dst_addr,
        );
    }
}

src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.

Just tested this on current esp-rs/rust and it seems to work well.

Noxime · 2024-08-30T19:09:33Z

Hi, I'm attempting to use the SIMD instructions with latest 1.80 release. Most instructions work as they're intended, but I've encountered a number of misassemblies especially on the arithmeric+load/store instructions. As an example:

asm!(
        "NOP",
        "NOP",
        "EE.VADDS.S8.LD.INCP q0, a15, q1, q2, q3",
        "NOP",
        "NOP",
)

Ends up as

420876c5:	0020f0               	nop
420876c8:	0020f0               	nop
420876cb:	cf                      	.byte	0xcf
420876cc:	0299                	s32i.n	a9, a2, 0
420876ce:	f01c                	movi.n	a0, 31
420876d0:	f00020               	subx8	a0, a0, a2
420876d3:	f00020               	subx8	a0, a0, a2

(as disassembled by xtensa-esp-elf-objdump). As you can see, the instruction bytes are quite wrong, and executing does lead to IllegalInstruction exceptions.

So far I've observed this for instructions in the *.LD/ST.INCP group at least, but I would not be surprised if more were broken.

MabezDev · 2024-10-28T14:51:05Z

@Noxime The latest 1.82 toolchain includes LLVM 18 which I believe has more (all?) of these instructions implemented - please retry and file a new issue if its still occurring.

@ProfFan It would be great if we can have this in Rust (or is it already done?)

memcpy is a weak symbol in compiler builtins, you can override it (we already use the ROM memcpy which might already do this btw)

Closing this for now.

jessebraham added the enhancement New feature or request label Jul 27, 2023

MabezDev transferred this issue from esp-rs/esp-hal Aug 24, 2023

kassane mentioned this issue Mar 1, 2024

LLVM ERROR: floatFromInt cast kassane/zig-espressif-bootstrap#2

Closed

kassane mentioned this issue Apr 3, 2024

SIMD: ESP32-S3 kassane/zig-esp-idf-sample#8

Open

MabezDev closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFT HW Acceleration #190

FFT HW Acceleration #190

ramtej commented Jun 22, 2023

ramtej commented Jun 22, 2023

igrr commented Jun 22, 2023

ramtej commented Jun 22, 2023

ramtej commented Jun 23, 2023 •

edited

Loading

zRedShift commented Jul 6, 2023 •

edited

Loading

ramtej commented Jul 9, 2023

ramtej commented Jul 9, 2023

zRedShift commented Jul 9, 2023

ramtej commented Jul 9, 2023 •

edited

Loading

zRedShift commented Jul 16, 2023

igrr commented Jul 16, 2023

zRedShift commented Jul 16, 2023

sstefan1 commented Jul 17, 2023

zRedShift commented Jul 17, 2023

ramtej commented Jul 19, 2023

zRedShift commented Jul 20, 2023

sstefan1 commented Jul 21, 2023

sstefan1 commented Jul 21, 2023

zRedShift commented Jul 22, 2023

zRedShift commented Jul 22, 2023

sstefan1 commented Jul 23, 2023

zRedShift commented Jul 23, 2023

sstefan1 commented Aug 7, 2023

zRedShift commented Aug 7, 2023

ProfFan commented Jul 20, 2024

ProfFan commented Jul 20, 2024

Noxime commented Aug 30, 2024

MabezDev commented Oct 28, 2024

FFT HW Acceleration #190

FFT HW Acceleration #190

Comments

ramtej commented Jun 22, 2023

ramtej commented Jun 22, 2023

igrr commented Jun 22, 2023

ramtej commented Jun 22, 2023

ramtej commented Jun 23, 2023 • edited Loading

zRedShift commented Jul 6, 2023 • edited Loading

ramtej commented Jul 9, 2023

ramtej commented Jul 9, 2023

zRedShift commented Jul 9, 2023

ramtej commented Jul 9, 2023 • edited Loading

zRedShift commented Jul 16, 2023

igrr commented Jul 16, 2023

zRedShift commented Jul 16, 2023

sstefan1 commented Jul 17, 2023

zRedShift commented Jul 17, 2023

ramtej commented Jul 19, 2023

zRedShift commented Jul 20, 2023

sstefan1 commented Jul 21, 2023

sstefan1 commented Jul 21, 2023

zRedShift commented Jul 22, 2023

zRedShift commented Jul 22, 2023

sstefan1 commented Jul 23, 2023

zRedShift commented Jul 23, 2023

sstefan1 commented Aug 7, 2023

zRedShift commented Aug 7, 2023

ProfFan commented Jul 20, 2024

ProfFan commented Jul 20, 2024

Noxime commented Aug 30, 2024

MabezDev commented Oct 28, 2024

ramtej commented Jun 23, 2023 •

edited

Loading

zRedShift commented Jul 6, 2023 •

edited

Loading

ramtej commented Jul 9, 2023 •

edited

Loading