Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFT HW Acceleration #190

Closed
ramtej opened this issue Jun 22, 2023 · 28 comments
Closed

FFT HW Acceleration #190

ramtej opened this issue Jun 22, 2023 · 28 comments
Labels
enhancement New feature or request

Comments

@ramtej
Copy link

ramtej commented Jun 22, 2023

I am currently benchmarking some DSP routines on the ESP32 S3 platform in Rust. Several issues have already arisen, see #180.

Upon reading the 'ESP32-S3 Technical Reference Manual', it became apparent that the S3 platform implements some SIMD as well as DSP operations in hardware, such as EE.FFT.R2BF.S16 or EE.CMUL.S16. I would be willing to invest some time and and implement a hardware-accelerated FFT. I am more of a mathematician and do not know the ESP32 architecture well enough, so I need some support.

Would it therefore be possible for someone to guide me and show me where I need to start? I looked at the SHA HW acceleration and can understand most things, but not all.

Thanks, Jiri

@ramtej
Copy link
Author

ramtej commented Jun 22, 2023

To me, it seems that the EPS DSP "C" examples are merely ASM optimized and do not use any DSP functions!? Is that the case?

See https://github.com/espressif/esp-dsp/blob/71514173b58b960173b40c4ade9d15d372770a74/modules/fft/float/dsps_fft2r_fc32_ae32_.S#L60

@igrr
Copy link

igrr commented Jun 22, 2023

Since these extra instructions are all integer instructions, only fixed-point FFT uses them: https://github.com/espressif/esp-dsp/blob/71514173b58b960173b40c4ade9d15d372770a74/modules/fft/fixed/dsps_fft2r_sc16_aes3.S

@ramtej
Copy link
Author

ramtej commented Jun 22, 2023

Yes, that makes sense. This means that the DSP primitives have already been spotted in the wild and not only described in the above paper.

My approach now would be to fork or extend e.g. https://gitlab.com/teskje/microfft-rs that similar to the SHA HW acceleration use the DSP primitives. Specifically, I would want to replace the radix-2 butterfly computation with the DSP HW functions.

@ramtej
Copy link
Author

ramtej commented Jun 23, 2023

Another (maybe stupid) idea could be to use the inline assembler asm!{} macro. Especially because the S3 target supports beside the DSP also some nice SIMD operations.

The follow compiles and even runs on the xtensa platform:

    std::hint::black_box(unsafe {
        asm!("nop");
    });

Now there is a lot between the above code and the universe that finally runs my code, especially the LLVM. I have dug into the esp Rust branch a bit and it looks like at least something was done in that direction:

pub enum InlineAsmArch {
    X86,
    X86_64,
    Arm,
  ..
    Xtensa,
  ..
}

pub enum InlineAsmArch {

Is that enough? I had naively just used a random DSP mnemonic - the error message is "mnemonic unknown".

Does anyone have an idea on the topic?

Thanks,
Jiri

@zRedShift
Copy link

zRedShift commented Jul 6, 2023

@ramtej AFAIK, LLVM plain old doesn't support/recognize the hundreds of DSP instructions, so to use inline assembly it would require either adding them all, or using the escaped binary opcode (but that also means specifying registers by number)

I wish support could be added, but it's a non-trivial amount of search and replace in the llvm-project codebase.

https://github.com/esp-rs/xtensa-lx-rt/blob/39256baa9ff78950f502262fdbd4bce77bb31e76/src/exception/assembly_lx6.rs#L424

@MabezDev showed me this "trick" a good while ago

@ramtej
Copy link
Author

ramtej commented Jul 9, 2023

@zRedShift Is the ".byte 0x00, 0x30, 0x00" sequence an inline assembly instruction specified as a sequence of bytes, serving as a workaround when the desired assembly instruction isn't supported by the LLVM? Indeed, this is a neat trick; I'll give it a try.

Thanks!

@ramtej
Copy link
Author

ramtej commented Jul 9, 2023

There is a nice crate that does the '.byte' encoding for the RISC-V V extension instructions - https://github.com/cryptape/rvv-encoder. It would be exciting to have something similar for the Xtensa extension instructions.

unsafe {
    xtensa_asm::asm!(
         ..
        "ee.cmul.s16	q3,q2,q1,3",
        ..
    );
}

@zRedShift
Copy link

@ramtej awesome find. This will also help with ESP32P4's custom RISC V DSP extensions, when it comes out.

I haven't investigated it yet (I was actually planning on just writing .S files and compiling/linking them), but I wonder, is there support for rur.*/wsr.* etc instructions in llvm (chapter 1.6.10 in the ESP32S3 technical reference, Processor Control Instructions). They are neccessary to manipulate special registers that control stuff like FFT width in fixed point mode.

@ramtej
Copy link
Author

ramtej commented Jul 9, 2023

Yes, probably the easiest thing for now will be to just link in the .S files. I think with the rur.*/wsr.* it will be similar to the other instructions. I need (i)FFT for my application and therefore I try to get the maximum out of the S3. The DSP benchmarks are promising, but I need the functions on Rust level. Maybe it makes sense to develop an esp-dsp-rs crate for the current Xtensa and the future RISC V DSP extensions.

@zRedShift
Copy link

I've been warming up with some Xtensa LLVM backend contributions over the last few days. My end goal is fast (scalar or vector) DSP, for ESP32S3 in Rust, since I'm working on real time audio processing/encoding and I need to maximize the performance to cram in as much processing into the pipeline. This is obviously DCT/FFT/FIR etc. heavy, among other things.

So a few more scalar instruction PRs, and I will move on to adding the 128-bit registers and instructions, so that the inline assembly can be supported directly. It's been a dream of mine for about a year to do it, at the time the ESP32S3 techinical manual still didn't include the extensions, and reverse engineering bfd/xtensa-modules.c was a major pain, and by the time they released the instructions in the reference, the priorities shifted. Now is the time to do the work.

Maybe even add auto-vectorization support in the (far) future, but it's a daunting task since the instructions are pipelined, the cost tables need to be populated, and there's a huge amount of user registers that control the runtime behavior of the instructions.

@igrr
Copy link

igrr commented Jul 16, 2023

Just in case, cc @sstefan1 who is planning to merge the initial support for ESP32-S3 DSP instructions into Espressif's LLVM fork soon.

@zRedShift
Copy link

Don't want to step on anyone's toes, if @sstefan1 has already started work on this, I won't pursue, unless I can somehow assist?

@sstefan1
Copy link

Currently we have all ESP32-S3 DSP instructions implemented in LLVM. All instructions are available in clang through clang's builtins, which translate to llvm intrinsics and then to appropriate instructions.

For example:
__builtin_xtensa_ee_vld_128_ip(1, data, 0); --> ee.vld.128.ip q1, a9, 0

This work should be merged soon.

I need to investigate how that should be done in rust, though. If anybody already knows, please let me know.

@zRedShift
Copy link

@sstefan1 Well, we don't currently have any xtensa intrinsics support, but it would go to stdarch/core_arch here. It lives out of tree, so will need to be forked by the esp-rs org and the .gitmodules at esp-rs/rust should point to it. Then they can be added just like in clang. This can be done separately, whenever, and not a blocker for initial support.

For inline assembly support, it's much simpler since most of the base work has already been done by @MabezDev here
All that needs to be done is add the qregs support, and the user regs (FFT_WIDTH, QACC_H_0, etc.).

If you can get a branch/PR running on espressif/llvm-project, I can start work on initial support/testing on Rust for this. I'm already working on those files, adding rust support for the clamps/minmax features based on this PR.

@ramtej
Copy link
Author

ramtej commented Jul 19, 2023

Ok, it looks to me like there is enough incentive and brainpower to tackle the ESP+Rust+DSP challenge. How are we going to coordinate this? Who does what?

@zRedShift
Copy link

@ramtej as soon as the esp32s3 changes land on espressif/llvm-project or one of its branches, I'll start working on the PR for esp-rs/rust.

@sstefan1
Copy link

I've merged the initial support for ESP32S3 DSP instructions in llvm. Just to keep in mind, builtins support is not yet very well tested. I will be doing testing in the following weeks.

@sstefan1
Copy link

One more note, llvm-objdump currently doesn't work correctly with DSP instructions. To check generated assembly, it is best to use llc.

@zRedShift
Copy link

Here's my branch with experimental support of this in Rust.
I ran into some issues/funky business (with the immediate addressing constant in ee.vld.128.ip, which I think is an issue with llvm, since the constant is correct in the generated llvm ir, I'll investigate later), but the core of it works.

@zRedShift
Copy link

@sstefan1

declare void @llvm.xtensa.ee.vld.128.ip(i32, i32, i32) nounwind
define void @test2(i32 %p){
    tail call void @llvm.xtensa.ee.vld.128.ip(i32 5, i32 %p, i32 16)
    ret void
}

If I generate assembly with llc I get the correct assembly:

// llc -O1 -mtriple=xtensa -mcpu=esp32s3 < xtensa-s3-ee-vld-128-ip.ll                                                                                                                                                                                                                    
        .text
        .file   "<stdin>"
        .global test2                           # -- Begin function test2
        .p2align        2
        .type   test2,@function
test2:                                  # @test2
        .cfi_startproc
# %bb.0:
        entry   a1, 32
        .cfi_def_cfa_offset 32
        ee.vld.128.ip    q5, a2, 16
        retw.n
.Lfunc_end0:
        .size   test2, .Lfunc_end0-test2
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

But if I generate and object file with llc and run gcc objdump (xtensa-esp32s3-elf-objdump) I get an issue:

// llc -O1 -mtriple=xtensa -mcpu=esp32s3 -filetype=obj < xtensa-s3-ee-vld-128-ip.ll > test.o
// xtensa-esp32s3-elf-objdump -D test.o
test.o:     file format elf32-xtensa-le


Disassembly of section .text:

00000000 <test2>:
   0:   004136          entry   a1, 32
   3:   a39024          ee.vld.128.ip   q5, a2, 0x100
   6:   f01d            retw.n

We get 0x100 instead of 0x10, and this happens to all other numbers, only 0 is unaffected.

@sstefan1
Copy link

I will have to look into it. BTW, I'm not sure if xtensa-esp32s3-elf-objdump disassembles DSP instructions correctly either. I had some problems while testing. I will check the llvm as well, but just mentioning I had issues with disassembling too.

@zRedShift
Copy link

Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:

#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);

#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
    let src_addr = src.0.as_ptr();
    let dst_addr = dst.0.as_mut_ptr();
    assert!(src_addr.is_aligned_to(16));
    assert!(dst_addr.is_aligned_to(16));
    assert_eq!(N % 32, 0);
    for _ in 0..N / 32 {
        core::arch::asm!(
            r#"
                ee.vld.128.ip    q0,  {src_addr},  16
                ee.vld.128.ip    q1,  {src_addr},  16
                ee.vst.128.ip    q0,  {dst_addr},  16
                ee.vst.128.ip    q1,  {dst_addr},  16
            "#,
            src_addr = in(reg) src_addr,
            dst_addr = in(reg) dst_addr,
        );
    }
}

src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...]
dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...]
So the loop is jumping with an offset 320 instead of 32, just like the objdump.

@jessebraham jessebraham added the enhancement New feature or request label Jul 27, 2023
@sstefan1
Copy link

sstefan1 commented Aug 7, 2023

Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:

#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);

#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
    let src_addr = src.0.as_ptr();
    let dst_addr = dst.0.as_mut_ptr();
    assert!(src_addr.is_aligned_to(16));
    assert!(dst_addr.is_aligned_to(16));
    assert_eq!(N % 32, 0);
    for _ in 0..N / 32 {
        core::arch::asm!(
            r#"
                ee.vld.128.ip    q0,  {src_addr},  16
                ee.vld.128.ip    q1,  {src_addr},  16
                ee.vst.128.ip    q0,  {dst_addr},  16
                ee.vst.128.ip    q1,  {dst_addr},  16
            "#,
            src_addr = in(reg) src_addr,
            dst_addr = in(reg) dst_addr,
        );
    }
}

src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.

Hi @zRedShift, I was on vacation and wasn't able to look at this earlier. LLVM backend was encoding the imm16 offset as the actual immediate value, but it should actually encode the multiple of 16. So for 32 it should encode 0x02 for 48 it should encode 0x03 and so on. I will post a fix internally and we should have it on the github repo soon.

@zRedShift
Copy link

@sstefan1 Thank you. I suspected something like that but didn't have time to look into it over the last few weeks. Glad it's been resolved.

I remember also encountering that it was impossible to use loop/loopnez/loopgtz in the inline assembly, but since it's not related to the DSP instructions, and hardware loops are already planned to be fixed/included in the future, I didn't investigate it further.

@ProfFan
Copy link

ProfFan commented Jul 20, 2024

Bumping this issue as I was researching memcpy on ESP32-S3. It turns out that aligned memcpy with EE.VLD instructions can be 6 times faster than regular memcpy, quoting https://github.com/project-x51/esp32-s3-memorycopy:

I (404) Memory Copy: Allocating 2 x 100kb in IRAM, alignment: 32 bytes
I (464) Memory Copy: 8-bit for loop copy IRAM->IRAM took 819922 CPU cycles = 28.59 MB/s
I (514) Memory Copy: 16-bit for loop copy IRAM->IRAM took 205776 CPU cycles = 113.90 MB/s
I (564) Memory Copy: 32-bit for loop copy IRAM->IRAM took 103383 CPU cycles = 226.71 MB/s
I (614) Memory Copy: 64-bit for loop copy IRAM->IRAM took 77682 CPU cycles = 301.71 MB/s
I (664) Memory Copy: memcpy IRAM->IRAM took 64323 CPU cycles = 364.37 MB/s
I (714) Memory Copy: async_memcpy IRAM->IRAM took 408520 CPU cycles = 57.37 MB/s
I (764) Memory Copy: PIE 128-bit (16 byte loop) IRAM->IRAM took 19498 CPU cycles = 1202.05 MB/s
I (814) Memory Copy: PIE 128-bit (32 byte loop) IRAM->IRAM took 13095 CPU cycles = 1789.81 MB/s
I (864) Memory Copy: DSP AES3 IRAM->IRAM took 15813 CPU cycles = 1482.17 MB/s

It would be great if we can have this in Rust (or is it already done?)

@ProfFan
Copy link

ProfFan commented Jul 20, 2024

Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:

#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);

#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
    let src_addr = src.0.as_ptr();
    let dst_addr = dst.0.as_mut_ptr();
    assert!(src_addr.is_aligned_to(16));
    assert!(dst_addr.is_aligned_to(16));
    assert_eq!(N % 32, 0);
    for _ in 0..N / 32 {
        core::arch::asm!(
            r#"
                ee.vld.128.ip    q0,  {src_addr},  16
                ee.vld.128.ip    q1,  {src_addr},  16
                ee.vst.128.ip    q0,  {dst_addr},  16
                ee.vst.128.ip    q1,  {dst_addr},  16
            "#,
            src_addr = in(reg) src_addr,
            dst_addr = in(reg) dst_addr,
        );
    }
}

src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.

Just tested this on current esp-rs/rust and it seems to work well.

@Noxime
Copy link

Noxime commented Aug 30, 2024

Hi, I'm attempting to use the SIMD instructions with latest 1.80 release. Most instructions work as they're intended, but I've encountered a number of misassemblies especially on the arithmeric+load/store instructions. As an example:

asm!(
        "NOP",
        "NOP",
        "EE.VADDS.S8.LD.INCP q0, a15, q1, q2, q3",
        "NOP",
        "NOP",
)

Ends up as

420876c5:	0020f0               	nop
420876c8:	0020f0               	nop
420876cb:	cf                      	.byte	0xcf
420876cc:	0299                	s32i.n	a9, a2, 0
420876ce:	f01c                	movi.n	a0, 31
420876d0:	f00020               	subx8	a0, a0, a2
420876d3:	f00020               	subx8	a0, a0, a2

(as disassembled by xtensa-esp-elf-objdump). As you can see, the instruction bytes are quite wrong, and executing does lead to IllegalInstruction exceptions.

So far I've observed this for instructions in the *.LD/ST.INCP group at least, but I would not be surprised if more were broken.

@MabezDev
Copy link
Member

@Noxime The latest 1.82 toolchain includes LLVM 18 which I believe has more (all?) of these instructions implemented - please retry and file a new issue if its still occurring.

@ProfFan It would be great if we can have this in Rust (or is it already done?)

memcpy is a weak symbol in compiler builtins, you can override it (we already use the ROM memcpy which might already do this btw)

Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

8 participants