Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Chiptun3r
Copy link

My original plan was just to implement wrapping_add, but the more I worked on it, the more I found things to fix, so here we are.
In order:

  • I fixed the upcast multiplication for high precision numbers, since it didn't work sometime if the base type was 32 bit and the decimal part was 16 or more (since the multiplication of the fract part could overflow). I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.
  • The num! macro didn't work with high precision numbers, so I fixed that too.
  • Same thing with the formatter.
  • FixedWidthUnsignedInteger and FixedWidthSignedInteger were quite a mess (they appears as mutually exclusive, bu the signed one actually derived from the unsigned one) and did not took advantage of the num_traits crate as much as they could, so I removed FixedWidthSignedInteger and renamed FixedWidthUnsignedInteger into FixedWidthInteger, since it is what it was.
  • Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).
  • Finally add all the checked, overflowing, saturating and wrapping operations I found useful (and that num_traits supports, because, for example, checked_div is the only division it supports).

All the changes are backed by some tests, that I also run on mgba to be sure the i32/u32 multiplications work on there too, since the code is different for arm (btw, is there a better way then just copy the tests as functions and put them in a rom as I did?).
Hopefully this pull request isn't too big, I tried to split all the changes in their own commit to make it easier to review.

@Chiptun3r Chiptun3r force-pushed the fixnum_revamped branch 3 times, most recently from da59045 to a13537e Compare May 23, 2024 22:05
@corwinkuiper
Copy link
Collaborator

Thanks so much for your PR, it's quite late right now, so I can't fully review it yet. I do have some bits to highlight first.

I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.

We tried using long multiplication and found it unacceptable at the time (#443). At the time:

  • LLVM never inlines different instruction sets, this means that multiplication of constants by constant propagation could not be performed at compile time.
  • It benchmarked slower anyway.

Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).

Given the word size is 32 bits, is this an improvement?


I wrote the following "benchmark", note that the 1000 repetition isn't for measurement purposes, the GBA doesn't need that (no cache, no complex pipeline, no branch prediction, etc.), it's just to make any fixed overhead of calling the test function negligible.

use core::hint::black_box;

use agb_fixnum::{num, Num};

#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut crate::Gba) {
    let a: Num<i32, 8> = black_box(num!(1_000.235));
    let b: Num<i32, 8> = black_box(num!(1_000.235));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

#[test_case]
fn bench_fixed_num_multiplication_u8(_: &mut crate::Gba) {
    let a: Num<u8, 4> = black_box(num!(1.2));
    let b: Num<u8, 4> = black_box(num!(2.7));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

the results of which are

# master, debug
agb::benches::bench_fixed_num_multiplication_i32...[ok: 48038 c ≈ 0 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]

# master, release
agb::benches::bench_fixed_num_multiplication_i32...[ok: 48038 c ≈ 0 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]

# this pr, debug
agb::benches::bench_fixed_num_multiplication_i32...[ok: 111036 c ≈ 0.01 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 34027 c ≈ 0 s]

# this pr, release
agb::benches::bench_fixed_num_multiplication_i32...[ok: 103036 c ≈ 0.01 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]

Note that these benchmarks may still be flawed, it is a micro-benchmark after all. They show the existing i32 multiplication to be significantly faster in both debug and release. While I'm not certain, I explain this by the overhead of calling an arm function from thumb and the shifting is more expensive. I also see that the existing u8 multiplication to be faster in debug and equivalent with this PR in release mode.


From your description, the rest all sounds great. I'll have a look at the code for it in more detail when I get the chance.

Thanks again!

@Chiptun3r
Copy link
Author

I'll take a better look in the following days, but for now:

I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.

We tried using long multiplication and found it unacceptable at the time (#443). At the time:

* LLVM never inlines different instruction sets, this means that multiplication of constants by constant propagation could not be performed at compile time.

* It benchmarked slower anyway.

That's unfortunate, I thought it would have add a bit of overhead for the jump from thumb to arm and vice versa, but still be faster than three MUL (+ stuff). In this case I'd propose to keep the old implementation for fixnum with 15 or less bits of precision, and mine for the others, since the old one is not correct for those cases anyway.

Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).

Given the word size is 32 bits, is this an improvement?
I thought it would may have given the compiler a bit more slack for some magic optimization I cannot even thing of. Anyway, I think the difference between debug and release just come from the new code that checks for overflow and panics in that case and has nothing to do with the upcast type, but I don't have a strong opinion if you want it back to i32/u32.

In any case, thanks a lot for the benchmarks, they are very useful.

@gwilymk
Copy link
Contributor

gwilymk commented May 27, 2024

@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :)

@Chiptun3r
Copy link
Author

@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :)

I gave a quick look at it and I noticed that you're relying on upcast to 64 bits for multiplications when you need to check if they are overflowing, and that is quite suboptimal. From my tests (same setup as @corwinkuiper's, in release mode):

agb_template::bench_fixed_num_multiplication_i32_fast...[ok: 48038 c ≈ 0 s]
agb_template::bench_fixed_num_multiplication_i32_smull...[ok: 103036 c ≈ 0.01 s]
agb_template::bench_fixed_num_multiplication_i32_upcast...[ok: 169036 c ≈ 0.01 s]

Also, the fast solution is wrong with high precision numbers, since, if the multiplication of the fract parts overflow the result is just incorrect and not the wrapping behavior you would expect.
Right now I have implemented a double solution (fast for less then 16 bits of precision, umull/smull otherwise), I'm just trying to figure out how to check if a overflow happened with the fast implementation (and then add some other tests for good measure) before pushing the changes.

@Chiptun3r
Copy link
Author

Now it uses the fast path if the precision is <= 16 (in that case the multiplication between the fract parts can't overflow in any case, so the result is always the expected, wrapping, one).
Tests:

#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut agb::Gba) {
    let a: Num<i32, 8> = black_box(num!(100.235));
    let b: Num<i32, 8> = black_box(num!(10.235));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

#[test_case]
fn bench_fixed_num_multiplication_u32_high_precision(_: &mut agb::Gba) {
    let a: Num<u32, 18> = black_box(num!(100.235));
    let b: Num<u32, 18> = black_box(num!(10.235));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

Results (debug):

agb_template::bench_fixed_num_multiplication_i32...[ok: 279113 c ≈ 0.02 s]
agb_template::bench_fixed_num_multiplication_i32_high_precision...[ok: 111065 c ≈ 0.01 s]

Results (release):

agb_template::bench_fixed_num_multiplication_i32...[ok: 48049 c ≈ 0 s]
agb_template::bench_fixed_num_multiplication_i32_high_precision...[ok: 103047 c ≈ 0.01 s]

Sadly the debug build is much slower now (even tho the dev profile has opt-level = 3 as default in the template repo), if someone wants to try to speed it up they're welcome.
I also noticed that the division is broken too for high precision numbers (I guess it needs upcasts too), but that's for another time.
@gwilymk I haven't pushed any of your latest commits from #711, you can do it after this one or I can add them here if you prefer.

@gwilymk
Copy link
Contributor

gwilymk commented Jun 4, 2024

Thanks for taking this further.

I'm happy to do the later commits and PR them separately once this is merged.

My current worry is that the asm block will stop constant multiplication optimisations e.g. multiplying by 2 gets turned into a shift. At least in our games, we rely on that a lot.

Ideally we'd want to use something like https://doc.rust-lang.org/nightly/std/intrinsics/fn.is_val_statically_known.html but I don't think that'll ever get stabilised.

I've experimented in the past with defining the abi method __aeabi_lmul (and similar) in agb itself but didn't have masses of success. But maybe with your changes here it'll work a bit better? That way (in theory) rust will keep optimisations and change things to a shift etc but call the lmul and ulmul instructions when it actually needs to do a 32x32 -> 64 bit multiply. But we'd need to confirm it actually makes the call when it does so.

@Chiptun3r
Copy link
Author

I checked on godbolt and it doesn't do any kind of optimization even with O3 optimizations enabled. The problem is not only the asm! macro, but also the switch to arm32 instructions that disrupt any optimization (at least currently on llvm). So, even with a nice wide_mul intrinsic, we couldn't get any optimization out of it (maybe if the intrinsic was universal for both arm and thumb and the compiler was smart enough). Right now I would suggest to use the shift operator instead of a multiplication for your case (if you use more than 16 bits of precision), but it doesn't indeed optimize way neither the case of the multiplication between two constants, and that's not good

@Chiptun3r Chiptun3r closed this Jun 5, 2024
@Chiptun3r Chiptun3r reopened this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants