[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706

Chiptun3r · 2024-05-23T19:37:00Z

My original plan was just to implement wrapping_add, but the more I worked on it, the more I found things to fix, so here we are.
In order:

I fixed the upcast multiplication for high precision numbers, since it didn't work sometime if the base type was 32 bit and the decimal part was 16 or more (since the multiplication of the fract part could overflow). I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.
The num! macro didn't work with high precision numbers, so I fixed that too.
Same thing with the formatter.
FixedWidthUnsignedInteger and FixedWidthSignedInteger were quite a mess (they appears as mutually exclusive, bu the signed one actually derived from the unsigned one) and did not took advantage of the num_traits crate as much as they could, so I removed FixedWidthSignedInteger and renamed FixedWidthUnsignedInteger into FixedWidthInteger, since it is what it was.
Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).
Finally add all the checked, overflowing, saturating and wrapping operations I found useful (and that num_traits supports, because, for example, checked_div is the only division it supports).

All the changes are backed by some tests, that I also run on mgba to be sure the i32/u32 multiplications work on there too, since the code is different for arm (btw, is there a better way then just copy the tests as functions and put them in a rom as I did?).
Hopefully this pull request isn't too big, I tried to split all the changes in their own commit to make it easier to review.

corwinkuiper · 2024-05-23T22:56:21Z

Thanks so much for your PR, it's quite late right now, so I can't fully review it yet. I do have some bits to highlight first.

I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.

We tried using long multiplication and found it unacceptable at the time (#443). At the time:

LLVM never inlines different instruction sets, this means that multiplication of constants by constant propagation could not be performed at compile time.
It benchmarked slower anyway.

Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).

Given the word size is 32 bits, is this an improvement?

I wrote the following "benchmark", note that the 1000 repetition isn't for measurement purposes, the GBA doesn't need that (no cache, no complex pipeline, no branch prediction, etc.), it's just to make any fixed overhead of calling the test function negligible.

use core::hint::black_box;

use agb_fixnum::{num, Num};

#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut crate::Gba) {
    let a: Num<i32, 8> = black_box(num!(1_000.235));
    let b: Num<i32, 8> = black_box(num!(1_000.235));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

#[test_case]
fn bench_fixed_num_multiplication_u8(_: &mut crate::Gba) {
    let a: Num<u8, 4> = black_box(num!(1.2));
    let b: Num<u8, 4> = black_box(num!(2.7));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

the results of which are

# master, debug
agb::benches::bench_fixed_num_multiplication_i32...[ok: 48038 c ≈ 0 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]

# master, release
agb::benches::bench_fixed_num_multiplication_i32...[ok: 48038 c ≈ 0 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]

# this pr, debug
agb::benches::bench_fixed_num_multiplication_i32...[ok: 111036 c ≈ 0.01 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 34027 c ≈ 0 s]

# this pr, release
agb::benches::bench_fixed_num_multiplication_i32...[ok: 103036 c ≈ 0.01 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]

Note that these benchmarks may still be flawed, it is a micro-benchmark after all. They show the existing i32 multiplication to be significantly faster in both debug and release. While I'm not certain, I explain this by the overhead of calling an arm function from thumb and the shifting is more expensive. I also see that the existing u8 multiplication to be faster in debug and equivalent with this PR in release mode.

From your description, the rest all sounds great. I'll have a look at the code for it in more detail when I get the chance.

Thanks again!

Chiptun3r · 2024-05-23T23:47:51Z

I'll take a better look in the following days, but for now:

I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.

We tried using long multiplication and found it unacceptable at the time (#443). At the time:
* LLVM never inlines different instruction sets, this means that multiplication of constants by constant propagation could not be performed at compile time.

* It benchmarked slower anyway.

That's unfortunate, I thought it would have add a bit of overhead for the jump from thumb to arm and vice versa, but still be faster than three MUL (+ stuff). In this case I'd propose to keep the old implementation for fixnum with 15 or less bits of precision, and mine for the others, since the old one is not correct for those cases anyway.

Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).

Given the word size is 32 bits, is this an improvement?
I thought it would may have given the compiler a bit more slack for some magic optimization I cannot even thing of. Anyway, I think the difference between debug and release just come from the new code that checks for overflow and panics in that case and has nothing to do with the upcast type, but I don't have a strong opinion if you want it back to i32/u32.

In any case, thanks a lot for the benchmarks, they are very useful.

gwilymk · 2024-05-27T09:59:59Z

@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :)

Chiptun3r · 2024-05-27T10:25:43Z

@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :)

I gave a quick look at it and I noticed that you're relying on upcast to 64 bits for multiplications when you need to check if they are overflowing, and that is quite suboptimal. From my tests (same setup as @corwinkuiper's, in release mode):

agb_template::bench_fixed_num_multiplication_i32_fast...[ok: 48038 c ≈ 0 s]
agb_template::bench_fixed_num_multiplication_i32_smull...[ok: 103036 c ≈ 0.01 s]
agb_template::bench_fixed_num_multiplication_i32_upcast...[ok: 169036 c ≈ 0.01 s]

Also, the fast solution is wrong with high precision numbers, since, if the multiplication of the fract parts overflow the result is just incorrect and not the wrapping behavior you would expect.
Right now I have implemented a double solution (fast for less then 16 bits of precision, umull/smull otherwise), I'm just trying to figure out how to check if a overflow happened with the fast implementation (and then add some other tests for good measure) before pushing the changes.

Chiptun3r · 2024-05-27T21:11:44Z

Now it uses the fast path if the precision is <= 16 (in that case the multiplication between the fract parts can't overflow in any case, so the result is always the expected, wrapping, one).
Tests:

#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut agb::Gba) {
    let a: Num<i32, 8> = black_box(num!(100.235));
    let b: Num<i32, 8> = black_box(num!(10.235));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

#[test_case]
fn bench_fixed_num_multiplication_u32_high_precision(_: &mut agb::Gba) {
    let a: Num<u32, 18> = black_box(num!(100.235));
    let b: Num<u32, 18> = black_box(num!(10.235));

    for _ in 0..1000 {
        let a = black_box(a);
        let b = black_box(b);
        black_box(a * b);
    }
}

Results (debug):

agb_template::bench_fixed_num_multiplication_i32...[ok: 279113 c ≈ 0.02 s]
agb_template::bench_fixed_num_multiplication_i32_high_precision...[ok: 111065 c ≈ 0.01 s]

Results (release):

agb_template::bench_fixed_num_multiplication_i32...[ok: 48049 c ≈ 0 s]
agb_template::bench_fixed_num_multiplication_i32_high_precision...[ok: 103047 c ≈ 0.01 s]

Sadly the debug build is much slower now (even tho the dev profile has opt-level = 3 as default in the template repo), if someone wants to try to speed it up they're welcome.
I also noticed that the division is broken too for high precision numbers (I guess it needs upcasts too), but that's for another time.
@gwilymk I haven't pushed any of your latest commits from #711, you can do it after this one or I can add them here if you prefer.

gwilymk · 2024-06-04T19:53:26Z

Thanks for taking this further.

I'm happy to do the later commits and PR them separately once this is merged.

My current worry is that the asm block will stop constant multiplication optimisations e.g. multiplying by 2 gets turned into a shift. At least in our games, we rely on that a lot.

Ideally we'd want to use something like https://doc.rust-lang.org/nightly/std/intrinsics/fn.is_val_statically_known.html but I don't think that'll ever get stabilised.

I've experimented in the past with defining the abi method __aeabi_lmul (and similar) in agb itself but didn't have masses of success. But maybe with your changes here it'll work a bit better? That way (in theory) rust will keep optimisations and change things to a shift etc but call the lmul and ulmul instructions when it actually needs to do a 32x32 -> 64 bit multiply. But we'd need to confirm it actually makes the call when it does so.

Chiptun3r · 2024-06-05T10:36:19Z

I checked on godbolt and it doesn't do any kind of optimization even with O3 optimizations enabled. The problem is not only the asm! macro, but also the switch to arm32 instructions that disrupt any optimization (at least currently on llvm). So, even with a nice wide_mul intrinsic, we couldn't get any optimization out of it (maybe if the intrinsic was universal for both arm and thumb and the compiler was smart enough). Right now I would suggest to use the shift operator instead of a multiplication for your case (if you use more than 16 bits of precision), but it doesn't indeed optimize way neither the case of the multiplication between two constants, and that's not good

… possible

…perations

Chiptun3r force-pushed the fixnum_revamped branch 3 times, most recently from da59045 to a13537e Compare May 23, 2024 22:05

gwilymk mentioned this pull request May 27, 2024

Fixnum chiptu3r improvements cherry picked #711

Closed

1 task

Chiptun3r force-pushed the fixnum_revamped branch from 5605cfc to 15f7ce3 Compare May 27, 2024 21:02

Chiptun3r closed this Jun 5, 2024

Chiptun3r reopened this Jun 5, 2024

Chiptun3r added 11 commits August 28, 2024 15:12

Fix upcast multiplication and use native arm long multiplication when…

182e72b

… possible

Fix num!() for high precision numbers

d108fef

Simplify FixedWidthUnsignedInteger name and trait bounds

43069bd

Fix print for high precision numbers

dbd5558

Small clean up

7e3193d

Reduce unneeded upcast size for u8 and add support to i8

2d151af

Remove FixedWidthSignedInteger

dfa50dd

Implement the various checked, overflowing, saturating and wrapping o…

e03b518

…perations

Make the CI happy

7df43a7

Make the "n" argument in upcast_multiply* a const generic

7e0d58f

Add fast multiplication for low precision 32bits fixnums

5116d37

gwilymk force-pushed the fixnum_revamped branch from 15f7ce3 to 5116d37 Compare August 28, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706

[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706

Chiptun3r commented May 23, 2024

corwinkuiper commented May 23, 2024

Chiptun3r commented May 23, 2024

gwilymk commented May 27, 2024

Chiptun3r commented May 27, 2024

Chiptun3r commented May 27, 2024

gwilymk commented Jun 4, 2024

Chiptun3r commented Jun 5, 2024

[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706

Are you sure you want to change the base?

[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706

Conversation

Chiptun3r commented May 23, 2024

corwinkuiper commented May 23, 2024

Chiptun3r commented May 23, 2024

gwilymk commented May 27, 2024

Chiptun3r commented May 27, 2024

Chiptun3r commented May 27, 2024

gwilymk commented Jun 4, 2024

Chiptun3r commented Jun 5, 2024