Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD is broken (Int32x4 and Float32x4) #53662

Open
maxim-saplin opened this issue Sep 30, 2023 · 6 comments
Open

SIMD is broken (Int32x4 and Float32x4) #53662

maxim-saplin opened this issue Sep 30, 2023 · 6 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. P3 A lower priority bug or feature request triaged Issue has been triaged by sub team type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@maxim-saplin
Copy link

maxim-saplin commented Sep 30, 2023

STEPS:

  1. Copy the file: https://github.com/maxim-saplin/mandelbrot/blob/main/_optimized/mandelbrot.dart
  2. RUN it in VM dart mandelbrot.dart and remember the results (execution time and check sum)
  3. Build it with AoT dart compile exe mandelbrot.dart, run and remember the result

EXPECTED:

  • VM and AoT versions have similar performance (i.e. +/- 10%)
  • VM and AoT have correct check sum (+/- 1% from 78513425)

ACTUAL:

  • AoT mode is ~40x slower on both Apple Silicon and Intel
  • AoT with Intel has wrong sum 87667671
// M1 Pro, sum 78513692
// dart mandelbrot.dart - Avg: 93.4ms, StdDev: 1.6119%
// dart compile exe mandelbrot.dart - Avg: 4038.5ms, StdDev: 0.6437%
// Intel
// dart mandelbrot.dart - !! sum 87667671 Avg: 162.9ms, StdDev: 7.5598%
// dart compile exe mandelbrot.dart - sum 78513692, Avg: 8806.0ms, StdDev: 4.4871%

ENVIRONMENTS:

  • macOS 13.6, Dart SDK version: 3.1.2, MacBook Pro with M1 Pro CPU
  • Ubuntu 22.04.3 LTS, 64 bit, Dart SDK version: 3.1.0, Intel Core i5-8257U @ 1.4GHz x 2, VMWare Workstation Player 17.0.1
@maxim-saplin maxim-saplin changed the title SIMD is broken SIMD is broken (Int32x4 and Float32x4) Sep 30, 2023
@lrhn
Copy link
Member

lrhn commented Sep 30, 2023

I notice that the first three rounds have a different checksum on JIT too, suggesting that the code is susceptible to differences in rounding of intermediate values.

That's a little worrisome, since the code appears like it should be deterministic. Every operation is a SIMD 32-bit floating point addition or multiplication, or a comparison.

Getting different results anyway suggests either not using the same SIMD operations before and after JIT optimization, having different rounding behavior, not using SIMD before optimizing at all, or just something being bugged.

I get the same slowdown on Windows. For JIT run, I get:

0         Execution Time: 126                       87666736
1         Execution Time: 125                       87662921
2         Execution Time: 126                       87661033
3         Execution Time: 119                       87667671
...

It's stable from there.

When compiled with dart compile exe, I get:

0         Execution Time: 7556                       78513692
...

stable all the way through, compiled wioth

(Dart SDK version: 3.2.0-202.0.dev (dev) (Tue Sep 26 21:06:42 2023 -0700) on "windows_x64")

That's a nice 64 times slowdown, and getting the same sum as you.
So AOT is predictable and deterministic, and very slow.

@maxim-saplin
Copy link
Author

maxim-saplin commented Sep 30, 2023

Here're a SO thread, a few guys checked assembly and told it is horrible (check comments): https://stackoverflow.com/questions/77201568/dart-simd-extensions-int32x4-float32x4-going-crazy-slow-in-aot-different-re

Regarding the the correctness of the results, what is not OK is the result on Intel, it is way to far from what I would expect (10%). JIC, I've tested same problem with 8 different languages (22 configs) they all differ slightly in precision and check sum: https://github.com/maxim-saplin/mandelbrot/tree/main

And performance in AoT is another problem, relevant for both Intel and Apple Silicon.

@lrhn
Copy link
Member

lrhn commented Sep 30, 2023

The difference in sum is around ~1M, which is on the order of one iteration per pixel. That is a very big difference, bigger than what (I think) can reasonably be explained by rounding errors.
Every iteration does a squaring, which should quench low-bit noise rather than amplify it. (If you multiply two 24 bit integers, then keep the top 24 bits of the result, that should be less affected by low-bit changes than addition.)

Since the result changes when the JIT optimizes, it seems the value problem could be the optimization.
And indeed, if I run without optimization, I get the expected result. Very slowly.

>dart --optimization_counter_threshold=-1 mandelo.dart
0         Execution Time: 9639                       78513692
1         Execution Time: 9654                       78513692
...

And ... the bug seems to be in greaterThan. If I change the comparison line to do it in the other direction:

        Int32x4 breakCondition = sum.lessThan(escapeThreshold); // (escapeThreshold).greaterThan(sum);

then the result becomes 78513692 on JIT as well.

It seems to be NaN-related. If I look at the produced values, it turns out that some of the iterations end up with a value of NaN in the comparison, also in the "working" versions.

It seems zzx and zzy both overflow to infinity, likely because of continuing to iterate on one coordinate even after it leaves the 2.0-circle, as long as another coordinate is still inside. Then zzx - zzy becomes NaN at that coordinate.
That propagates into sum in the next iteration, and (4.0...).greaterThan(NaN...) should give false (0), but in optimized code it gives true (-1), in the compare mask.

The swapped lessThan gives the correct result (false).

(Our Float32x4.greaterThan method is undocumented. Let's assume the same as the corresponding JavaScript method, which returns Bool32x4. After trying to read that spec, I'm fairly sure that comparing to NaN should give false. We don't give Bool32x4, instead we give a Int32x4 usable as a bit-mask, so true is all-ones and false is all-zeros. So the result of that greaterThan should be zero, and it is all-ones in optimized code.)

@lrhn lrhn added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) labels Sep 30, 2023
@a-siva
Copy link
Contributor

a-siva commented Oct 2, 2023

//cc @rmacnak-google @alexmarkov

copybara-service bot pushed a commit that referenced this issue Oct 3, 2023
…give the expected result for NaNs.

TEST=lib/typed_data/float32x4_compare_test
Bug: #53662
Change-Id: Ia5e1e5088fde84c60d30e0a1d50d1d2d3b50f2f0
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/328768
Reviewed-by: Alexander Markov <alexmarkov@google.com>
Commit-Queue: Ryan Macnak <rmacnak@google.com>
@a-siva a-siva added the P3 A lower priority bug or feature request label Oct 12, 2023
@a-siva
Copy link
Contributor

a-siva commented Oct 12, 2023

The correctness issue has been fixed, the performance issue is still open.
#31487 is tracking the need for const constructors so Flutter code could use SIMD.

@a-siva a-siva added the triaged Issue has been triaged by sub team label Oct 12, 2023
@gyrdym
Copy link

gyrdym commented Jun 1, 2024

Hi everyone,

Are there any updates on this issue?

I'm experiencing the same problem: my library, which heavily utilizes SIMD, is extremely slow in AoT mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. P3 A lower priority bug or feature request triaged Issue has been triaged by sub team type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

4 participants