Path finding broken on Raspberry Pi #2088

MarcelHB · 2024-05-18T10:26:05Z

Bug description

I have found some mysterious bug, and only when running GemRB on a RPi.

Essentially almost all movement attempts fail on all the (IWD2) maps I've been trying, just leaving this log line as a clue:

gemrb/gemrb/core/Scriptable/Scriptable.cpp

Line 2226 in dbda9db

    
           Log(DEBUG, "WalkTo", "{} re-pathing ignoring actors", fmt::WideToChar{actor->GetShortName()});

Yet the 2nd attempt won't work, too.

It's bad at master and 43647ce already and I'll try a bisection but that takes a little time since there were also compiler errors around.

GemRB version (check as many as you know apply)

master as of this issue
0.9.2
0.9.1
0.9.0

Video Driver (check as many as you know apply)

SDL1.2
SDL2 built with OPENGL_BACKEND enabled
SDL2 without OPENGL_BACKEND enabled

The text was updated successfully, but these errors were encountered:

MarcelHB · 2024-05-18T11:05:14Z

Appears that 43647ce has introduced the problem.

MarcelHB · 2024-05-18T12:23:08Z

Simple fix proposal that works, since everything else looked tedious to check also w. r. t. compiler settings and compilation times on RPi:

Make NormalizeDeltas a template function.
Replace every occurrence of float_t in PathFinder.* by double.

lynxlynxlynx · 2024-05-18T14:41:54Z

Why 1 though? Everyone passes in the same type ...

MarcelHB · 2024-05-18T15:07:42Z

I'm seeing float_t-users in Map.cpp and Scriptable.cpp.

bradallred · 2024-05-18T15:13:17Z

I'm having difficulty conceiving NormalizeDeltas producing a meaningful difference between float and double. If true tho, you should be able to repro on another system by changing it to float and deducing why.

MarcelHB · 2024-05-18T15:48:56Z

When I discovered that, I wrote a test on Linux on both x86_64 and RPi-ARM that report sizeof(float_t) and FLT_EVAL_METHOD with 4, 0, with -O3 -ffast-math GCC v12/13. I assume that both platforms go float here but I can't reproduce this outside of the RPi.

Can try to write a test case later to look for differences between float and double on both platforms.

MarcelHB · 2024-05-18T18:56:57Z

What did I do: rotate vector (2, 2) by 1° up to 360 times and then put x, y every time into NormalizeDeltas. One calculation is done over float types, one over double. I found two cases at 15° and 75° where float and double form a different pair between ARM and x86_64.

Compiled over: -O3 -ffast-math -frounding-math

So yes, there are cases where the downgrade from double to float give us a different output on different platforms here.

bradallred · 2024-05-18T19:05:27Z

It might help if you elaborated a bit. Have you ruled out -ffast-math and -frounding-math? It would still be nice to know the specifics so we can consider implications beyond NormalizeDeltas.

MarcelHB · 2024-05-18T19:25:17Z

-ffast-math -frounding-math: ARM and x86_64 show different results in the float calculation
-frounding-math: ARM and x86_64 show the same differences, and on each platform there is the same one difference in the float calculation compared to -ffast-math
no add. flags, just -O3: suprisingly, still a diff between them
-O0: still different output

Let's try being explicit about the rounding mode, I pick FE_DOWNWARD: Whoop, one of the two cases has disappeared, one line of diff is left. 🤷

MarcelHB · 2024-05-18T19:58:35Z

std::hypot generates some diff noise when using it, in the less significant parts of course but I better don't try to analyze the numerical impact here now.

bradallred · 2024-05-18T20:15:21Z

these type of comparisons seems suspicious dx == 0.0. I believe thats the sort of thing thats not going to be well defined with fast-math and also that 0.0 is a double, not a float.

MarcelHB · 2024-05-18T20:49:16Z

Looking at godbolt.org assembly output for such a check, that does not seem to make any difference (using -ffast-math or not).

bradallred · 2024-05-18T21:17:16Z

it looks like there is a bug here: dy = STEP_RADIUS;. If I'm not mistaken, that should be dy = STEP_RADIUS * 0.75;, no?

MarcelHB · 2024-05-18T21:30:36Z

I guess so.

bradallred · 2024-05-18T22:02:52Z

std::hypot generates some diff noise when using it, in the less significant parts of course but I better don't try to analyze the numerical impact here now.

If the problem is std::hypot, we could use an approximation replacement like alpha beta.

MarcelHB · 2024-05-18T23:58:14Z

Sounds somewhat counter-intuitive to fix a problem of small errors with even bigger errors?

Also, not sure if NormalizeDeltas is even the root cause here, given that the numbers we put in there shouldn't be that large to suddenly result in the blue and make the algorithm fail that largely. It's just an easy method to quickly validate whether there are platform discrepancies in the type downgrade between the two archs.

bradallred · 2024-05-19T00:08:00Z

Sounds somewhat counter-intuitive to fix a problem of small errors with even bigger errors?

I don't know why you're insisting the problem is one of precision. just because the problem "goes away" when you use a double means nothing. The math shouldn't require that, all we are doing is scaling a velocity along a vector. A better performing implementation for something on a hot path is always welcome too, provided it is accurate enough and I don't see why it wouldn't be.

lynxlynxlynx · 2024-05-19T06:57:11Z

The problem goes away also when using floats, so I wouldn't discount precision so easily. I can't make sense of that at FLT_EVAL_METHOD being 0 though.

MarcelHB · 2024-05-19T15:35:37Z

I don't know why you're insisting the problem is one of precision. just because the problem "goes away" when you use a double means nothing.

In 2nd paragraph I meant to say that if we have precision problem here on a such smale scale (affecting our calculations in maybe 2/360 cases) and this ends up making 95% of movements invalid already, less precision doesn't look like the fix. But also, the whole image does not fit since the errors still look too small for that impact, and hypot and NormalizeDelta was just an unfortunate test to see that some math issues aren't portable.

My gut feeling was somewhat right, this is probably not the place to focus on, at least for this problem. I put traces into the code and tested a case with the same journey on one PC on a map with no moving actors. So full determinism, if I'm not wrong.

The result shows some interesting problem. This my desktop:

[FindPath/DEBUG]: s = (1019, 808), d = (684, 684), caller = Nummer Zwei, dist = 0, size = 2
[FindPath/DEBUG]: nmptCurrent: 1019, 808
[FindPath/DEBUG]: smptCurrent: 63, 67
[FindPath/DEBUG]: Idx: 5423
[FindPath/DEBUG]: nmptChild: 1035, 808 (0, 1)
[FindPath/DEBUG]: smptChildIdx: 67 * 80 + 64 = 5424
[FindPath/DEBUG]: layz T*: 1019, 808, 1035, 808
[FindPath/DEBUG]: nmptChild: 1019, 820 (1, 0)
[FindPath/DEBUG]: smptChildIdx: 68 * 80 + 63 = 5503
[FindPath/DEBUG]: layz T*: 1019, 808, 1019, 820
[FindPath/DEBUG]: nmptChild: 1003, 808 (2, -1)
[FindPath/DEBUG]: smptChildIdx: 67 * 80 + 62 = 5422
[FindPath/DEBUG]: Next: child blocked
[FindPath/DEBUG]: nmptChild: 1019, 796 (3, 0)
[FindPath/DEBUG]: smptChildIdx: 66 * 80 + 63 = 5343
[FindPath/DEBUG]: layz T*: 1019, 808, 1019, 796
[FindPath/DEBUG]: nmptCurrent: 1019, 796
[FindPath/DEBUG]: smptCurrent: 63, 66
[FindPath/DEBUG]: Idx: 5343

And this is the Raspberry:

[FindPath/DEBUG]: s = (1019, 808), d = (684, 684), caller = Nummer Zwei, dist = 0, size = 2
[FindPath/DEBUG]: nmptCurrent: 1019, 808
[FindPath/DEBUG]: smptCurrent: 63, 67
[FindPath/DEBUG]: Idx: 5423
[FindPath/DEBUG]: nmptChild: 1019, 820 (0, 0)
[FindPath/DEBUG]: smptChildIdx: 68 * 80 + 63 = 5503
[FindPath/DEBUG]: layz T*: 1019, 808, 1019, 820
[FindPath/DEBUG]: nmptChild: 1019, 820 (1, 0)
[FindPath/DEBUG]: smptChildIdx: 68 * 80 + 63 = 5503
[FindPath/DEBUG]: layz T*: 1019, 808, 1019, 820
[FindPath/DEBUG]: nmptChild: 1019, 820 (2, 0)
[FindPath/DEBUG]: smptChildIdx: 68 * 80 + 63 = 5503
[FindPath/DEBUG]: layz T*: 1019, 808, 1019, 820
[FindPath/DEBUG]: nmptChild: 1019, 820 (3, 0)
[FindPath/DEBUG]: smptChildIdx: 68 * 80 + 63 = 5503
[FindPath/DEBUG]: layz T*: 1019, 808, 1019, 820
[FindPath/DEBUG]: nmptCurrent: 1019, 820
[FindPath/DEBUG]: smptCurrent: 63, 68
[FindPath/DEBUG]: Idx: 5503

We can focus on [FindPath/DEBUG]: nmptChild: 1019, 820 (0, 0), that's the result of

NavmapPoint nmptChild(nmptCurrent.x + 16 * dxAdjacent[i], nmptCurrent.y + 12 * dyAdjacent[i]);
Log(DEBUG, "FindPath", "nmptChild: {}, {} ({}, {})", nmptChild.x, nmptChild.y, i, int8_t{dxAdjacent[i]});

What we see is that on the RPi none of the accesses to dxAdjacent[i] is correct, its [0, 0, 0, 0] where it should have been [0, 1, 0, -1] ... if we use float_t instead of double in this file.

With RelWithDebInfo and breaking into this line, the structure of dxAdjacent looks correct though. So far now,

no idea why values are wrong, maybe some bogus assembly code? Snippet for ref below
no idea how double <-> float_t has its hands in there,
couldn't reproduce in a minimal test case yet

ARMv8 assemler code in PathFinder.cpp:374

cmp     x20, #0x0
add     x0, x20, #0x3f
ldr     x2, [sp, #200]
csel    x0, x0, x20, lt // lt = tstop
negs    x1, x20
and     x20, x20, #0x3f
asr     x0, x0, #6
and     x1, x1, #0x3f
csneg   x1, x20, x1, mi // mi = first
add     x0, x2, x0, lsl #3
tbz     x1, #63, 0x7ff7e48fe4
add     x1, x1, #0x40
sub     x0, x0, #0x8
ldr     x2, [x0]
mov     x3, #0x1                        // #1
adrp    x4, 0x7ff7f0b000
add     x4, x4, #0x138
lsl     x1, x3, x1
orr     x1, x2, x1
str     x1, [x0]
add     x0, x23, #0x10
stp     x4, x0, [sp, #160]
add     x24, sp, #0x1c8
ldr     x4, [sp, #264]
add     x19, x4, #0x138
add     x21, x4, #0x140
ldrsb   w3, [x21]
mov     x0, x24
ldrsb   w4, [x19] // <---- PC at L374
ldr     w1, [sp, #440]
ldr     w2, [sp, #444]
add     w3, w3, w3, lsl #1
add     w1, w1, w4, lsl #4
add     w2, w2, w3, lsl #2
bl      0x7ff7cccf50 <_ZN5GemRB5PointC1Eii@plt>
mov     x0, x24
bl      0x7ff7ccc570 <_ZN5GemRB3Map18ConvertCoordToTileERKNS_5PointE@plt>

lynxlynxlynx · 2024-05-19T18:28:00Z

Wow. Have you tried making them a static const (since constexpr working on classes is apparently a c++20 feature) or compiling with clang?

MarcelHB · 2024-05-19T19:14:47Z

static did not work
plain old C arrays did not work
replacing 12/16 by prime numbers did not work (this looks like a valid and desperate attempt when looking at the ARM math code that is being generated here)
replacing the index access by a function with some case statements works

Will try clang later.

MarcelHB · 2024-05-19T19:45:43Z

Eh ... godbolt put me in some good direction. First I was thinking that GCC was outsmarting me with taking all these 0, 1, -1 into its emitted assembly code or somthing since there was something disappearing at higher levels of O (see below). But no, this is a minimal case roughly looking looking like our problem (this is -O2):

Details of the assembly are irrelevant, I guess, and also constexpr or static don't make a difference. What does make a difference is dropping static/constexpr altogether, and suddenly something re-appears at the bottom:

Although there are no apparent references to dxAdjacent in the code (probably relative to that .LANCHOR0), that made me puzzled a little. Turns out then it works for this bug ...

Looking at the result of -S in both cases locally (to rule out godbolt isn't just cutting things away by some unknown fashion), the array data is there but the code is somewhat different – slightly, visible in the screenshots, too.

bradallred · 2024-05-19T20:39:28Z

The problem goes away also when using floats, so I wouldn't discount precision so easily. I can't make sense of that at FLT_EVAL_METHOD being 0 though.

you're misunderstanding me. What I'm saying is what we want to do here (normalize a velocity along a vector) does not require double precision. We should change our algorithm if it requires that. we would probably get better performance using another (approximate) method too.

We also shouldn't discount the sort of assumptions fast-math makes. Remember that these std math functions are templates or compiler builtins, so their internals are also subject to fast-math.

MarcelHB · 2024-05-19T22:04:42Z

Clang v14 is good. So what to do? My best guess is that there is some kind of problem that is isolated to GCC (v12) on (RPi-)ARM.

Live with it and recommend Clang if somebody complains
Investigate deeper and find a fix, given that it is not a GCC bug
Don't investigate deeper and write some macros to make it work somehow

Indications of some kind of bug in GCC are strong but breaking this down into a small test case only gives clues but no reproducibility yet.

lynxlynxlynx · 2024-05-20T12:41:36Z

Can you try with 13 or 14? Ideally it'd be a versioned problem and we could simply handle it in cmake.

MarcelHB · 2024-05-20T13:04:52Z

Hm, don't know. Raspbian Bookworm only provides v12 for now.

MarcelHB · 2024-05-20T22:26:41Z

Docker to the rescue again, although this isn't much fun on such a device. No improvement with GCC v13.

lynxlynxlynx · 2024-05-21T07:29:04Z

Ok, thanks. I think the sanest thing for now is to block gcc RPI builds and if anyone external gets annoyed enough to investigate further successfully, we can lift the ban. Should be trivial to do in CONFIGURE_RPI_SPECIFICS, since by the time it's ran, the compiler has already been detected. Let me know if you want me to handle it.

cf. gemrb#2088

MarcelHB added bug system: pathfinding labels May 18, 2024

MarcelHB changed the title ~~Path finding broken on Rasberry Pi~~ Path finding broken on Raspberry Pi May 18, 2024

MarcelHB added this to the 0.9.3 - TBN milestone May 18, 2024

MarcelHB added a commit to MarcelHB/gemrb that referenced this issue May 21, 2024

cmake: bans GCC on Raspberry Pi for now

686c6a9

cf. gemrb#2088

MarcelHB added a commit to MarcelHB/gemrb that referenced this issue May 21, 2024

cmake: bans GCC on Raspberry Pi for now

0d19967

cf. gemrb#2088

MarcelHB added a commit to MarcelHB/gemrb that referenced this issue May 21, 2024

cmake: bans GCC on Raspberry Pi for now

b15a1c9

cf. gemrb#2088

MarcelHB mentioned this issue May 21, 2024

Banning GCC on Raspberry Pi #2090

Merged

5 tasks

lynxlynxlynx linked a pull request May 21, 2024 that will close this issue

Banning GCC on Raspberry Pi #2090

Merged

5 tasks

lynxlynxlynx closed this as completed in #2090 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path finding broken on Raspberry Pi #2088

Path finding broken on Raspberry Pi #2088

MarcelHB commented May 18, 2024

MarcelHB commented May 18, 2024

MarcelHB commented May 18, 2024

lynxlynxlynx commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024 •

edited

MarcelHB commented May 18, 2024 •

edited

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 19, 2024

lynxlynxlynx commented May 19, 2024

MarcelHB commented May 19, 2024 •

edited

lynxlynxlynx commented May 19, 2024

MarcelHB commented May 19, 2024

MarcelHB commented May 19, 2024 •

edited

bradallred commented May 19, 2024 •

edited

MarcelHB commented May 19, 2024

lynxlynxlynx commented May 20, 2024

MarcelHB commented May 20, 2024

MarcelHB commented May 20, 2024

lynxlynxlynx commented May 21, 2024

Path finding broken on Raspberry Pi #2088

Path finding broken on Raspberry Pi #2088

Comments

MarcelHB commented May 18, 2024

Bug description

MarcelHB commented May 18, 2024

MarcelHB commented May 18, 2024

lynxlynxlynx commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024 • edited

MarcelHB commented May 18, 2024 • edited

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 18, 2024

MarcelHB commented May 18, 2024

bradallred commented May 19, 2024

lynxlynxlynx commented May 19, 2024

MarcelHB commented May 19, 2024 • edited

lynxlynxlynx commented May 19, 2024

MarcelHB commented May 19, 2024

MarcelHB commented May 19, 2024 • edited

bradallred commented May 19, 2024 • edited

MarcelHB commented May 19, 2024

lynxlynxlynx commented May 20, 2024

MarcelHB commented May 20, 2024

MarcelHB commented May 20, 2024

lynxlynxlynx commented May 21, 2024

MarcelHB commented May 18, 2024 •

edited

MarcelHB commented May 18, 2024 •

edited

MarcelHB commented May 19, 2024 •

edited

MarcelHB commented May 19, 2024 •

edited

bradallred commented May 19, 2024 •

edited