Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIDM has platform-dependent behavior #8921

Open
namdre opened this issue Aug 6, 2021 · 23 comments
Open

EIDM has platform-dependent behavior #8921

namdre opened this issue Aug 6, 2021 · 23 comments

Comments

@namdre
Copy link
Contributor

namdre commented Aug 6, 2021

Test cfmodel/drive_in_circles_small/EIDM generates

  • 25 collisions on my local RHEL6 machine
  • 5 collisions on my Ubuntu 20.4 machine
  • 12 collisions on our Ubuntu test server
  • 6 collision on my windows machine

on each machine the runs are stable (same result for 100 repeats) and also consistent accross release/debug/clang build

@Domsall
Copy link
Contributor

Domsall commented Aug 6, 2021

I ran a simulation of mine on a Linux System (OpenSuse) and on a Windows System and got the same fcd-output...
Do you know where this could come from? RNG-values? tanh-function?

@namdre
Copy link
Contributor Author

namdre commented Aug 6, 2021

I must be something subtle that requires many RNG-calls:

  • when disabling the 4 Randhelper calls, the differences disappear
  • when only enabling the random minGap, there is still no difference
  • when enabling minGap and myw_gap, the difference shows up after 2500 steps at precision 4 (210 steps at precision 20)
  • when enabling minGap myw_gap and myw_speed, the difference shows up after 2100 steps at precision 4 (98 steps at precision 20)

@behrisch behrisch added this to the 1.11.0 milestone Aug 10, 2021
@Domsall
Copy link
Contributor

Domsall commented Aug 30, 2021

I took a deeper look into the issue with RNG-calls and added a driverstate-device to Krauss-vehicles.
The fcd-output of the Linux simulation and the Windows simulation are also different.

I also tested if it has something to do with collisions, but even without collisions the values are not the same.

Could you check this behavior on your machines?

The scenario:
circles_collisions_EIDM_and_platform_dependancy.zip

@Domsall
Copy link
Contributor

Domsall commented Sep 1, 2021

Update:

  • All RNG-calls are the same (values, call and rng-number).
  • Very small differences start showing at some point (position is different after 10-15th decimal)
  • At some point the position difference between Linux and Windows creates a step, where one vehicle is on lane "X" in Linux and on lane "Y" in Windows. In that step the RNG-call of the Linux vehicle and that of the Windows vehicle are different because the RNG's belong to a lane and not to a vehicle.
  • After this step, the simulations start drifting away from each other

@namdre
Copy link
Contributor Author

namdre commented Sep 1, 2021

Since the "normal" floating point math governing vehicle positions should work the same on all machines (and seems to do so for other models) I suspect that it's tanh or some other math library function used so far only by EIDM.

@namdre
Copy link
Contributor Author

namdre commented Sep 1, 2021

This appears to be "normal": https://stackoverflow.com/questions/21183477/windows-vs-linux-math-result-difference though it wasn't a problem for us so far.

@Domsall
Copy link
Contributor

Domsall commented Sep 6, 2021

As mentioned above, the DriverState-Device suffers from a similar precision-leak as the EIDM does.

I tracked the issue down to the log-function call in randNorm. Summary from my understanding:

  • the processors often use extended double values between calculations
  • it then sometimes happens that on different systems the intermediate values are not perfectly the same
  • the log-function then outputs slightly different values (after 16 decimals)
  • This is often not a problem, but for the random walks, each error influences all future values
  • To make sure both systems behave the same, I "forced" the log-function output to the double precision of 15 decimals and now get the same fcd-output (precision 6) for each system with the following small patch:
    log_patch.txt

This solution is not really elegant and may still drift away after some time, but works as intended for the above examples (circle example with the DriverState Krauss or the EIDM).

@namdre
Copy link
Contributor Author

namdre commented Sep 6, 2021

As far as my understanding of floating point rounding goes, there are many cases where the given approach fails and may actually increase the deviation between platforms.
Consider the case where one implementation returns the binary equivalent of 1.0 and the other 0.999...9 If these numbers are rounded by discarding some digits (which is approximately what your multiply-intcast-divide does), then the difference between values increases.

@namdre
Copy link
Contributor Author

namdre commented Sep 7, 2021

On further thought, my example is maybe "proving too much"(namely, that rounding generally doesn't work). Every rounding algorithm has edge cases where nearby values that fall on different sides of a threshold are rounded away from each other. Nevertheless, there are more cases where rounding serves to reduce the difference (as you yourself have observed).

@Domsall
Copy link
Contributor

Domsall commented Sep 7, 2021

I compared the values from the randNorm-function and could see that "q" is always the same on each system and between 0 and 1.
But the output of the log-function is sometimes different (probably a "bit shift" to the next higher/lower double value). Nonetheless, the double precision float format ensures 15 significant decimal digits precision. So if I round to that decimal, both values should still be the same. I couldn't find any other idea when searching the internet.

@behrisch
Copy link
Contributor

behrisch commented Sep 7, 2021

I agree that rounding does not solve it but it may be good enough. I would just prefer to do the rounding on the final result instead of just the log. This would also mask out deviations stemming from different sqrt behavior. And maybe there is a more efficient way of zeroing out the last bits of the mantissa (https://stackoverflow.com/a/5672983/5731587)

@namdre
Copy link
Contributor Author

namdre commented Sep 7, 2021

I'm pretty sure that sqrt behaves the same on all platforms or we'd have noticed (though I don't have a hard source for this).
Curiously, lots of input on this topic comes from game developers that try to optimize their network code: https://gafferongames.com/post/floating_point_determinism/

@Domsall
Copy link
Contributor

Domsall commented Sep 7, 2021

Indeed, recasting and changing bits instead of using a "round"-function would work a lot better. But I also agree that sqrt should not be a problem.

@behrisch
Copy link
Contributor

behrisch commented Sep 8, 2021

I'm pretty sure that sqrt behaves the same on all platforms or we'd have noticed (though I don't have a hard source for this).
Curiously, lots of input on this topic comes from game developers that try to optimize their network code: https://gafferongames.com/post/floating_point_determinism/

But this thread looks like we could also get near it with some compiler options and/or using _controlfp ?

@behrisch
Copy link
Contributor

_controlfp is probably not the way to go. If we want to use it we need to enable /fp:strict which changes several test results even if I only run the netgen tests. And if I enable it it does at least not solve #8973.

@Domsall
Copy link
Contributor

Domsall commented Nov 25, 2021

I dug a bit deeper and it all comes down to the log-function. I unfortunately could not find a good solution and hope I understood everything correctly.
Here are my findings:

The algorithms to run the log-calculation are compiler and hardware dependant. Changing options via _controlfp and /fp:strict (for windows) or fesetround and -frounding-math/-ffloat-store/-fexcess-precision=style (for linux) did not change anything for me.
Like your link above, there exist many different ways to cope with this problem.
One solution is to use a platform independent math library (see https://stackoverflow.com/questions/1129032/platform-independent-math-library), what you are already kind of doing (for example with your own RandomNumber-functions). But that slows things down.

The resulting difference in the log-calculation happens approx. every 100th call and consists of a rounding error (+- 1bit).

To give this issue a bump: What are you preferring going forward?

@namdre
Copy link
Contributor Author

namdre commented Nov 25, 2021

I think most users do not need to replicate the same simulation on different machines. The annoyance comes mostly from the developer side when trying to reproduce user examples. I would just document the platform-dependency of EIDM including the fact that it comes from log.

@behrisch
Copy link
Contributor

But if you have a patch ready which solves it on the platforms you tested fell free to submit a PR

@Domsall
Copy link
Contributor

Domsall commented Nov 29, 2021

The workaround I wrote about is here:

As mentioned above, the DriverState-Device suffers from a similar precision-leak as the EIDM does.

I tracked the issue down to the log-function call in randNorm. Summary from my understanding:

  • the processors often use extended double values between calculations
  • it then sometimes happens that on different systems the intermediate values are not perfectly the same
  • the log-function then outputs slightly different values (after 16 decimals)
  • This is often not a problem, but for the random walks, each error influences all future values
  • To make sure both systems behave the same, I "forced" the log-function output to the double precision of 15 decimals and now get the same fcd-output (precision 6) for each system with the following small patch:
    log_patch.txt

This solution is not really elegant and may still drift away after some time, but works as intended for the above examples (circle example with the DriverState Krauss or the EIDM).

I also tried a bit-shift approach, but this did not work as intended. The log()-function only varies by +- 1 bit (rounding of the internal log algorithm), so if I make sure I cut off the number representation of this bit, both results are the same.

Like I stated above, it is not a great method, but I could not find any better one (except by adding a C-software-based log()-function). So if someone absolutely needs the platform independency, they can use this approach.

@behrisch
Copy link
Contributor

I just applied the patch with a small adaption. Please recheck whether it still works with your setup.

namdre added a commit that referenced this issue Nov 30, 2021
namdre added a commit that referenced this issue Nov 30, 2021
namdre added a commit that referenced this issue Dec 1, 2021
@Domsall
Copy link
Contributor

Domsall commented Dec 6, 2021

First of all, I am sorry for the erroneous patch. I just recognized that back then after testing I added an int32 instead of an int64... Secondly, I must admit that the workaround does not fully solve the problem, it just slows down the drift. But you probably already know that.

Your patch works on my side. For information: I am approx. getting 1 dissimilar log-return value (between the platforms) per 50.000 calls now. Previously it approx. was 1 dissimilar value per 100/1000 calls.

The different return values every 50.000 call stem somewhat from the rounding issue, already posted by @namdre:

As far as my understanding of floating point rounding goes, there are many cases where the given approach fails and may actually increase the deviation between platforms. Consider the case where one implementation returns the binary equivalent of 1.0 and the other 0.999...9 If these numbers are rounded by discarding some digits (which is approximately what your multiply-intcast-divide does), then the difference between values increases.

One example is:

  • One platform outputs "-0.136581060337999993237190210493"
  • The other platform outputs "-0.136581060337999965481614594864"
  • When we multiply those values by 1e12/1e13/etc., the last digits of the first value get rounded to 80 and the digits of the second value to 79.

So after some time the results of this solution will drift. That is why I would call it a "workaround", but not a solution.

From my view, the only "real" solution would be to add a platform independent math library.

@namdre
Copy link
Contributor Author

namdre commented Jan 6, 2022

maye https://www.swmath.org/software/12390 (though this is LGPL).
Anyway, I'd rate this as a low priority now

@namdre namdre modified the milestones: 1.11.0, 2.0.0 Jan 6, 2022
@behrisch
Copy link
Contributor

behrisch commented Jan 6, 2022

maye https://www.swmath.org/software/12390 (though this is LGPL).

and it does not look very well maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants