New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major random bug causes the engine to quickly go on a downward spiral until it crashes #508
Comments
Update: After noticing that it became slower after {D}, I added two prints inside:
You can see it here: https://pastebin.com/sKXA3M4q One at the beginning and one at the end, then run the server again until it died today, these were the results: As always, everything normal until suddendly:
...some time later...
As you see, both the m_pairCount and m_contactManager.m_contactList increase suddendly at the same time, and both continue increasing until the server crashes. Again, any kind of help, even about where I can look for a cause, is extremely welcome. I'm desperate. |
Could it be because box2d isn't designed for creating top-down games? Anyway, the strange thing is that it works perfectly until something happens suddenly. |
Restarted yesterday night, and crashed again today:
...a while after...
|
I'm guessing you have many overlapping shapes creating a quadratic explosion of contacts. I recommend logging b2ContactManager::m_contactCount. Do you have any game logic that could cause all the bodies to go to the origin? Also, if you are concerned about TOI, you can disable continuous collision to rule that out. I'm guessing it won't affect your game too much. |
I'll monitor contactCount, bodies position and disable CCD. About the latter, turning off CCD will also turn it off for bullet bodies? PS: Thanks for answering!! You gave me a light of hope |
Not sure, but it should be easy to modify the source so that only bullets use CCD. |
The server has been running with CCD disabled for 9 days so I'm almost sure that the problem is gone. Do you have any idea of where the problem could be? m_contactCount is on average around 1000. |
Well, 13 days after it finally did crash, some details: Everything was correct:
Then...
As you asked me if there was any chance of bodies moving to origin, I printed the position of all bodies after this (from Java, printing the Vector2 result of body.getPosition, which calls a JNI method to the C library and contains two floats), this is what I found: https://drive.google.com/file/d/1V6YebNSJpFMWNPXwadPCsRt8oUd-K7kx/view?usp=sharing 4 bodies with position in (0.0,0.0), 64 bodies with position (NaN,NaN) A few seconds later: 2 bodies with position in (0.0,0.0), 104 bodies with position (NaN,NaN) The latter continues to grow in later prints, until at the end most of the bodies are at NaN NaN, and contactCount also grows exponentially, I don't know what that means. But at the time the contact explosion happens, the bodies are basically all spread through the world, with only 4 at the origin and 64 at nan nan. Do you have any idea of what could be the problem? Thanks |
I'm guessing all the bodies at [NaN NaN] are considered overlapping. I'm guessing there is an instability in your simulation. Do you have large mass ratios between rigid bodies? You can get instabilities if connect rigid bodies together by joints and they have a large mass ratio. Anything above 10 to 1 can give you trouble. Are there any large forces in your game? |
There are two type of joints in the game, both weld joints. One between a "grabber" (a fishing rod's hook like thing) and fishable entities. Grabber has a mass of 6.28 and the heaviest fishable has a mass of 0.3, so there is a ratio bigger than 10 to 1, but it's the grabber the one which moves the other, is there a problem with that? Also, grabber is kinematic while fishable entities are dynamic. The other is between rafts. Rafts have all the same mass and they can be welded to other rafts to make bigger rafts, you can see how that works looking for videos of the game on youtube. About large forces, what is considered large? The largest forces are the ones applied on grabbers which they then apply to grabbed objects, and the one persons apply to rafts and rafts to welded rafts, but they aren't really large. |
The mass of a kinematic body is not used in the solver. So that mass ratio doesn't matter. Can someone make a huge raft? How many segments? Is it possible to get overlapping and colliding raft segments that are fighting against the weld joints? How are the rafts moved? By force, impulse, setting velocity, or setting position? Can the grabber collide with anything? How about the fishables? Applying a force to a kinematic body doesn't do anything. You should only be setting the velocity of a kinematic body. Could multiple grabbers grab the same fishable? |
Hi:
|
You seem to have a couple things going on that could lead to instabilities (and thus NaNs). 1a. The best solution for raft merging is to use create multiple fixtures on a single b2Body. This removes the need for the weld joint. You will have to do some bookkeeping. 1b. If you really need to use the weld joints, then try using a negative unique collision group for all fixtures on a single raft. See b2Filter::groupIndex. 1a and 1b should both prevent the bug in the video.
Do you use a variable time step? What is the largest time step you could use? I recommend to clamp this to no larger than 1/20. (20Hz) |
I'll try with 1b as it doesn't requires I change how the raft system work. How many different negative groupIndexes is the limit I can set? (so that different raft groups can collide with each other) I'll fix 2 too. I call world.step with a fixed timestep (world.step(3/60f, 3, 3);) Thanks for your help |
The groupIndex is 16 bits, so 32K negative numbers is possible. You can change it to a 32 bit number if you like. Correct usage of b2Filter prevents the b2Contact object from being created. You are simulating at 20Hz. This makes the solver even more sensitive to mass ratios and inconsistent constraints. Are you using a non-zero b2WeldJointDef::frequencyHz? |
So I should run step more often? |
I implemented both changes (negative group index for groups of rafts and only one grabber welding a fishable at a time) but today the server crashed again :( I thought it was fixed this time.
And the same case as before, where NaN position bodies start to grow. I do not want to give up, because I put a lot of effort on this game, but I don't know what could be causing this. |
I think you have some instability causing the simulation to blow up. I recommend adding some code in b2Island.cpp to detect large positions and/or velocities. Probably in the loop at b2Island.cpp:275. You can then setup some code to call b2World::Dump. Then we can look at what the contents of the world are in the testbed. |
I've started logging big translations and rotations, and it seems it's indeed related to that. Just a few seconds before the contact explosion, a few off-limit rotations and translations started to appear:
A few seconds later, the rotations and translation grew to huge numbers:
I've also printed a few b2dump of when the max rotations and translations started appearing, you can download them here: How do I open those dumps in the testbed? |
Grab branch issue508 and run the testbed. I recommend hitting pause (P), then restart (R), then single step (O). I see some problems with the collision filtering. There seem to be large static regions that push everything out. Is this a problem with your collision masks? The big problem I see for one of the dumps (I call it b2Dump02.txt) is this: b2WeldJointDef jd; As you can see the local anchors are huge. If you create the weld joint with b2WeldJointDef::Initialize, then this means you are feeding in an anchor that is very far the from the bodies. No doubt this is causing huge instabilities. |
Do the localAnchors change over time? Or the anchors you see there had to be set when creating the joint? I create the joins like this:
The anchors be at most 1 meter away from any of the bodies, unless there's a weird bug in my code. If you confirm that that anchor must have been set when creating the joint, I'll check in what situations that could be happening. About the large static regions, that may be the islands, but only rafts collide with islands. I haven't loaded the dumps in the testbed yet, I'll do as soon as I can. |
The local joint anchors are not modified by Box2D. I'm guessing there is a problem with jointReq.getTarget(). You could check to see if the target is ever far from bodyA or bodyB. |
I've added a code that checks for far anchors before join creation, and it detected and corrected anchors that were far from the bodies, so I did have a problem there. But anyway, the server today crashed anyway. I increased the threshold for dump creation so only 3 dumps were made long after the server started going crazy, so it has a lot of:
And the testbed can't read them. I opened the issue508 testbed and about the bodies being pushed out of the islands, they are the humans/animals, they can collide with islands but the contacts are disabled on the collision callbacks. I guess the problem might indeed be related to rafts, as sometimes the MAXT (big translation) messages start after joining rafts. I should implement the solution you said before of adding additional rafts as fixtures instead of joining bodies and see if the bug keeps happening. That's the only thing I think is left to try now. |
Any more on this? Otherwise I will close this soon. |
I'm using Box2d through the gdx-box2d java library, which makes use of the c code through JNI.
I've used it without problems in my other games, serverside as they are multiplayer MMO games, but on my newest game, after a random amount of time (usually 1 or 2 days) it suddendly gets really slow until it crashes.
The game is this one (you can try it without download)
After trying to solve it java-side, I decided to start touching the c code, basically adding some printf in world.step and related functions.
b2World.cpp: https://pastebin.com/d6fn0mLd (I added printfs at the end of step displaying m_profiler times, and at different places in SolveTOI)
b2BroadPhase.h: https://pastebin.com/WTihivb1 (added prints at updatepairs showing moveCount and pairCount)
After it crashes, I analyse the log, and this is what I see:
Everything works right until suddendly it slows down, and the pairCount increases substantially.
...a few seconds...
...minutes...
...minutes... at this time I see a lot of STUCK at colliding that come from here:
`
gameServer.getWorld().getBox2dWorld().QueryAABB(
then
...minutes...
CRASH
Sometimes it crashes showing some error at memmove (or something like that) method, and sometimes it doesn't show anything
A second before that this were the world stats (normal):
Any kind of help, even about where to continue looking for the cause is welcome. I'm hopeless right now, and the game is stuck as I don't want to continue updating or advertising it because it crashes every 1 or 2 days.
I've already checked if there were any updates on related code in current box2d master that weren't in box2d version, but there aren't.
Thanks in advance.
The text was updated successfully, but these errors were encountered: