Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroying while a Connection is Pending Causes Crashes #56

Closed
LordNed opened this issue Jul 25, 2017 · 9 comments

Comments

Projects
None yet
2 participants
@LordNed
Copy link

commented Jul 25, 2017

This is an interesting one. The callstack effectively points to the engine running out of memory trying to allocate a huge amount of memory, caused by broadcasting a Multicast delegate. It seems to only happen if you start the game, attempt a connection to a websocket server, and then close the game before the webserver responds.

After many hours of searching, logging and breakpointing I think it comes down to this case:
The SocketIONative class starts a long-running background thread for connecting where it registers a couple of handlers (such as OnFailCallback). Somehow when running UninitializeComponent it ends up invoking the OnFailCallback. This OnFailCallback is listened to in the SocketIOClientComponent. By the time the callback gets invoked (which does look valid when invoked), the SocketIOClientComponent thinks the NativeClient pointer is null, ie when Line 195 is hit, NativeClient reports as null. Then, it starts a lambda on the game thread to broadcast the failure.

By the time the game thread lambda actually happens, OnFail now looks to be deconstructed or partially deconstructed. Broadcasting it causes an array reallocation with a huge amount of memory requested, and the engine crashes.

The point of opening this ticket is to help determine the best course of action to fix this. On one hand, it looks like I could slap a if(NativeClient == nullptr) { return; } in before we start the RunShortLambdaOnGameThread, skipping the callback if the object is destroyed. This is untested, but would presumably fix the issue.

The other option is we discover how the OnFailCallback is being called when NativeClient reports nullptr. It might be reporting nullptr because of how MakeShareable/TSharedPtr is implemented. It also looks like when the SocketIONative client is deconstructed (ie: assigned to nullptr in UnintializeComponent) it doesn't run the ClearCallbacks() function, which may be why the OnFailCallback is getting called. Perhaps clearing the list of callbacks when the object is deconstructed is the right course of action?

Unfortunately both of these changes mean that you won't receive potentially expected callbacks (such as OnDisconnected, OnFail) from a native client which is being destroyed, but I figure it's better as a known issue (you can hook the OnDestroyed event to know when the component is about to be destroyed instead, and use that as a "hey we got disconnected" message).

The most reliable way to repo this is put an actor in your scene, ask it to connect to something in BeginPlay (no async disconnect, no auto-connect), hit begin play and then immediately hit escape. Sometimes it doesn't crash but doing it repeatedly will cause it to crash. That being said, you can also cause it to crash late in the game when the actor is destroyed, ie 2-3 minutes later if it still hasn't connected to a websocket server.

This is becoming a blocker for our project so I may apply one of the above patches myself to see if they fix it, but I'd like to know the ideal course of action here so I can make it a PR and fix it for everyone.

@LordNed

This comment has been minimized.

Copy link
Author

commented Jul 25, 2017

I came up with a reasonable sounding workaround which will undergo more extensive testing over the upcoming weeks.

In SocketIONative.cpp, I added a deconstructor and simply call ClearCallbacks in it. This prevents the native client from invoking the callbacks when it's being deconstructed. This seems to fix the crash for now, though at the downside that you won't get any messages when your socket client is destroyed, but if you're destroying it you should handle 'connection closed' type events imo...

@getnamo

This comment has been minimized.

Copy link
Owner

commented Jul 25, 2017

I think that's the right approach, something akin to https://github.com/getnamo/socketio-client-ue4/blob/master/Source/SocketIOClient/Private/SocketIONative.cpp#L128 should work. The async nature of scheduling a callback means that the object that should listen to it may no longer exist when you scheduled it.

@getnamo getnamo added the bug label Jul 25, 2017

@getnamo

This comment has been minimized.

Copy link
Owner

commented Jul 25, 2017

Just wanted to add that OOM error from using an invalid pointer is typical and you may have different types of crashes from async bugs as it will depend on what uses that pointer and when. This strongly points to a released object reference by the time the callback gets scheduled on the taskgraph gamethread pool.

@LordNed

This comment has been minimized.

Copy link
Author

commented Jul 25, 2017

It's actually unrelated to the Async Disconnect callback. It crashes in both cases. I agree that it's caused by the object being released by the time the callback gets scheduled though, but I'm not entirely sure how the native client calls the OnFail callback but the native client is nullptr (who is the caller).

Looking at the code you linked, it's interesting that the Disconnect call doesn't clear callbacks, but the SyncDisconnect does. Also the deconstructor doesn't try to disconnect on its own if it's connected?

I'll submit a PR in a bit, going to leave the deconstructor/clearcallbacks change in our project for a while to ensure that's actually a long term fix.

@LordNed LordNed changed the title Out of Memory Crash Destroying while a Connection is Pending Causes Crashes Jul 26, 2017

@getnamo

This comment has been minimized.

Copy link
Owner

commented Jul 29, 2017

I've been thinking about this for a bit and remembered that there's been quite a bit of previous work trying to get async disconnect to behave properly especially when transitioning between levels and the more I think about it the more I feel the FSocketIONative requires plugin scoped memory rather than owned by the component. This would allow the component to cleanup and cancel its callbacks while the plugin handles async disconnects in the background without worrying about being forced to release its memory and locking or coming back with a released pointer in a callback.

I would also emit an immediate synthetic disconnect call when the component gets closed (e.g. on uninitialize) to allow the desired cleanup to take place within the connect/disconnect callbacks.

I'll try to schedule work for this change to my 4.17 update as it will be quite a bit of change to get it to work as expected.

@LordNed

This comment has been minimized.

Copy link
Author

commented Jul 29, 2017

I agree with emitting a synthetic disconnect call when the component is uninitialized, assuming you can call BP events from that thread (and that we don't run into the same issue with enqueing it onto the game thread).

Is it likely that someone would ever want to connect to multiple Websocket servers at once? Moving it up to Plugin scoped memory does sound better on the whole though, since that gives the native client a memory scope of the entire game's lifespan. As long as the component causes it to connect/disconnect at expected times it shouldn't be an issue. This would let you put your component on your GameInstance/GameMode/Object and the connection lifetime would still be as expected.

@getnamo

This comment has been minimized.

Copy link
Owner

commented Jul 29, 2017

The lifetimes would remain the same, this is merely fixing delayed destruction by letting it happen in a space that doesn't depend on game world lifetimes.

Imagine it as if you have an array or set of FSocketIONative pointers that get allocated and destroyed on demand. The command to disconnect would still get triggered when the component gets uninitialized, the callbacks would get cleared and the plugin would have plenty of time to asynchronously close the connection cleanly.

Testing obviously needed to confirm the above assertions...

@LordNed

This comment has been minimized.

Copy link
Author

commented Jul 29, 2017

and the plugin would have plenty of time to asynchronously close the connection cleanly.

I'm not entirely sure this is true. When the native client is destroyed if a connection is pending it calls the callbacks from a different thread - so even if it had the scope of a Plugin we'd still have to enqueue the OnFail type callbacks to the main thread, which is where we get into the crash.

I suspect Uninitialize and Plugin Shutdown would be called in very close order when the game is exiting which leaves you with a race condition to see if disconnecting from the socket happens before the plugin shut down.

I can see combining the deconstructor fix I came up with with a plugin scope lifetime, but at that point once you've applied the deconstructor fix I'm not sure you need to move it to plugin scope. I do agree that my fix should synthesize a Disconnect event though!

@getnamo

This comment has been minimized.

Copy link
Owner

commented Nov 11, 2017

Fixed in f17df08.

Try it out and re-open the issue if the new architecture still has this problem.

@getnamo getnamo closed this Nov 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.