Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition Crash in FSIOLambdaRunnable #60

Closed
LordNed opened this issue Aug 9, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@LordNed
Copy link

commented Aug 9, 2017

This was a fun one to track down. It only happens in builds and it was very sporadic. I'd add a space and recompile and it'd crash constantly every time the Connect was called. I'd remove the space and compile again and it'd go away.

I finally tracked it down to FSIOLambdaRunnable (one of which is created upon calling the Connect function). The issue is that FSIOLambdaRunnable calls "delete this" in the Exit() function.

The issue is that calling Thread = FRunnableThread::Create(this, *threadStatGroup, 0, TPri_BelowNormal); (source) can lead to a race condition whereif the runnable thread is created, executes the lambda and returns before the main thread finishes the call to ::Create, you have now deleted the FSIOLambdaRunnable while... still inside of it. This finishes the function call on released memory and then the engine crashes a moment later with random differing callstack as it tries to use the memory that you helpfully overwrote.

I came up with a temporary solution so we could ship, but I can't submit it as a PR without some additional discussion as I think it leads to an additional edge case. The idea is that you remove "delete this" from the exit function (so that if the lambda finishes before the main thread completes, we don't delete the FSIOLambdarUnnable)

void FSIOLambdaRunnable::Exit()
{
	Finished = true;
        //delete this; <- Don't delete FSIOLambdaRunnable on thread exit as thread can exit before constructor finishes on main thread.
}

This means we now are responsible for the lifecycle of FSIOLambdaRunnable. It is inadvisable that FSIOLambdaRunnable controls the lifecycle of both itself and its FRunnableThread. Then, when you use the FSIOLambdaRunnable you need to store the pointer to the runnable and delete it in the deconstructor, and before you use it again. (Source)

Example:

FSocketIONative::~FSocketIONative()
{
	ClearCallbacks();
	delete ConnectionThread;
}
void FSocketIONative::Connect(const FString& InAddressAndPort, const TSharedPtr<FJsonObject>& Query /*= nullptr*/, const TSharedPtr<FJsonObject>& Headers /*= nullptr*/)
{
        delete ConnectionThread;
        ConnectionThread = FSIOLambdaRunnable::RunLambdaOnBackGroundThread([&,Query, Headers]() {});
}

This technically lets the Runnable exist in memory for longer than strictly needed but will be cleaned up when FSocketIONative is removed. However, this leads to a second issue - you need to delete ConnectionThread before you re-assign it otherwise you leak memory, thus the second delete call.

This however leads to another issue - if you delete ConnectionThread from the main thread while the FSIOLambdaRunnable is still executing something you crash again! I think the fix for this would be effectively:

void FSocketIONative::Connect(const FString& InAddressAndPort, const TSharedPtr<FJsonObject>& Query /*= nullptr*/, const TSharedPtr<FJsonObject>& Headers /*= nullptr*/)
{
        if(ConnectionThread)
        {
                ConnectionThread->WaitForCompletion();
                delete ConnectionThread;
        }
        ConnectionThread = FSIOLambdaRunnable::RunLambdaOnBackGroundThread([&,Query, Headers]() {});
}

This has the disadvantage of stalling the main thread until your background thread completes which kind of defeats the purpose of a background thread, but it'd only happen if you called the function twice in a row before the existing one completed. The only other way I can think of is finding a way to safely delete a thread while it's in the middle of executing a function, but I imagine that can still leave things in a bad state.

The other solution I could think of would be to safeguard inside the code that uses FSIOLambdaRunnable, ie: It sets a "IsConnecting" bool to true and the next time you try to call Connect if it's already connecting it just early outs and warns, and then a callback on the exit of the Runnable sets IsConnecting to false once the function actually completes the first time. Then it'd be safe to just delete ConnectionThread before assigning it, since it'd never get to that code if the current ConnectionThread was still in use.

I'm opening this as an issue and not a PR to start a discussion about the best way to fix it, because I'm honestly not sure. ¯\_(ツ)_/¯

@getnamo getnamo added the bug label Aug 9, 2017

@getnamo

This comment has been minimized.

Copy link
Owner

commented Aug 12, 2017

I'm still convinced we can encapsulate the life cycle inside the lambda, but I need to be able to replicate this issue on my end to explore ways to achieve this. The usual way is to add a lock or two somewhere ensuring things happen according to the desired order.

Do you have a minimal fresh project you can make that will cause this condition?

@LordNed

This comment has been minimized.

Copy link
Author

commented Aug 15, 2017

Don't have a minimal project, but suggestions for replication is:
Pass nullptr to the current FSIOLambdaRunnable in the Connect function, make a shipping build and run that. It only shows up in shipping builds, probably due to optimization changes.

@getnamo

This comment has been minimized.

Copy link
Owner

commented Nov 11, 2017

Fixed in f17df08.

The plugin now uses plugin scoped memory and the race conditions should be a thing of the past (hopefully).

Try it out and re-open the issue if the new architecture still has this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.