Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 17037 - std.concurrency has random segfaults #5004

Merged
merged 1 commit into from
Oct 19, 2021

Conversation

WalterWaldron
Copy link
Contributor

@WalterWaldron WalterWaldron commented Dec 30, 2016

@dlang-bot
Copy link
Contributor

dlang-bot commented Dec 30, 2016

Thanks for your pull request and interest in making D better, @WalterWaldron! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
17037 major std.concurrency has random segfaults

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#5004"

std/concurrency.d Outdated Show resolved Hide resolved
Thread.sleep(dur!("msecs")( 10 ));
else
dosleep = true;
GC.collect;
Copy link
Contributor

@John-Colvin John-Colvin Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment would be nice to explain why doing a collection helps here as it's not immediately obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std.concurrency is designed around having a global variable (scheduler) set once at the outset. There is no synchronization for this variable and the implementation does not appear to support changing it on the fly.

However we need to test with both implementations of Scheduler: ThreadScheduler and FiberScheduler.

This function waits until it is the only thread before modifying scheduler (i.e. it's a mutual exclusion hack.)
Collection helps because threads can wait on the finalization action of other threads (e.g. waiting for OwnerTerminated exceptions initiated by static ~this.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I actually meant a comment in the source. I suggest something like // wait for all other threads to terminate, using GC.collect to trigger finalizers which may terminate threads (e.g. OwnerTerminated or LinkTerminated) at the top of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I was giving the explanation as interim to updating the PR.

@WalterWaldron
Copy link
Contributor Author

Updated according to feedback.

@MetaLang
Copy link
Member

@wilzbach in the same vein as #5515 (comment) why hasn't the bot suggested a reviewer?

@WalterWaldron
Copy link
Contributor Author

@MartinNowak ping please! this has been open for 7 months!
I'm pinging you since you're slated to be std.concurrency code owner.

@wilzbach
Copy link
Member

wilzbach commented Jul 8, 2017

@wilzbach in the same vein as #5515 (comment) why hasn't the bot suggested a reviewer?

Same answer as in #5515 (comment) (we turned the feature of due to too much noise), but #5573 looks very promising.

@wilzbach wilzbach requested a review from MartinNowak July 9, 2017 16:11
@MetaLang
Copy link
Member

MetaLang commented Jul 9, 2017

This has been all green for awhile now which I think should be a pretty good indicator that it at least shouldn't break anything, and if it does we can revert it. Unfortunately Martin is a pretty busy guy so it's hard to say when he'll get to this.

@WalterWaldron
Copy link
Contributor Author

WalterWaldron commented Jul 9, 2017

This has been all green for awhile now which I think should be a pretty good indicator that it at least shouldn't break anything, and if it does we can revert it.

It can't break code because it only modifies the unittests. My changes are:

  • Add changeScheduler inside version(unittest) block: This function is a hack to make changing the scheduler in unittests more sane.
  • Modify problematic unit test: failures in other threads are reported to the main thread which executes the assertion:
    • this avoids depending on exceptions thrown in other threads to signal test failure.
    • this allows blocking the main thread until test completion to avoid interference

@MetaLang
Copy link
Member

MetaLang commented Jul 9, 2017

It can't break any code because it only modifies the unittests

What I was getting at is any unittest build would fail if one of the tests was broken. It's pretty common for people to run the full test suite for Phobos locally, and would also break the auto tester.

Copy link
Member

@JackStouffer JackStouffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems with this. I'll leave this open for two or three more days and merge if no one has any more comments.

Copy link
Member

@MartinNowak MartinNowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly does the race condition manifest?

Add changeScheduler inside version(unittest) block: This function is a hack to make changing the scheduler in unittests more sane.

From a first look, I'd say we should make the tests more sane instead.

Thread.sleep(dur!("msecs")( 10 ));
else
sleepFirst = true;
GC.collect;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks frightening? Do we really only send Owner/LinkTerminated messages when the thread object get's collected? Sounds horribly unreliable as the other peer might hold some (implicit) reference to the thread.
If so we should add some onThreadExit hook to core.thread or wrap the thread function with some scope (exit) guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really only send Owner/LinkTerminated messages when the thread object get's collected?

They get sent when the module destructor is run for threads, and via scope(exit) for fibers, so the comment I added must be wrong.

I have look at the code again (it's been so long since I made this PR) to see whether this was just a hack for force bad tests to hang (instead of random failures,) or whether it was necessary.


changeScheduler(new ThreadScheduler);
scheduler.spawn(testdg);
assert(receiveOnly!bool());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unclear, is this really the last life-signal of thread being spawned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I made it so that the test result (failure/success) was communicated back to the main thread instead of relying on exceptions being re-thrown (like it had been previously.)

@WalterWaldron
Copy link
Contributor Author

WalterWaldron commented Jul 14, 2017

How exactly does the race condition manifest?

Typically it was failing like this:

@property ref ThreadInfo thisInfo() nothrow
{
    if (scheduler is null) // scheduler is modified from non-null to null after this test
        return ThreadInfo.thisInfo;
    return scheduler.thisInfo; // Scheduler is null
}

scheduler was concurrently modified by the unit test code.

From a first look, I'd say we should make the tests more sane instead.

The problem is that the code being tested references the global variable (__gshared Scheduler scheduler;) and we want to test both ThreadScheduler and FiberScheduler implementations. The code assumes scheduler will be set once and left.

@WalterWaldron
Copy link
Contributor Author

WalterWaldron commented Jul 14, 2017

I've removed the changeScheduler hack, it wasn't necessary (it must have been left over from when I was debugging because it would make bad tests hang instead of pass & randomly segfault.)

@WalterWaldron
Copy link
Contributor Author

WalterWaldron commented Jul 14, 2017

I think it's still necessary to serialize changing the scheduler, however I don't think the GC.collect part should be present (Thread.pause until other threads have exited.)
With the modified test code, I think the following could still happen:

Main Thread              |    Thread 1    |    Thread 2
                         |  static ~this  | 
                         |    cleanup     |    Owner/Link Terminated
assert(receiveOnly...    |                | 
scheduler = new ...      |                |
scheduler.start( ...     |                |
assert(receiveOnly...    |                |    static ~this
scheduler = null         |                |    thisInfo()      (data race on scheduler)

@Imperatorn
Copy link
Contributor

Is this still "frightening". More review needed?

@dlang-bot dlang-bot merged commit 32ecd42 into dlang:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants