Fix issue 17037 - std.concurrency has random segfaults #5004

WalterWaldron · 2016-12-30T04:41:33Z

https://issues.dlang.org/show_bug.cgi?id=17037

dlang-bot · 2016-12-30T04:41:34Z

Thanks for your pull request and interest in making D better, @WalterWaldron! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close	Bugzilla	Severity	Description
✓	17037	major	std.concurrency has random segfaults

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#5004"

std/concurrency.d

John-Colvin · 2017-03-07T00:06:53Z

std/concurrency.d

+                Thread.sleep(dur!("msecs")( 10 ));
+            else
+                dosleep = true;
+            GC.collect;


A comment would be nice to explain why doing a collection helps here as it's not immediately obvious.

std.concurrency is designed around having a global variable (scheduler) set once at the outset. There is no synchronization for this variable and the implementation does not appear to support changing it on the fly.

However we need to test with both implementations of Scheduler: ThreadScheduler and FiberScheduler.

This function waits until it is the only thread before modifying scheduler (i.e. it's a mutual exclusion hack.)
Collection helps because threads can wait on the finalization action of other threads (e.g. waiting for OwnerTerminated exceptions initiated by static ~this.)

Thanks for the explanation, I actually meant a comment in the source. I suggest something like // wait for all other threads to terminate, using GC.collect to trigger finalizers which may terminate threads (e.g. OwnerTerminated or LinkTerminated) at the top of the loop.

I know, I was giving the explanation as interim to updating the PR.

WalterWaldron · 2017-03-18T19:43:58Z

Updated according to feedback.

MetaLang · 2017-06-29T02:04:11Z

@wilzbach in the same vein as #5515 (comment) why hasn't the bot suggested a reviewer?

WalterWaldron · 2017-07-08T18:50:42Z

@MartinNowak ping please! this has been open for 7 months!
I'm pinging you since you're slated to be std.concurrency code owner.

wilzbach · 2017-07-08T22:21:28Z

@wilzbach in the same vein as #5515 (comment) why hasn't the bot suggested a reviewer?

Same answer as in #5515 (comment) (we turned the feature of due to too much noise), but #5573 looks very promising.

MetaLang · 2017-07-09T18:01:55Z

This has been all green for awhile now which I think should be a pretty good indicator that it at least shouldn't break anything, and if it does we can revert it. Unfortunately Martin is a pretty busy guy so it's hard to say when he'll get to this.

WalterWaldron · 2017-07-09T20:39:07Z

This has been all green for awhile now which I think should be a pretty good indicator that it at least shouldn't break anything, and if it does we can revert it.

It can't break code because it only modifies the unittests. My changes are:

Add changeScheduler inside version(unittest) block: This function is a hack to make changing the scheduler in unittests more sane.
Modify problematic unit test: failures in other threads are reported to the main thread which executes the assertion:
- this avoids depending on exceptions thrown in other threads to signal test failure.
- this allows blocking the main thread until test completion to avoid interference

MetaLang · 2017-07-09T20:41:09Z

It can't break any code because it only modifies the unittests

What I was getting at is any unittest build would fail if one of the tests was broken. It's pretty common for people to run the full test suite for Phobos locally, and would also break the auto tester.

JackStouffer

I don't see any problems with this. I'll leave this open for two or three more days and merge if no one has any more comments.

MartinNowak

How exactly does the race condition manifest?

Add changeScheduler inside version(unittest) block: This function is a hack to make changing the scheduler in unittests more sane.

From a first look, I'd say we should make the tests more sane instead.

MartinNowak · 2017-07-14T12:56:01Z

std/concurrency.d

+                Thread.sleep(dur!("msecs")( 10 ));
+            else
+                sleepFirst = true;
+            GC.collect;


That looks frightening? Do we really only send Owner/LinkTerminated messages when the thread object get's collected? Sounds horribly unreliable as the other peer might hold some (implicit) reference to the thread.
If so we should add some onThreadExit hook to core.thread or wrap the thread function with some scope (exit) guard.

Do we really only send Owner/LinkTerminated messages when the thread object get's collected?

They get sent when the module destructor is run for threads, and via scope(exit) for fibers, so the comment I added must be wrong.

I have look at the code again (it's been so long since I made this PR) to see whether this was just a hack for force bad tests to hang (instead of random failures,) or whether it was necessary.

MartinNowak · 2017-07-14T13:02:00Z

std/concurrency.d

+
+    changeScheduler(new ThreadScheduler);
+    scheduler.spawn(testdg);
+    assert(receiveOnly!bool());


Somewhat unclear, is this really the last life-signal of thread being spawned?

Yes, I made it so that the test result (failure/success) was communicated back to the main thread instead of relying on exceptions being re-thrown (like it had been previously.)

WalterWaldron · 2017-07-14T15:05:29Z

How exactly does the race condition manifest?

Typically it was failing like this:

@property ref ThreadInfo thisInfo() nothrow
{
    if (scheduler is null) // scheduler is modified from non-null to null after this test
        return ThreadInfo.thisInfo;
    return scheduler.thisInfo; // Scheduler is null
}

scheduler was concurrently modified by the unit test code.

From a first look, I'd say we should make the tests more sane instead.

The problem is that the code being tested references the global variable (__gshared Scheduler scheduler;) and we want to test both ThreadScheduler and FiberScheduler implementations. The code assumes scheduler will be set once and left.

WalterWaldron · 2017-07-14T16:14:16Z

I've removed the changeScheduler hack, it wasn't necessary (it must have been left over from when I was debugging because it would make bad tests hang instead of pass & randomly segfault.)

WalterWaldron · 2017-07-14T17:43:04Z

I think it's still necessary to serialize changing the scheduler, however I don't think the GC.collect part should be present (Thread.pause until other threads have exited.)
With the modified test code, I think the following could still happen:

Main Thread              |    Thread 1    |    Thread 2
                         |  static ~this  | 
                         |    cleanup     |    Owner/Link Terminated
assert(receiveOnly...    |                | 
scheduler = new ...      |                |
scheduler.start( ...     |                |
assert(receiveOnly...    |                |    static ~this
scheduler = null         |                |    thisInfo()      (data race on scheduler)

Imperatorn · 2021-10-19T08:35:41Z

Is this still "frightening". More review needed?

Addressed

WalterWaldron force-pushed the fix17037 branch 3 times, most recently from 4766211 to 3bd0258 Compare January 4, 2017 21:23

JackStouffer added the Bug Fix label Jan 5, 2017

JackStouffer added the Needs Review label Jan 18, 2017

WalterWaldron mentioned this pull request Jan 24, 2017

combine std.array.split() overloads into one #5063

Closed

John-Colvin reviewed Mar 7, 2017

View reviewed changes

std/concurrency.d Outdated Show resolved Hide resolved

John-Colvin reviewed Mar 7, 2017

View reviewed changes

WalterWaldron force-pushed the fix17037 branch from 3bd0258 to aafa6d9 Compare March 18, 2017 19:43

WalterWaldron mentioned this pull request Mar 28, 2017

Fix issue 8411 - add opCast!bool support for Duration. dlang/druntime#1793

Merged

dlang-bot added Needs Work stalled labels Jun 22, 2017

WalterWaldron force-pushed the fix17037 branch from aafa6d9 to 26a102e Compare June 27, 2017 01:37

MetaLang mentioned this pull request Jul 1, 2017

Generator implements InputRange interface #5515

Merged

wilzbach requested a review from MartinNowak July 9, 2017 16:11

JackStouffer approved these changes Jul 10, 2017

View reviewed changes

MartinNowak previously requested changes Jul 14, 2017

View reviewed changes

JackStouffer self-assigned this Jul 14, 2017

WalterWaldron force-pushed the fix17037 branch from 26a102e to bc7550f Compare July 14, 2017 16:11

dlang-bot added Needs Rebase and removed Needs Rebase labels Dec 29, 2017

Fix issue 17037 - std.concurrency has random segfaults

2c6051d

RazvanN7 force-pushed the fix17037 branch from bc7550f to 2c6051d Compare October 16, 2021 10:35

dlang-bot removed Needs Work stalled labels Oct 16, 2021

RazvanN7 removed the Needs Review label Oct 16, 2021

RazvanN7 approved these changes Oct 16, 2021

View reviewed changes

RazvanN7 added the auto-merge label Oct 16, 2021

dlang-bot added the stalled label Oct 16, 2021

dlang-bot merged commit 32ecd42 into dlang:master Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 17037 - std.concurrency has random segfaults #5004

Fix issue 17037 - std.concurrency has random segfaults #5004

WalterWaldron commented Dec 30, 2016 •

edited

Loading

dlang-bot commented Dec 30, 2016 •

edited

Loading

John-Colvin Mar 7, 2017 •

edited

Loading

WalterWaldron Mar 7, 2017

John-Colvin Mar 7, 2017

WalterWaldron Mar 7, 2017

WalterWaldron commented Mar 18, 2017

MetaLang commented Jun 29, 2017

WalterWaldron commented Jul 8, 2017

wilzbach commented Jul 8, 2017

MetaLang commented Jul 9, 2017

WalterWaldron commented Jul 9, 2017 •

edited

Loading

MetaLang commented Jul 9, 2017 •

edited

Loading

JackStouffer left a comment

MartinNowak left a comment •

edited

Loading

MartinNowak Jul 14, 2017

WalterWaldron Jul 14, 2017

MartinNowak Jul 14, 2017

WalterWaldron Jul 14, 2017

WalterWaldron commented Jul 14, 2017 •

edited

Loading

WalterWaldron commented Jul 14, 2017 •

edited

Loading

WalterWaldron commented Jul 14, 2017 •

edited

Loading

Imperatorn commented Oct 19, 2021

Fix issue 17037 - std.concurrency has random segfaults #5004

Fix issue 17037 - std.concurrency has random segfaults #5004

Conversation

WalterWaldron commented Dec 30, 2016 • edited Loading

dlang-bot commented Dec 30, 2016 • edited Loading

Bugzilla references

Testing this PR locally

John-Colvin Mar 7, 2017 • edited Loading

Choose a reason for hiding this comment

WalterWaldron Mar 7, 2017

Choose a reason for hiding this comment

John-Colvin Mar 7, 2017

Choose a reason for hiding this comment

WalterWaldron Mar 7, 2017

Choose a reason for hiding this comment

WalterWaldron commented Mar 18, 2017

MetaLang commented Jun 29, 2017

WalterWaldron commented Jul 8, 2017

wilzbach commented Jul 8, 2017

MetaLang commented Jul 9, 2017

WalterWaldron commented Jul 9, 2017 • edited Loading

MetaLang commented Jul 9, 2017 • edited Loading

JackStouffer left a comment

Choose a reason for hiding this comment

MartinNowak left a comment • edited Loading

Choose a reason for hiding this comment

MartinNowak Jul 14, 2017

Choose a reason for hiding this comment

WalterWaldron Jul 14, 2017

Choose a reason for hiding this comment

MartinNowak Jul 14, 2017

Choose a reason for hiding this comment

WalterWaldron Jul 14, 2017

Choose a reason for hiding this comment

WalterWaldron commented Jul 14, 2017 • edited Loading

WalterWaldron commented Jul 14, 2017 • edited Loading

WalterWaldron commented Jul 14, 2017 • edited Loading

Imperatorn commented Oct 19, 2021

WalterWaldron commented Dec 30, 2016 •

edited

Loading

dlang-bot commented Dec 30, 2016 •

edited

Loading

John-Colvin Mar 7, 2017 •

edited

Loading

WalterWaldron commented Jul 9, 2017 •

edited

Loading

MetaLang commented Jul 9, 2017 •

edited

Loading

MartinNowak left a comment •

edited

Loading

WalterWaldron commented Jul 14, 2017 •

edited

Loading

WalterWaldron commented Jul 14, 2017 •

edited

Loading

WalterWaldron commented Jul 14, 2017 •

edited

Loading