Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

MT GC #7546

Merged
merged 13 commits into from Apr 16, 2019

Conversation

@bcardiff
Copy link
Member

commented Mar 14, 2019

This PR allows the GC to work with threads running different fibers. It does not change the scheduling at all. There is a sample program added that manually schedule fibers to threads.

To test this branch you need a custom build of the bdwgc.

The path to find the libgc.a is fixed for now.

To be able to run this you need checkout and build in a specific location:
In your crystal working directory with mt/gc branch checked out:

$ pushd ..
$ git clone http://github.com/crystal-lang/bdwgc
$ cd bdwgc
$ git checkout release-7_6-crystal
$ git clone https://github.com/ivmai/libatomic_ops
$ cd libatomic_ops
$ git checkout v7.6.10
$ cd ..
$ autoreconf -vif
$ ./configure --disable-debug --disable-dependency-tracking --disable-shared --enable-large-config
$ make 
$ ls .libs/
libcord.a   libcord.la  libcord.lai libgc.a     libgc.la    libgc.lai

$ popd # Go back to the crystal working directory

$ make clean deps

# To run the samples/mt_gc_test.cr (no need to build the compiler)
$ ./bin/crystal samples/mt_gc_test.cr -D preview_mt_gc

# To build the compiler itself
$ make crystal FLAGS="-D preview_mt_gc"
  • The PR adds a simple flag to check whether a fiber is running or not. This is only used for GC purposes and the GC disable/enable in the context switch ensures the information is only read in a consistent state. Yet that check might be affected by the work in #7407
  • One of the extensions of bdwgc GC_get_stackbottom is to get the stackbottom of the current thread. That might not be needed if #6970 evolves successfully.
  • The other extension of the bdwgc GC_switch_to_corutine and how it is used in the Scheduler/Fiber are the real game changer here.
  • The whole patch to bdwgc can be found at crystal-lang/bdwgc@417756b

This is based on some of the past efforts of (and present talks with 馃槃) @ggiraldez and @waj

Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
@@ -57,9 +57,15 @@ class Crystal::Scheduler
end

protected def resume(fiber : Fiber) : Nil
GC.disable

This comment has been minimized.

Copy link
@yxhuvud

yxhuvud Mar 14, 2019

Contributor

I wonder, would it be a good idea to have a block version of GC.disable that reenables GC when the block is exited?

This comment has been minimized.

Copy link
@bcardiff

bcardiff Mar 14, 2019

Author Member

Due to how things work, I would prefer to not use a block version. This GC.disable is paired with the enable of the Fiber#run or the trailing GC.enable in Scheduler#resume depending on whether the fiber argument is new or not.

Show resolved Hide resolved src/fiber.cr Outdated
@vladfaust

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

So, the Thread.new method has already been available before, but new threads were not GC'ed, right? And if this got merged, then we can create true multi-threaded applications?

If I understand correctly, #7214 was about "silently" running fibers in different threads, while this PR is about GC'ing these threads. So these two PRs are kinda complimentary?

I'm sorry if I'm mistaken 馃槄

@ysbaddaden

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

Do you get acceptable performance?

I initially used GC.disable and GC.enable while running a context switch, but the performance was awful. I also tried masking the signal used to stop-the-world, which improved performance to not that horrible with multiple threads (but very bad with a single one), but it was linux only, so not portable. See #7214 (comment) for benchmarks.

I eventually did what the single-msqueue branch did: use a rwlock and performance got surprisingly acceptable (not using ConcurrencyKit but a custom multi-reader single-writer lock). See ysbaddaden@ab7bfb4 and a benchmark comparable to the previous disable/sigmask link (1, 2, or 4 threads switching between 128 fibers):

switch[1/128]: crystal: 10000000 yields in 411 ms, 24285940 yields per second
switch[2/128]: crystal: 10000000 yields in 991 ms, 10082371 yields per second
switch[4/128]: crystal: 10000000 yields in 949 ms, 10526425 yields per second

I also push thread stacks on GC collect, not on each and every context switch. I never benchmarked, but since there are only a few threads, but I assume it's faster.

Show resolved Hide resolved samples/mt_gc_test.cr Outdated
@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

@ysbaddaden currently just using the std spec, compiler specs and the sample. I didn't experience slowdown but probably because the only MT code checked is the sample (and is unable to run in ST).

The GC.disable/GC.enables could be wrapped by flag?(:preview_mt_gc) if needed to allow checking things in MT before enabling the feature.

I try to avoid calling GC_switch_to_corutine (aka set_thread_stack_bottom) in the context switch and calling it just in GC.before_collect but that didn't work. Some collections can be triggered by the gc after a context switch but did not trigger the before_collect callback and then the stack information in the GC was wrong causing a lovely segfault. Having a set_thread_stack_bottom(thread, stack_bottom) would be better in terms of performance but I didn't succeed. So I ended up using a more declarative API like GC_switch_to_corutine. The sample script with 8 threads and 100 fibers did consistently trigger a collection with wrong stack information for one of the threads.

  • Would be better to wrap then GC.disable/GC.enables to at least avoid hitting down the ST performance for now, right?

  • Should I use :mt or :preview_mt as a flag directly so other MT features can be enabled all together?

  • From ysbaddaden@ab7bfb4 , we could include the RWLock and add the integration with boehm. I am not sure if it's sound to do the change on atomic.cr from :sequentially_consistent to :acquire/:release. It seems the former puts stronger constraints

    Sequentially-consistent ordering
    Atomic operations tagged memory_order_seq_cst not only order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load), but also establish a single total modification order of all atomic operations that are so tagged.

  • I will submit another set of instructions that will require creating a .pc file 卢卢 and remove the hack path so we can have better chances on merging and testing this.

  • I would like to have some specific benchmark ready to run and try to decide better if the rwlock is worth it. Do you have the http stress test that you used?

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

@vladfaust Thread.new was already there, not as a public api. As long as you don't use fibers/io on secondary threads everything is fine. So, don't use it 馃樃.

@Fryguy

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

Minor, but is corutine a typo (expected coroutine), or was that intentional for name collision reasons?

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

I need to learn english. Thanks @Fryguy 馃樃

@ysbaddaden

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

It took me some time to understand the atomic memory ordering constraints, but AFAIK the memory ordering is declaring memory barriers for LLVM optimizer's to not reorder across atomic instructions. Since I only use Atomic::Flag for creating locks, I only care for instructions within :acquire and :release to never be moved before or after the atomic instructions, but instructions outside of the atomics can be reordered within. :seq_cst is more strict, and nothing can cross the memory barriers.

I believe that acquire/release is fine for locks. I'm not sure there are other use cases for Atomic::Flag that would need a strict sequential memory ordering?

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from 0a64948 to c2e05f4 Mar 22, 2019

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2019

I've rebased this on (almost) master and integrate changes from #7551, #7551 (comment), and compare with the GC aspects of #7547.
#7547 helped me identify the problem mention at #7546 (comment), I missed swapping the procs call in GC.before_collect.

I've created a couple of branches to compare strategies (more on that later).

To check this you first need to build GC with custom extensions. It match this patch.

$ pushd ..
$ git clone http://github.com/crystal-lang/bdwgc
$ cd bdwgc
$ git checkout mt-gc-benchmark
$ git clone https://github.com/ivmai/libatomic_ops
$ cd libatomic_ops
$ git checkout v7.6.10
$ cd ..
$ autoreconf -vif
$ ./configure --disable-debug --disable-dependency-tracking --disable-shared --enable-large-config
$ make 
$ ls .libs/
libcord.a   libcord.la  libcord.lai libgc.a     libgc.la    libgc.lai

Also, create the following gc.pc file, changing the /path/to/bdwgc with the full path of the bdwgc working directory.

$ cat gc.pc
prefix=/path/to/bdwgc
libdir=${prefix}/.libs

Name: bdwgc
Description:
Version: 7.6.12+mt
Libs: ${libdir}/libgc.a

Then we build the samples/mt_gc_test.cr. It now has some command line options to measure its performance.
I've have checked some alternatives. Each of it in their own branch.

  • branch: mt/gc-switch_to_coroutine. The GC_switch_to_coroutine is used in Fiber.resume. GC_enable/GC_disable for locking.
  • branch: mt/gc-set_stackbottom-in-resume-lock. The GC_set_stackbottom is used in Fiber.resume with Thread.current. GC_enable/GC_disable for locking.
  • branch: mt/gc-set_stackbottom-lock. The GC_set_stackbottom is used in GC.before_collect. GC_enable/GC_disable for locking.
  • branch: mt/gc-set_stackbottom-rwlock. The GC_set_stackbottom is used in GC.before_collect. Custom GC::RWLock for locking.

Note: these branches are in my fork

The last three uses the same GC extension. So I wanted to check if there was any benefit of using the implicit current thread or not.
If not is better to have a more flexible API since it can be used, as shown, in different ways (Fiber.resume vs GC.before_collect).

After checking out each STRATEGY branch, to build the mt_gc_test the following can be used:

$ rm -rf ~/.cache/crystal/ # clear temp files because reasons
$ (export PKG_CONFIG_PATH=/path/to/bdwgc:$PKG_CONFIG_PATH && ./bin/crystal build samples/mt_gc_test.cr --release -D preview_mt -o gc-STRATEGY)

Note: I changed the flag to preview_mt.

Additionally as a baseline a -D gc_none can also be compiled:

$ rm -rf ~/.cache/crystal/ # clear temp files because reasons
$ ./bin/crystal build samples/mt_gc_test.cr --release -D gc_none -o gc-none

Finally, things like the following can be used to measure the performance of different strategies.

$ ./gc-none -m -f 1000 -t 2 -l 20
$ ./gc-switch_to_coroutine -m -f 1000 -t 2 -l 20
$ ./gc-set_stackbottom-in-resume-lock -m -f 1000 -t 2 -l 20
$ ./gc-set_stackbottom-lock -m -f 1000 -t 2 -l 20
$ ./gc-set_stackbottom-rwlock -m -f 1000 -t 2 -l 20

I've tested this on OSX for a couple of variables. Results can be viewed here.

So far it seems that the best alternative is gc-set_stackbottom-in-resume-lock, but they are all very similar.
The gc-set_stackbottom-rwlock seems to hang on some examples, but if that strategy ends up working the same GC extension could be used.

I have moved the mt/gc branch to mt/gc-set_stackbottom-in-resume-lock so that can be reviewed/approved in this PR.

Of course, if more experiments shows that another alternative is better we can switch that. But for now, that strategy seems enough.

@bcardiff bcardiff marked this pull request as ready for review Mar 22, 2019

@bcardiff bcardiff requested a review from ysbaddaden Mar 22, 2019

@RX14

RX14 approved these changes Mar 23, 2019

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from c2e05f4 to 582a755 Mar 25, 2019

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

I did the edits in the samples code that @ysbaddaden the most notable changes is that the gc_none version is even faster. The others didn't change much for now. So I will stick with the previous decision of using mt/gc-set_stackbottom-in-resume-lock as a starting point.

@ysbaddaden

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

I tweaked my implementation a bit to try different lock strategies. See ysbaddaden@0930600 and ysbaddaden@58af519.

Using my HTTP::Server with different tests (IO bound, GC bound, CPU bound) I got slightly better values with pthread_rwlock and my custom rwlock than with GC.disable but nothing really significant. I think the overhead mostly lies in libevent and IO syscalls and somewhat erases the context switch overhead.

Yet, using my context switch benchmarks, the overhead of a context switch (including the resumable CAS 1 to 2) is significantly different using the different strategies. See this plot, using gc_none as a reference (zero overhead):
switch

I'm worried about the hangup you got with my rwlock, but I couldn't reproduce it, despite running HTTP::Server with a GC collect running every some requests :/

@ysbaddaden

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

I'm fine with using GC.disable for now, but I'd love the GC to have the 4 lock_read, lock_write, unlock_read and unlock_write methods, so we could change what's used.

@bcardiff bcardiff added this to the 0.28.0 milestone Mar 27, 2019

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from 582a755 to a8b0cdb Apr 1, 2019

@vladfaust

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2019

I'm so happy to see the 0.28 milestone on it! 馃殌

Drum beat intensifies 馃

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from a8b0cdb to 8c3cc4b Apr 1, 2019

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

@ysbaddaden with the last push

  • The GC has the 4 lock_read, lock_write, unlock_read and unlock_write methods, yet only one strategy is used.
  • Changing the strategy should imply changes only in boehm.cr
  • I fixed the hang problems. There was sync issue with the worker threads when finishing/starting each round of the stress test that is fixed in 81783cf.
  • Thread.current is used several times already in a context switch: to access the scheduler, to set the stack_bottom. Accessing it one more time to check the single resume isn't problematic. I still prefer to have the Fiber::Context as an @[Extern]
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr Outdated
Show resolved Hide resolved samples/mt_gc_test.cr
Show resolved Hide resolved src/fiber.cr Outdated

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from 8c3cc4b to 01ace5b Apr 1, 2019

@bcardiff bcardiff referenced this pull request Apr 3, 2019

Merged

Bump distribution-scripts #7622

@ysbaddaden

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

I don't have time to review properly. From your last comment, the patch sounds acceptable.

About Thread.current there should ideally be one single call to it during a context switch: to reach the current thread's scheduler. Setting the stack bottom can be moved to GC.before_collect without delaying GC marker threads (see 0930600#diff-859e2b09d96c61167d13e65208236dd8R332).

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from 01ace5b to a3632e1 Apr 11, 2019

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Apr 11, 2019

As I mentioned in before using setting the stack bottom in the context switch or before_collect didn't show impact in the current tests. Adding the change seems reasonable.

bcardiff added some commits Mar 20, 2019

Fiber: Refactor GC hook
Avoid iterating fibers if no gc is used
Remove GC.before_collect as a private API
Introduce Fiber::Context struct
The fiber context holds the current stack top pointer (for a saved
stack), as well as a resumable flag that tells whether the fiber can
be resumed (its context was fully saved) or is currently running, or
is dead (its proc returned).

The resumable flag is required to prevent a fiber to be resumed
while it's still running. This should usually not happen with the
monothreaded scheduler, but still useful to find a buggy
synchronisation primitive that would enqueue the same fiber multiple
times (e.g. the current `select` implementation). This will be very
frequent with a multithreaded scheduler where a thread B could try
to resume a fiber that has just been enqueued by thread A but didn't
fully store its context yet.

The fiber status, running or dead, is monitored using an `@alive`
boolean. We can't set `context.resumable` to another value, because
the scheduler eventually swaps the context to another fiber, which
would overwrite `context.resumable`.

The scheduler now aborts with a fatal error whenever it would resume
a running or dead fiber, which would otherwise lead to a segfault.
It doesn't just ignore the fiber since to enqueue a fiber twice or a
dead fiber is an error that must be fixed.
Samples: stress test of the GC with multi-threading
* CLI options to sample to run as benchmark
* Keep private queue per thread to avoid contingency

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from a3632e1 to a14b4aa Apr 12, 2019

GC: Added missing call to set the GC stackbottom on single thread mod鈥
鈥. Use preview_mt flag to disable validations in single thread mode.

@bcardiff bcardiff force-pushed the bcardiff:mt/gc branch from be449ae to 24a7615 Apr 15, 2019

@bcardiff

This comment has been minimized.

Copy link
Member Author

commented Apr 15, 2019

Today we've been reviewing this PR with waj and we found a missing piece introduced in the last refactor at a14b4aa where for single thread we were not setting the stackbottom.

Now that is fixed at for single thread app there should be no impact.

@ysbaddaden

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Can we just use mt instead of preview_mt?

Thread: deterministic detach. Avoid finalize warnings because of refe鈥
鈥ence loops between threads and fibers.
@waj

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

I prefer to use preview_mt for now. It requires less explanation about why mt is not usable yet.

@waj

waj approved these changes Apr 16, 2019

@bcardiff bcardiff merged commit 55f1735 into crystal-lang:master Apr 16, 2019

5 checks passed

ci/circleci: check_format Your tests passed on CircleCI!
Details
ci/circleci: test_darwin Your tests passed on CircleCI!
Details
ci/circleci: test_linux Your tests passed on CircleCI!
Details
ci/circleci: test_linux32 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@bcardiff bcardiff deleted the bcardiff:mt/gc branch Apr 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.